___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Text Classification Assessment 

### Goal: Given a set of text movie reviews that have been labeled negative or positive

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

## Complete the tasks in bold below!

**Task: Perform imports and load the dataset into a pandas DataFrame**
For this exercise you can load the dataset from `'../DATA/moviereviews.csv'`.

In [1]:
# CODE HERE

In [1]:
import numpy as np
import pandas as pd

In [17]:
df = pd.read_csv('../DATA/moviereviews.csv')

In [3]:
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


**TASK: Check to see if there are any missing values in the dataframe.**

In [7]:
#CODE HERE

In [24]:
df.isnull().sum()

label     0
review    0
dtype: int64

**TASK: Remove any reviews that are NaN**

In [18]:
df.dropna(inplace=True)

**TASK: Check to see if any reviews are blank strings and not just NaN. Note: This means a review text could just be: "" or "  " or some other larger blank string. How would you check for this? Note: There are many ways! Once you've discovered the reviews that are blank strings, go ahead and remove them as well. [Click me for a big hint](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.isspace.html)**

In [23]:
df['review'].str.isspace().sum()

0

In [19]:
df = df[~df['review'].str.isspace()]

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27 entries, 57 to 1993
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   27 non-null     object
 1   review  27 non-null     object
dtypes: object(2)
memory usage: 648.0+ bytes


**TASK: Confirm the value counts per label:**

In [23]:
#CODE HERE

In [11]:
df['label'].value_counts()

neg    14
pos    13
Name: label, dtype: int64

## EDA on Bag of Words

**Bonus Task: Can you figure out how to use a CountVectorizer model to get the top 20 words (that are not english stop words) per label type? Note, this is a bonus task as we did not show this in the lectures. But a quick cursory Google search should put you on the right path.  [Click me for a big hint](https://stackoverflow.com/questions/16288497/find-the-most-common-term-in-scikit-learn-classifier)**

In [45]:
#CODE HERE

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

In [27]:
neg_reviews = df[df['label'] == 'neg']['review']
pos_reviews = df[df['label'] == 'pos']['review']

# Create a CountVectorizer
cv = CountVectorizer(stop_words='english')

# Fit and transform for each label
neg_matrix = cv.fit_transform(neg_reviews)
pos_matrix = cv.transform(pos_reviews)  # Use transform, not fit_transform

# Get the vocabulary learned from CountVectorizer
vocabulary = cv.get_feature_names_out()

# Calculate word frequencies for negative and positive reviews
neg_word_freqs = zip(vocabulary, neg_matrix.sum(axis=0).tolist()[0])
pos_word_freqs = zip(vocabulary, pos_matrix.sum(axis=0).tolist()[0])

# Define a function to get top words
def get_top_words(word_freqs):
    sorted_words = sorted(word_freqs, key=lambda x: x[1], reverse=True)
    top_words = [word for word, freq in sorted_words[:20]]
    return top_words

print("Top 20 words for Negative reviews:")
print(get_top_words(neg_word_freqs))

print("\nTop 20 words for Positive reviews:")
print(get_top_words(pos_word_freqs))


Top 20 words for Negative reviews:
['film', 'movie', 'like', 'just', 'time', 'good', 'bad', 'character', 'story', 'plot', 'characters', 'make', 'really', 'way', 'little', 'don', 'does', 'doesn', 'action', 'scene']

Top 20 words for Positive reviews:
['film', 'movie', 'like', 'just', 'story', 'good', 'time', 'character', 'life', 'characters', 'way', 'films', 'does', 'best', 'people', 'make', 'little', 'really', 'man', 'new']


### Training and Data

**TASK: Split the data into features and a label (X and y) and then preform a train/test split. You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.20, random_state=101`**

In [None]:
#CODE HERE

In [28]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

### Training a Mode

**TASK: Create a PipeLine that will both create a TF-IDF Vector out of the raw text data and fit a supervised learning model of your choice. Then fit that pipeline on the training data.**

In [None]:
#CODE HERE

In [29]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

In [30]:
pipe = Pipeline([('tfidf', TfidfVectorizer()),('svc', LinearSVC()),])

**TASK: Create a classification report based on the results of your PipeLine.**

In [77]:
#CODE HERE

In [31]:
pipe.fit(X_train, y_train)



In [32]:
from sklearn.metrics import classification_report

In [33]:
preds = pipe.predict(X_test)

In [34]:
print(classification_report(y_test,preds))

              precision    recall  f1-score   support

         neg       0.81      0.86      0.83       191
         pos       0.85      0.81      0.83       197

    accuracy                           0.83       388
   macro avg       0.83      0.83      0.83       388
weighted avg       0.83      0.83      0.83       388



## Great job!