___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Text Classification Assessment
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [12]:

import pandas as pd

df = pd.read_csv("../TextFiles/moviereviews2.tsv", sep="\t")
df.head()


Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


### Task #2: Check for missing values:

In [13]:
# Check for NaN values:
df.isnull().sum()


label      0
review    20
dtype: int64

In [14]:
# Check for whitespace strings (it's OK if there aren't any!):
idx_list =[]

for id, lb, rv in df.itertuples():
    if type(rv) == 'str':
        if rv.isspace():
            idx_list.append(id)
print(len(idx_list), 'blanks', idx_list)


0 blanks []


### Task #3: Remove NaN values:

In [15]:
df.dropna(inplace=True)
len(df)

5980

### Task #4: Take a quick look at the `label` column:

In [16]:
df.label.value_counts()

pos    2990
neg    2990
Name: label, dtype: int64

In [34]:
stopwords = ['a', 'about', 'an', 'and', 'are', 'as', 'at', 'be', 'been', 'but', 'by', 'can', \
             'even', 'ever', 'for', 'from', 'get', 'had', 'has', 'have', 'he', 'her', 'hers', 'his', \
             'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'just', 'me', 'my', 'of', 'on', 'or', \
             'see', 'seen', 'she', 'so', 'than', 'that', 'the', 'their', 'there', 'they', 'this', \
             'to', 'was', 'we', 'were', 'what', 'when', 'which', 'who', 'will', 'with', 'you']

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [18]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.33)



### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [46]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

svc_pipe = Pipeline([
    ( 'tfidf', TfidfVectorizer(stop_words=stopwords)),
    ('svc', LinearSVC() )
])

svc_pipe.fit(X_train, y_train)

nb_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=stopwords)),
    ('nb', MultinomialNB())
])

nb_pipe.fit(X_train, y_train)


Pipeline(steps=[('tfidf',
                 TfidfVectorizer(stop_words=['a', 'about', 'an', 'and', 'are',
                                             'as', 'at', 'be', 'been', 'but',
                                             'by', 'can', 'even', 'ever', 'for',
                                             'from', 'get', 'had', 'has',
                                             'have', 'he', 'her', 'hers', 'his',
                                             'how', 'i', 'if', 'in', 'into',
                                             'is', ...])),
                ('nb', MultinomialNB())])

### Task #7: Run predictions and analyze the results

In [49]:
# Form a prediction set
y_pred = svc_pipe.predict(X_test)
y_pred_1 = nb_pipe.predict(X_test)



In [51]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

print(confusion_matrix(y_test, y_pred))

report = classification_report(y_test, y_pred)
print(report)
accuracy_score(y_test, y_pred)

[[893  98]
 [ 64 919]]
              precision    recall  f1-score   support

         neg       0.93      0.90      0.92       991
         pos       0.90      0.93      0.92       983

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974



0.9179331306990881

In [52]:
# Report the confusion matrix

print(confusion_matrix(y_test, y_pred_1))
report = classification_report(y_test, y_pred_1)
print(report)

accuracy_score(y_test, y_pred_1)

[[938  53]
 [126 857]]
              precision    recall  f1-score   support

         neg       0.88      0.95      0.91       991
         pos       0.94      0.87      0.91       983

    accuracy                           0.91      1974
   macro avg       0.91      0.91      0.91      1974
weighted avg       0.91      0.91      0.91      1974



0.9093211752786221

## Great job!