# Text Classification Assessment - Solution
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`. 

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('../TextFiles/moviereviews2.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


### Task #2: Check for missing values:

In [2]:
# Check for NaN values:
df.isnull().sum()

label      0
review    20
dtype: int64

In [3]:
# Check for whitespace strings (it's OK if there aren't any!):
blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    print(i)
    print(lb)
    print(rv)
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
len(blanks)

0
pos
I loved this movie and will watch it again. Original twist to Plot of Man vs Man vs Self. I think this is Kurt Russell's best movie. His eyes conveyed more than most actors words. Perhaps there's hope for Mankind in spite of Government Intervention?
1
pos
A warm, touching movie that has a fantasy-like quality.<br /><br />Ellen Burstyn is, as always, superb.<br /><br />Samantha Mathis has given many great performances, but there is just something about this one will haunt your memory.<br /><br />Most of all, you've got to see this amazing 5-yr. old, Jodelle Ferland. I was so captivated by her presence, I had to buy the movie so I could watch her again and again. She is a miracle of God's creation.<br /><br />Judging by the high IMDB rating, I'm not the only one who was mesmerized by this young actress.
2
pos
I was not expecting the powerful filmmaking experience of "Girlfight". It's an Indie; low-budget, no big-name actors, freshman director. I had heard it was good, but not this 

1267
pos
This show is beautifully done. When it first came out I though it nothing more than a light-hearted family comedy with quite a few good one-liners. It seemed to express many families really well too, with different concepts of both parent and child, however, like I said, I never thought any more of it then a good watch on an evening. However, my view was shot out the other window when the tragic death of the fantastically funny John Ritter accrued. The programme stood it's ground and really commended the characters life in a very sensitive way that also touched the hearts of all the admire res of John Ritter, a fantastic actor with the talent to do anything. When the show aired after Ritters passing, I really wanted to just give my dad a hug and let him know how much he meant to me. I thought this shone threw the acting talents of the three children, particularly that of Bridget's character, who was worried of the last words she said to him. It reminded me that no matter what 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



0

In [4]:
# Check for whitespace strings (it's OK if there aren't any!):
blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
len(blanks)

0

### Task #3:  Remove NaN values:

In [5]:
df.dropna(inplace=True)

### Task #4: Take a quick look at the `label` column:

In [6]:
df['label'].value_counts()

pos    2990
neg    2990
Name: label, dtype: int64

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [7]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [8]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

### Task #7: Run predictions and analyze the results

In [9]:
# Form a prediction set
predictions = text_clf.predict(X_test)

In [10]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[900  91]
 [ 63 920]]


In [11]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       991
         pos       0.91      0.94      0.92       983

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974



In [12]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.9219858156028369


## Great job!