# Text Classification Assessment
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [43]:
import pandas as pd
import numpy as np

df = pd.read_csv('../TextFiles/moviereviews.tsv',sep='\t')
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


### Task #2: Check for missing values:

In [44]:
# Check for NaN values:
df.isnull().sum()

label      0
review    35
dtype: int64

In [45]:
# Check for whitespace strings (it's OK if there aren't any!):
blanks=[]

for i,lb,rv in df.itertuples():
    if type(rv)==str:
        if rv.isspace():
            blanks.append(i)
            

len(blanks)


27

In [46]:
df.iloc[blanks]

Unnamed: 0,label,review
57,neg,
71,pos,
147,pos,
151,pos,
283,pos,
307,pos,
313,neg,
323,pos,
343,pos,
351,neg,


In [47]:
df.drop(blanks,inplace=True)

### Task #3: Remove NaN values:

In [48]:
df.dropna(inplace=True)

### Task #4: Take a quick look at the `label` column:

In [49]:
df['label'].value_counts()

neg    969
pos    969
Name: label, dtype: int64

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [50]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [51]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

model=Pipeline([('tfidf',TfidfVectorizer()),
               ('clf',LinearSVC()),
               ])

model.fit(X_train,y_train)



Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

### Task #7: Run predictions and analyze the results

In [52]:
# Form a prediction set
predictions = model.predict(X_test)

In [53]:
# Report the confusion matrix
from sklearn import metrics

print(metrics.confusion_matrix(y_test,predictions))


[[259  49]
 [ 49 283]]


In [54]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.84      0.84      0.84       308
         pos       0.85      0.85      0.85       332

    accuracy                           0.85       640
   macro avg       0.85      0.85      0.85       640
weighted avg       0.85      0.85      0.85       640



In [55]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.846875
