**Aim:** To develop a classification model to predict the Positive/Negative labels based on text content.

**Steps involved to develop a classification model:**<br>
* Read in a collection of documents - a corpus<br>
* Transform text into numerical vector data using a pipeline<br>
* Create a classifier<br>
* Fit/train the classifier<br>
* Test the classifier on new data<br>
* Evaluate performance<br>

**Perform imports and load the dataset:**

In [None]:
import numpy as np
import pandas as pd

df = pd.read_csv('data/moviereviews.tsv', sep='\t')
df.head()

In [None]:
len(df)

**Check for missing values:**  
* Detect & remove NaN values   
* Detect & remove empty strings


**Detect & remove NaN values:**

In [None]:
# Check for the existence of NaN values in a cell:
df.isnull().sum()

In [None]:
df.dropna(inplace=True)

len(df)

**Detect & remove empty strings:**

In [None]:
blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
print(len(blanks), 'blanks: ', blanks)

In [None]:
df.drop(blanks, inplace=True)
len(df)

In [None]:
#Take a quick look at the label column
df['label'].value_counts()

**Split the data into train & test sets:**

In [None]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

**Build pipelines to vectorize the data, then train and fit a model:**

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# Naïve Bayes:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', MultinomialNB()),
])

# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

**Feed the training data through the first pipeline:**

In [None]:
#Running Naive Bayes

text_clf_nb.fit(X_train, y_train)


**Run predictions and analyze the results (naïve Bayes):**

In [None]:
# Form a prediction set
predictions = text_clf_nb.predict(X_test)

In [None]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

In [None]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

In [None]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

**Feed the training data through the second pipeline:**

In [None]:
#Running Linear SVC
text_clf_lsvc.fit(X_train, y_train)

**Run predictions and analyze the results (Linear SVC):**

In [None]:
# Form a prediction set
predictions = text_clf_lsvc.predict(X_test)

In [None]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

In [None]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

In [None]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))
