### Word Embedding & Sentiment Classification
**Note:** The dataset used here is obtained from initial preprocessing on the [original dataset](http://ai.stanford.edu/~amaas/data/sentiment/).

In [1]:
import warnings
warnings.simplefilter('ignore')

In [2]:
import numpy as np
import pandas as pd
import pylab as plt
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import cohen_kappa_score

from sklearn.feature_extraction.text import TfidfVectorizer
import pickle

from imblearn.over_sampling import SMOTE
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

import nltk
nltk.download('punkt')
nltk.download('stopwords')

Using TensorFlow backend.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\harshil\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\harshil\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

#### Import and prepare dataset

In [3]:
data = pd.read_csv('./datasets/data.csv')

Let's see what's inside the *data*!

In [4]:
data.head()

Unnamed: 0,review,label
0,this was painful i made myself watch it until ...,0.0
1,once again mr costner has dragged out a movie ...,0.0
2,by strange coincidence i ve started to watch t...,0.0
3,well the hero and the terror is slightly below...,0.0
4,well the hero and the terror is slightly below...,0.0


Here, Label $0.0$ means **_negative_** *sentiment*, while $1.0$ means **_positive_** *sentiment*.

Now, let's check if our dataset contains any null values, if yes, drop those rows!

In [5]:
data.isnull().sum()

review    2
label     5
dtype: int64

In [6]:
data = data.dropna().reset_index(drop=True)

In [7]:
data.isnull().sum()

review    0
label     0
dtype: int64

Now it looks good!

Next step, split the dataset in train-test!

In [8]:
X_train, X_test, y_train, y_test = train_test_split(data['review'], data['label'], test_size = .4, shuffle = False)

In [9]:
len(X_train), len(y_train), len(X_test), len(y_test)

(29446, 29446, 19632, 19632)

#### A little about number of sentiments!

In [10]:
sentiments = data['label'].value_counts()
print('Sentiments in entire dataset:\n Positive: {}\n Negative: {}'.format(sentiments[1], sentiments[0]))

Sentiments in entire dataset:
 Positive: 24536
 Negative: 24542


After split,
- Number of **_positive sentiments_** in **_train data + test data_** should be equal to the **_total number positive sentiments_** in entire dataset,
- Likewise, Number of **_neagetive sentiments_** in **_train data + test data_** should be equal to the **_total number negative sentiments_** in entire dataset.

In [11]:
def get_sentiments(d, _d):
    positive = (d==1).sum()
    negative = (d==0).sum()
    print('Sentiments in {}:\n Positive: {}\n Negative: {}'.format(_d, positive, negative))

In [12]:
get_sentiments(y_train, 'Train data')
get_sentiments(y_test, 'Test data')

Sentiments in Train data:
 Positive: 12333
 Negative: 17113
Sentiments in Test data:
 Positive: 12203
 Negative: 7429


Everything seems right! Let's proceed further!

#### TF-IDF

Now, as we all know, we cannot feed the data to a classifier as it is. We should first convert these data into a numerical form known as vectors.

Here, we will use **TF-IDF** (**T**erm **F**requency–**I**nverse **D**ocument **F**requency), a numerical statistic that reflects how important a word is to a document in a collection or corpus by assigning some weight to it.

In [13]:
def tokenize(text):
    return [word for word in word_tokenize(text.lower()) if word not in stopwords.words('english')]

Since our dataset contains in total $49078$ reviews, generating vectors will take a lot of time. Doing so every time will be a time-consuming task.

Therefore, once we generate vectors in the first run, we can _store_ the **vocabulary** in a separate file. We can then use this saved vocabulary to _fit_ our _train_ and _test data_ in future runs (*of this program, of course!*).

The following method will **_initialize vectorizer_** based on the chosen option.

In [14]:
def choose_vectorizer(option):
    if option == 'generate':
        vectorizer = TfidfVectorizer(tokenizer = tokenize)
    elif option == 'load':
        vectorizer = TfidfVectorizer(vocabulary = pickle.load(open('vocabulary.pkl', 'rb')))
    
    return vectorizer

In following code cell, choose the option **_generate_** if you want to generate vectors of train-test data again (*this will also store the vocabulary in the same directory as this project*) otherwise go with the option **_load_** if you already have vocabulary (*saved in the same directory as this project*).

In [15]:
%%time
options = ['generate', 'load']

# 0 to generate, 1 to load (choose wisely, your life depends on it!)
option = options[0] 

vectorizer = choose_vectorizer(option)
vectorized_train_data = vectorizer.fit_transform(X_train)
vectorized_test_data = vectorizer.transform(X_test)
    
if option == 'generate':
    pickle.dump(vectorizer.vocabulary_, open('vocabulary.pkl', 'wb'))

Wall time: 2h 4s


#### Training and Validation

In the training dataset, there are more negative reviews than positive ones. Therefore, we will first make both the sides equal using **SMOTE** (**S**ynthetic **M**inority **O**ver-sampling **TE**chnique).

In [16]:
%%time
sm = SMOTE(random_state=42, ratio=1.0)
X_train, y_train = sm.fit_sample(vectorized_train_data, y_train)

Wall time: 29.3 s


We will use **Logistic Regression** classifier here. So, let's train this guy!

In [17]:
clf = LogisticRegression()

In [18]:
%%time
clf.fit(X_train, y_train)

Wall time: 3.22 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Of course, we will also do some **_cross-validation_**!

In [19]:
%%time
kf = KFold(n_splits=10, random_state = 42, shuffle = True)
scores = cross_val_score(clf, X_train, y_train, cv = kf)

Wall time: 21.2 s


In [20]:
print('Cross-validation scores:', scores)
print('Cross-validation accuracy: {:.4f} (+/- {:.4f})'.format(scores.mean(), scores.std() * 2))

Cross-validation scores: [0.90914403 0.90855974 0.915279   0.91002045 0.90739118 0.90943617
 0.90531853 0.92314436 0.91642314 0.92197545]
Cross-validation accuracy: 0.9127 (+/- 0.0118)


The classifier is trained, so let's check its performance on the **_validation set_**.

In [21]:
predictions = clf.predict(vectorized_test_data)

validation = dict()

validation['accuracy'] = accuracy_score(y_test, predictions)
validation['precision'] = precision_score(y_test, predictions, average='macro')
validation['recall'] = recall_score(y_test, predictions, average='macro')
validation['f1'] = f1_score(y_test, predictions, average='macro')

In [22]:
print('Training evaluation:\n', '-' * 12)
for v in validation:
    print('{}: {:.5f}'.format(v.title(), validation[v]))

Training evaluation:
 ------------
Accuracy: 0.87322
Precision: 0.86305
Recall: 0.87579
F1: 0.86780


#### Cohen-Kappa score

**Cohen’s kappa** statistic measures **_interrater reliability_** which is more robust than a simple percent agreement calculation.

In simple words, it will show us a level of agreement between the labels predicted by the classifier and the actual labels.

In [23]:
p = predictions.tolist()
ck = cohen_kappa_score(y_test, p)
print('C-K Score: {:.5f}'.format(ck))

C-K Score: 0.73606


#### It's time for a tiny test!

In [24]:
example_reviews = ['An honest, engaging, and surprisingly funny look back at one of modern television\'s greatest achievements.',
          'Excellent movie! Inspiring and very entertaining for all especially youth and anyone inspired by today\'s modern age of tech entrepreneurship!',
          'Honestly even the trailer made me uncomfortable.',
          'I never write movie reviews, but this one was such a stinker, I feel I owe it to everyone to at least provide a warning.',
          'This movie was a good movie by standard and a lil beyond standard. It was written very well, The acting was great, each characters performance was clever and the comedic timing was spot on. The story line is very real and relatable. Enjoyable for adults and completely appropriate for pre-teens up to 20. Go support, my family loved it.']

As one can tell, **_first_**, **_second_** and **_fifth_** reviews in **_example_reviews_** are **_positive_** while **_third_** and **_fourth_** reviews are **_negative_**. _Let's see what classifier predicts!_

In [25]:
example_preds = clf.predict(vectorizer.transform(example_reviews))
print(' '.join(str(int(p)) for p in example_preds))

1 1 0 0 1


**Perfect**, Classifier also says the same!