# Movie Review Sentiment Baseline Performance
This notebook establishes baseline performance on binary sentiment classification using a simple model.  The data for this example comes from [Andrew Maas](http://ai.stanford.edu/~amaas/data/sentiment/).  We use sklearn to tokenize the reviews, create a TF-IDF feature vector, and a logistic regression model with Ridge regularization. There is also a custom implementation of the NB-SVM model from the Baselines and Bigrams paper which gets slightly better performance.

### Download Data

In [1]:
import os, urllib, tarfile

In [2]:
DATA_URL = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
DATA_DIR = './data'

if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR)

if not os.path.isfile(os.path.join(DATA_DIR,'movie_data.tar.gz')):
    urllib.request.urlretrieve(DATA_URL, os.path.join(DATA_DIR,'movie_data.tar.gz'))
else:
    print("Data already downloaded.")

if os.path.isfile(os.path.join(DATA_DIR,'movie_data.tar.gz')) and not os.path.exists(os.path.join(DATA_DIR,'aclImdb')):
    f = tarfile.open(os.path.join(DATA_DIR,'movie_data.tar.gz'))
    f.extractall(path=DATA_DIR)
    f.close()
else:
    print("Tar file already extracted.")

Tar file already extracted.


### Create Train/Test IMDB Dataframes

In [3]:
import numpy as np
import pandas as pd

In [4]:
TRAIN_DATA_FOLDER = 'data/aclImdb/train/'
TEST_DATA_FOLDER = 'data/aclImdb/test/'

In [18]:
def create_dataframe_from_files(data_folder):
    examples = list()
    for d in ['pos','neg']:
        for f in os.listdir(os.path.join(data_folder,d)):
            _tmp = open(os.path.join(data_folder,d,f),'r', encoding='utf-8')
            if d=='pos':
                examples += [(_tmp.read(),f,1)]
            else:
                examples += [(_tmp.read(),f,0)]
    df_tmp = pd.DataFrame(examples, columns=['text','file','target'])
    df_tmp = df_tmp.sample(frac=1)
    df_tmp = df_tmp.reset_index(drop=True)
    return df_tmp
                
df_train = create_dataframe_from_files(TRAIN_DATA_FOLDER)
df_test = create_dataframe_from_files(TEST_DATA_FOLDER)

print(df_train.shape)
print(df_test.shape)

(25000, 3)
(25000, 3)


### Vectorize Test and Targets

In [15]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [52]:
tfidf_vectorizer = TfidfVectorizer(encoding='utf-8',
                                   decode_error='strict',
                                   strip_accents=None,
                                   lowercase=False,
                                   preprocessor=None,
                                   tokenizer=None,
                                   stop_words='english',
                                   ngram_range=(1, 2),
                                   max_features=100000)

In [53]:
X_train = tfidf_vectorizer.fit_transform(df_train['text'])
X_test = tfidf_vectorizer.transform(df_test['text'])
y_train = df_train['target'].values
y_test = df_test['target'].values
y_test_rand = df_test['target'].sample(frac=1).values

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(25000, 100000)
(25000, 100000)
(25000,)
(25000,)


### Baseline Ridge Model

In [54]:
from sklearn.linear_model import RidgeClassifierCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score

In [55]:
def print_baseline_performance(y_test, y_test_rand, y_hat, y_hat_proba=None):
    print('Random baseline: {0:.3f}'.format(accuracy_score(y_test, y_test_rand)))
    print('Ridge model accuracy: {0:.3f}'.format(accuracy_score(y_test, y_hat)))
    if y_hat_proba is not None:
        print('ROC AUC: {0:.3f}'.format(roc_auc_score(y_test, y_hat_proba)))
    print('Confusion Matrix:')
    print(confusion_matrix(y_test, y_hat))
    print('Report: \n')
    print(classification_report(y_test, y_hat))

In [56]:
ridge_model = RidgeClassifierCV(alphas=(0.1, 0.5, 2.0, 5.0, 10.0),
                                fit_intercept=True,
                                normalize=False, 
                                cv=5,
                                class_weight=None)

In [57]:
ridge_model.fit(X_train, y_train)

RidgeClassifierCV(alphas=(0.1, 0.5, 2.0, 5.0, 10.0), class_weight=None, cv=5,
         fit_intercept=True, normalize=False, scoring=None)

In [59]:
y_hat = ridge_model.predict(X_test)

In [60]:
print_baseline_performance(y_test, y_test_rand, y_hat)

Random baseline: 0.505
Ridge model accuracy: 0.873
Confusion Matrix:
[[11031  1469]
 [ 1702 10798]]
Report: 

             precision    recall  f1-score   support

          0       0.87      0.88      0.87     12500
          1       0.88      0.86      0.87     12500

avg / total       0.87      0.87      0.87     25000



### NB-SVM Model

This model is inspired by the [Baselines and Bigrams](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf) paper, and this [Kaggle notebook](https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline).  This model will generally do a bit better than just Ridge with TF-IDF, and would be good to use as a baseline on any real projects.

In [28]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_is_fitted
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from scipy import sparse

In [29]:
class NbSvmClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, C=1.0, dual=False, n_jobs=1):
        self.C = C
        self.dual = dual
        self.n_jobs = n_jobs
        self.coef_ = None

    def predict(self, x):
        check_is_fitted(self, ['_r', '_clf'])
        return self._clf.predict(x.multiply(self._r))

    def predict_proba(self, x):
        check_is_fitted(self, ['_r', '_clf'])
        return self._clf.predict_proba(x.multiply(self._r))

    def fit(self, x, y):
        # Check that X and y have correct shape
        #y = y.values
        x, y = check_X_y(x, y, accept_sparse=True)

        def pr(x, y_i, y):
            '''Sum the X vectors by column for a target class, then return result normalized by count'''
            p = x[y==y_i].sum(0)
            return (p+1) / ((y==y_i).sum()+1)

        self._r = sparse.csr_matrix(np.log(pr(x,1,y) / pr(x,0,y)))
        x_nb = x.multiply(self._r)
        self._clf = LogisticRegression(C=self.C, dual=self.dual, n_jobs=self.n_jobs).fit(x_nb, y)
        self.coef_ = self._clf.coef_
        return self

In [30]:
nbsvm_model = NbSvmClassifier(C=8.0)

In [31]:
nbsvm_model.fit(X_train, y_train)

NbSvmClassifier(C=8.0, dual=False, n_jobs=1)

In [32]:
y_hat = nbsvm_model.predict_proba(X_test)
y_hat_proba = y_hat[:,1]
y_hat = y_hat[:,1] > 0.5

In [33]:
print_baseline_performance(y_test, y_test_rand, y_hat, y_hat_proba)

Random baseline: 0.503
Ridge model accuracy: 0.888
ROC AUC: 0.956
Confusion Matrix:
[[11091  1409]
 [ 1392 11108]]
Report: 

             precision    recall  f1-score   support

          0       0.89      0.89      0.89     12500
          1       0.89      0.89      0.89     12500

avg / total       0.89      0.89      0.89     25000

