# Bag of Words model (Movie reviews)

- Convert a collection of text documents to a matrix of token counts

- This implementation produces a sparse representation of the counts using
scipy.sparse.csr_matrix.

1. Tokenize the text 
2. Idnetify unnique words acted aas vocab
3. Count occurrence of each unique word in text

Sklearn provides a class called CountVectorizer whihc implements BOW model. 
Fit the list phrases and also create vocabulary.

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
import warnings
warnings.filterwarnings('ignore')

## Read data

In [3]:
df = pd.read_csv('../moviereviews.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [4]:
#np.bincount(df.label)
df['label'].value_counts()

neg    1000
pos    1000
Name: label, dtype: int64

In [5]:
df.isnull().sum()

label      0
review    35
dtype: int64

In [6]:
df = df.dropna()

In [7]:
df.shape

(1965, 2)

## Split data into train and test

In [38]:
def split_data(data,y,length,split_mark=0.8):
    if split_mark > 0. and split_mark<1.0:
        n = int(split_mark*length)
    else:
        n = int(split_mark)
    xtrain = data[:n].copy()
    xtest = data[n:].copy()
    ytrain = y[:n].copy()
    ytest = y[n:].copy()
    return xtrain, xtest, ytrain, ytest

In [39]:
xtrain, xtest, ytrain, ytest = split_data(df.review, df.label, len(df))
print(xtrain.shape, xtest.shape)

(1572,) (393,)


In [40]:
xtrain

0       how do films like mouse hunt get into theatres...
1       some talented actresses are blessed with a dem...
2       this has been an extraordinary year for austra...
3       according to hollywood movies made in last few...
4       my first press screening of 1998 and already i...
                              ...                        
1598    usually when one is debating who the modern qu...
1599    aliens ! ! well , that is what this movie is a...
1601     " mission to mars " is one of those annoying ...
1602    martin scorsese's triumphant adaptation of edi...
1603    like the great musical pieces of mozart himsel...
Name: review, Length: 1572, dtype: object

Let's build a model using pipeline object and separate steps.

## Separate Steps

### Feature Extraction

In [41]:
vectorizer = CountVectorizer()

In [43]:
xtrain_bow = vectorizer.fit_transform(xtrain)
xtest_bow = vectorizer.transform(xtest)

In [44]:
# 35629 diff unique words
xtrain_bow.shape

(1572, 35629)

In [50]:
xtrain_bow.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [51]:
feature_names = vectorizer.get_feature_names()
feature_names[19500:19520]

['mastering',
 'masterless',
 'mastermind',
 'masterminded',
 'masterminds',
 'masterpeice',
 'masterpiece',
 'masterpieces',
 'masters',
 'masterson',
 'masterwork',
 'mastery',
 'mastrantonio',
 'masturbates',
 'masturbation',
 'masturbatory',
 'masur',
 'mat',
 'matador',
 'matarazzo']

### Model Training & Prediction

In [54]:
parameters = {'C':[0.001,0.01,0.1,1,10], 'max_iter':[50, 75, 100, 150]}
gs_clf = GridSearchCV(estimator=LogisticRegression(),param_grid=parameters, n_jobs=-1, cv=5)

In [55]:
scores = cross_val_score(estimator=gs_clf, X=xtrain_bow, y=ytrain ,cv=5)
print("MEan score:{:.2f}".format(np.mean(scores)))

MEan score:0.83


In [67]:
log_model = gs_clf.fit(xtrain_bow, ytrain)

In [69]:
preds = log_model.predict(xtest_bow)

In [70]:
accuracy_score(ytest, preds)

0.8371501272264631

### Single data prediction

In [71]:
raw = df['review'][2]
data = vectorizer.transform([raw])

In [72]:
# 35629 unique words as features
data.shape

(1, 35629)

In [73]:
log_model.predict(data)

array(['pos'], dtype=object)

In [74]:
df['label'][2]

'pos'

Now, let's summarise all these steps as in a pipeline object.

## Pipeline Methodology

In [11]:
#count(word)/#total words in doc

t = Pipeline([
    ('vect',CountVectorizer()),
    ('clf',LogisticRegression())])

In [12]:
t.fit(xtrain, ytrain)

Pipeline(steps=[('vect', CountVectorizer()), ('clf', LogisticRegression())])

In [14]:
#np.mean(preds==ytest)
t.score(xtest,ytest)

0.8269720101781171

### Hyper-parameter Tuning

In [20]:
pipe = Pipeline([
    ('vect',CountVectorizer()),
    ('gs_clf',gs_clf)])
gs_clf = pipe.fit(xtrain,ytrain)

In [23]:
gs_clf.score(xtest,ytest)

0.8371501272264631

## Model Evaluation

In [None]:
preds = gs_clf.predict(xtest)

In [26]:
accuracy_score(ytest, preds)

0.8371501272264631

In [27]:
print(classification_report(ytest, preds))

              precision    recall  f1-score   support

         neg       0.83      0.86      0.84       201
         pos       0.84      0.82      0.83       192

    accuracy                           0.84       393
   macro avg       0.84      0.84      0.84       393
weighted avg       0.84      0.84      0.84       393



### Raw data prediction

We havent performed any text-preprocessing hence, can directly predict over the pipeline model

In [29]:
pipe.predict([" Not best"])

array(['neg'], dtype=object)

# END