# Assignment - Sentiment Analysis of Movie Reviews

![](https://i.imgur.com/6Wfmf2S.png)

> **Problem Statement**: Apply the TF-IDF technique to train ML models for sentiment analysis using data from the "[Sentiment Analysis on Movie Reviews](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews)" Kaggle competition.


Outline:

1. Download and Explore Dataset
2. Implement the TF-IDF Technique
3. Train baseline model & submit to Kaggle
4. Train & finetune different ML models
3. Document & submit your notebook


Dataset: https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews


## Download and Explore the Data

Outline:

1. Download Dataset from Kaggle
2. Explore and visualize data

### Download Dataset from Kaggle

- Read the "Description", "Evaluation" and "Data" sections on the Kaggle competition page carefully
- Make sure to download the `kaggle.json` file from your [Kaggle account](https://kaggle.com/me/account) and upload it on Colab

In [9]:
!pip install opendatasets --quiet
!pip install pandas --quiet

In [10]:
import opendatasets as od
od.download("https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: 
Your Kaggle username: sanketec87
Your Kaggle Key: ··········
Downloading sentiment-analysis-on-movie-reviews.zip to ./sentiment-analysis-on-movie-reviews


100%|██████████| 1.90M/1.90M [00:00<00:00, 142MB/s]


Extracting archive ./sentiment-analysis-on-movie-reviews/sentiment-analysis-on-movie-reviews.zip to ./sentiment-analysis-on-movie-reviews





In [11]:
import zipfile

with zipfile.ZipFile("/content/sentiment-analysis-on-movie-reviews/train.tsv.zip", 'r') as zip_ref:
   zip_ref.extractall("/content/sentiment-analysis-movie-review")

In [12]:
with zipfile.ZipFile("/content/sentiment-analysis-on-movie-reviews/test.tsv.zip", 'r') as zip_ref:
   zip_ref.extractall("/content/sentiment-analysis-movie-review")

In [13]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

### Explore and Visualize Data

* Load the train, test, and submission files using Pandas
* Explore rows, columns, sample values etc.
* Visualize distribution of target columns

In [21]:
train_fname = "/content/sentiment-analysis-movie-review/train.tsv"
test_fname = "/content/sentiment-analysis-movie-review/test.tsv"
sub_fname = "/content/sentiment-analysis-on-movie-reviews/sampleSubmission.csv"

In [15]:
raw_df = pd.read_csv(train_fname,sep='\t')

In [16]:
raw_df.head(2)

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2


In [17]:
raw_df.Phrase.sample(5)

76382    Intensely romantic , thought-provoking and eve...
37837    a mere plot pawn for two directors with far le...
46313    re-assess the basis for our lives and evaluate...
94045    the most part a useless movie , even with a gr...
28776                            run-of-the-mill revulsion
Name: Phrase, dtype: object

In [18]:
test_df = pd.read_csv(test_fname,sep='\t')

In [19]:
test_df.sample(10)

Unnamed: 0,PhraseId,SentenceId,Phrase
63324,219385,11687,from a large group of your relatives
7519,163580,8842,has made from its other animated TV series
25570,181631,9709,of all slashers
60058,216119,11513,crafted and
44783,200844,10682,CliffsNotes version
33531,189592,10107,penetrates with a rawness that that is both un...
24507,180568,9656,"For starters , the story is just too slim ."
10307,166368,8971,"The film hinges on its performances , and both..."
32795,188856,10069,a patch somewhere between mirthless Todd Solon...
36132,192193,10230,ultimately pulls up lame


In [22]:
sub_df = pd.read_csv(sub_fname)

In [23]:
sub_df.sample(10)

Unnamed: 0,PhraseId,Sentiment
65724,221785,2
31415,187476,2
49406,205467,2
16303,172364,2
32781,188842,2
18479,174540,2
53932,209993,2
44370,200431,2
44136,200197,2
45522,201583,2


In [24]:
raw_df.Sentiment.value_counts(normalize=False,sort=True).sort_index()

0     7072
1    27273
2    79582
3    32927
4     9206
Name: Sentiment, dtype: int64

In [25]:
fig = px.bar(raw_df.Sentiment.value_counts(normalize=False,sort=True).sort_index(),title="Sentiment Distribution Train Data",labels={"index":"Sentiment Class","value":"Counts"})
fig.show()

In [26]:
#Count of short phrases(lenth < 2)
raw_df['Phrase'][raw_df['Phrase'].apply(len)<2].reset_index(drop='index').describe()

count     45
unique    45
top        A
freq       1
Name: Phrase, dtype: object

Summarize your insights and learnings from the dataset below:

* Sentiments Data is spread in gaussian  dsitribution
* 45 short phrases(lenth < 2) are there.

## Implement TF-IDF Technique

![](https://i.imgur.com/5VbUPup.png)

Outline:

1. Learn the vocabulary using `TfidfVectorizer`
3. Transform training and test data

#### Learn Vocabulary using `TfidfVectorizer `

* Create custom tokenizer with stemming
* Create a list of stop words
* Configure and create `TfidfVectorizer `
* Learn vocubulary from training set
* View sample entries from vocabulary

In [27]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [28]:
word_tokenize("Hello This is Word Tokenizer")

['Hello', 'This', 'is', 'Word', 'Tokenizer']

In [29]:
stemmer = SnowballStemmer(language='english')

In [30]:
stemmer.stem("working")

'work'

In [31]:
def tokenize(text):
  return [stemmer.stem(token) for token in word_tokenize(text) if token.isalpha()]

In [32]:
tokenize("This is great example of working; life loving it ?")

['this', 'is', 'great', 'exampl', 'of', 'work', 'life', 'love', 'it']

In [33]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [34]:
english_stopwords = stopwords.words('english')

In [35]:
", ".join(english_stopwords)

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [36]:
selected_stopwords = english_stopwords[:709]

In [38]:
vectorizer = TfidfVectorizer(tokenizer = tokenize,
                             stop_words=selected_stopwords,
                            ngram_range=(1,2),
                             max_features=4000)

In [39]:
vectorizer.fit(raw_df.Phrase)

TfidfVectorizer(max_features=4000, ngram_range=(1, 2),
                stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...],
                tokenizer=<function tokenize at 0x7fda06134680>)

In [40]:
vectorizer.get_feature_names_out()[:200]

array(['abandon', 'abil', 'abl', 'abov', 'absolut', 'absorb', 'abstract',
       'absurd', 'abund', 'abus', 'academi', 'academi award', 'accent',
       'accept', 'access', 'acclaim', 'accompani', 'accomplish',
       'account', 'accumul', 'accur', 'ach', 'achiev', 'acknowledg',
       'acquir', 'across', 'act', 'act like', 'action', 'action film',
       'action flick', 'action hero', 'action movi', 'action sequenc',
       'activ', 'actor', 'actress', 'actual', 'ad', 'adam',
       'adam sandler', 'adapt', 'add', 'addict', 'addit', 'address',
       'adequ', 'adher', 'admir', 'admit', 'adolesc', 'ador', 'adrenalin',
       'adult', 'advanc', 'advantag', 'adventur', 'advic', 'aesthet',
       'affair', 'affect', 'affirm', 'afraid', 'african', 'afternoon',
       'age', 'agent', 'aggress', 'ago', 'ah', 'ahead', 'ai', 'aid',
       'aim', 'aimless', 'air', 'aisl', 'alabama', 'albeit', 'album',
       'alert', 'alic', 'alien', 'aliv', 'alleg', 'allegori', 'allen',
       'allow', 'allow 

In [41]:
inputs = vectorizer.transform(raw_df.Phrase)

In [42]:
inputs.shape

(156060, 4000)

In [43]:
inputs.toarray()[0][:100]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

### Transform Training & Test Data

* Transform phrases from training set
* Transform phrases from test set
* Look at some example values

In [44]:
test_df.sample(5)

Unnamed: 0,PhraseId,SentenceId,Phrase
7384,163445,8834,monument
54928,210989,11221,High on melodrama .
43749,199810,10629,Baran is n't the most transporting or gripping...
47938,203999,10849,of hearts
1793,157854,8604,a movie


In [45]:
test_inputs = vectorizer.transform(test_df.Phrase)

In [46]:
test_inputs.shape

(66292, 4000)

## Train Baseline Model & Submit to Kaggle

1. Split training and validation sets
2. Train logistic regression model
3. Study predictions on sample phrases
4. Make predictions and submit to Kaggle




### Split Training and Validation Sets

Tip: Don't use a random sample for validation set (why?)

In [47]:
train_size = 110_000
train_inputs = inputs[:train_size]
train_targets = raw_df.Sentiment[:train_size]

In [48]:
train_inputs.shape,train_targets.shape

((110000, 4000), (110000,))

In [49]:
val_inputs = inputs[train_size:]
val_targets = raw_df.Sentiment[train_size:]

In [50]:
val_inputs.shape,val_targets.shape

((46060, 4000), (46060,))

### Train Logistic Regression Model



In [51]:
model  = LogisticRegression()

In [52]:
model.fit(train_inputs,train_targets)

LogisticRegression()

In [53]:
train_preds = model.predict(train_inputs)

In [54]:
train_targets

0         1
1         2
2         2
3         2
4         2
         ..
109995    1
109996    0
109997    1
109998    0
109999    2
Name: Sentiment, Length: 110000, dtype: int64

In [55]:
train_preds

array([3, 2, 2, ..., 0, 0, 2])

In [56]:
accuracy_score(train_targets,train_preds)

0.6546909090909091

In [57]:
val_preds = model.predict(val_inputs)

In [58]:
accuracy_score(val_targets,val_preds)

0.5790490664350847

### Study Predictions on Sample Inputs

In [59]:
small_df = raw_df.sample(20)

In [60]:
small_df

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
18871,18872,828,like a year late,2
126562,126563,6805,snail-like,1
29270,29271,1356,Does a good job of establishing a time and pla...,3
123332,123333,6619,much funnier than anything,3
142398,142399,7727,outtakes in which most of the characters forge...,1
36936,36937,1750,", Treasure Planet is truly gorgeous to behold .",4
98890,98891,5187,gaping enough to pilot an entire Olympic swim ...,2
113357,113358,6023,sucking,1
109525,109526,5801,artistic merits,3
117957,117958,6301,A movie so bad that it quickly enters the pant...,2


In [61]:
small_inputs = vectorizer.transform(small_df.Phrase)

In [62]:
small_inputs.shape

(20, 4000)

In [63]:
small_preds = model.predict(small_inputs)

In [64]:
small_preds

array([2, 2, 3, 2, 2, 4, 2, 2, 2, 0, 3, 3, 2, 1, 1, 2, 2, 2, 4, 2])

### Make Predictions & Submit to Kaggle

1. Make predictions on Test Dataset
2. Generate & submit CSV on Kaggle
3. Add screenshot of your score 



In [65]:
test_df.sample(5)

Unnamed: 0,PhraseId,SentenceId,Phrase
27169,183230,9782,charged with the impossible task of making the...
46560,202621,10773,"at times , and lots of fun"
55212,211273,11239,to two completely different
42556,198617,10561,just another kung-fu sci-fi movie with silly a...
60386,216447,11532,makes up for in compassion


In [66]:
test_inputs

<66292x4000 sparse matrix of type '<class 'numpy.float64'>'
	with 197760 stored elements in Compressed Sparse Row format>

In [67]:
test_preds = model.predict(test_inputs)

In [68]:
test_preds

array([3, 3, 2, ..., 2, 2, 1])

In [69]:
sub_df.Sentiment = test_preds

In [70]:
sub_df.Sentiment

0        3
1        3
2        2
3        3
4        2
        ..
66287    1
66288    1
66289    2
66290    2
66291    1
Name: Sentiment, Length: 66292, dtype: int64

In [71]:
sub_df.to_csv('/content/sentiment-analysis-movie-review/submission.csv',index=None)

In [72]:
!head '/content/sentiment-analysis-movie-review/submission.csv'

PhraseId,Sentiment
156061,3
156062,3
156063,2
156064,3
156065,2
156066,3
156067,3
156068,2
156069,3


## Train & Finetune Different ML Models

Train & finetune at least 2 other types of models

Models to try:
- Decision Trees
- Random Forests
- Gradient Boosting
- Naive Bayes
- SVM

Optional: 
* Use PCA for dimensionality reduction
* Compare classification vs regression models


Tips: 

- If using a regression model, make sure to round predictions to integer and clip to the range `[1,5]`
- Track your progress in a copy of [this experiment tracking spreadsheet](https://docs.google.com/spreadsheets/d/1X-tifxAOAYeIA2J32hBGP5B0MPnZy_o-zOz1NbS-1Ig/edit?usp=sharing)


### Model 1

In [73]:
# Naive Bayes
from sklearn.naive_bayes import MultinomialNB

In [74]:
model_nb = MultinomialNB()

In [75]:
model_nb.fit(train_inputs,train_targets)

MultinomialNB()

In [76]:
accuracy_score(train_targets,model_nb.predict(train_inputs))

0.6019727272727273

### Model 2

In [77]:
from sklearn.ensemble import RandomForestClassifier

In [78]:
model_rf = RandomForestClassifier()

In [79]:
model_rf.fit(train_inputs,train_targets)

RandomForestClassifier()

In [80]:
accuracy_score(train_targets,model_rf.predict(train_inputs))

0.8066818181818182

In [81]:
test_preds = model_rf.predict(test_inputs)

In [82]:
test_preds

array([1, 1, 2, ..., 2, 2, 1])

In [83]:
sub_df.Sentiment = test_preds

In [84]:
sub_df.Sentiment

0        1
1        1
2        2
3        1
4        3
        ..
66287    1
66288    1
66289    2
66290    2
66291    1
Name: Sentiment, Length: 66292, dtype: int64

In [87]:
sub_df.to_csv('/content/sentiment-analysis-movie-review/submission.csv',index=None)

### Model 3

Best Model:

??? 

(include Kaggle score screenshot)

## Submission and Future Work


How to make a submission:

- Add documentation and explanations
- Clean up any stray code/comments
- Include a screenshot of your best score
- Make a submission on the assignment page
- Review evaluation criteria carefully


Future work:
- Try more machine learning models
- Try configuring CountVectorizer differently
- Try approaches other than bag of words
