<a href="https://colab.research.google.com/github/rahiakela/machine-learning-research-and-practice/blob/main/grokking-machine-learning/06-logistic-regression/02_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Sentiment analysis using IMDB

In this notebook, we see a real-life application of the logistic classifier in sentiment analysis. 

We use Turi Create to build a model that analyzes movie reviews on the popular IMDB site.

##Setup

In [None]:
!pip -q install turicreate

In [2]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt

import turicreate as tc

random.seed(0)

In [None]:
!wget https://github.com/luisguiserrano/manning/raw/master/Chapter_6_Logistic_Regression/IMDB_Dataset.csv

##Defining dataset

First, let's convert the dataset into an SFrame.

In [4]:
movies = tc.SFrame("IMDB_Dataset.csv")
movies

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


review,sentiment
One of the other reviewers has mentioned ...,positive
A wonderful little production. <br /><br ...,positive
I thought this was a wonderful way to spend ...,positive
Basically there's a family where a little ...,negative
"Petter Mattei's ""Love in the Time of Money"" is a ...",positive
"Probably my all-time favorite movie, a story ...",positive
I sure would like to see a resurrection of a up ...,positive
"This show was an amazing, fresh & innovative idea ...",negative
Encouraged by the positive comments about ...,negative
If you like original gut wrenching laughter you ...,positive


In [5]:
# add a new column called words containing this dictionary
movies["words"] = tc.text_analytics.count_words(movies["review"])
movies

review,sentiment,words
One of the other reviewers has mentioned ...,positive,"{'darker': 1.0, 'touch': 1.0, 'thats': 1.0, ..."
A wonderful little production. <br /><br ...,positive,"{'done': 1.0, 'surface': 1.0, 'every': 1.0, ..."
I thought this was a wonderful way to spend ...,positive,"{'go': 1.0, 'superman': 1.0, 'interesting': 1.0, ..."
Basically there's a family where a little ...,negative,"{'them': 1.0, 'ignore': 1.0, 'dialogs': 1.0, ..."
"Petter Mattei's ""Love in the Time of Money"" is a ...",positive,"{'work': 1.0, 'for': 1.0, 'anxiously': 1.0, ..."
"Probably my all-time favorite movie, a story ...",positive,"{'for': 1.0, 'dozen': 1.0, 'i': 1.0, 'if': ..."
I sure would like to see a resurrection of a up ...,positive,"{'do': 1.0, 'go': 1.0, 'must': 1.0, 'new': 1.0, ..."
"This show was an amazing, fresh & innovative idea ...",negative,"{'awful': 1.0, 'just': 1.0, 'huge': 1.0, 'a': ..."
Encouraged by the positive comments about ...,negative,"{'effort': 1.0, 'an': 1.0, 'making': 1.0, ..."
If you like original gut wrenching laughter you ...,positive,"{'camp': 1.0, 'great': 1.0, 'br': 2.0, 'movie': ..."


We are ready to train our model!

In [6]:
model = tc.logistic_classifier.create(movies, features=["words"], target="sentiment")

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [7]:
model

Class                          : LogisticClassifier

Schema
------
Number of coefficients         : 101631
Number of examples             : 47500
Number of classes              : 2
Number of feature columns      : 1
Number of unpacked features    : 101630

Hyperparameters
---------------
L1 penalty                     : 0.0
L2 penalty                     : 0.01

Training Summary
----------------
Solver                         : lbfgs
Solver iterations              : 10
Solver status                  : Completed (Iteration limit reached).
Training time (sec)            : 4.1963

Settings
--------
Log-likelihood                 : 307.5779

Highest Positive Coefficients
-----------------------------
words[retrained]               : 32.3741
words[cozzie]                  : 32.3741
words[publican]                : 32.3741
words[phworr]                  : 32.3741
words[workplaces]              : 32.3741

Lowest Negative Coefficients
----------------------------
words[unfortuntately]         

Now, we can look at the weights of the words, with the coefficients
command.

In [8]:
weights = model.coefficients
weights 

name,index,class,value,stderr
(intercept),,positive,0.0759701927312501,
words,darker,positive,0.6460312144900954,
words,touch,positive,0.2197435271262215,
words,thats,positive,-0.5175719504167362,
words,your,positive,-0.0065758127151575,
words,viewing,positive,0.1305325786326617,
words,their,positive,0.0101980304680989,
words,into,positive,-0.007736536977814,
words,turned,positive,-0.3201473671770852,
words,being,positive,-0.0042587608548417,


In [9]:
weights.sort("value")

name,index,class,value,stderr
words,unfortuntately,positive,-18.85498417710016,
words,sierre,positive,-18.399875454021643,
words,toyshop,positive,-18.28146314562758,
words,newlwed,positive,-18.04209639210535,
words,lodz,positive,-17.558567713153312,
words,bsed,positive,-16.991589574585284,
words,phantasmagoria,positive,-16.520585161600895,
words,taylors,positive,-16.316050931116973,
words,carribean,positive,-16.042933780103915,
words,graziano,positive,-15.981636540382032,


In [10]:
weights.sort("value", ascending=False)

name,index,class,value,stderr
words,publican,positive,32.37407896191911,
words,schwa,positive,32.37407896191911,
words,laddishness,positive,32.37407896191911,
words,righties,positive,32.37407896191911,
words,workplaces,positive,32.37407896191911,
words,retrained,positive,32.37407896191911,
words,cozzie,positive,32.37407896191911,
words,phworr,positive,32.37407896191911,
words,distrusting,positive,25.936105885911104,
words,widdecombe,positive,25.936105885911104,


In [11]:
# the weight of the word wonderful is positive
weights[weights["index"] == "wonderful"]

name,index,class,value,stderr
words,wonderful,positive,1.141119067538709,


In [12]:
# the weight of the word horrible is negative
weights[weights["index"] == "horrible"]

name,index,class,value,stderr
words,horrible,positive,-1.1194574095528114,


In [13]:
# the weight of the word the is small
weights[weights["index"] == "the"]

name,index,class,value,stderr
words,the,positive,0.0005643378930812,


This makes sense: wonderful is a positive word, horrible
is a negative word, and the is a neutral word.

As a last step, let’s find the most positive and negative reviews.

In [14]:
movies["prediction"] = model.predict(movies, output_type="probability")

Let’s find the most positive and most negative movies, according to the model.

In [16]:
movies.sort("prediction", ascending=False)[0]

{'prediction': 1.0,
 'review': 'This movie is stuffed full of stock Horror movie goodies: chained lunatics, pre-meditated murder, a mad (vaguely lesbian) female scientist with an even madder father who wears a mask because of his horrible disfigurement, poisoning, spooky castles, werewolves (male and female), adultery, slain lovers, Tibetan mystics, the half-man/half-plant victim of some unnamed experiment, grave robbing, mind control, walled up bodies, a car crash on a lonely road, electrocution, knights in armour - the lot, all topped off with an incredibly awful score and some of the worst Foley work ever done.<br /><br />The script is incomprehensible (even by badly dubbed Spanish Horror movie standards) and some of the editing is just bizarre. In one scene where the lead female evil scientist goes to visit our heroine in her bedroom for one of the badly dubbed: "That is fantastical. I do not understand. Explain to me again how this is..." exposition scenes that litter this movie, 

In [17]:
movies.sort("prediction", ascending=True)[0]

{'prediction': 4.4577661704483655e-102,
 'sentiment': 'negative',
 'words': {'1960s': 1.0,
  '70s': 2.0,
  'a': 26.0,
  'about': 1.0,
  'absolute': 1.0,
  'absorb': 1.0,
  'academics': 1.0,
  'accepted': 1.0,
  'achieved': 2.0,
  'acting': 1.0,
  'action': 2.0,
  'addict': 1.0,
  'admits': 1.0,
  'adult': 1.0,
  'after': 1.0,
  'against': 1.0,
  'all': 2.0,
  'allowed': 2.0,
  'also': 1.0,
  'ambiguity': 1.0,
  'ambivalence': 1.0,
  'america': 1.0,
  'american': 8.0,
  'americaness': 1.0,
  'among': 1.0,
  'an': 5.0,
  'and': 24.0,
  'anti': 1.0,
  'anywhere': 1.0,
  'are': 5.0,
  'arthouse': 1.0,
  'as': 3.0,
  'at': 3.0,
  'attempt': 1.0,
  'avoid': 1.0,
  'away': 1.0,
  'baddie': 1.0,
  'barthelmy': 1.0,
  'be': 2.0,
  'becomes': 1.0,
  'bed': 1.0,
  'before': 1.0,
  'begins': 1.0,
  'being': 2.0,
  'best': 1.0,
  'block': 1.0,
  'body': 1.0,
  'bonnie': 1.0,
  'boom': 1.0,
  'br': 22.0,
  'burning': 1.0,
  'but': 4.0,
  'by': 6.0,
  'can': 3.0,
  'cannot': 1.0,
  'career': 1.0,
  '