 <h2 align="center"> Sentiment Analysis </h2>

### Introduction
___

- IMDB movie reviews dataset
- http://ai.stanford.edu/~amaas/data/sentiment
- Contains 25000 positive and 25000 negative reviews
<img src="https://i.imgur.com/lQNnqgi.png" align="center">
- Contains at most reviews per movie
- At least 7 stars out of 10 $\rightarrow$ positive (label = 1)
- At most 4 stars out of 10 $\rightarrow$ negative (label = 0)
- 50/50 train/test split
- Evaluation accuracy

<b>Features: bag of 1-grams with TF-IDF values</b>:
- Extremely sparse feature matrix - close to 97% are zeros

 <b>Model: Logistic regression</b>
- $p(y = 1|x) = \sigma(w^{T}x)$
- Linear classification model
- Can handle sparse data
- Fast to train
- Weights can be interpreted
<img src="https://i.imgur.com/VieM41f.png" align="center" width=500 height=500>

Loading the dataset
---

In [1]:
import pandas as pd
movie=pd.read_csv("data/movie_data.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'data/movie_data.csv'

In [2]:
movie.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


## <h2 align="center">Bag of words / Bag of N-grams model</h2>

In [3]:
movie.review[1]


"OK... so... I really like Kris Kristofferson and his usual easy going delivery of lines in his movies. Age has helped him with his soft spoken low energy style and he will steal a scene effortlessly. But, Disappearance is his misstep. Holy Moly, this was a bad movie! <br /><br />I must give kudos to the cinematography and and the actors, including Kris, for trying their darndest to make sense from this goofy, confusing story! None of it made sense and Kris probably didn't understand it either and he was just going through the motions hoping someone would come up to him and tell him what it was all about! <br /><br />I don't care that everyone on this movie was doing out of love for the project, or some such nonsense... I've seen low budget movies that had a plot for goodness sake! This had none, zilcho, nada, zippo, empty of reason... a complete waste of good talent, scenery and celluloid! <br /><br />I rented this piece of garbage for a buck, and I want my money back! I want my 2 hou

### Transforming documents into feature vectors

Below, we will call the fit_transform method on CountVectorizer. This will construct the vocabulary of the bag-of-words model and transform the following three sentences into sparse feature vectors:
1. The sun is shining
2. The weather is sweet
3. The sun is shining, the weather is sweet, and one and one is two


In [4]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer


count= CountVectorizer()

docs=np.array([''''The sun is shining',
'The weather is sweet',
'The sun is shining, the weather is sweet, and one and one is two'''])
bag = count.fit_transform(docs)

In [5]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [6]:
print(bag.toarray())

[[2 5 2 2 2 2 4 1 2]]


Raw term frequencies: *tf (t,d)*—the number of times a term t occurs in a document *d*

###  Word relevancy using term frequency-inverse document frequency

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

where $n_d$ is the total number of documents, and df(d, t) is the number of documents d that contain the term t.

In [7]:
from sklearn.feature_extraction.text import TfidfTransformer
np.set_printoptions(precision=2)
tfidf=TfidfTransformer(use_idf=True,norm="l2",smooth_idf=True)

print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.25 0.62 0.25 0.25 0.25 0.25 0.49 0.12 0.25]]


The equations for the idf and tf-idf that are implemented in scikit-learn are:

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$
The tf-idf equation that is implemented in scikit-learn is as follows:

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

###  Data Preparation

In [8]:
movie.loc[0,"review"][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [24]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

In [10]:
preprocessor(movie.loc[0,"review"][-50:])

'is seven title brazil not available'

In [11]:
preprocessor("</a>This :) is a :( test :-)!")

'this is a test :) :( :)'

In [12]:
movie["review"] = movie["review"].apply(preprocessor)

###  Tokenization of documents

In [13]:
from nltk.stem.porter import PorterStemmer
porter= PorterStemmer()

In [14]:
def tokenizer(text):
    return text.split()


In [15]:
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]


In [16]:
tokenizer("runner like running")

['runner', 'like', 'running']

In [17]:
tokenizer_porter("ruuners like running")

['ruuner', 'like', 'run']

In [18]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/rhyme/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [19]:
from nltk.corpus import stopwords
stop=stopwords.words('english')
[w for w in tokenizer_porter('a running like running and runs a lot')[-10:] if w not in stop]

['run', 'like', 'run', 'run', 'lot']

###  Transform Text Data into TF-IDF Vectors

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer(strip_accents=None,lowercase=False,preprocessor=None,tokenizer=tokenizer_porter,use_idf=True,norm='l2',smooth_idf=True)
y=movie.sentiment.values
x=tfidf.fit_transform(movie.review)


### Document Classification using Logistic Regression

In [21]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=1,test_size=0.5,shuffle=False)

In [22]:
import pickle

from sklearn.linear_model import LogisticRegressionCV

clf=LogisticRegressionCV(cv=5,scoring='accuracy',random_state=0,n_jobs=-1,verbose=3,
                         max_iter=300).fit(x_train,y_train)
saved_model=open('saved_model.sav','wb')
pickle.dump(clf,saved_model)
saved_model.close()

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  2.6min finished


###  Model Evaluation

In [25]:
filename='saved_model.sav'
saved_clf=pickle.load(open(filename,'rb'))

In [27]:
saved_clf.score(x_test,y_test)



0.89604