# Sentiment Analysis on Movie Reviews

Using Logistic Regression, SGD, Naive Bayes, OneVsOne Models

- 0 - negative

- 1 - somewhat negative

- 2 - neutral

- 3 - somewhat positive

- 4 - positive

## Load Libraries

In [1]:
import nltk
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier

from sklearn.metrics import classification_report, confusion_matrix

## Load & Read Datasets

In [2]:
train = pd.read_csv('train.tsv', delimiter='\t')
test = pd.read_csv('test.tsv', delimiter='\t')

In [3]:
train.shape, test.shape

((156060, 4), (66292, 3))

In [4]:
train.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


In [5]:
test.head()

Unnamed: 0,PhraseId,SentenceId,Phrase
0,156061,8545,An intermittently pleasing but mostly routine ...
1,156062,8545,An intermittently pleasing but mostly routine ...
2,156063,8545,An
3,156064,8545,intermittently pleasing but mostly routine effort
4,156065,8545,intermittently pleasing but mostly routine


In [6]:
# unique sentiment labels
train.Sentiment.unique()

array([1, 2, 3, 4, 0])

In [7]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156060 entries, 0 to 156059
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   PhraseId    156060 non-null  int64 
 1   SentenceId  156060 non-null  int64 
 2   Phrase      156060 non-null  object
 3   Sentiment   156060 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 4.8+ MB


In [8]:
train.Sentiment.value_counts()

2    79582
3    32927
1    27273
4     9206
0     7072
Name: Sentiment, dtype: int64

In [9]:
train.Sentiment.value_counts() / train.Sentiment.count()

2    0.509945
3    0.210989
1    0.174760
4    0.058990
0    0.045316
Name: Sentiment, dtype: float64

## Extracting features

In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors.

### Bags of words

The most intuitive way to do so is the bags of words representation:

- assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).

- for each document $#i$, count the number of occurrences of each word $w$ and store it in $X[i, j]$ as the value of feature $#j$ where $j$ is the index of word $w$ in the dictionary

*Reference: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html*

The Bag of Words model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears.

We'll be using the *CountVectorizer* feature extractor module from *scikit-learn* to create *bag-of-words* features.

In [10]:
X_train = train['Phrase']
y_train = train['Sentiment']

In [11]:
# Convert a collection of text documents to a matrix of token counts
count_vect = CountVectorizer() 

# Fit followed by Transform
# Learn the vocabulary dictionary and return term-document matrix
X_train_counts = count_vect.fit_transform(X_train)

In [12]:
#X_train_count = X_train_count.toarray()

In [13]:
# 156060 rows of train data & 15240 features (one for each vocabulary word)
X_train_counts.shape

(156060, 15240)

In [14]:
# get all words in the vocabulary
vocab = count_vect.get_feature_names()
print (vocab)



In [15]:
# get index of any word
count_vect.vocabulary_.get(u'100')

2

In [16]:
# Sum up the counts of each vocabulary word
dist = np.sum(X_train_counts, axis=0)
# print (dist) # matrix

dist = np.squeeze(np.asarray(dist))
print (dist) # array

zipped = zip(vocab, dist)
zipped.sort(key = lambda t: t[1], reverse=True) # sort words by highest number of occurrence

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zipped:
    print (count, tag)

[ 23 179  70 ...  15   9   5]


AttributeError: 'zip' object has no attribute 'sort'

## Convert Occurrence to Frequency

Problem with occurrence count of words:
- longer documents will have higher average count values than shorter documents, even though they might talk about the same topics

Solution:
- divide the number of occurrences of each word in a document by the total number of words in the document
- new features formed by this method are called **tf** (***Term Frequencies***)

Refinement on *tf*:
- downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus
- this downscaling is called **tf-idf** (***Term Frequency times Inverse Document Frequency***)

Let's compute *tf* and *tf-idf* : 

In [None]:
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

In [None]:
# 156060 rows of train data & 15240 features (one for each vocabulary word)
X_train_tf.shape

In [None]:
# print some values of tf-idf transformed feature vector
print X_train_tf[1:2]

In the above code, we first used the *fit()* method to fit our estimator and then the *transform()* method to transform our count-matrix to a tf-idf representation.

These two steps can be combined using *fit_transform()* method.

In [None]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

## Train Classifier

We train our classifier by inputing our features and expecting our classifier to output/predict the sentiment value for each phrase in test dataset.

### Naive Bayes Classifier

In [None]:
clf = MultinomialNB().fit(X_train_tfidf, y_train)

In [None]:
predicted = clf.predict(X_train_tfidf)

In [None]:
np.mean(predicted == y_train)

### Building a Pipeline

In order to make the **vectorizer => transformer => classifier** easier to work with, scikit-learn provides a **Pipeline** class that behaves like a compound classifier.

You can compare the above accuracy result of the classifier without using Pipeline and the below accuracy result of the classifier while using Pipeline class. It's the same. Hence, Pipeline class highly simplifies our task of tokenizing and tfidf conversion.

In [None]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

In [None]:
text_clf.fit(X_train, y_train)

In [None]:
predicted = text_clf.predict(X_train)

In [None]:
np.mean(predicted == y_train)

Let's use stop words filter in *CountVectorizer* method and see how it affects the classifier's accuracy. We see that this increases accuracy.

In [None]:
text_clf = Pipeline([
    ('vect', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_train)
np.mean(predicted == y_train)

### Classification Report (precision, recall, f1-score)

In [None]:
target_names = y_train.unique()
#np.array(map(str, target_names))
#np.char.mod('%d', target_names)
target_names = ['0', '1', '2', '3', '4']

print (classification_report(
    y_train, \
    predicted, \
    target_names = target_names
))

### Confusion Matrix

In [None]:
print (confusion_matrix(y_train, predicted))

### Stochastic Gradient Descent (SGD) Classifier

In [None]:
text_clf = Pipeline([
    ('vect', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='modified_huber', shuffle=True, penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None)),
])

text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_train)
np.mean(predicted == y_train)

### Logistic Regression Classifier

In [None]:
text_clf = Pipeline([
    ('vect', CountVectorizer(stop_words='english', max_features=5000)),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression())
])

text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_train)
np.mean(predicted == y_train)

### OneVsOne Classifier

In [None]:
text_clf = Pipeline([
    ('vect', CountVectorizer(stop_words='english', max_features=5000)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsOneClassifier(LinearSVC()))
])

text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_train)
np.mean(predicted == y_train)

## Create Submission

In [None]:
test.info()

In [None]:
X_test = test['Phrase']
phraseIds = test['PhraseId']
predicted = text_clf.predict(X_test)
output = pd.DataFrame( data={"PhraseId":phraseIds, "Sentiment":predicted} )
#output.to_csv( "submission.csv", index=False, quoting=3 )