# Sentiment Analysis, the ML way

Given to you, is a collection of short reviews of some movies. The reviews sound positive or negative about the movie. Though we, as humans, can easily identify the sentiment of the text by looking/reading the words in a sentence, it is very difficult to teach a machine/system to understand the sentiment in a given text.

One way is the ML way. There is a ground truth that is created for some corpus i.e.,  we have both postive and negative reviews that are tagged with their respective classes. This forms the base and the algorithm is trained on this data (after converting this to structured form) and depending on the words used, the classification is done (Machine/system tries to obtain a pattern from data).

Another way is dictionary approach, where we create a dictionary of positive and negative words and explicitly state that these words are positive or negative. We can then count the number of positive and negative words in the sentence and give a score. If the score is positive then its positive else its negative.

In either cases, there is some manual work involved (creating ground truth in case 1 or creating the dictionary in case 2)

In [6]:
import os
import re
import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix

In [7]:
os.getcwd()

'/Users/maheshkumar/DataBox/INSOFE/Academics/PGP/Batch 45/CSE 7124c/20180916_Batch_45_CSE7124c_SentimentAnalysis_LSA_Day02_Lab_Content/20180516_Batch45_CSE7124c_SentimentAnalysis_Code'

In [8]:
path = '/Users/maheshkumar/DataBox/INSOFE/Academics/PGP/Batch 45/CSE 7124c/20180916_Batch_45_CSE7124c_SentimentAnalysis_LSA_Day02_Lab_Content/20180916_Bacth45_CSE7124c_TextMining_Lab02_Datasets'
os.chdir(path)
os.getcwd()

'/Users/maheshkumar/DataBox/INSOFE/Academics/PGP/Batch 45/CSE 7124c/20180916_Batch_45_CSE7124c_SentimentAnalysis_LSA_Day02_Lab_Content/20180916_Bacth45_CSE7124c_TextMining_Lab02_Datasets'

In [9]:
f1 = open("short_reviews/positive.txt","r", encoding="latin")   # "r" is for reading
short_pos = f1.readlines() 

In [10]:
short_pos[1]

'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth . \n'

In [11]:
type(short_pos)

list

In [12]:
short_pos=[re.sub("\n","",i)for i in short_pos]
x_short_pos=short_pos[:1000]

In [13]:
f2 = open("short_reviews/negative.txt","r",encoding="latin")
short_neg = f2.readlines()
short_neg=[re.sub("\n","",i)for i in short_neg]
x_short_neg=short_neg[:1000]

In [14]:
#print(short_neg[:1000])
#print#(hort_pos[:1000])

In [15]:
#combine the first 1000 positive reviews and first 1000 negative reviews to form a corpus
data = x_short_pos + x_short_neg

#create the target variable representing 1000 'pos' and 'neg' instances each, wrt the data created above
target = ['pos']*1000+['neg']*1000


In [16]:
len(data)
len(target)

2000

In [17]:
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=1234)

In [18]:
#Get doc-term matrix on train data
cv=CountVectorizer(stop_words='english',lowercase=True,
                   strip_accents='unicode',decode_error='ignore')

tdm_train = cv.fit_transform(X_train)
Mat = tdm_train.todense()
Mat
Mat.shape

(1600, 6435)

In [19]:
#Get doc-term matrix for test data
tdm_test = cv.transform(X_test)
Mat_test = tdm_test.todense()
Mat_test.shape

(400, 6435)

In [20]:
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings("ignore") #just to ignore any warnings

#Training the model
logreg = LogisticRegression()
lr_clf = logreg.fit(tdm_train, y_train)


#Predicting on train data
train_pred = lr_clf.predict(tdm_train)


#Predicting on test data
test_pred=lr_clf.predict(tdm_test)


In [21]:
print("Train_Confusion Matrix: \n", confusion_matrix(y_train,train_pred))
print("Test_Confusion Matrix: \n", confusion_matrix(y_test,test_pred))

Train_Confusion Matrix: 
 [[794   4]
 [  5 797]]
Test_Confusion Matrix: 
 [[145  57]
 [ 62 136]]


In [22]:
print("A glance of first 10 values:\n", "\n Test Actuals: ", y_test[:10],"\n", "Test Predictions: ", test_pred[:10])

A glance of first 10 values:
 
 Test Actuals:  ['neg', 'pos', 'neg', 'pos', 'pos', 'neg', 'neg', 'neg', 'neg', 'pos'] 
 Test Predictions:  ['neg' 'neg' 'neg' 'pos' 'pos' 'neg' 'pos' 'neg' 'neg' 'pos']


In [23]:
#Train Metrics
from sklearn.metrics import recall_score, precision_score, accuracy_score

acc = accuracy_score(y_train, train_pred)
rec = recall_score(y_train, train_pred, pos_label='neg')
prec = precision_score(y_train, train_pred, pos_label='neg')

print("Results of logistic regression on train data:","\nAccuracy:",acc, "\nRecall:",rec, "\nPrecision:",prec)

Results of logistic regression on train data: 
Accuracy: 0.994375 
Recall: 0.994987468672 
Precision: 0.993742177722


In [24]:
#Test Metrics
acc = accuracy_score(y_test, test_pred)
rec = recall_score(y_test, test_pred, pos_label='neg')
prec = precision_score(y_test, test_pred, pos_label='neg')

print("Results of logistic regression on test data:","\nAccuracy:",acc, "\nRecall:",rec, "\nPrecision:",prec)

Results of logistic regression on test data: 
Accuracy: 0.7025 
Recall: 0.717821782178 
Precision: 0.700483091787


# Work with any other classification model and check if you can improve the accuracies

In [25]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(tdm_train, y_train)
test_pred = classifier.predict(tdm_test)
confusion_matrix(y_test,test_pred)

array([[137,  65],
       [ 55, 143]])

In [26]:
#Test Metrics
acc = accuracy_score(y_test, test_pred)
rec = recall_score(y_test, test_pred, pos_label='neg')
prec = precision_score(y_test, test_pred, pos_label='neg')

print("Results of Multinomial_NB on test data:","\nAccuracy:",acc, "\nRecall:",rec, "\nPrecision:",prec)

Results of Multinomial_NB on test data: 
Accuracy: 0.7 
Recall: 0.678217821782 
Precision: 0.713541666667


What else could be done to improve the accuracies?

Solution: There could be some common words in both positive and negative reviews.
    To avoid such words we can consider only adjectives to solve the problem and check if the accuracies improve