# Sentiment Analysis

In this project we will analyse movie reviews and try to predict the sentiment of those reviews. We will be looking at the following models and try to find the one with the highest accuracy score:

* Naive Bayes
* Support Vector Machines
* K Nearest Neighbors
* Logistic Regression

Import the required modules

In [27]:
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

We will now read the data file

In [28]:
data=pd.read_table('opinions.tsv',header=None,skiprows=1,names=['Sentiment','Reviews'])
X=data.Reviews
y=data.Sentiment

Now use the CountVectorizer to convert text into tokens

In [29]:
v=CountVectorizer(stop_words='english',ngram_range=(1,1),max_df=.80,min_df=4)
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1,test_size=0.2)
v.fit(X_train)
X_train_dtm=v.transform(X_train)
X_test_dtm=v.transform(X_test)

**Testing the Naive Bayes model**

In [30]:
NB=MultinomialNB()
NB.fit(X_train_dtm,y_train)
y_pred=NB.predict(X_test_dtm)
print('\nNaive Bayes')
print('Accuracy Score: ',metrics.accuracy_score(y_test,y_pred)*100,'%',sep='')
print('Confusion Matrix: ',metrics.confusion_matrix(y_test,y_pred),sep='\n')


Naive Bayes
Accuracy Score: 98.91618497109826%
Confusion Matrix: 
[[586  12]
 [  3 783]]


**Testing the Logistic Regression model**

In [31]:
LR=LogisticRegression()
LR.fit(X_train_dtm,y_train)
y_pred=LR.predict(X_test_dtm)
print('\nLogistic Regression')
print('Accuracy Score: ', metrics.accuracy_score(y_test,y_pred)*100,'%',sep='')
print('Confusion Matrix: ',metrics.confusion_matrix(y_test,y_pred),sep='\n')


Logistic Regression
Accuracy Score: 99.34971098265896%
Confusion Matrix: 
[[593   5]
 [  4 782]]


**Testing the SVM model**

In [32]:
SVM=LinearSVC()
SVM.fit(X_train_dtm,y_train)
y_pred=SVM.predict(X_test_dtm)
print('\nSupport Vector Machine')
print('Accuracy Score: ',metrics.accuracy_score(y_test,y_pred)*100,'%',sep='')
print('Confusion Matrix :',metrics.confusion_matrix(y_test,y_pred),sep='\n')


Support Vector Machine
Accuracy Score: 99.0606936416185%
Confusion Matrix :
[[592   6]
 [  7 779]]


**Testing the K Nearest Neighbors model**

In [33]:
KNN=KNeighborsClassifier()
KNN.fit(X_train_dtm,y_train)
y_pred=KNN.predict(X_test_dtm)
print('\nK Nearest Neighbor')
print('Accuracy Score :',metrics.accuracy_score(y_test,y_pred)*100,'%',sep='')
print('Confusion Matrix: ',metrics.confusion_matrix(y_test,y_pred),sep='\n')


K Nearest Neighbor
Accuracy Score :98.48265895953757%
Confusion Matrix: 
[[585  13]
 [  8 778]]


Here we see that the logistic regression model has the highest accuracy out of all the models. We will now use this model to predict the output on custom inputs.

In [34]:
trainingVector=CountVectorizer(stop_words='english',ngram_range=(1,1),max_df=.80,min_df=5)
trainingVector.fit(X)
X_dtm=trainingVector.transform(X)
LR_complete=LogisticRegression()
LR_complete.fit(X_dtm,y)

print("Enter review to be analysed ",end=" ")
test=[]
test.append(input())
test_dtm=trainingVector.transform(test)
predLabel=LR_complete.predict(test_dtm)
tags=['Negative','Positive']
print("The predicted review is ",tags[predLabel[0]])

Enter review to be analysed  The predicted review is  Negative
