Reading the dataset and storing it in a Dataframe. 


In [1]:
import pandas as pd
import numpy as np
import matplotlib 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics


from google.colab import files 
files.upload()

data = pd.read_csv('Part4_Dataset.csv')
#print(data[0:2])

reviews = data.iloc[:,0]
labels = data.iloc[:,1].map({'positive':1, 'negative':-1})

#print(reviews[0:10])
#print(labels[0:10])

print(len(reviews))
print(reviews[1])

Saving Part4_Dataset.csv to Part4_Dataset.csv
50000
A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals

Cleaning the reviews ==> removing punctuation marks and html tags from the text using regular expression

In [2]:
import re

NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def tagremoval(reviews):
    reviews = [NO_SPACE.sub("", review.lower()) for review in reviews]
    reviews = [WITH_SPACE.sub(" ", review) for review in reviews]
    
    return reviews

X_clean = tagremoval(reviews)

print(X_clean[1])

a wonderful little production  the filming technique is very unassuming  very old time bbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece  the actors are extremely well chosen  michael sheen not only has got all the polari but he has all the voices down pat too you can truly see the seamless editing guided by the references to williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece a masterful production about one of the great masters of comedy and his life  the realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears it plays on our knowledge and our senses particularly with the scenes concerning orton and halliwell and the sets particularly of their flat with halliwells murals decorating every surface are terribly well done


Converting the Text into TFIDF matrix such that it can be used in various algorithms 

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

vect = TfidfVectorizer(stop_words='english', ngram_range=(1,1), max_df=0.8, min_df=5)
X_vect = vect.fit_transform(X_clean) 
print(X_vect.shape)


(50000, 38291)


Splitting the dataset into train and test in the ration 80:20

In [4]:
X_train, X_test, Y_train, Y_test = train_test_split(X_vect, labels, test_size=0.2)

print(X_train.shape)
print(X_test.shape)

(40000, 38291)
(10000, 38291)


Using truncated SVD to reduce dimensionality of the sparse TFIDF matrix. The reason for choosing truncated SVD was, we can provide the number of components we want in the reduced dimensionality matrix

In [5]:
from sklearn.decomposition import TruncatedSVD

SVD_model = TruncatedSVD(n_components=1000, algorithm='arpack')
X_train_SVD = SVD_model.fit_transform(X_train)
X_test_SVD = SVD_model.transform(X_test)
print(X_train_SVD.shape)

(40000, 1000)


Trying various algorithms for the Sentiment analysis 

In [6]:
from sklearn.neighbors import KNeighborsClassifier

KNN = KNeighborsClassifier(n_neighbors = 300, algorithm='kd_tree')
KNN.fit(X_train_SVD, Y_train)
y_pred = KNN.predict(X_test_SVD)
print('For K Nearest Neighbors')
print('Accuracy Score: {:.2f}%'.format(metrics.accuracy_score(Y_test,y_pred)*100))
print('Confusion Matrix: ',metrics.confusion_matrix(Y_test,y_pred))


For K Nearest Neighbors
Accuracy Score: 72.74%
Confusion Matrix:  [[2758 2276]
 [ 450 4516]]


In [15]:
"""from sklearn.naive_bayes import MultinomialNB

NB = MultinomialNB()
NB.fit(X_train_SVD, Y_train)
y_pred = NB.predict(X_test_SVD)
print('For Naive Bayes')
print('Accuracy Score: {:.2f}%'.format(metrics.accuracy_score(Y_test,y_pred)*100))
print('Confusion Matrix: ',metrics.confusion_matrix(Y_test,y_pred))"""

# I tried multinomial NB, for every run it differs its working. For some runs it gives error stating  "negative values in data passed to Multinomial NB". Matrix factorization method like SVD result in some negative values hence multinomial NB shoudn't be used with SVD 

"from sklearn.naive_bayes import MultinomialNB\n\nNB = MultinomialNB()\nNB.fit(X_train_SVD, Y_train)\ny_pred = NB.predict(X_test_SVD)\nprint('For Naive Bayes')\nprint('Accuracy Score: {:.2f}%'.format(metrics.accuracy_score(Y_test,y_pred)*100))\nprint('Confusion Matrix: ',metrics.confusion_matrix(Y_test,y_pred))"

In [8]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()
LR.fit(X_train, Y_train)
y_pred = LR.predict(X_test)
print('\nLogistic Regression')
print('Accuracy Score: {:.2f}%'.format(metrics.accuracy_score(Y_test,y_pred)*100))
print('Confusion Matrix: ',metrics.confusion_matrix(Y_test,y_pred))


Logistic Regression
Accuracy Score: 89.38%
Confusion Matrix:  [[4442  592]
 [ 470 4496]]


In [9]:
from sklearn.svm import LinearSVC

SVM = LinearSVC()
SVM.fit(X_train, Y_train)
y_pred = SVM.predict(X_test)
print('Support Vector Machine')
print('Accuracy Score: {:.2f}%'.format(metrics.accuracy_score(Y_test,y_pred)*100))
print('Confusion Matrix: ',metrics.confusion_matrix(Y_test,y_pred))

Support Vector Machine
Accuracy Score: 88.98%
Confusion Matrix:  [[4464  570]
 [ 532 4434]]


In [10]:
from sklearn import tree

DT = tree.DecisionTreeClassifier(criterion='entropy', random_state=0)
DT.fit(X_train, Y_train)
y_pred = DT.predict(X_test)
print("Decision Tree")
print('Accuracy Score: {:.2f}%'.format(metrics.accuracy_score(Y_test,y_pred)*100))
print('Confusion Matrix: ',metrics.confusion_matrix(Y_test,y_pred))

Decision Tree
Accuracy Score: 72.42%
Confusion Matrix:  [[3628 1406]
 [1352 3614]]


In [13]:
from sklearn.ensemble import RandomForestClassifier

RF = RandomForestClassifier(n_estimators=100, criterion='gini')
RF.fit(X_train, Y_train)
y_pred = RF.predict(X_test)
print('For Random Forest algorithm')
print('Accuracy Score: {:.2f}%'.format(metrics.accuracy_score(Y_test,y_pred)*100))
print('Confusion Matrix: ',metrics.confusion_matrix(Y_test,y_pred))

For Random Forest algorithm
Accuracy Score: 85.38%
Confusion Matrix:  [[4333  701]
 [ 761 4205]]
