# Part 1: Using the TextBlob Sentiment Analyzer

In [73]:
# Load the dataset as a Pandas data frame.

import pandas as pd

movies_df = pd.read_csv(r'C:\Users\Riaz\Desktop\MSDS\Data Mining\Week 3\labeledTrainData.tsv\labeledTrainData.tsv', sep='\t')

# Required libraries are imported and using the read_csv function the dataframe has been loaded from tsv file.
# Separator has been mentioned as tab,
movies_df

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...
...,...,...,...
24995,3453_3,0,It seems like more consideration has gone into...
24996,5064_1,0,I don't believe they made this film. Completel...
24997,10905_3,0,"Guy is a loser. Can't get girls, needs to buil..."
24998,10194_3,0,This 30 minute documentary Buñuel made in the ...


In [74]:
# How many of each positive and negative reviews are there?

# Using the value_counts function, on the sentiment column which would give the number of unique values.

movies_df['sentiment'].value_counts()



sentiment
1    12500
0    12500
Name: count, dtype: int64

According to the definition for the Sentiment column. 1 is for positive reviews and 0 is for negative reviews.  
So, from the above output out of the 25000 rows present in the dataframe, it is equally divided with 12500 positive and 12500 negative reviews.  

Using TextBlob to classify each movie review as positive or negative.   
Assuming that a polarity score greater than or equal to zero is a positive sentiment and less than 0 is a negative sentiment.  

In [75]:
# Downloading and installing the required textblob libraries

!pip install textblob
!python -m textblob.download_corpora
from textblob import TextBlob

Finished.


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Riaz\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Riaz\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Riaz\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Riaz\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\Riaz\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Riaz\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


In [76]:
# Defining the function to accept the column values and finding both the polarity and subjectivity, 

def analysis(text):
    analysis=TextBlob(text)
    return analysis.sentiment.polarity, analysis.sentiment.subjectivity
    
# Assigning the returned values to two different newly created columns, by unpacking them,

movies_df['polarity'],movies_df['subjectivity']=zip(*movies_df['review'].apply(analysis))


In [77]:
# As the returned values are continuous values, using the lambda function and creating one more column to assign them 0 or 1
# depending on the value more than or less than 0,

movies_df['polarity_result'] = movies_df['polarity'].apply(lambda x:1 if x > 0 else 0)
movies_df.head()


Unnamed: 0,id,sentiment,review,polarity,subjectivity,polarity_result
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,0.606746,1
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,0.531111,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0.562933,0
3,3630_4,0,It must be assumed that those who praised this...,0.134753,0.492901,1
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842,0.459818,0


In [78]:
# Importing the accuracy score from sklearn and computing the value of accuracy,

from sklearn.metrics import accuracy_score
print ("The accuracy score of the TextBlob model is", accuracy_score(movies_df['sentiment'],movies_df['polarity_result']))

The accuracy score of the TextBlob model is 0.68528


Yes, this model is better than random guessing, because if we are doing random guessing then the value would be 50% as the dataset is equally divided in postive and negative reviews.  

However, our TextBlob model is having accuracy of 68%

Using another prebuilt text sentiment analyzer, VADER and finding the results

In [79]:
# Importing the required libraries

import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Riaz\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [80]:
def analysis_vader(text):
    sid = SentimentIntensityAnalyzer()
    return sid.polarity_scores(text)['compound']

movies_df['polarity_vader']=movies_df['review'].apply(analysis_vader)

In [81]:
# As the returned values are continuous values, using the lambda function and creating one more column to assign them 0 or 1
# depending on the value more than or less than 0,

movies_df['polarity_vader_result'] = movies_df['polarity_vader'].apply(lambda x:1 if x > 0 else 0)
movies_df.head()

Unnamed: 0,id,sentiment,review,polarity,subjectivity,polarity_result,polarity_vader,polarity_vader_result
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,0.606746,1,-0.8278,0
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,0.531111,1,0.9819,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0.562933,0,-0.9883,0
3,3630_4,0,It must be assumed that those who praised this...,0.134753,0.492901,1,-0.2189,0
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842,0.459818,0,0.796,1


In [82]:
# Computing the value of accuracy,

print ("The accuracy score of the Vader model is", accuracy_score(movies_df['sentiment'],movies_df['polarity_vader_result']))

The accuracy score of the Vader model is 0.69364


As seen from the above results accuracy score of the Vader model is 0.69364 which is very slightly higher than Textblob

# Part 2: Prepping Text for a Custom Model

In [83]:
# Converting all the movie reviews to lower case by using string function,

movies_df['review']=movies_df['review'].str.lower()
movies_df.head()

Unnamed: 0,id,sentiment,review,polarity,subjectivity,polarity_result,polarity_vader,polarity_vader_result
0,5814_8,1,with all this stuff going down at the moment w...,0.001277,0.606746,1,-0.8278,0
1,2381_9,1,"\the classic war of the worlds\"" by timothy hi...",0.256349,0.531111,1,0.9819,1
2,7759_3,0,the film starts with a manager (nicholas bell)...,-0.053941,0.562933,0,-0.9883,0
3,3630_4,0,it must be assumed that those who praised this...,0.134753,0.492901,1,-0.2189,0
4,9495_8,1,superbly trashy and wondrously unpretentious 8...,-0.024842,0.459818,0,0.796,1


In [84]:
# Removing punctuation and special characters from the text by defining the punctuation dictionary and having the values
# as None.  Then using the apply function and calling the trans function which will return the values after using the 
# translate function.  Storing in a new column 'review_trans' for easier reference,

import unicodedata
import sys

punctuation = dict.fromkeys((i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P')),None)
punctuation

def trans(text_data):
    return text_data.translate(punctuation)

movies_df['review_trans'] = movies_df['review'].apply(trans)



In [85]:
# Removing the stop words by using the NLTK library.  Tokenizing it by the individual word using from nltk function
# Removing the words present in stop_words  list.  Using the apply function for calling the user defined function.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')

stop_words = stopwords.words('english')


def stopw(text_data):
    words = word_tokenize(text_data)
    filtered_text = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_text)
    

movies_df['review_trans'] = movies_df['review_trans'].apply(stopw)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Riaz\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [86]:
# Applying NLTK porterstemmer, after tokenizing

from nltk.stem.porter import PorterStemmer

def porter(text_data):
    porter = PorterStemmer()
    words = word_tokenize(text_data)
    filtered_text = [porter.stem(word) for word in words]
    return ' '.join(filtered_text)
    

movies_df['review_trans'] = movies_df['review_trans'].apply(porter)

print (movies_df.head(5))

       id  sentiment                                             review  \
0  5814_8          1  with all this stuff going down at the moment w...   
1  2381_9          1  \the classic war of the worlds\" by timothy hi...   
2  7759_3          0  the film starts with a manager (nicholas bell)...   
3  3630_4          0  it must be assumed that those who praised this...   
4  9495_8          1  superbly trashy and wondrously unpretentious 8...   

   polarity  subjectivity  polarity_result  polarity_vader  \
0  0.001277      0.606746                1         -0.8278   
1  0.256349      0.531111                1          0.9819   
2 -0.053941      0.562933                0         -0.9883   
3  0.134753      0.492901                1         -0.2189   
4 -0.024842      0.459818                0          0.7960   

   polarity_vader_result                                       review_trans  
0                      0  stuff go moment mj ive start listen music watc...  
1                   

In [87]:
# Creating bag of words matrix by using sklearn.
# Also using crs_matrix for more understanding

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the text data
movies = vectorizer.fit_transform(movies_df['review_trans'] )
print ("The dimensions of sparse matrix are \n",movies.shape,'\n')
print ("The sparse matrix is\n",csr_matrix(movies),'\n')



The dimensions of sparse matrix are 
 (25000, 82838) 

The sparse matrix is
   (0, 70119)	1
  (0, 30205)	3
  (0, 47585)	1
  (0, 47382)	11
  (0, 37797)	2
  (0, 69106)	2
  (0, 42551)	1
  (0, 49059)	2
  (0, 79282)	3
  (0, 51891)	1
  (0, 20911)	1
  (0, 80877)	1
  (0, 47870)	2
  (0, 45387)	3
  (0, 79058)	3
  (0, 29616)	1
  (0, 13409)	1
  (0, 36851)	1
  (0, 31743)	2
  (0, 73290)	1
  (0, 59761)	2
  (0, 16530)	2
  (0, 22951)	1
  (0, 44236)	2
  (0, 46886)	1
  :	:
  (24999, 23441)	1
  (24999, 59431)	1
  (24999, 48161)	1
  (24999, 50381)	1
  (24999, 80936)	3
  (24999, 14158)	3
  (24999, 17437)	1
  (24999, 11143)	1
  (24999, 33068)	1
  (24999, 74174)	1
  (24999, 25905)	1
  (24999, 14316)	1
  (24999, 31186)	1
  (24999, 71811)	1
  (24999, 76633)	1
  (24999, 11140)	1
  (24999, 37167)	1
  (24999, 19520)	1
  (24999, 19182)	1
  (24999, 17719)	1
  (24999, 76615)	1
  (24999, 40879)	1
  (24999, 67742)	1
  (24999, 76541)	1
  (24999, 14283)	2 



In [88]:
# Creating tfidf matrix by using sklearn.
# Also using crs_matrix for more understanding

# Load libraries
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer


# Create the tf-idf feature matrix
tfidf = TfidfVectorizer()
movies_tfidf_matrix = tfidf.fit_transform(movies_df['review_trans'])

# Show tf-idf feature matrix
print ("The dimensions of tfidf sparse matrix are \n", movies_tfidf_matrix.shape,'\n')
print ("The sparse tfidf matrix is\n",csr_matrix(movies_tfidf_matrix))

The dimensions of tfidf sparse matrix are 
 (25000, 82838) 

The sparse tfidf matrix is
   (0, 41245)	0.043725100179534154
  (0, 34706)	0.028004292890656174
  (0, 42048)	0.06090031485862757
  (0, 66141)	0.07382701969817851
  (0, 70166)	0.031383838501484636
  (0, 25035)	0.03246385013925293
  (0, 22999)	0.03040025882904578
  (0, 25234)	0.0252263160449798
  (0, 21249)	0.0410342540809113
  (0, 15097)	0.03153700591378916
  (0, 8349)	0.03356816689513565
  (0, 20089)	0.027365141847489616
  (0, 21196)	0.019406600525887933
  (0, 70289)	0.08513943090777847
  (0, 29237)	0.03355349030884784
  (0, 6417)	0.035743700698454256
  (0, 79688)	0.03703344101521702
  (0, 56027)	0.04332696366859673
  (0, 30830)	0.042901118540963616
  (0, 24382)	0.021765801837110497
  (0, 71645)	0.031145689444850307
  (0, 75259)	0.030962185606040454
  (0, 29892)	0.026493026546793894
  (0, 11435)	0.04499150205708709
  (0, 8741)	0.08183076273260166
  :	:
  (24999, 77237)	0.1032011653188209
  (24999, 49847)	0.06184906567580395
 

As seen from the above output, both the dimensions of bag-of-words matrix and tfidf matrix are same.  