Sentiment Analysis for Online Reviews

In [25]:
# libraries to import
import string
import pandas as pd
import nltk as nltk
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer  

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\evatr\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\evatr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\evatr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


a) Downloading, reading and analyzing datasets

In [26]:
# load data in the right> format according to readme files
yelp=pd.read_csv("sentiment_labelled_sentences\yelp_labelled.txt",delimiter="\t", names=["Sentence", "Label"])
imdb=pd.read_csv("sentiment_labelled_sentences\imdb_labelled.txt",delimiter="\t", names=["Sentence", "Label"])
amazon=pd.read_csv("sentiment_labelled_sentences\labelled_amazon.txt",delimiter="\t", names=["Sentence", "Label"])

In [27]:
# check if data is balance in all three dataframes

# yelp
ones_yelp = len(yelp[yelp['Label'] == 1])
zeros_yelp = len(yelp[yelp['Label'] == 0])
print('Number of 1s in Yelp:', ones_yelp)
print('Number of 0s in Yelp:', zeros_yelp)

#imdb
ones_imdb = len(imdb[imdb['Label'] == 1])
zeros_imdb = len(imdb[imdb['Label'] == 0])
print('Number of 1s in Imdb:', ones_imdb)
print('Number of 0s in Imdb:', zeros_imdb)

#amazon
ones_amazon = len(amazon[amazon['Label'] == 1])
zeros_amazon = len(amazon[amazon['Label'] == 0])
print('Number of 1s in Amazon:', ones_amazon)
print('Number of 0s in Amazon:', zeros_amazon)

Number of 1s in Yelp: 500
Number of 0s in Yelp: 500
Number of 1s in Imdb: 386
Number of 0s in Imdb: 362
Number of 1s in Amazon: 500
Number of 0s in Amazon: 500


The data in the Yelp and Amazon files is balanced because there are the same number of 1s and 0s as labels. 
The data in the Imdb file can be considered almost balancen because the number of 1s and 0s is almost the same (386 and 362, respectively). The ratio of 1s to 0s is 386/362 = 1.067.

b) Pre-processing datasets

In [28]:
# convert all letters to lower case
yelp = yelp.applymap(lambda s:s.lower() if type(s) == str else s)
imdb = imdb.applymap(lambda s:s.lower() if type(s) == str else s)
amazon = amazon.applymap(lambda s:s.lower() if type(s) == str else s)

# lemmatize, remove punctuation, remove stop words

stop_words = set(stopwords.words('english')) # find stop words in English language
lemmatizer = WordNetLemmatizer() # declare nltk lemmatizer

# iterate through every sentence and replace it by itself lemmatized, without punctuation and without stop words
for i in yelp['Sentence'].index:
    
    # remove punctuation
    sentence_no_punct = ''
    for char in (yelp.at[i, 'Sentence']):
        if char not in string.punctuation:
            sentence_no_punct = sentence_no_punct + char
    (yelp.at[i, 'Sentence']) = sentence_no_punct
    
    # remove stop words
    word_tokens = word_tokenize(yelp.at[i, 'Sentence'])
    (yelp.at[i, 'Sentence']) = ''
    for word in word_tokens: 
        if word not in stop_words: 
            (yelp.at[i, 'Sentence']) += (' ' + lemmatizer.lemmatize(word)) # check why it's not lemmatizing!
            
    # lemmatize words
    #(yelp.at[i, 'Sentence']) = lemmatizer.lemmatize(yelp.at[i, 'Sentence'])
    
print(yelp['Sentence']) # check it worked correctly

0                                        wow loved place
1                                             crust good
2                                    tasty texture nasty
3       stopped late may bank holiday rick steve reco...
4                             selection menu great price
5                            getting angry want damn pho
6                             honeslty didnt taste fresh
7       potato like rubber could tell made ahead time...
8                                              fry great
9                                            great touch
10                                        service prompt
11                                         would go back
12      cashier care ever say still ended wayyy overp...
13       tried cape cod ravoli chickenwith cranberrymmmm
14                      disgusted pretty sure human hair
15                            shocked sign indicate cash
16                                    highly recommended
17                          wai

For this part, we decided to convert all sentences to lower case, so that the same word with some upper case letters and without them would not be detected as different words since we are using the string type which takes into account their differences. We also stripped the sentences of stop words because they do not add any meaning as the same stop words appear in many different sentences. Additionally, we removed the punctuation because it does not add any meaning to the word analysis exercise we will do in this question. We also lemmatized all the words because we are interested in knowing which class of words they belong to in order to understand the meaning of the sentence and not whether they are a noun, adjective, etc.

c) Split training and testing data

In [29]:
# split the three datasets into training and testing data according to the specifications

# split yelp
training_yelp = (yelp.query('Label == 1' )).head(400)
training_yelp.append((yelp.query('Label == 0' )).head(400))
testing_yelp = (yelp.query('Label == 1' )).tail(100)
testing_yelp.append((yelp.query('Label == 0' )).tail(100))

# split imdb
training_imdb = (imdb.query('Label == 1' )).head(400)
training_imdb.append((imdb.query('Label == 0' )).head(400))
testing_imdb = (imdb.query('Label == 1' )).tail(100)
testing_imdb.append((imdb.query('Label == 0' )).tail(100))

# split amazon
training_amazon = (amazon.query('Label == 1' )).head(400)
training_amazon.append((amazon.query('Label == 0' )).head(400))
testing_amazon = (amazon.query('Label == 1' )).tail(100)
testing_amazon.append((amazon.query('Label == 0' )).tail(100))

Unnamed: 0,Sentence,Label
778,this is a great deal.,1
787,it is simple to use and i like it.,1
788,"it's a great tool for entertainment, communica...",1
791,i own 2 of these cases and would order another.,1
792,great phone.,1
793,i bought this battery with a coupon from amazo...,1
795,perfect for the ps3.,1
796,"five star plus, plus.",1
797,a good quality bargain.. i bought this after i...,1
800,"good , works fine.",1


d) Bag of Words model

For this question we cannot use the testing set to create the dictionary of unique words because the model needs to be created with the training set so that we can use the testing set as new data to test our model's ability to generalize. If we create the dictionary with the testing data, we are esentially using all the data as training data and would need to look for another set of new data to test the classifier.

In [37]:
# DID IT ONLY FOR YELP - NEED TO CHECK IF WE SHOULD DO THE 3 OF THEM SEPARATELY OR JUST ONE FOR ALL OF THEM,
# IT IS THE SAME DOUBT IN C)


# create dictionary of unique words in training set
word_dictionary = {}

# iterate through every word or every sentence and store it in dictionary with count 0 (the count will be updated
# later when we iterate through both testing and training set
for i in training_yelp.index:
    word_tokens_training = word_tokenize(training_yelp.at[i, 'Sentence'])
    for word in word_tokens_training:
        if word not in word_dictionary.keys():
            word_dictionary[word] = 0
            
# count the number of occurences of each word in dictionary in training set
for i in training_yelp.index:
    word_tokens_training = word_tokenize(training_yelp.at[i, 'Sentence'])
    for word in word_tokens_training:
        if word in word_dictionary.keys():
            word_dictionary[word] += 1
            
# count the number of occurences of each word in dictionary in testing set  
for i in testing_yelp.index:
    word_tokens_testing = word_tokenize(testing_yelp.at[i, 'Sentence'])
    for word in word_tokens_testing:
        if word in word_dictionary.keys():
            word_dictionary[word] += 1

# create one feature vector per review
feature_column = [] # list to store the feature vectors and add to dataframe at the end
j = 0; # to indicate position in iterating through feature vector

for i in training_yelp.index:
    word_tokens_training = word_tokenize(training_yelp.at[i, 'Sentence'])
    feature_vector = [0] * len(word_dictionary.keys()) # to store feature vector in each iteration
    for dict_word in word_dictionary.keys():
        if dict_word in word_tokens_training:
            feature_vector[j] = word_dictionary[dict_word]
            j += 1
    feature_column.append(feature_vector) 
    j = 0

In [71]:
word_dictionary # just checking here

{'wow': 3,
 'loved': 10,
 'place': 60,
 'stopped': 2,
 'late': 1,
 'may': 3,
 'bank': 1,
 'holiday': 1,
 'rick': 1,
 'steve': 1,
 'recommendation': 3,
 'selection': 10,
 'menu': 13,
 'great': 70,
 'price': 13,
 'fry': 6,
 'touch': 2,
 'service': 46,
 'prompt': 1,
 'tried': 5,
 'cape': 1,
 'cod': 1,
 'ravoli': 1,
 'chickenwith': 1,
 'cranberrymmmm': 1,
 'highly': 3,
 'recommended': 2,
 'food': 60,
 'amazing': 21,
 'also': 18,
 'cute': 2,
 'could': 8,
 'care': 2,
 'le': 1,
 'interior': 1,
 'beautiful': 3,
 'performed': 1,
 'thats': 1,
 'rightthe': 1,
 'red': 1,
 'velvet': 1,
 'cakeohhh': 1,
 'stuff': 2,
 'good': 73,
 'hole': 1,
 'wall': 3,
 'mexican': 2,
 'street': 1,
 'taco': 4,
 'friendly': 23,
 'staff': 15,
 'combo': 1,
 'like': 18,
 'burger': 6,
 'beer': 8,
 '23': 1,
 'decent': 1,
 'deal': 4,
 'found': 3,
 'accident': 1,
 'happier': 1,
 'overall': 4,
 'lot': 5,
 'redeeming': 1,
 'quality': 5,
 'restaurant': 17,
 'inexpensive': 2,
 'ample': 1,
 'portion': 5,
 'first': 13,
 'visit': 4,

In [38]:
feature_column # just checking here

[[3,
  10,
  60,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0