Sentiment Analysis for Online Reviews

In [21]:
import pandas as pd
import nltk as nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\evatr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\evatr\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


a) Downloading, reading and analyzing datasets

In [105]:
# load data in the right> format according to readme files
yelp=pd.read_csv("sentiment_labelled_sentences\yelp_labelled.txt",delimiter="\t", names=["Sentence", "Label"])
imdb=pd.read_csv("sentiment_labelled_sentences\imdb_labelled.txt",delimiter="\t", names=["Sentence", "Label"])
amazon=pd.read_csv("sentiment_labelled_sentences\labelled_amazon.txt",delimiter="\t", names=["Sentence", "Label"])

In [7]:
# check if data is balance in all three dataframes

# yelp
onesYelp = len(yelp[yelp['Label'] == 1])
zerosYelp = len(yelp[yelp['Label'] == 0])
print('Number of 1s in Yelp:', onesYelp)
print('Number of 0s in Yelp:', zerosYelp)

#imdb
onesImdb = len(imdb[imdb['Label'] == 1])
zerosImdb = len(imdb[imdb['Label'] == 0])
print('Number of 1s in Imdb:', onesImdb)
print('Number of 0s in Imdb:', zerosImdb)

#amazon
onesAmazon = len(amazon[amazon['Label'] == 1])
zerosAmazon = len(amazon[amazon['Label'] == 0])
print('Number of 1s in Amazon:', onesAmazon)
print('Number of 0s in Amazon:', zerosAmazon)

Number of 1s in Yelp: 500
Number of 0s in Yelp: 500
Number of 1s in Imdb: 386
Number of 0s in Imdb: 362
Number of 1s in Amazon: 500
Number of 0s in Amazon: 500


The data in the Yelp and Amazon files is balanced because there are the same number of 1s and 0s as labels. 
The data in the Imdb file can be considered almost balancen because the number of 1s and 0s is almost the same (386 and 362, respectively). The ratio of 1s to 0s is 386/362 = 1.067.

b) Pre-processing datasets

In [106]:
# convert all letters to lower case
yelp = yelp.applymap(lambda s:s.lower() if type(s) == str else s)
imdb = imdb.applymap(lambda s:s.lower() if type(s) == str else s)
amazon = amazon.applymap(lambda s:s.lower() if type(s) == str else s)

# remove stop words

stopWords = set(stopwords.words('english')) # find stop words in English language
strippedSentence = [] # list to store the new sentence stripped of stop words

# iterate through every sentence and replace it by itself without stop words - CHECK WHY OUTPUT APPEARS WITH COMMAS
for i in yelp['Sentence'].index:
    wordTokens = word_tokenize(yelp.at[i, 'Sentence'])
    for word in wordTokens: 
        if word not in stopWords: 
            strippedSentence.append(word)
    (yelp.at[i, 'Sentence']) = strippedSentence
    strippedSentence = []

print(yelp['Sentence']) # check it worked correctly

0                            [wow, ..., loved, place, .]
1                                       [crust, good, .]
2                             [tasty, texture, nasty, .]
3      [stopped, late, may, bank, holiday, rick, stev...
4                    [selection, menu, great, prices, .]
5                   [getting, angry, want, damn, pho, .]
6                    [honeslty, n't, taste, fresh, ., )]
7      [potatoes, like, rubber, could, tell, made, ah...
8                                      [fries, great, .]
9                                      [great, touch, .]
10                                  [service, prompt, .]
11                                  [would, go, back, .]
12     [cashier, care, ever, say, still, ended, wayyy...
13     [tried, cape, cod, ravoli, ,, chicken, ,, cran...
14             [disgusted, pretty, sure, human, hair, .]
15                   [shocked, signs, indicate, cash, .]
16                              [highly, recommended, .]
17                  [waitress, 

For this part, we decided to convert all sentences to lower case, so that the same word with some upper case letters and without them would not be detected as different words since we are using the string type which takes into account their differences. We also stripped the sentences of stop words because they do not add any meaning as the same stop words appear in many different sentences.

c) Split training and testing data

In [78]:
# not finished / working yet

training_yelp = (yelp.query('Label == 1' )).head(400)
training_yelp.append((yelp.query('Label == 0' )).head(400))
testing_yelp = (yelp.query('Label == 1' )).tail(100)
testing_yelp.append((yelp.query('Label == 0' )).tail(100))

Unnamed: 0,Sentence,Label
715,only pros : large seating area/ nice bar area/...,1
716,they have a really nice atmosphere.,1
718,"after one bite, i was hooked.",1
720,"cute, quaint, simple, honest.",1
721,the chicken was deliciously seasoned and had t...,1
722,"the food was great as always, compliments to t...",1
723,special thanks to dylan t. for the recommendat...,1
724,awesome selection of beer.,1
725,great food and awesome service!,1
726,one nice thing was that they added gratuity on...,1


In [79]:
training_yelp

Unnamed: 0,Sentence,Label
0,wow... loved this place.,1
3,stopped by during the late may bank holiday of...,1
4,the selection on the menu was great and so wer...,1
8,the fries were great too.,1
9,a great touch.,1
10,service was very prompt.,1
13,"i tried the cape cod ravoli, chicken,with cran...",1
16,highly recommended.,1
21,"the food, amazing.",1
22,service is also cute.,1
