# Imports and reading in amazon review data

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

%matplotlib inline

In [2]:
pd.set_option('display.max_colwidth', -1)

In [3]:
# pd.set_option('display.max_columns', None)

In [4]:
amazon = pd.read_csv('./datasets/training_data/amazon_reviews/train.csv', header=0, names=['sentiment', 'title', 'review'])

# EDA

In [5]:
amazon.head(2)

Unnamed: 0,sentiment,title,review
0,5,Inspiring,"I hope a lot of people hear this cd. We need more strong and positive vibes like this. Great vocals, fresh tunes, cross-cultural happiness. Her blues is from the gut. The pop sounds are catchy and mature."
1,5,The best soundtrack ever to anything.,"I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny."


## No nulls

In [6]:
amazon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2999999 entries, 0 to 2999998
Data columns (total 3 columns):
sentiment    int64
title        object
review       object
dtypes: int64(1), object(2)
memory usage: 68.7+ MB


## Balanced classes by pos/neg sentiment

In [7]:
amazon.sentiment.value_counts()

5    600000
4    600000
2    600000
1    600000
3    599999
Name: sentiment, dtype: int64

In [8]:
soap = amazon[amazon.review.str.contains('soap', case=False)]

In [9]:
soap = soap[~soap.review.str.contains('opera', case=False)]

In [10]:
soap = soap[~soap.review.str.contains('operish', case=False)]

In [11]:
soap = soap[~soap.review.str.contains('soapbox', case=False)]

In [12]:
soap = soap[~soap.review.str.contains('soap box', case=False)]

## .09% of the data contains the word soap without opera or soapbox or operish~ 5100 rows

In [13]:
soap.head()

Unnamed: 0,sentiment,title,review
715,5,Best Tale of Scandal and Deceptjon,"I was required to read this book for English class, and it's my favorite book we've read all year :) full of juicy drama, scandal, and humor, the book is more like an entertaining Spanish soap than a boring classic novel some find it to be. If you look at the boon with the right eyes, you can enjoy it :))"
2919,2,Worse than Stevia.,Mild sweeetness - though ultimately tastes like soap - or something reminiscent of soap.
3959,2,Greasy Grapefruit!,"A sugar and oil mix with an explosive citrus fragrance. Yes, your skin will feel softer, but it's not worth it. I had to wash it off with soap and water, then clean the tub before I got out so I wouldn't fall. Take a shower and spray yourself with oil spray - it's not messy or stinky, and certainly not as slippery!"
4161,5,LOVED THEM,"All of us girls loved the card holders. It did have an odor, but I sprayed them with Fabreeze and let them dry, then washed with soap and water and let them dry again. After that all was fine. We find that we even use them when we have 3 cards left. You can set it on the table and it leaves your hands free for important things--like snacks!"
7460,2,Adds very little to the online docu,"This book adds very little to the online documentation provided by the tomcat team and seems to be published in a hurry just to be the first tomcat book in place. E.g. on p.103 at the end of chapter 5 you can read: ""In the next chapter, we cover securing a Web application using the Secure Sockets Layer (SSL)."" Well, there is no such chapter in the whole book. The same at the end of chapter 12, where Goodwill promises a Chapter on integrating the XML Apache Soap project into Tomcat. Again, no such Chapter.Covering the basics, the book does a good job, but as said, the provided ducumentation does it as well."


In [14]:
soap.describe()

Unnamed: 0,sentiment
count,5103.0
mean,2.907505
std,1.407151
min,1.0
25%,2.0
50%,3.0
75%,4.0
max,5.0


## Searching for some initial brands

### Mrs Meyer's soap stats below

In [15]:
amazon[amazon.review.str.contains('mrs. meyer', case=False)].describe()

Unnamed: 0,sentiment
count,39.0
mean,3.102564
std,1.569359
min,1.0
25%,2.0
50%,3.0
75%,5.0
max,5.0


In [16]:
amazon[amazon.review.str.contains('mrs meyer', case=False)].describe()

Unnamed: 0,sentiment
count,6.0
mean,2.833333
std,1.602082
min,1.0
25%,1.5
50%,3.0
75%,3.75
max,5.0


In [17]:
soap[soap.review.str.contains('meyer', case=False)].describe()
    

Unnamed: 0,sentiment
count,25.0
mean,2.92
std,1.497776
min,1.0
25%,1.0
50%,3.0
75%,4.0
max,5.0


### Dove

In [18]:
soap[soap.review.str.contains('dove', case=False)].describe()

Unnamed: 0,sentiment
count,36.0
mean,3.138889
std,1.290687
min,1.0
25%,2.0
50%,3.0
75%,4.0
max,5.0


### Softsoap

In [19]:
soap[soap.review.str.contains('softsoap', case=False)].describe()

Unnamed: 0,sentiment
count,11.0
mean,2.818182
std,1.537412
min,1.0
25%,1.5
50%,3.0
75%,4.0
max,5.0


### 7th Generation

In [20]:
soap[soap.review.str.contains('7th', case=False)].describe()

Unnamed: 0,sentiment
count,8.0
mean,2.75
std,1.581139
min,1.0
25%,1.75
50%,2.5
75%,3.5
max,5.0


# vader SA experiment

In [21]:
amazon.dropna(inplace=True)

In [22]:
amazon.head(1)

Unnamed: 0,sentiment,title,review
0,5,Inspiring,"I hope a lot of people hear this cd. We need more strong and positive vibes like this. Great vocals, fresh tunes, cross-cultural happiness. Her blues is from the gut. The pop sounds are catchy and mature."


## Lambda function to create vader sentiment columns in my amazon df

In [26]:
analyser = SentimentIntensityAnalyzer()

In [27]:
sentiment = amazon['review'].apply(lambda x: analyser.polarity_scores(x))
amazon = pd.concat([amazon, sentiment.apply(pd.Series)], 1)

# Comparing vader results to test scores from Amazon Sentiment Scores (y actual)

In [29]:
amazon['vader_correct'] = 2
amazon.head(1)

Unnamed: 0,sentiment,title,review,neg,neu,pos,compound,vader_correct
0,5,Inspiring,"I hope a lot of people hear this cd. We need more strong and positive vibes like this. Great vocals, fresh tunes, cross-cultural happiness. Her blues is from the gut. The pop sounds are catchy and mature.",0.0,0.505,0.495,0.9786,2


In [30]:
conditions = [
    (amazon['sentiment'] > 3) & (amazon['compound'] >= 0.05) |
    (amazon['sentiment'] == 3) & (amazon['compound'] < 0.05) & (amazon['compound'] > -0.05) |
    (amazon['sentiment'] < 3) & (amazon['compound'] <= -0.05)
]

In [31]:
amazon[conditions[0]]

Unnamed: 0,sentiment,title,review,neg,neu,pos,compound,vader_correct
0,5,Inspiring,"I hope a lot of people hear this cd. We need more strong and positive vibes like this. Great vocals, fresh tunes, cross-cultural happiness. Her blues is from the gut. The pop sounds are catchy and mature.",0.000,0.505,0.495,0.9786,2
1,5,The best soundtrack ever to anything.,"I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny.",0.019,0.851,0.129,0.8481,2
2,4,Chrono Cross OST,"The music of Yasunori Misuda is without question my close second below the great Nobuo Uematsu.Chrono Cross OST is a wonderful creation filled with rich orchestra and synthesized sounds. While ambiance is one of the music's major factors, yet at times it's very uplifting and vigorous. Some of my favourite tracks include; ""Scars Left by Time, The Girl who Stole the Stars, and Another World"".",0.000,0.816,0.184,0.9260,2
3,5,Too good to be true,Probably the greatest soundtrack in history! Usually it's better to have played the game first but this is so enjoyable anyway! I worked so hard getting this soundtrack and after spending [money] to get it it was really worth every penny!! Get this OST! it's amazing! The first few tracks will have you dancing around with delight (especially Scars Left by Time)!! BUY IT NOW!!,0.027,0.665,0.307,0.9766,2
4,5,There's a reason for the price,"There's a reason this CD is so expensive, even the version that's not an import.Some of the best music ever. I could listen to every track every minute of every day. That's about all i can say.",0.000,0.887,0.113,0.6369,2
...,...,...,...,...,...,...,...,...
2999988,5,Tyler rocks!!!,This music is awsome!!!!! I first heard Tyler when I was at a Hanson concert he was the opening act. Before that I had never heard of him. His performance was amazing and so when he was done we went out to get his autograph. He was so sweet he signed the date and my name as well and was gracious enough to do it on my Hanson tour book becuase I had not been able to get anything of his yet. Anyway he is a great person and his music is absolutely awsome. I highly recomend this cd!!!!!Melinda,0.000,0.824,0.176,0.9619,2
2999989,5,Tyler Hilton is the best!,"Tyler Hilton's EP may only have a couple songs, but all of them are sooo amazing and good. All of them are my favorites. Tyler Hilton has a really good voice. His music is pop. Def. buy this CD....i promice you wont regret it!!!!",0.000,0.641,0.359,0.9711,2
2999990,1,What A Slap In The Face To Masami Ueda,"Do NOT buy this cd. Ever. This was probably just released as a test to see if REfans would buy anything that had ""Resident Evil/Biohazard"" on it.How dare this metamorphis guy ruin such perfect scores.Masami Ueda probably had a heart attack when he heard these awful remixes of his songs.Trance....bah.",0.172,0.764,0.064,-0.7351,2
2999991,2,Too simplistic,"While Mr. Harrison makes some extremely valid arguments in this book , I wish he had also explored why the Anglo-protestant culture , which he holds up as""best in class"" went about enslaving the world and what impact this has had on various countries , whom they enslaved.Perhaps some of the progress those Anglo-protestant societies have made, is due to the fact that they exploited other countries and other peoples and not so much their work ethic as Mr. Harrison seems to suggest",0.069,0.867,0.064,-0.1226,2


In [32]:
amazon.loc[conditions[0], 'vader_correct'] = 1
amazon.head()

Unnamed: 0,sentiment,title,review,neg,neu,pos,compound,vader_correct
0,5,Inspiring,"I hope a lot of people hear this cd. We need more strong and positive vibes like this. Great vocals, fresh tunes, cross-cultural happiness. Her blues is from the gut. The pop sounds are catchy and mature.",0.0,0.505,0.495,0.9786,1
1,5,The best soundtrack ever to anything.,"I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny.",0.019,0.851,0.129,0.8481,1
2,4,Chrono Cross OST,"The music of Yasunori Misuda is without question my close second below the great Nobuo Uematsu.Chrono Cross OST is a wonderful creation filled with rich orchestra and synthesized sounds. While ambiance is one of the music's major factors, yet at times it's very uplifting and vigorous. Some of my favourite tracks include; ""Scars Left by Time, The Girl who Stole the Stars, and Another World"".",0.0,0.816,0.184,0.926,1
3,5,Too good to be true,Probably the greatest soundtrack in history! Usually it's better to have played the game first but this is so enjoyable anyway! I worked so hard getting this soundtrack and after spending [money] to get it it was really worth every penny!! Get this OST! it's amazing! The first few tracks will have you dancing around with delight (especially Scars Left by Time)!! BUY IT NOW!!,0.027,0.665,0.307,0.9766,1
4,5,There's a reason for the price,"There's a reason this CD is so expensive, even the version that's not an import.Some of the best music ever. I could listen to every track every minute of every day. That's about all i can say.",0.0,0.887,0.113,0.6369,1


In [33]:
amazon.loc[~conditions[0], 'vader_correct'] = 0

In [34]:
amazon

Unnamed: 0,sentiment,title,review,neg,neu,pos,compound,vader_correct
0,5,Inspiring,"I hope a lot of people hear this cd. We need more strong and positive vibes like this. Great vocals, fresh tunes, cross-cultural happiness. Her blues is from the gut. The pop sounds are catchy and mature.",0.000,0.505,0.495,0.9786,1
1,5,The best soundtrack ever to anything.,"I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny.",0.019,0.851,0.129,0.8481,1
2,4,Chrono Cross OST,"The music of Yasunori Misuda is without question my close second below the great Nobuo Uematsu.Chrono Cross OST is a wonderful creation filled with rich orchestra and synthesized sounds. While ambiance is one of the music's major factors, yet at times it's very uplifting and vigorous. Some of my favourite tracks include; ""Scars Left by Time, The Girl who Stole the Stars, and Another World"".",0.000,0.816,0.184,0.9260,1
3,5,Too good to be true,Probably the greatest soundtrack in history! Usually it's better to have played the game first but this is so enjoyable anyway! I worked so hard getting this soundtrack and after spending [money] to get it it was really worth every penny!! Get this OST! it's amazing! The first few tracks will have you dancing around with delight (especially Scars Left by Time)!! BUY IT NOW!!,0.027,0.665,0.307,0.9766,1
4,5,There's a reason for the price,"There's a reason this CD is so expensive, even the version that's not an import.Some of the best music ever. I could listen to every track every minute of every day. That's about all i can say.",0.000,0.887,0.113,0.6369,1
...,...,...,...,...,...,...,...,...
2999994,1,Don't do it!!,"The high chair looks great when it first comes out of the box but it is all down hill after that. It is impossible to keep clean. The finish is flaking off, after less than 6 months of use. It is not worth the struggle to keep up. If you already have it, call the customer service number and order a second chair pad. It is only $12 and does help to be able to switch out when one is in the wash.",0.022,0.828,0.150,0.8896,0
2999995,2,"Looks nice, low functionality","I have used this highchair for 2 kids now and finally decided to sell it because I did not like it. It's a beautiful chair and looks great in the kitchen. It's much nicer looking than the 'plastic' ones you see. What I don't like about this chair is the day-to-day functionality of it. It does not adjust, at all. As the baby gets older, I want a chair that tilts, adjusts, etc-- and this does not do that. Also, it is very heavy and hard to move; it does not slide on the floor well at all. The seatbelt is clumsy and a pain to put through the fabric cover when you take the cover on or off for cleaning. After being used multiple times, the velcro that fastens the fabric cover down will no longer stick so every time you lift your child out of the highchair, the cover comes with them. I found myself very frustrated with this chair.I would not recommend this chair to anyone. I do think it's pretty though if you are going strictly by looks!",0.089,0.820,0.092,0.3829,0
2999996,2,"compact, but hard to clean","We have a small house, and really wanted two of these high chairs for our twins. Space-wize, they are great; they have a small footprint and look nice along side the rest of our furniture.The 2nd tray and the slide-out cup holders are a great addition to this high chair. The shelf under the seat is a great place to store bibs.The downside of the Eddie Bauer high chair is that its difficult to clean. I have toddler twins, I don't have a lot of time to spend cleaning their high chairs with a toothbrush! The slats on the side of the seat are close together and food gets slopped in between them at every meal. If you don't throughly clean this up after each meal, it dries and is difficult to get clean.Another downside is my kids never really fit well into these highchairs. They were always too short for the tray, and the tray was always too far away as opposed to other highchairs. This helped create even more food mess!",0.085,0.793,0.122,0.8767,0
2999997,3,Hard to clean!,"I agree with everyone else who says this chair is hard to clean. I bought the lighter colored wood chair, which really makes the grime build up noticeable, especially on the foot rest. We clean the high chair all the time, but the wood almost has an absorbency about it and there's just so much you're able to get clean. And I'm not sure why some areas that have been cleaned have a sticky feeling to them except that maybe it's a reaction of the wood to the chemicals in disinfectant wipes. It looks nice, but I wish I had gone with a plastic, easier to clean high chair.",0.046,0.745,0.209,0.9624,0


In [35]:
amazon.vader_correct.value_counts(normalize=1)

1    0.537101
0    0.462899
Name: vader_correct, dtype: float64

In [38]:
amazon.groupby(['sentiment'])['compound'].mean()

sentiment
1   -0.028551
2    0.190786
3    0.418094
4    0.633814
5    0.711505
Name: compound, dtype: float64

# Exporting vaderized amazon reviews to csv

In [39]:
#amazon.to_csv('./datasets/training_data/amazon_reviews/vader_amazon.csv')