# Amazon Fine Food Reviews

This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

Data includes:
- Reviews from Oct 1999 - Oct 2012
- 568,454 reviews
- 256,059 users
- 74,258 products
- 260 users with > 50 reviews

### Columns
- Id - Row Id
- Product - IdUnique identifier for the product
- UserId - Unqiue identifier for the user
- ProfileName - Profile name of the user
- HelpfulnessNumerator- Number of users who found the review helpful
- HelpfulnessDenominator - Number of users who indicated whether they found the review helpful or not
- Score - Rating between 1 and 5
- Time - Timestamp for the review
- Summary - Brief summary of the review
- Text - Text of the review

## Objective
    Our objective here is to cleanup and preprocess the text data such that it is ready to be
    used by a predictive model later

In [1]:
import sqlite3
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

# Read the SQLite table
con = sqlite3.connect('database.sqlite')
filtered_data = pd.read_sql_query('select * from Reviews where score != 3',con)
filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [2]:
# # Pandas Profiling
# import pandas_profiling

# pandas_profiling.ProfileReport(filtered_data.head(10000))

In [3]:
def score_label(score):
    if score > 3 or score == 0:
        return 'postive'
    return 'negative'

filtered_data['sentiment'] = filtered_data.Score.apply(score_label)
filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,sentiment
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,postive
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,negative
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,postive
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,negative
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...,postive


## Data Cleaning

In [4]:
# Removing duplicates

#Sort the data
sorted_reviews = filtered_data.sort_values('ProductId')

# Remove Duplicates
distinct_revirews = filtered_data.drop_duplicates(subset=['UserId','ProfileName','Time','Text'])
distinct_revirews.shape

(364173, 11)

In [5]:
print('Remaining data: {:.4f} %'.format((float(distinct_revirews.Id.size)*100/float(filtered_data.Id.size))))

Remaining data: 69.2589 %


In [6]:
# Removing those rows where HelpfulnessNumerator is greater than HelpfulnessDenominator

final_data = distinct_revirews[distinct_revirews.HelpfulnessNumerator<=distinct_revirews.HelpfulnessDenominator]
final_data.shape

(364171, 11)

In [7]:
final_data.sentiment.value_counts()

postive     307061
negative     57110
Name: sentiment, dtype: int64

## Text Pre-Processing

In [8]:
# Remove HTML Tags
from bs4 import BeautifulSoup
def remove_html(text):
    soup = BeautifulSoup(text,'lxml')
    html_free_text = soup.get_text()
    return html_free_text


In [9]:
# First find any text containing HTMl tags
import re
i=0

for text in final_data.Text.values:
    if (len(re.findall('<.*?>',text))):
        print(i)
        print(text)
        break
        
    i+=1

10
I don't know if it's the cactus or the tequila or just the unique combination of ingredients, but the flavour of this hot sauce makes it one of a kind!  We picked up a bottle once on a trip we were on and brought it back home with us and were totally blown away!  When we realized that we simply couldn't find it anywhere in our city we were bummed.<br /><br />Now, because of the magic of the internet, we have a case of the sauce and are ecstatic because of it.<br /><br />If you love hot sauce..I mean really love hot sauce, but don't want a sauce that tastelessly burns your throat, grab a bottle of Tequila Picante Gourmet de Inclan.  Just realize that once you taste it, you will never want to use any other sauce.<br /><br />Thank you for the personal, incredible service!


In [10]:
final_data.Text = final_data.Text.apply(lambda x : remove_html(x))
final_data.Text.head()

0    I have bought several of the Vitality canned d...
1    Product arrived labeled as Jumbo Salted Peanut...
2    This is a confection that has been around a fe...
3    If you are looking for the secret ingredient i...
4    Great taffy at a great price.  There was a wid...
Name: Text, dtype: object

In [11]:
# Lets check the 10th review if our cleanup was successful
final_data.iloc[10].Text

"I don't know if it's the cactus or the tequila or just the unique combination of ingredients, but the flavour of this hot sauce makes it one of a kind!  We picked up a bottle once on a trip we were on and brought it back home with us and were totally blown away!  When we realized that we simply couldn't find it anywhere in our city we were bummed.Now, because of the magic of the internet, we have a case of the sauce and are ecstatic because of it.If you love hot sauce..I mean really love hot sauce, but don't want a sauce that tastelessly burns your throat, grab a bottle of Tequila Picante Gourmet de Inclan.  Just realize that once you taste it, you will never want to use any other sauce.Thank you for the personal, incredible service!"

#### HTML cleanup was successful !!

In [12]:
# Remove Puntuations
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [13]:
def punctuation_remover(text):
    punctuation_free_text = "".join([char for char in text if char \
                                    not in string.punctuation])
    return punctuation_free_text

In [14]:
final_data.Text = final_data.Text.apply(lambda x : punctuation_remover(x))
final_data.Text.head()

0    I have bought several of the Vitality canned d...
1    Product arrived labeled as Jumbo Salted Peanut...
2    This is a confection that has been around a fe...
3    If you are looking for the secret ingredient i...
4    Great taffy at a great price  There was a wide...
Name: Text, dtype: object

##### Tokenization
 This breaks up the strings into a list of words or pieces based on a 
 specified pattern using Regular Expressions aka RegEx. The pattern I
 chose to use this time (r'\w') also removes punctuation and is a
 better option for this data in particular. We can also add.lower()
 in the lambda function to make everything lowercase.



In [15]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
final_data.Text = final_data.Text.apply(lambda x : tokenizer.tokenize(x.lower()))
final_data.Text.head()

0    [i, have, bought, several, of, the, vitality, ...
1    [product, arrived, labeled, as, jumbo, salted,...
2    [this, is, a, confection, that, has, been, aro...
3    [if, you, are, looking, for, the, secret, ingr...
4    [great, taffy, at, a, great, price, there, was...
Name: Text, dtype: object

Some other examples of RegEx are:
- ‘\w+|\$[\d\.]+|\S+’ = splits up by spaces or by periods that are not  attached to a digit
- ‘\s+’, gaps=True = grabs everything except spaces as a token
- ‘[A-Z]\w+’ = only words that begin with a capital letter.

##### Removing Stop Words

In [16]:
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
print('Stop Words in English:')
print('----------------------')
print(stop)

Stop Words in English:
----------------------
{'for', 'an', 'did', 'so', "weren't", 'can', 'hadn', 've', 'about', "should've", 'mustn', 'y', 'your', 'now', 'he', "that'll", 'most', "wouldn't", 'have', 'her', 'because', 'those', "shouldn't", "needn't", 'after', 'itself', "mustn't", "you've", 'whom', 'both', 'some', "mightn't", 'at', 'haven', 'while', 'why', 'being', "won't", "shan't", 'they', 'i', 'didn', 'with', 'more', 'd', 'them', 'into', 'who', "hasn't", 'yours', 't', 'she', "you'll", 'before', 'out', 'shouldn', 'wouldn', 'theirs', 'ourselves', 'herself', 'their', 'ours', 'same', 'won', 'yourselves', 'his', 'couldn', 'through', 'there', 'having', "she's", 'aren', 'here', 'this', 'by', 'what', 'weren', 'shan', 'yourself', 'when', 'against', 'himself', 'and', 'myself', "hadn't", 'of', 'under', 'other', 'its', 'down', 'how', 'll', 'up', 'doing', 'should', "you'd", 'is', 'until', 'am', 'him', 'but', 'which', 'own', 'from', 'it', 'these', 'below', 'we', 'only', "you're", 'in', 'don', 'ne

In [17]:
cached_stop_words = stopwords.words('english') # Provides 70 X Speedup
def stop_words_remover(text):
    words = [word for word in text if \
             word not in cached_stop_words]
    return words

In [18]:
final_data.Text = final_data.Text.apply(lambda x: stop_words_remover(x))
final_data.Text.head()

0    [bought, several, vitality, canned, dog, food,...
1    [product, arrived, labeled, jumbo, salted, pea...
2    [confection, around, centuries, light, pillowy...
3    [looking, secret, ingredient, robitussin, beli...
4    [great, taffy, great, price, wide, assortment,...
Name: Text, dtype: object

##### Stemming & Lemmatization 
 Both tools shorten words back to their root form. Stemming is a little more aggressive. It cuts off prefixes and/or endings of words based on common ones. It can sometimes be helpful, but not always because often times the new word is so much a root that it loses its actual meaning. Lemmatizing, on the other hand, maps common words into one base. Unlike stemming though, it always still returns a proper word that can be found in the dictionary. I like to compare the two to see which one works better for what I need. I usually prefer Lemmatizer, but surprisingly, this time, Stemming seemed to have more of an affect.   

In [19]:
# Lemmatization
# from nltk.stem import WordNetLemmatizer
# lemmatizer = WordNetLemmatizer()

# def lemmatizer(text):
#     lem = [lemmatizer.lemmatize(word) for word in text]
#     return lem

In [20]:
# final_data.Text.apply(lambda x : lemmatizer(x) )

In [44]:
# Stemming
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
def word_stemmer(text):
    stem_text = " ".join(stemmer.stem(word) for word in text )
    return stem_text

In [46]:
final_data.Text = final_data.Text.apply(lambda x : word_stemmer(x))
final_data.Text.head()

0    bought sever vital can dog food product found ...
1    product arriv label jumbo salt peanutsth peanu...
2    confect around centuri light pillowi citru gel...
3    look secret ingredi robitussin believ found go...
4    great taffi great price wide assort yummi taff...
Name: Text, dtype: object

#### Now lets save the cleaned data

In [47]:
final_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,sentiment
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,bought sever vital can dog food product found ...,postive
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,product arriv label jumbo salt peanutsth peanu...,negative
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",confect around centuri light pillowi citru gel...,postive
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,look secret ingredi robitussin believ found go...,negative
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,great taffi great price wide assort yummi taff...,postive


In [48]:
conn = sqlite3.connect('final_cleaned.sqlite')
c = conn.cursor()
conn.text_factory = str
final_data.to_sql('Reviews',conn,if_exists='replace')

In [49]:
# Also save a CSV
final_data.to_csv('final_cleaned.csv')