# This notebook outlines pre-processing of Amazong Reviews

Most of the work is done in clases that I wrote, but I'll outline what I did for pre-processing in this notebook. Running pre-processing in the notebook took a lot longer so running it via command line seems to be quicker so I can iterate

Add code is checked into my github repo: https://github.com/sv650s/sb-capstone


* preprocess_amazon.py - python program that calls TextProcessor with parameters set to handle the amazon review file
* TextProcessor.py - processor class that uses various utilities to pre-process the file
* text_util.py - has a bunch of text processing methods to clean the data
* file_util.py - functiont to handle files (ie, covert tsv to csv)
* df_util.py - utility to handle pandas DataFrames


Unit Tests:
* TestTextUtil.py - tests text_util.py

To Be Implemented:
* unit tests for file_util.py
* unit tests for pd_util.py


## Before we pre-process, I had to convert tsv to CSV because Pandas was not reading the columns correctly and was putting multiple rows into a column resulting in headline columns that had over 30k words

Original Amazon file had 9mil reviews. I added sampling parameter to reduce the size of the file. Currently, the sampling is pretty dumb. It just grabs every nth line in the file and put it in the final csv file. Will probably rewrite this so it's based on random.rand later

I already ran this via command line because it was faster. So the commented out lines are how we would generate the other sizes.

In [1]:
from file_util import convert_tsv_to_csv

# only need to run this one time
CONVERT_FILE=False

# full 9mil Wireless reviews - not enough memory locally to do this
ORIG_FILE_WIRELESS="dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00.tsv"

# about 22k reviews
DATA_FILE_TEST = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-test.csv"
# about 100023 reviews
DATA_FILE_TINY = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-tiny.csv"
# about 300068 reviews
DATA_FILE_SMALL = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-small.csv"
# about 450101 reviews
DATA_FILE_MEDIUM = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-medium.csv"
# about 900203 reviews
DATA_FILE_LARGE = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-large.csv"

# already ran this
if CONVERT_FILE:
    convert_tsv_to_csv(ORIG_FILE_WIRELESS, DATA_FILE_TEST, 400)
    # convert_tsv_to_csv(ORIG_FILE_WIRELESS, DATA_FILE_TINY, 90)
    # convert_tsv_to_csv(ORIG_FILE_WIRELESS, DATA_FILE_SMALL, 25)
    # convert_tsv_to_csv(ORIG_FILE_WIRELESS, DATA_FILE_MEDIUM, 20)
    # convert_tsv_to_csv(ORIG_FILE_WIRELESS, DATA_FILE_LARGE, 10)


## sample run and output of preprocess_amazon.py

I'm using a file that has only 25k entires so I can run it in the notebook quickly

Notice final output is 19889 because after steming and removing stop words some of the review headlines are now blank

Pre-processing entails the following (in order):
* make everything lowercase
* remove newlines
* remove amazon tags - amazon embeds these [[VIDDEO:dsfljlsjf]] and [[ASIN:sdfjlsjdfl]] tags that need to be removed
* remove html tags - line breaks, etc are represented in reviews as HTML tags
* remove accent characters
* expand contractions - expands contractions like he's but needs to be done before special charaters because we want to expand don't into do not for our text processing
* remove special characters - anything that is not alphanumeric or spaces
* remove stop words - see text_util.py for stop words that I removed from nltk stop words because I think they will be important for sentiment analysis
* stem or lemmatize words - ONLY Porter stemming is implemented currently (THIS IS TURNED OFF)

Columns that it drops right off the bat: marketplace, vine, verified_purchase
Columns that it is pre-processing: product_title, review_headline, review_body

Also, for convenience, there is a flag to retain the original column so we can see the orignal text next to the pre-processed text so we can look for errors. Will not be using this flag for final data files


Here is the list of default stop words from nltk:


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

In [2]:
!python preprocess_amazon.py -l INFO -r \
    -o dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-testout.csv \
    dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-testin.csv


2019-05-02 21:38:56,472 INFO    __main__.main [65] - loading data frame from dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-testin.csv
2019-05-02 21:38:56,644 INFO    __main__.main [67] - finished loading dataframe dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-testin.csv
2019-05-02 21:38:56,644 INFO    __main__.main [68] - original dataframe length: 22505
Traceback (most recent call last):
  File "preprocess_amazon.py", line 83, in <module>
    main()
  File "preprocess_amazon.py", line 74, in main
    custom_preprocessor=remove_amazon_tags)
  File "/Users/vinceluk/Dropbox/0_springboard/capstone/TextPreprocessor.py", line 69, in __init__
    tu.remove_stop_words(self.stop_word_remove_list)
  File "/Users/vinceluk/Dropbox/0_springboard/capstone/text_util.py", line 98, in remove_stop_words
    tokens = wpt.tokenize(text)
  File "/Users/vinceluk/anaconda3/envs/capstone/lib/python3.7/site-packages/nltk/tokenize/regexp.py", line 136, in tokenize
    return self._regexp.fi

In [3]:
import pandas as pd

PREPROCESSED_CSV = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-testout.csv"


## Reading the output file back in to look at some data

In [4]:
review_df = pd.read_csv(PREPROCESSED_CSV, parse_dates=["review_date"])
review_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22410 entries, 0 to 22409
Data columns (total 15 columns):
customer_id             22410 non-null int64
review_id               22410 non-null object
product_id              22410 non-null object
product_parent          22410 non-null int64
product_title_orig      22410 non-null object
product_title           22410 non-null object
product_category        22410 non-null object
star_rating             22410 non-null int64
helpful_votes           22410 non-null int64
total_votes             22410 non-null int64
review_headline_orig    22410 non-null object
review_headline         22410 non-null object
review_body_orig        22410 non-null object
review_body             22410 non-null object
review_date             22410 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(5), object(9)
memory usage: 2.6+ MB


In [5]:
# let's sample the dataframe so we can look at some of the data
sample_df = review_df.sample(n=10)

## Product Title

In [6]:
df = review_df
df["product_title_orig_len"] = df["product_title_orig"].apply(lambda x: len(x))
df["product_title_len"] = df["product_title"].apply(lambda x: len(x))

df["product_title_diff"] = df["product_title_len"] - df["product_title_orig_len"]
df["product_title_percent_diff"] = df["product_title_diff"] / df["product_title_orig_len"]



df[["product_title_orig_len", "product_title_len", "product_title_diff", "product_title_percent_diff"]].describe()

Unnamed: 0,product_title_orig_len,product_title_len,product_title_diff,product_title_percent_diff
count,22410.0,22410.0,22410.0,22410.0
mean,104.419277,94.322445,-10.096832,-0.088215
std,61.652263,54.613601,9.13249,0.053863
min,3.0,3.0,-86.0,-0.470588
25%,60.0,55.0,-14.0,-0.120879
50%,90.0,82.0,-8.0,-0.086957
75%,136.0,122.0,-4.0,-0.053191
max,400.0,400.0,0.0,0.0


In [7]:
import numpy as np

rows = len(sample_df)
# first let's randomly look at a couple rows
for i in np.arange(0,3,1):
    row_index = round(np.random.rand() * rows - 1)
    row = sample_df.iloc[row_index]
    print(f'product_title_orig\t[{row["product_title_orig"]}]')
    print(f'product_title\t\t[{row["product_title"]}]')




product_title_orig	[HTC Desire 510 Case, ToPerk Cyber Grid Armor Case + Free HD Screen Protector & ToPerk ™ Stylus Pen As Bundle Sale]
product_title		[htc desire 510 case toperk cyber grid armor case free hd screen protector toperk TM stylus pen bundle sale]
product_title_orig	[Install Bay Copper Ring Terminal Connector 8 Gauge 1/4 Inch 25 Pack -]
product_title		[install bay copper ring terminal connector 8 gauge 1 4 inch 25 pack]
product_title_orig	[HTC Desire 510 Case, ToPerk Cyber Grid Armor Case + Free HD Screen Protector & ToPerk ™ Stylus Pen As Bundle Sale]
product_title		[htc desire 510 case toperk cyber grid armor case free hd screen protector toperk TM stylus pen bundle sale]


## I did notice that in product titles - things like 5.5 get converted to 5 5 - will have to do something about this if we decide to use product title

## review headlines

average 22% reduction in length with stemming turned on

average 17% reduction in length with stemming turned off

In [8]:

df = review_df
df["review_headline_orig_len"] = df["review_headline_orig"].apply(lambda x: len(x))
df["review_headline_len"] = df["review_headline"].apply(lambda x: len(x))

df["review_headline_diff"] = df["review_headline_len"] - df["review_headline_orig_len"]
df["review_headline_percent_diff"] = df["review_headline_diff"] / df["review_headline_orig_len"]

df[["review_headline_orig_len", "review_headline_len", "review_headline_diff", "review_headline_percent_diff"]].describe()

Unnamed: 0,review_headline_orig_len,review_headline_len,review_headline_diff,review_headline_percent_diff
count,22410.0,22410.0,22410.0,22410.0
mean,22.041187,16.727755,-5.313432,-0.168167
std,18.062227,12.014138,8.247019,0.199324
min,1.0,1.0,-68.0,-0.904762
25%,10.0,10.0,-8.0,-0.309524
50%,15.0,12.0,-1.0,-0.083333
75%,28.0,21.0,0.0,0.0
max,128.0,125.0,0.0,0.0


In [9]:
# let's now look at review_headline
# first let's randomly look at a couple rows
for i in np.arange(0,5,1):
    row_index = round(np.random.rand() * rows -1)
    row = sample_df.iloc[row_index]
    print(f'review_headline_orig\t[{row["review_headline_orig"]}]')
    print(f'review_headline\t\t[{row["review_headline"]}]')

review_headline_orig	[One Star]
review_headline		[one star]
review_headline_orig	[Slim enough to keep in my portfolio along with my ...]
review_headline		[slim enough keep portfolio along]
review_headline_orig	[Looks good on paper, but cut corners make it a poor product.]
review_headline		[looks good paper cut corners make poor product]
review_headline_orig	[My experience so far]
review_headline		[experience far]
review_headline_orig	[Looks good on paper, but cut corners make it a poor product.]
review_headline		[looks good paper cut corners make poor product]


## review body - pretty significant reduction in length

Average 40% reduction in size of review body if stemming is turned on

Once you turn off stemming, about 35% average reduction in sbody size

In [10]:
df = review_df
df["review_body_orig_len"] = df["review_body_orig"].apply(lambda x: len(x))
df["review_body_len"] = df["review_body"].apply(lambda x: len(x))
df["review_body_diff"] = df["review_body_len"] - df["review_body_orig_len"]
df["review_body_percent_diff"] = df["review_body_diff"] / df["review_body_orig_len"]
df[["review_body_orig_len", "review_body_len", "review_body_diff", "review_body_percent_diff"]].describe()

Unnamed: 0,review_body_orig_len,review_body_len,review_body_diff,review_body_percent_diff
count,22410.0,22410.0,22410.0,22410.0
mean,256.35444,158.150959,-98.203481,-0.34504
std,420.598432,258.101204,164.786304,0.128854
min,1.0,1.0,-3426.0,-0.891892
25%,70.0,45.0,-109.0,-0.423313
50%,141.0,89.0,-53.0,-0.368421
75%,281.0,171.0,-22.0,-0.295858
max,9888.0,6519.0,0.0,0.0


In [11]:
# let's now look at review_body
# first let's randomly look at a couple rows
for i in np.arange(0,10,1):
    row_index = round(np.random.rand() * rows - 1)
    row = sample_df.iloc[row_index]
    print(f'review_body_orig\t[{row["review_body_orig"]}]')
    print(f'review_body\t\t[{row["review_body"]}]\n')

review_body_orig	[ok I recieved the PLD70BT unit last week on friday, it arrived in perfect condition and on time!   (ordered from amazon not a third party)   I bought this unit for my 2003 Dodge Ram 1500 Quad cab.    a few years back I bought a Pyle PLD45MUT (3.5 inch screen and tv tuner built in.) which after about 2 or 3 years stopped playing the DVD's or CD's and thus was not much use as my kids love to bring CD's in the car and torture me! anyway, I had to buy a wiring adaptor for the old unit and I kept it all wired after I took it out of the truck and shelved it. so when I recieved the new unit I was thrilled to see thaat the Pyle wiring harness was exactly the same from unit to unit and I just popped it off the old unit and onto the new one and it slid right into place in the dash! so if you have a Pyle unit already You can upgrade to this one pretty easy!     ok, that being said let's talk about this unit. it's a good unit for the price and so far I am very satisfied! the soun