# This notebook outlines pre-processing of Amazong Reviews

Most of the work is done in clases that I wrote, but I'll outline what I did for pre-processing in this notebook. Running pre-processing in the notebook took a lot longer so running it via command line seems to be quicker so I can iterate

Add code is checked into my github repo: https://github.com/sv650s/sb-capstone


* preprocess_amazon.py - python program that calls TextProcessor with parameters set to handle the amazon review file
* TextProcessor.py - processor class that uses various utilities to pre-process the file
* text_util.py - has a bunch of text processing methods to clean the data
* file_util.py - functiont to handle files (ie, covert tsv to csv)
* df_util.py - utility to handle pandas DataFrames


Unit Tests:
* TestTextUtil.py - tests text_util.py

To Be Implemented:
* unit tests for file_util.py
* unit tests for pd_util.py


## Before we pre-process, I had to convert tsv to CSV because Pandas was not reading the columns correctly and was putting multiple rows into a column resulting in headline columns that had over 30k words

Original Amazon file had 9mil reviews. I added sampling parameter to reduce the size of the file. Currently, the sampling is pretty dumb. It just grabs every nth line in the file and put it in the final csv file. Will probably rewrite this so it's based on random.rand later

I already ran this via command line because it was faster. So the commented out lines are how we would generate the other sizes.

In [1]:
from file_util import convert_tsv_to_csv


# full 9mil Wireless reviews - not enough memory locally to do this
ORIG_FILE_WIRELESS="dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00.tsv"

# about 22k reviews
DATA_FILE_TEST = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-test.csv"
# about 100023 reviews
DATA_FILE_TINY = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-tiny.csv"
# about 300068 reviews
DATA_FILE_SMALL = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-small.csv"
# about 450101 reviews
DATA_FILE_MEDIUM = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-medium.csv"
# about 900203 reviews
DATA_FILE_LARGE = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-large.csv"

# already ran this
convert_tsv_to_csv(ORIG_FILE_WIRELESS, DATA_FILE_TEST, 400)
# convert_tsv_to_csv(ORIG_FILE_WIRELESS, DATA_FILE_TINY, 90)
# convert_tsv_to_csv(ORIG_FILE_WIRELESS, DATA_FILE_SMALL, 25)
# convert_tsv_to_csv(ORIG_FILE_WIRELESS, DATA_FILE_MEDIUM, 20)
# convert_tsv_to_csv(ORIG_FILE_WIRELESS, DATA_FILE_LARGE, 10)


## sample run and output of preprocess_amazon.py

I'm using a file that has only 25k entires so I can run it in the notebook quickly

Notice final output is 19889 because after steming and removing stop words some of the review headlines are now blank

Pre-processing entails the following (in order):
* make everything lowercase
* remove newlines
* remove amazon tags - amazon embeds these [[VIDDEO:dsfljlsjf]] and [[ASIN:sdfjlsjdfl]] tags that need to be removed
* remove html tags - line breaks, etc are represented in reviews as HTML tags
* remove accent characters
* expand contractions - THIS IS NOT YET IMPLEMENTED but needs to be done before special charaters because we want to expand don't into do not for our text processing
* remove special characters - anything that is not alphanumeric or spaces
* stem or lemmatize words - ONLY Porter stemming is implemented currently
* remove stop words - see text_util.py for stop words that I removed from nltk stop words because I think they will be important for sentiment analysis

Columns that it drops right off the bat: marketplace, vine, verified_purchase
Columns that it is pre-processing: product_title, review_headline, review_body

Also, for convenience, there is a flag to retain the original column so we can see the orignal text next to the pre-processed text so we can look for errors. Will not be using this flag for final data files


Here is the list of default stop words from nltk:


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

In [2]:
!python preprocess_amazon.py -l INFO -r \
    -o dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-testout.csv \
    dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-testin.csv


2019-05-02 00:28:39,725 INFO    __main__.main [41] - loading data frame from dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-testin.csv
2019-05-02 00:28:39,894 INFO    __main__.main [43] - finished loading dataframe dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-testin.csv
2019-05-02 00:28:39,894 INFO    __main__.main [44] - original dataframe length: 22505
2019-05-02 00:28:39,894 INFO    TextPreprocessor.preprocess_data [119] - start preprocessing data
2019-05-02 00:28:39,894 INFO    TextPreprocessor.preprocess_data [121] - column count before dropping columns: 15
2019-05-02 00:28:39,897 INFO    TextPreprocessor.preprocess_data [124] - column count after dropping columnes: 12
2019-05-02 00:28:39,897 INFO    TextPreprocessor.preprocess_data [127] - original row count: 22505
2019-05-02 00:28:39,910 INFO    TextPreprocessor.preprocess_data [129] - row count after dropping na: 22505
2019-05-02 00:28:39,911 INFO    TextPreprocessor.preprocess_data [131] - column count befo

In [3]:
import pandas as pd

PREPROCESSED_CSV = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-testout.csv"


## Reading the output file back in to look at some data

In [4]:
review_df = pd.read_csv(PREPROCESSED_CSV, parse_dates=["review_date"])
review_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22409 entries, 0 to 22408
Data columns (total 15 columns):
customer_id             22409 non-null int64
review_id               22409 non-null object
product_id              22409 non-null object
product_parent          22409 non-null int64
product_title_orig      22409 non-null object
product_title           22409 non-null object
product_category        22409 non-null object
star_rating             22409 non-null int64
helpful_votes           22409 non-null int64
total_votes             22409 non-null int64
review_headline_orig    22409 non-null object
review_headline         22409 non-null object
review_body_orig        22409 non-null object
review_body             22409 non-null object
review_date             22409 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(5), object(9)
memory usage: 2.6+ MB


In [5]:
# let's sample the dataframe so we can look at some of the data
sample_df = review_df.sample(n=10)

## Product Title

In [6]:
df = review_df
df["product_title_orig_len"] = df["product_title_orig"].apply(lambda x: len(x))
df["product_title_len"] = df["product_title"].apply(lambda x: len(x))

df["product_title_diff"] = df["product_title_len"] - df["product_title_orig_len"]
df["product_title_percent_diff"] = df["product_title_diff"] / df["product_title_orig_len"]



df[["product_title_orig_len", "product_title_len", "product_title_diff", ]].describe()

Unnamed: 0,product_title_orig_len,product_title_len
count,22409.0,22409.0
mean,104.411933,88.198313
std,61.650591,51.114203
min,3.0,2.0
25%,60.0,51.0
50%,90.0,77.0
75%,136.0,114.0
max,400.0,393.0


In [7]:
import numpy as np

rows = len(sample_df)
# first let's randomly look at a couple rows
for i in np.arange(0,3,1):
    row_index = round(np.random.rand() * rows)
    row = sample_df.iloc[row_index]
    print(f'product_title_orig\t[{row["product_title_orig"]}]')
    print(f'product_title\t\t[{row["product_title"]}]')




product_title_orig	[Nokia Lumia 822 GSM  Verizon CDMA 4G LTE Windows Smartphone -Black]
product_title		[nokia lumia 822 gsm verizon cdma 4g lte window smartphon black]
product_title_orig	[TRENDE - Apple iPhone 5C Case Patchwork Owl Rhinestone (Bling) Design Snap-on Hard Cover + Free Gift Box (Compatible Models: ONLY for iPhone 5C - NOT 5 or 5S!)]
product_title		[trend appl iphon 5c case patchwork owl rhineston bling design snap hard cover free gift box compat model onli iphon 5c not 5 5s]
product_title_orig	[Designer Hard case for at&T iphone 4 bulk package--please see additonal comment of listing]
product_title		[design hard case iphon 4 bulk packag plea see additon comment list]


## I did notice that in product titles - things like 5.5 get converted to 5 5 - will have to do something about this if we decide to use product title

## review headlines

average 22% reduction in length

In [16]:

df = review_df
df["review_headline_orig_len"] = df["review_headline_orig"].apply(lambda x: len(x))
df["review_headline_len"] = df["review_headline"].apply(lambda x: len(x))

df["review_headline_diff"] = df["review_headline_len"] - df["review_headline_orig_len"]
df["review_headline_percent_diff"] = df["review_headline_diff"] / df["review_headline_orig_len"]

df[["review_headline_orig_len", "review_headline_len", "review_headline_diff", "review_headline_percent_diff"]].describe()

Unnamed: 0,review_headline_orig_len,review_headline_len,review_headline_diff,review_headline_percent_diff
count,22409.0,22409.0,22409.0,22409.0
mean,22.042438,15.547726,-6.494712,-0.227335
std,18.062178,11.171589,8.646664,0.18807
min,1.0,1.0,-73.0,-0.896552
25%,10.0,9.0,-9.0,-0.366667
50%,15.0,11.0,-3.0,-0.175
75%,28.0,20.0,-1.0,-0.1
max,128.0,121.0,0.0,0.0


In [9]:
# let's now look at review_headline
# first let's randomly look at a couple rows
for i in np.arange(0,5,1):
    row_index = round(np.random.rand() * rows)
    row = sample_df.iloc[row_index]
    print(f'review_headline_orig\t[{row["review_headline_orig"]}]')
    print(f'review_headline\t\t[{row["review_headline"]}]')

review_headline_orig	[Love it]
review_headline		[love]
review_headline_orig	[Not good product]
review_headline		[not good product]
review_headline_orig	[Not good product]
review_headline		[not good product]
review_headline_orig	[Not good product]
review_headline		[not good product]
review_headline_orig	[Good Product - Protects my iPhone]
review_headline		[good product protect iphon]


## review body - pretty significant reduction in length

Average 40% reduction in size of review body

In [15]:
df = review_df
df["review_body_orig_len"] = df["review_body_orig"].apply(lambda x: len(x))
df["review_body_len"] = df["review_body"].apply(lambda x: len(x))
df["review_body_diff"] = df["review_body_len"] - df["review_body_orig_len"]
df["review_body_percent_diff"] = df["review_body_diff"] / df["review_body_orig_len"]
df[["review_body_orig_len", "review_body_len", "review_body_diff", "review_body_percent_diff"]].describe()

Unnamed: 0,review_body_orig_len,review_body_len,review_body_diff,review_body_percent_diff
count,22409.0,22409.0,22409.0,22409.0
mean,256.354099,148.489446,-107.864653,-0.382181
std,420.607621,238.309817,183.931886,0.119247
min,1.0,1.0,-3916.0,-0.891892
25%,70.0,43.0,-118.0,-0.451477
50%,141.0,85.0,-57.0,-0.40404
75%,281.0,163.0,-26.0,-0.33945
max,9888.0,6029.0,0.0,0.0


In [11]:
# let's now look at review_body
# first let's randomly look at a couple rows
for i in np.arange(0,5,1):
    row_index = round(np.random.rand() * rows)
    row = sample_df.iloc[row_index]
    print(f'review_body_orig\t[{row["review_body_orig"]}]')
    print(f'review_body\t\t[{row["review_body"]}]')

review_body_orig	[it very handed he keep his phone case  on his pantloop he alway know were it is.i like that becase it were it belong]
review_body		[veri hand keep hi phone case hi pantloop alway know like beca belong]
review_body_orig	[&#34;Plastic&#34; top, side and bottom gold &#34;paint&#34; is rubbing off. This is the third case I have bought, which aren't cheap, and although I like the sleek design I have not been happy with this defect.]
review_body		[plastic top side bottom gold paint rub thi third case bought cheap although like sleek design not happi thi defect]
review_body_orig	[All is useful, great value.]
review_body		[use great valu]
review_body_orig	[I have now purchased this product for both my iPad and my iPhone and it is a very good product and protects the screen from scratches. The clarity is so good that you can hardly even tell that anything has been applied to the screen. I do highly recommend you read the directions, clean your screen thoroughly, and take your 