# This notebook outlines pre-processing of Amazong Reviews

Most of the work is done in clases that I wrote, but I'll outline what I did for pre-processing in this notebook. Running pre-processing in the notebook took a lot longer so running it via command line seems to be quicker so I can iterate

Add code is checked into my github repo: https://github.com/sv650s/sb-capstone


* preprocess_amazon.py - python program that calls TextProcessor with parameters set to handle the amazon review file
* TextProcessor.py - processor class that uses various utilities to pre-process the file
* text_util.py - has a bunch of text processing methods to clean the data
* file_util.py - functiont to handle files (ie, covert tsv to csv)
* df_util.py - utility to handle pandas DataFrames


Unit Tests:
* TestTextUtil.py - tests text_util.py

To Be Implemented:
* unit tests for file_util.py
* unit tests for pd_util.py


## Before we pre-process, I had to convert tsv to CSV because Pandas was not reading the columns correctly and was putting multiple rows into a column resulting in headline columns that had over 30k words

Original Amazon file had 9mil reviews. I added sampling parameter to reduce the size of the file. Currently, the sampling is pretty dumb. It just grabs every nth line in the file and put it in the final csv file. Will probably rewrite this so it's based on random.rand later

I already ran this via command line because it was faster. So the commented out lines are how we would generate the other sizes.

In [6]:
from file_util import convert_tsv_to_csv


# full 9mil Wireless reviews - not enough memory locally to do this
ORIG_FILE_WIRELESS="dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00.tsv"

# about 22k reviews
DATA_FILE_TEST = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-test.csv"
# about 100023 reviews
DATA_FILE_TINY = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-tiny.csv"
# about 300068 reviews
DATA_FILE_SMALL = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-small.csv"
# about 450101 reviews
DATA_FILE_MEDIUM = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-medium.csv"
# about 900203 reviews
DATA_FILE_LARGE = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-large.csv"

# already ran this
convert_tsv_to_csv(ORIG_FILE_WIRELESS, DATA_FILE_TEST, 400)
# convert_tsv_to_csv(ORIG_FILE_WIRELESS, DATA_FILE_TINY, 90)
# convert_tsv_to_csv(ORIG_FILE_WIRELESS, DATA_FILE_SMALL, 25)
# convert_tsv_to_csv(ORIG_FILE_WIRELESS, DATA_FILE_MEDIUM, 20)
# convert_tsv_to_csv(ORIG_FILE_WIRELESS, DATA_FILE_LARGE, 10)


## sample run and output of preprocess_amazon.py

I'm using a file that has only 25k entires so I can run it in the notebook quickly

Notice final output is 19889 because after steming and removing stop words some of the review headlines are now blank

Pre-processing entails the following (in order):
* make everything lowercase
* remove newlines
* remove amazon tags - amazon embeds these [[VIDDEO:dsfljlsjf]] and [[ASIN:sdfjlsjdfl]] tags that need to be removed
* remove html tags - line breaks, etc are represented in reviews as HTML tags
* remove accent characters
* expand contractions - THIS IS NOT YET IMPLEMENTED but needs to be done before special charaters because we want to expand don't into do not for our text processing
* remove special characters - anything that is not alphanumeric or spaces
* stem or lemmatize words - ONLY Porter stemming is implemented currently
* remove stop words - see text_util.py for stop words that I removed from nltk stop words because I think they will be important for sentiment analysis

Columns that it drops right off the bat: marketplace, vine, verified_purchase
Columns that it is pre-processing: product_title, review_headline, review_body

Also, for convenience, there is a flag to retain the original column so we can see the orignal text next to the pre-processed text so we can look for errors. Will not be using this flag for final data files

In [12]:
!python preprocess_amazon.py -l INFO -r \
    -o dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-testout.csv \
    dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-testin.csv


2019-05-01 23:51:47,402 INFO    __main__.main [41] - loading data frame from dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-testin.csv
2019-05-01 23:51:47,573 INFO    __main__.main [43] - finished loading dataframe dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-testin.csv
2019-05-01 23:51:47,573 INFO    __main__.main [44] - original dataframe length: 22505
2019-05-01 23:51:47,573 INFO    TextPreprocessor.preprocess_data [119] - start preprocessing data
2019-05-01 23:51:47,573 INFO    TextPreprocessor.preprocess_data [121] - column count before dropping columns: 15
2019-05-01 23:51:47,577 INFO    TextPreprocessor.preprocess_data [124] - column count after dropping columnes: 12
2019-05-01 23:51:47,577 INFO    TextPreprocessor.preprocess_data [127] - original row count: 22505
2019-05-01 23:51:47,591 INFO    TextPreprocessor.preprocess_data [129] - row count after dropping na: 22505
2019-05-01 23:51:47,591 INFO    TextPreprocessor.preprocess_data [131] - column count befo

In [13]:
import pandas as pd

PREPROCESSED_CSV = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-testout.csv"


## Reading the output file back in to look at some data

In [17]:
review_df = pd.read_csv(PREPROCESSED_CSV, parse_dates=["review_date"])
review_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22409 entries, 0 to 22408
Data columns (total 15 columns):
customer_id             22409 non-null int64
review_id               22409 non-null object
product_id              22409 non-null object
product_parent          22409 non-null int64
product_title_orig      22409 non-null object
product_title           22409 non-null object
product_category        22409 non-null object
star_rating             22409 non-null int64
helpful_votes           22409 non-null int64
total_votes             22409 non-null int64
review_headline_orig    22409 non-null object
review_headline         22409 non-null object
review_body_orig        22409 non-null object
review_body             22409 non-null object
review_date             22409 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(5), object(9)
memory usage: 2.6+ MB


In [19]:
# let's sample the dataframe so we can look at some of the data
sample_df = review_df.sample(n=10)

In [26]:
import numpy as np

rows = len(sample_df)
# first let's randomly look at a couple rows
for i in np.arange(0,3,1):
    row_index = round(np.random.rand() * rows)
    row = sample_df.iloc[row_index]
    print(f'product_title_orig\t[{row["product_title_orig"]}]')
    print(f'product_title\t\t[{row["product_title"]}]')




product_title_orig	[HTC Desire 816 Black (Virgin mobile) - 5.5 inch S-LCD Display]
product_title		[htc desir 816 black virgin mobil 5 5 inch lcd display]
product_title_orig	[Samsung Galaxy Ace 4 G313M Unlocked GSM HSPA+ Android Smartphone]
product_title		[samsung galaxi ace 4 g313m unlock gsm hspa android smartphon]
product_title_orig	[Mini Adjustable Tripod+camera Holder for Iphone and Other Cellphone]
product_title		[mini adjust tripod camera holder iphon cellphon]


## I did notice that in product titles - things like 5.5 get converted to 5 5 - will have to do something about this if we decide to use product title

In [27]:
# let's now look at review_headline
# first let's randomly look at a couple rows
for i in np.arange(0,5,1):
    row_index = round(np.random.rand() * rows)
    row = sample_df.iloc[row_index]
    print(f'review_headline_orig\t[{row["review_headline_orig"]}]')
    print(f'review_headline\t\t[{row["review_headline"]}]')

review_headline_orig	[Not at all useful.]
review_headline		[not use]
review_headline_orig	[Everyone fits]
review_headline		[everyon fit]
review_headline_orig	[The case does not work because the smart chip has been removed]
review_headline		[case doe not work becau smart chip ha remov]
review_headline_orig	[No more annoying AUX cable in my car.]
review_headline		[no annoy aux cabl car]
review_headline_orig	[Bulky but protective]
review_headline		[bulki protect]


In [28]:
# let's now look at review_body
# first let's randomly look at a couple rows
for i in np.arange(0,5,1):
    row_index = round(np.random.rand() * rows)
    row = sample_df.iloc[row_index]
    print(f'review_body_orig\t[{row["review_body_orig"]}]')
    print(f'review_body\t\t[{row["review_body"]}]')

review_body_orig	[This product eliminates the need for an AUX cable , and because it converts your radio into a blue tooth<br />receiver you get better sound as well. The charging is accurate and lasts about 6 hours. In about 30<br />minutes its fully charged again. Syncs easily to my phone and is more affordable than several other<br />models.]
review_body		[thi product elimin need aux cabl becau convert radio blue toothreceiv get better sound well charg accur last 6 hour 30minut fulli charg sync easili phone afford sever othermodel]
review_body_orig	[Does not have front-facing VGA camera as it says in the product specs.]
review_body		[doe not front face vga camera say product spec]
review_body_orig	[Great gift new baby everyone in picture now.]
review_body		[great gift new babi everyon pictur]
review_body_orig	[It was exactly what was advertised. Looks good. Holds good. For the price, you can't complain. I would recommend this to anyone.]
review_body		[wa exactli wa adverti look good