# This notebook outlines pre-processing of Amazong Reviews

Most of the work is done in clases that I wrote, but I'll outline what I did for pre-processing in this notebook. Running pre-processing in the notebook took a lot longer so running it via command line seems to be quicker so I can iterate

Code is located in util folder:
* preprocess_amazon.py - python program that calls TextProcessor with parameters set to handle the amazon review file
* TextProcessor.py - processor class that uses various utilities to pre-process the file
* text_util.py - has functions to do text processing
* file_util.py - functiont to handle files (ie, covert tsv to csv)
* df_util.py - utility to handle pandas DataFrames


Unit Tests:
* TestTextUtil.py - tests text_util.py

To Be Implemented:
* unit tests for file_util.py
* unit tests for pd_util.py


## Before we pre-process, I had to convert tsv to CSV because Pandas was not reading the columns correctly and was putting multiple rows into a column resulting in headline columns that had over 30k words

Original Amazon file had 9mil reviews. I added sampling parameter to reduce the size of the file. Currently, the sampling is pretty dumb. It just grabs every nth line in the file and put it in the final csv file. Will probably rewrite this so it's based on random.rand later

I already ran this via command line because it was faster. So the commented out lines are how we would generate the other sizes.

In [3]:
# import sibling utilities
import sys
sys.path.append('..')

from util.file_util import convert_tsv_to_csv
import pandas as pd

# only need to run this one time
CONVERT_FILE=False

# full 9mil Wireless reviews - not enough memory locally to do this
ORIG_FILE_WIRELESS="dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00.tsv"

# about 22k reviews
DATA_FILE_TEST = "dataset/amazon_reviews_us_Wireless_v1_00-50k-preprocessed.csv"
# about 100023 reviews
DATA_FILE_50K = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-50k.csv"
# about 300068 reviews
DATA_FILE_100K = "dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-100k.csv"




# First we have to convert Amazon files to csv format

Pandas had issues with original tsv format and would merge multiple lines together. Had to convert the file to csv format

I wrote the following function to fix this


In [4]:
if CONVERT_FILE:
    convert_tsv_to_csv(ORIG_FILE_WIRELESS, DATA_FILE_100K, SimpleSampler(90))
    convert_tsv_to_csv(ORIG_FILE_WIRELESS, DATA_FILE_50K, SimpleSampler(180))

## Preprocssing Amazon Review File

To do pre-processing. Run the following:
```
cd ../tools
python amazon_review_preprocessor.py -l INFO -o ../dataset/amazon_reviews ../dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-100k.csv
```

I'm using a file that has only 25k entires so I can run it in the notebook quickly

Notice final output is 19889 because after steming and removing stop words some of the review headlines are now blank

Pre-processing entails the following (in order):
* make everything lowercase
* remove newlines
* remove amazon tags - amazon embeds these [[VIDDEO:dsfljlsjf]] and [[ASIN:sdfjlsjdfl]] tags that need to be removed
* remove html tags - line breaks, etc are represented in reviews as HTML tags
* remove accent characters
* expand contractions - expands contractions like he's but needs to be done before special charaters because we want to expand don't into do not for our text processing
* remove special characters - anything that is not alphanumeric or spaces
* remove stop words - see text_util.py for stop words that I removed from nltk stop words because I think they will be important for sentiment analysis
* lemmatize words - stem using wordnet

Columns that it drops right off the bat: marketplace, vine, verified_purchase
Columns that it is pre-processing: product_title, review_headline, review_body

Also, for convenience, there is a flag to retain the original column so we can see the orignal text next to the pre-processed text so we can look for errors. Will not be using this flag for final data files


Here is the list of default stop words from nltk:


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# Look at our Pre-processed file

In [5]:
import pandas as pd

PREPROCESSED_CSV = "../dataset/amazon_reviews/amazon_reviews_us_Wireless_v1_00-100k-preprocessed.csv"


## Reading the output file back in to look at some data

In [6]:
review_df = pd.read_csv(PREPROCESSED_CSV, parse_dates=["review_date"])
review_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99567 entries, 0 to 99566
Data columns (total 6 columns):
star_rating        99567 non-null int64
helpful_votes      99567 non-null int64
total_votes        99567 non-null int64
review_headline    99567 non-null object
review_body        99567 non-null object
review_date        99567 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 4.6+ MB


### Let's sample the data

In [9]:
# let's sample the dataframe so we can look at some of the data
sample_df = review_df.sample(n=10)
sample_df

Unnamed: 0,star_rating,helpful_votes,total_votes,review_headline,review_body,review_date
9018,5,0,0,five star,essential equipment ditch bag,2015-06-26
67099,1,0,0,illegal artwork stolen,please remove item others artwork stole sellin...,2013-11-14
79379,2,0,0,screen blue hue not sharp,replaced quite screen small electronics shop f...,2013-03-21
98600,5,0,0,far good,purchased product day ago very ampressed great...,2007-12-14
95730,5,0,0,nice little case,quality case top notch zen fit perfectly leath...,2010-09-10
4348,5,0,0,useful case,person like go swi amming everything wet want ...,2015-07-31
86589,5,0,0,very nice screen portector,ordered screen protector wife iphone got first...,2012-10-31
6647,5,0,0,five star,fit perfect love,2015-07-14
83575,3,0,0,tinny sound,read review product decided try price not bad ...,2013-01-06
34398,5,0,1,five star,great product,2014-12-22
