# Amazon Review Summarizer
### Cleaning and Pre-processing

In this notebook, we will be creating a prototype of our Amazon Review Summarizer by performing data preprocessing on a specific category: "Clothing, Shoes, and Jewelry." We chose this category because it contains a larger dataset that allows us to filter it down while still maintaining a reasonably good-sized dataset for testing our summarizer.

### Steps:
1. **Data Selection**: We start by selecting the "Clothing, Shoes, and Jewelry" category from the Amazon review dataset. This category has a substantial number of reviews, which will be useful for building a robust summarizer.

2. **Data Filtering**: Since the complete dataset might be extensive, we will filter the reviews in the selected category to create a more manageable subset. This step will allow us to focus on a representative sample for prototype development.

3. **Data Sampling**: After filtering, we will randomly sample a subset of reviews from the "Clothing, Shoes, and Jewelry" category. This sampling will ensure that we have enough data to start our preprocessing tasks while still keeping the dataset size manageable.

By following these steps, we aim to set up a well-prepared dataset that is tailored to the needs of our Amazon Review Summarizer prototype. Let's proceed with the data preprocessing and create an effective summarizer for this category.

### Importing Custom Module for Data Processing

In this section of the code, we are importing a custom module named 'sanitization', which contains the function 'getDF' used for data processing and sanitization. The custom module resides in a separate directory '/Users/williamfussell/Documents/Github/amasum/src/'.

#### Steps:
1. **Setting Custom Module Path**: We set the variable 'new_path' to the path where the custom module 'sanitization' is located.

2. **Checking and Appending Custom Module Path**: We check if the custom module path is already included in the system path using the 'sys.path' list. If it is not already present, we append the custom module path to the system path. This step ensures that Python can access the custom module and its functions.

3. **Importing the Custom Module**: After appending the custom module path to the system path, we import the 'getDF' function from the 'sanitization' module using the 'import' statement. The 'getDF' function will be used later for data processing and sanitization.

By importing the custom module, we can now use the 'getDF' function in our current code to perform specific data processing tasks that the custom module provides.

In [1]:
import sys
import os
import pandas as pd
import numpy as np

# Set the path to the custom module directory
new_path = '/Users/williamfussell/Documents/Github/amasum/src/'

# Check if the custom module path is not already in the system path
if new_path not in sys.path:
    # Append the custom module path to the system path to access the custom module
    sys.path.append(new_path)

# Import the custom module 'DataProcessor2' which contains the function 'automate_data_processing'
from sanitization import *

In [3]:
df = getDF('/Users/williamfussell/Documents/local_capstone_master/capstone_data/capstone_reviewdata/Automotive_5.json.gz')

In [4]:
df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,4.0,False,"05 1, 2015",A8WEXFRWX1ZHH,209688726,{'Color:': ' AC'},Goldengate,"After I wrote the below review, the manufactur...",Works well if you place phone in horizontally ...,1430438400,,
1,1.0,True,"04 19, 2018",ABCA1A8E4DGV1,209688726,{'Color:': ' Blue'},noe,It sucks barely picks up anything definitely n...,sucks,1524096000,,
2,1.0,True,"04 16, 2018",A1NX8HM89FRQ32,209688726,{'Color:': ' Black'},Eduard,"Well to write a short one, it blew 2 fuses of ...",Defective,1523836800,,
3,3.0,True,"04 13, 2018",A1X77G023NY0KY,209688726,{'Color:': ' CA'},Lauren,I have absolutely no memory of buying this but...,Looks cool! Probably works,1523577600,,
4,5.0,True,"04 8, 2018",A3GK37JO2MGW6Q,209688726,{'Color:': ' Black'},danny,it ok it does it job,Five Stars,1523145600,,


In [5]:
df.count()

overall           1711519
verified          1711519
reviewTime        1711519
reviewerID        1711519
asin              1711519
style              593415
reviewerName      1711379
reviewText        1710653
summary           1711177
unixReviewTime    1711519
vote               190868
image               42694
dtype: int64

In [6]:
df['overall'].value_counts()

5.0    1234163
4.0     242409
3.0     102649
1.0      80208
2.0      52090
Name: overall, dtype: int64

In [6]:
df.describe()

Unnamed: 0,overall,unixReviewTime
count,11285460.0,11285460.0
mean,4.277177,1454477000.0
std,1.130252,45344330.0
min,1.0,1048378000.0
25%,4.0,1425773000.0
50%,5.0,1458778000.0
75%,5.0,1486685000.0
max,5.0,1538698000.0
