## Data Prep

Download a sample form the [Amazon Customer Reviews](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) dataset

In [2]:
!aws s3 cp s3://amazon-reviews-pds/tsv/amazon_reviews_us_Electronics_v1_00.tsv.gz reviews.tsv.gz

download: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Electronics_v1_00.tsv.gz to ./reviews.tsv.gz


Load the compressed reviews into pandas, selecting the review headling, body and star rating (should take approx 30 seconds)

In [5]:
%%time
import pandas as pd

# Load a sample of the rows
df_reviews = pd.read_csv('reviews.tsv.gz', compression='gzip', error_bad_lines=False, #nrows=100000,
                         sep='\t', usecols=['product_id', 'product_title',
                                            'review_headline', 'review_body', 'star_rating',
                                            'helpful_votes', 'total_votes']).dropna()
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3090980 entries, 0 to 3091102
Data columns (total 7 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   product_id       object
 1   product_title    object
 2   star_rating      int64 
 3   helpful_votes    int64 
 4   total_votes      int64 
 5   review_headline  object
 6   review_body      object
dtypes: int64(3), object(4)
memory usage: 188.7+ MB
CPU times: user 27.3 s, sys: 1.19 s, total: 28.5 s
Wall time: 28.6 s


Inspect the first few rows of the dataset

In [6]:
df_reviews.head()

Unnamed: 0,product_id,product_title,star_rating,helpful_votes,total_votes,review_headline,review_body
0,B00428R89M,yoomall 5M Antenna WIFI RP-SMA Female to Male ...,5,0,0,Five Stars,As described.
1,B000068O48,"Hosa GPM-103 3.5mm TRS to 1/4"" TRS Adaptor",5,0,0,It works as advertising.,It works as advertising.
2,B000GGKOG8,Channel Master Titan 2 Antenna Preamplifier,5,1,1,Five Stars,Works pissa
3,B000NU4OTA,LIMTECH Wall charger + USB Hotsync & Charging ...,1,0,0,One Star,Did not work at all.
4,B00JOQIO6S,Skullcandy Air Raid Portable Bluetooth Speaker,5,1,1,Overall pleased with the item,Works well. Bass is somewhat lacking but is pr...


Visualize the helpful score grouped by sentiment.  We can validate a high helpfulness count is correlated with strong negative or positive reviews.

### Feature engineering

Filter on reviews that have at least 5 votes, calculate a helpful score based  and rating sentiment.

In [None]:
df_reviews = df_reviews[df_reviews['total_votes'] >= 5]
df_reviews['helpful_score'] = df_reviews['helpful_votes'] / df_reviews['total_votes']
df_reviews['sentiment'] = pd.cut(df_reviews['star_rating'], bins=[0,2,3,6], labels=['Negative','Nuetral','Positive'])
df_reviews.describe()

Visualize the helpful score grouped by sentiment.  We can validate a high helpfulness count is correlated with strong negative or positive reviews.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.style.use("dark_background")
sns.displot(df_reviews, x='helpful_score', col='sentiment', hue='star_rating', kind='kde', palette='icefire')

Group by the product and get the count of reviews, as well as sum of helpful and total votes.

In [10]:
df_votes =  df_reviews.groupby('product_id').agg({'product_id': 'count', 'helpful_votes': 'sum', 'total_votes': 'sum'})
df_votes.describe()

Unnamed: 0,product_id,helpful_votes,total_votes
count,52376.0,52376.0,52376.0
mean,5.559187,88.608828,109.393654
std,12.490246,371.332497,437.566803
min,1.0,0.0,5.0
25%,1.0,6.0,9.0
50%,2.0,17.0,22.0
75%,5.0,58.0,73.0
max,638.0,43288.0,46228.0


In [11]:
df_reviews = df_reviews.merge(df_votes, how='inner', left_on='product_id', right_index=True, suffixes=('','_total'))
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 291168 entries, 18 to 3091057
Data columns (total 12 columns):
 #   Column               Non-Null Count   Dtype   
---  ------               --------------   -----   
 0   product_id           291168 non-null  object  
 1   product_title        291168 non-null  object  
 2   star_rating          291168 non-null  int64   
 3   helpful_votes        291168 non-null  int64   
 4   total_votes          291168 non-null  int64   
 5   review_headline      291168 non-null  object  
 6   review_body          291168 non-null  object  
 7   helpful_score        291168 non-null  float64 
 8   sentiment            291168 non-null  category
 9   product_id_total     291168 non-null  int64   
 10  helpful_votes_total  291168 non-null  int64   
 11  total_votes_total    291168 non-null  int64   
dtypes: category(1), float64(1), int64(6), object(4)
memory usage: 26.9+ MB


In [12]:
df_reviews['is_helpful'] = (df_reviews['helpful_score'] > 0.80)
df_reviews['is_helpful'].sum()/df_reviews['is_helpful'].count()

0.5421577920650621

### SageMaker starts from here

In [None]:
## split the dataset into training and testing dataset

In [None]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(df_reviews, test_size=0.1, random_state=42) 
val_df, test_df = train_test_split(val_df, test_size=0.5, random_state=42)
print('split train: {}, val: {}, test: {} '.format(train_df.shape[0], val_df.shape[0], test_df.shape[0]))

In [None]:
test_df.to_csv('test.csv', index=False, header=True)

In [None]:
from spacy.lang.en import English

index_to_label = {0: 'NotHelpful', 1: 'Helpful'} 
nlp = English()
tokenizer = nlp.tokenizer

def labelize_df(df):
    return '__label__' + df['is_helpful'].apply(lambda is_helpful: index_to_label[is_helpful])

def tokenize_sent(sent, max_length=1000):
    return ' '.join([token.text for token in tokenizer(sent)])[:max_length]

def tokenize_df(df):
    return (df['review_headline'].apply(tokenize_sent) + ' ' + 
            df['review_body'].apply(tokenize_sent))

In [None]:
labelize_df(train_df.head(3)) + ' ' + tokenize_df(train_df.head(3))

In [None]:
%%time
train_text.to_csv('train.txt', index=False, header=False)
val_text.to_csv('validation.txt', index=False, header=False)

In [None]:
## Upload the dataset to S3 bucket

In [None]:
import sagemaker

# Get the session and default bucket
role = sagemaker.get_execution_role()
session = sagemaker.session.Session()
bucket = session.default_bucket()

# Set the prefix for this dataset
prefix = 'mab-reviews-helpfulness'

s3_train_uri = session.upload_data('train.txt', bucket, prefix + '/data/training')
s3_val_uri = session.upload_data('validation.txt', bucket, prefix + '/data/validation')
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)

### model training with hyperparameters

In [None]:
import boto3
from sagemaker.estimator import Estimator

region_name = boto3.Session().region_name
image_uri = sagemaker.image_uris.retrieve("blazingtext", region_name)

estimator = Estimator(image_uri=image_uri,
                      role=role, 
                      instance_count=training_instance_count, # Param
                      instance_type=training_instance_type, # Param
                      volume_size = 30,
                      max_run = 360000,
                      input_mode= 'File',
                      output_path=s3_output_location,
                      sagemaker_session=session)

estimator.set_hyperparameters(mode="supervised",
                              epochs=10,
                              min_epochs=5, # Min epochs before early stopping is introduced
                              early_stopping=True,
                              patience=2,
                              learning_rate=0.01,
                              min_count=2, # words that appear less than min_count are discarded 
                              word_ngrams=1, # the number of word n-gram features to use.
                              vector_dim=16, # dimensions of embedding layer
                              )

In [None]:
input_train = TrainingInput(s3_data=s3_train_uri, content_type="text/plain")
input_val = TrainingInput(s3_data=s3_val_uri, content_type="text/plain")
data_channels = {'train': input_train, 'validation': input_val}

estimator.fit(data_channels)

## Run Tuning Job

To try and improve on our model, let's run a tuning job to find the parameters to maximize accuracy, and register this model.

### Setup Hyperparamter Tuning

Create the [Blazing Text](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html) binary classifier for review helpfulness.

In [None]:
import boto3
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

region_name = boto3.Session().region_name
image_uri = sagemaker.image_uris.retrieve("blazingtext", region_name)
print(f'Using container: {image_uri}')

estimator = Estimator(image_uri,
                      role, 
                      instance_count=1, 
                      instance_type='ml.c5.4xlarge',
                      volume_size = 30,
                      max_run = 360000,
                      input_mode= 'File',
                      output_path=s3_output_location,
                      sagemaker_session=session)

estimator.set_hyperparameters(mode="supervised",
                              epochs=10,
                              min_epochs=5, # Min epochs before early stopping is introduced
                              early_stopping=False,
                              learning_rate=0.01,
                              min_count=2, # words that appear less than min_count are discarded 
                              word_ngrams=1, # the number of word n-gram features to use.
                              vector_dim=32, # dimensions of embedding layer
                             )

Tune an Amazon SageMaker BlazingText text classification model with the following [hyperparameters](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext-tuning.html).


| Parameter Name | Parameter Type | Recommended Ranges or Values | 
| --- | --- | --- | 
| buckets |  `IntegerParameterRange`  |  \[1000000\-10000000\]  | 
| epochs |  `IntegerParameterRange`  |  \[5\-15\]  | 
| learning\_rate |  `ContinuousParameterRange`  |  MinValue: 0\.005, MaxValue: 0\.01  | 
| min\_count |  `IntegerParameterRange`  |  \[0\-100\]  | 
| mode |  `CategoricalParameterRange`  |  \[`'supervised'`\]  | 
| vector\_dim |  `IntegerParameterRange`  |  \[32\-300\]  | 
| word\_ngrams |  `IntegerParameterRange`  |  \[1\-3\]  | 

In [None]:
hyperparameter_ranges = {'epochs': IntegerParameter(5, 50),
                        'learning_rate': ContinuousParameter(0.005, 0.01),
                        'min_count': IntegerParameter(0, 100),
                        'vector_dim': ContinuousParameter(1, 10),
                        'word_ngrams': IntegerParameter(1, 3),
                        'vector_dim': IntegerParameter(32, 300)}

Now, we'll create a `HyperparameterTuner` object, to which we pass:

* The `BlazingText` estimator we created above
* Our hyperparameter ranges
* Objective metric name and definition

Tuning resource configurations such as Number of training jobs to run in total and how many training jobs can be run in parallel.

In [None]:
max_jobs = 9
objective_name = 'validation:accuracy'
tuner = HyperparameterTuner(estimator, 
                            objective_name,
                            hyperparameter_ranges,
                            tags=project_tags,
                            max_jobs=max_jobs,
                            max_parallel_jobs=3)

In [None]:
input_train = TrainingInput(s3_data=s3_train_uri, content_type="text/plain")
input_val = TrainingInput(s3_data=s3_val_uri, content_type="text/plain")
data_channels = {'train': input_train, 'validation': input_val}

tuner.fit(inputs=data_channels)

In [None]:
## deploy the endpoint

In [None]:
# Real-time endpoint:
predictor = estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large",
    # wait=False,  # Remember, predictor.predict() won't work until deployment finishes!
)

In [None]:
## test the results

In [None]:
endpoint_name = f'sagemaker-{project_name}-{stage_name}' 

endpoint_status = sm_client.describe_endpoint(EndpointName = endpoint_name)['EndpointStatus']
if endpoint_status != 'InService':
    raise Exception(f'Endpoint {endpoint_name} status is: {endpoint_status}')