# Sentiment Analysis using Blazing Text Algorithm
In this exercise we will be using the Sagemaker Blazing Text algorithm which provides highly optimized implementations of the Word2vec and text classification algorithm. Text classification is a Natural Language Processing (NLP) task which can help determine the sentiment of a statement.

We will use a public dataset for our training activity. Blazing Text algorithm requires the input to be in a standard format. This format requires a statement and its corresponding label to be in a single line. 
If your training dataset is across multiple files, then these files have to be concatenanted to create a single file with all the text lines for the algorithm.

You can read more about this algorithm [here](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html).

In [62]:
# import section
import sagemaker
from sagemaker import get_execution_role
import json
import boto3
import csv
import numpy as np
# The sesssion variable will be used to access the default bucket which will host the 
# training and validation datasets along with output

sess = sagemaker.Session()

role = get_execution_role()
print(
    role
)  # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf

bucket = sess.default_bucket()  # Replace with your own bucket name if needed
print(bucket)
prefix = "lakehouse"  # Replace with the prefix under which you want to store the data if needed


arn:aws:iam::034677018494:role/service-role/AmazonSageMaker-ExecutionRole-20220111T161445
sagemaker-us-east-1-034677018494


## Product Review Dataset
In order to train our model we will be using a public dataset. There are several public datasets available that could be used. One such dataset is the [Amazon Product Review dataset](http://jmcauley.ucsd.edu/data/amazon/). 

We will be using the Clothing, Shoes and Jewelry dataset which has 278,677 reviews.

Each line in the product review file is a JSON line with the following attributes:
* reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
* asin - ID of the product, e.g. 0000013714
* reviewerName - name of the reviewer
* helpful - helpfulness rating of the review, e.g. 2/3
* **reviewText** - text of the review
* **overall** - rating of the product (a value between 1 to 5)
* summary - summary of the review
* unixReviewTime - time of the review (unix time)
* reviewTime - time of the review (raw)

The 2 attributes within this dataset that are key to the sentiment analysis are:
**reviewText** and **overall**.
Lets download and unzip the gunzip file.

In [63]:
# Download the gz file, overwrite if the file exists and name the file as amazon_pr.json.gz.
!wget -q http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Clothing_Shoes_and_Jewelry_5.json.gz -O amazon_pr.json.gz

# Unzip the Office Product Review file
!gzip -fd amazon_pr.json.gz 


In [64]:
# Lets view the first 10 lines of the unzipped dataset
!head amazon_pr.json

{"reviewerID": "A1KLRMWW2FWPL4", "asin": "0000031887", "reviewerName": "Amazon Customer \"cameramom\"", "helpful": [0, 0], "reviewText": "This is a great tutu and at a really great price. It doesn't look cheap at all. I'm so glad I looked on Amazon and found such an affordable tutu that isn't made poorly. A++", "overall": 5.0, "summary": "Great tutu-  not cheaply made", "unixReviewTime": 1297468800, "reviewTime": "02 12, 2011"}
{"reviewerID": "A2G5TCU2WDFZ65", "asin": "0000031887", "reviewerName": "Amazon Customer", "helpful": [0, 0], "reviewText": "I bought this for my 4 yr old daughter for dance class, she wore it today for the first time and the teacher thought it was adorable. I bought this to go with a light blue long sleeve leotard and was happy the colors matched up great. Price was very good too since some of these go for over $15.00 dollars.", "overall": 5.0, "summary": "Very Cute!!", "unixReviewTime": 1358553600, "reviewTime": "01 19, 2013"}
{"reviewerID": "A1RLQXYNCMWRWN",

In [106]:
# Import shuffle library to randomly shuffle the dataset to avoid bias
from random import shuffle
# Import the Natural Language Toolkit(NLTK). 
# The NLTK data package includes a pre-trained Punkt tokenizer for English.
import nltk

nltk.download("punkt")

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Prepare Data
Preparing a dataset is a key step in the Machine Learning process. Preparation of a dataset is unique to the dataset that will be used to train the machine learning model. Some of the reasons data has to be prepared prior to training is to:
* Avoid noise - Drop columns that are not relevant to the machine learning problem
* Input requirements of Machine Learning algorithms - As we will see for the BlazingText algorithm ahead.
* Avoid sparse data problems - Datasets can vary in quality with some having missing data which should be remediated.

In our example, we will:
* Create a Pandas Dataframe and load the Product Review file made up of JSONLines.
* Drop Columns that are not relevant to the business problem
* Introduce a new label column based on the value of an existing column. BlazingText algorithm requires the label to be prefixed by **\_\_label\_\_**
* Do a random shuffle to ensure remove bias from the data
* Split the training data into 90:10 training and validation dataset

**Note:**
For ``supervised`` mode, the training/validation file should contain a training sentence per line along with the labels. Labels are words that are prefixed by the string ``__label__``. 

In [88]:

import pandas as pd
data_df = pd.read_json('amazon_pr.json', lines=True)

# Remove unnecessary columns
# Retain only the reviewText Column and the Overall column. 
# The Overall column stores the rating provided by the reviewer on a scale from 1-5

data_df=data_df.drop(['reviewerID', 'asin','reviewerName','helpful','summary','unixReviewTime','reviewTime'], axis=1)
# Display the top5 rows of the dataframe
data_df.head()

Unnamed: 0,reviewText,overall
0,This is a great tutu and at a really great pri...,5
1,I bought this for my 4 yr old daughter for dan...,5
2,What can I say... my daughters have it in oran...,5
3,"We bought several tutus at once, and they are ...",5
4,Thank you Halo Heaven great product for Little...,5


In [118]:
# Check if any of the values are NULL
# A value of True suggests that are are NULLs. These values can either be set or removed depending 
# on the number of rows affected.
# A value of False suggests that there are no empty values in both dataframe columns
data_df.isna().any()

reviewText    False
label         False
dtype: bool

In [107]:
# Define a method label_create that will create a categorisation based on the overall rating field.
# You can choose your own categorisation. 
# For this activity we have followed the below label creation logic:
    # BlazingText algorithm requires the label value to be prefixed by "__label__"
    # If the overall (rating) is 1 or 2 Set the Label to __label__1 (or Negative Sentiment)
    # If the overall (rating) is 3 or 4 Set the Label to __label__2 (or Neutral Sentiment)
    # If the overall (rating) is 5 Set the Label to __label__3 (or Positive Sentiment)
# You could change this logic to create your own text labels(classifications)

def label_create(row):
    if row['overall'] == 1 or row['overall'] == 2 :
        val = '__label__1'
    elif row['overall'] == 3 or row['overall'] == 4 :
        val = '__label__2'
    elif row['overall'] == 5 :
        val = '__label__3'
    return val


In [90]:
# Create a new dataframe column called 'label' and use the label_create method to set values
data_df['label'] = data_df.apply(label_create, axis=1)

# Drop the overall column from the dataframe as this is now replaced with label column
data_df=data_df.drop(['overall'], axis=1)

# Change the reviewText to lowercase
data_df["reviewText"] = data_df["reviewText"].str.lower()

In [91]:
# Once done lets look at text classification spread.
# For an effective model, the model should ideally be trained on a dataset with adequate representations
# from each label

data_df['label'].value_counts()

__label__3    163240
__label__2     88782
__label__1     26655
Name: label, dtype: int64

In [108]:
# At this stage it is a good idea to shuffle and then split your training dataset 
# into training and validation dataset.

# Use 90:10 split for training:validation
fractions = np.array([0.9, 0.1])
# shuffle your input
data_df = data_df.sample(frac=1) 
# split into 2 parts
train, val = np.array_split(data_df, (fractions[:-1].cumsum() * len(data_df)).astype(int))

In [119]:
# Display the first 5 rows of train dataframe.
# Notice how the index values have been shuffled (out of order)
train.head()

Unnamed: 0,reviewText,label
203092,"from the photo, these appear to fit like skinn...",__label__2
33341,i got these as a present for my sister because...,__label__2
38748,this is a great slimline carry-on style garmen...,__label__3
22528,i ordered this watch few days back from jomash...,__label__3
234687,it gave me the look i was looking for. a must ...,__label__3


In [94]:
# Create 2 csv files, val (validation) and train(training) to start training the ML model
val.to_csv('val.csv', mode='w', sep=' ', columns=['label','reviewText'], index=False, header=False)
train.to_csv('train.csv', mode='w', sep=' ', columns=['label','reviewText'], index=False, header=False)

### Set the Training and Validation channel
Once the dataset is shuffled and split, the training file should be uploaded to the train channel and the validation dataset should be uploaded under the validation channel (Using a validation channel is optional).

In [120]:
%%time
# Set the channel names
train_channel = prefix + "/train"
validation_channel = prefix + "/validation"

# Upload the training and validation dataset to the default bucket
sess.upload_data(path="train.csv", bucket=bucket, key_prefix=train_channel)
sess.upload_data(path="val.csv", bucket=bucket, key_prefix=validation_channel)
# Set the location for the training & validation data.
s3_train_data = "s3://{}/{}".format(bucket, train_channel)
s3_validation_data = "s3://{}/{}".format(bucket, validation_channel)
# Set the location for the output data. This is where the model once generated will be stored.
s3_output_location = "s3://{}/{}/output".format(bucket, prefix)

CPU times: user 747 ms, sys: 409 ms, total: 1.16 s
Wall time: 1.25 s


### Fetch Container image
Get the container image name for the Sagemaker BlazingText algorithm in the region

In [97]:
# Get the container image name for the Sagemaker BlazingText algorithm
region_name = boto3.Session().region_name
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
print("Using SageMaker BlazingText container: {} ({})".format(container, region_name))

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: latest.


Using SageMaker BlazingText container: 811284229777.dkr.ecr.us-east-1.amazonaws.com/blazingtext:1 (us-east-1)


### Define the resource configuration 
 Define the resource configuration and hyperparameters to perform the text classification using ``supervised`` mode on a ``ml.c4.4xlarge`` instance.

The [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext_hyperparameters.html) for the complete list of hyperparameters. As this is a Text Classification problem, look for Text Classification Hyperparameters.

Sagemaker allows performing [hyperparameter tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext-tuning.html) to find the best set of hyperparamters for the machine learning problem.

In [98]:
bt_model = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type="ml.c4.4xlarge",
    volume_size=30,
    max_run=360000,
    input_mode="File",
    output_path=s3_output_location,
    hyperparameters={
        "mode": "supervised",
        "epochs": 5,
        "min_count": 2,
        "learning_rate": 0.05,
        "vector_dim": 10,
        "early_stopping": True,
        "patience": 4,
        "min_epochs": 5,
        "word_ngrams": 2,
    },
)

### Other configurations
Set the training inputs and data channels prior to running training the ML model

In [99]:
train_data = sagemaker.inputs.TrainingInput(
    s3_train_data,
    distribution="FullyReplicated",
    content_type="text/plain",
    s3_data_type="S3Prefix",
)
validation_data = sagemaker.inputs.TrainingInput(
    s3_validation_data,
    distribution="FullyReplicated",
    content_type="text/plain",
    s3_data_type="S3Prefix",
)
data_channels = {"train": train_data, "validation": validation_data}

### Train the model
Once the input channels are defined and the hyperparameters are set the ML training can begin. 
The ``fit`` method is run to start the training process.

In [100]:
bt_model.fit(inputs=data_channels, logs=True)

2022-01-16 07:04:25 Starting - Starting the training job...
2022-01-16 07:04:49 Starting - Launching requested ML instancesProfilerReport-1642316665: InProgress
......
2022-01-16 07:05:49 Starting - Preparing the instances for training.........
2022-01-16 07:07:14 Downloading - Downloading input data
2022-01-16 07:07:14 Training - Downloading the training image..[34mArguments: train[0m
[34m[01/16/2022 07:07:31 INFO 139977912178048] nvidia-smi took: 0.02518439292907715 secs to identify 0 gpus[0m
[34m[01/16/2022 07:07:31 INFO 139977912178048] Running single machine CPU BlazingText training using supervised mode.[0m
[34mNumber of CPU sockets found in instance is  1[0m
[34m[01/16/2022 07:07:31 INFO 139977912178048] Processing /opt/ml/input/data/train/train.csv . File size: 78.7041711807251 MB[0m
[34m[01/16/2022 07:07:31 INFO 139977912178048] Processing /opt/ml/input/data/validation/val.csv . File size: 8.823238372802734 MB[0m
[34mRead 10M words[0m
[34mRead 15M words[0m
[34

## Deploy the Model
Now that the model is trained, we can deploy this model as a SageMaker Endpoint using the SageMaker hosting services. 
This is quite easily done using ``model.deploy`` command or via the SageMaker console.

This will take a few minutes to execute.

Note: SageMaker now supports deploying a [serverless endpoint in preview](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html) providing a cost effective pay per use cost model.

In [135]:
from sagemaker.serializers import JSONSerializer

text_classifier = bt_model.deploy(
    initial_instance_count=1, instance_type="ml.m4.xlarge", serializer=JSONSerializer()
)

--------!

## Test the SageMaker Endpoint
Once deployed, the SageMaker Endpoint can be invoked from a Jupyter Notebook. 

In [138]:
sentences = [
    "The fit is narrower than the other Croc products that I use and I could not get it on to my foot. I sent it back."
]

# using the same nltk tokenizer that we used during data preparation for training
tokenized_sentences = [" ".join(nltk.word_tokenize(sent)) for sent in sentences]
payload = {"instances": tokenized_sentences}
print (payload)
response = text_classifier.predict(payload)



predictions = json.loads(response)
print(json.dumps(predictions, indent=2))

{'instances': ['The fit is narrower than the other Croc products that I use and I could not get it on to my foot . I sent it back .']}
[
  {
    "label": [
      "__label__1"
    ],
    "prob": [
      0.5706868171691895
    ]
  }
]


## Conclusion
Once the endpoint is deployed, it can be invoked by an Lambda function fronted by an API gateway. The API calls to the SageMaker Endpoint can be secured using the AWS security best practices.
As part of the LakeHouse formation Immersion day, we will use Athena to call a User Defined Function to invoke a SageMaker Endpoint hosting an ML Model.