# Sentiment Analysis with Sagemaker 
  <i> <b> using BlazingText supervised algorithm </b></i>

## Contents
1. Introduction
2. Setup  
     A. Fetching the Dataset  
     B. Exploring the data
     C. Data Ingestion 
3. Training the blazingtext supervised model
4. Deploying the model

## Introduction

This notebook demonstrates the use of Amazon SageMaker BlazingText algorithm supervised model for sentiment analysis. BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms. BlazingText's implementation of the supervised multi-class, multi-label text classification algorithm extends the fastText text classifier to use GPU acceleration with custom CUDA. 

## Setup

This notebook was created and tested on an ml.t2.medium notebook instance and was run with the Python 3 (Data Science) kernel.

Let's start by specifying:

   1. The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.  
   2. The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).  


In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role

sess = sagemaker.Session()
role = get_execution_role()
region = boto3.Session().region_name

bucket = sess.default_bucket()

prefix = 'sentiment-analysis/supervised'

print(region)

### Fetching the Dataset

We use the Amazon product reviews dataset. The data provided is actually not in correct json format readable for python. Each row is dictionary but for it to be a valid json format, a square bracket should be at the start and end of the file with , being added at end of each row. 

1. Download the dataset
2. Convert data into correct JSON format
3. Convert JSON to CSV format

In [None]:
import json
import pandas as pd
import csv

!wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Toys_and_Games_5.json.gz    
!gzip -d reviews_Toys_and_Games_5.json.gz

# read the entire file into a python array
with open("reviews_Toys_and_Games_5.json",'r') as f:
    data = f.readlines()
    
# remove the trailing "\n" from each line  
data = map(lambda x: x.rstrip(), data)

data_json_str = "[" + ','.join(data) + "]"

#write in another JSON file
with open("data_json_reviews.json",'w')as f:
    f.write(data_json_str)

#load in to Pandas
data_df=pd.read_json("data_json_reviews.json")

# Convert JSON to CSV File 
data_df.to_csv('output_reviews_top.csv')

### Exploring the data

The sentiment data is not present in the dataset and therefore sentiment information will be derived from the information in the rating column. Once the sentiment information is added, its distribution is plotted using matplotlib

In [None]:
data_csv=pd.read_csv('output_reviews_top.csv')
print(data_csv.head(3))

In [None]:

import matplotlib.pyplot as plt 

print("Number of rows per star rating:")
print(data_df['overall'].value_counts())

# Function to map stars to sentiment
def map_sentiment(stars_received):
    if stars_received <= 2:
        return -1
    elif stars_received == 3:
        return 0
    else:
        return 1
# Mapping stars to sentiment into three categories
data_df['sentiment'] = [ map_sentiment(x) for x in data_df['overall']]

# Plotting the sentiment distribution
plt.figure()
pd.value_counts(data_df['sentiment']).plot.bar(title="Sentiment distribution in df")
plt.xlabel("Sentiment")
plt.ylabel("No. of rows in df")
plt.show()

In [None]:
# Function to retrieve top few number of each category
def get_top_data(top_n = 5000):
    data_df_positive = data_df[data_df['sentiment'] == 1].head(top_n)
    data_df_negative = data_df[data_df['sentiment'] == -1].head(top_n)
    data_df_neutral = data_df[data_df['sentiment'] == 0].head(top_n)
    data_df_small = pd.concat([data_df_positive, data_df_negative, data_df_neutral])
    return data_df_small

# Function call to get the top 10000 from each sentiment
top_data_df_small = get_top_data(top_n=10000)

# After selecting top few samples of each sentiment
print("After segregating and taking equal number of rows for each sentiment:")
print(top_data_df_small['sentiment'].value_counts())
top_data_df_small.head(10)

In [None]:
# Remove irrelevant information  
sparse_df = top_data_df_small.drop(['reviewerID','asin','reviewerName','helpful','overall','unixReviewTime','reviewTime'],axis=1)
sparse_df = sparse_df.reindex(columns=['sentiment','reviewText','summary'])
print(sparse_df.head(5))

In [None]:
# Attach label to each Sentiment 
index_to_label ={'1':'Positive','0':'neutral','-1':'Negative'}
print(index_to_label)

In [None]:
# Shuffle data and split
from sklearn.model_selection import train_test_split

train_path = 'train.csv'
test_path = 'test.csv'

train, test = train_test_split(sparse_df, test_size=0.2, random_state=42, shuffle=True)

train.to_csv(train_path)
test.to_csv(test_path)

print(train.head(5))

In [None]:
from random import shuffle
import multiprocessing
from multiprocessing import Pool
import csv
import nltk
nltk.download('punkt')

def transform_instance(row):
    cur_row = []
    label = "__label__" + index_to_label[row[1]]  #Prefix the index-ed label with __label__
    cur_row.append(label)
    cur_row.extend(nltk.word_tokenize(row[2].lower()))
    cur_row.extend(nltk.word_tokenize(row[3].lower()))
    return cur_row


def preprocess(input_file, output_file, keep=1):
    all_rows = []
    with open(input_file, 'r') as csvinfile:
        csv_reader = csv.reader(csvinfile, delimiter=',')
        next(csv_reader)
        for row in csv_reader: 
            all_rows.append(row)
    shuffle(all_rows)
    all_rows = all_rows[:int(keep*len(all_rows))]
    pool = Pool(processes=multiprocessing.cpu_count())
    transformed_rows = pool.map(transform_instance, all_rows)
    pool.close() 
    pool.join()
    
    with open(output_file, 'w') as csvoutfile:
        csv_writer = csv.writer(csvoutfile, delimiter=' ', lineterminator='\n')
        csv_writer.writerows(transformed_rows)

### Data Ingestion

Next, we read the dataset from the existing repository into memory, for preprocessing prior to training.

In [None]:
%%time

# Preparing the training dataset
preprocess('train.csv', 'sentiment.train')
        
# Preparing the validation dataset        
preprocess('test.csv', 'sentiment.validation')

# create train and validation channels 
train_channel = prefix + '/train'
validation_channel = prefix + '/validation'

sess.upload_data('sentiment.train', bucket = bucket, key_prefix = train_channel )
sess.upload_data('sentiment.validation', bucket = bucket, key_prefix = validation_channel )

s3_train_data = 's3://{}/{}'.format(bucket,train_channel)
s3_validation_data = 's3://{}/{}'.format(bucket,validation_channel)
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)

## Training the BlazingText supervised model

Now that we have prepared the dataset, we are ready to train the model. We first initiate an estimator which includes:

   * The container image for the algorithm (blazingtext)
   * Configuration for the output of the training jobs
   * The values of static algorithm hyperparameters, those that are not specified will be given default values
   * The type and number of instances to use for the training jobs
 
Then we create data channels to read from the S3 source and then fit the model on the training and validation data.


In [None]:
container = sagemaker.amazon.amazon_estimator.get_image_uri(region, "blazingtext", "latest")
print('Using SageMaker BlazingText container: {} ({})'.format(container, region))

sa_model = sagemaker.estimator.Estimator(container,
                                         role, 
                                         instance_count=1, 
                                         instance_type='ml.c4.4xlarge',
                                         volume_size = 30,
                                         max_run = 360000,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)

sa_model.set_hyperparameters(mode='supervised',
                            epochs=1,
                            min_count=2,
                            learning_rate= 0.05,
                            vector_dim=10,
                            early_stopping=True,
                            patience=4,
                            min_epochs=5,
                            word_ngrams=2)

In [None]:
train_data = sagemaker.inputs.TrainingInput(s3_train_data, distribution='FullyReplicated', 
                        content_type='text/plain', s3_data_type='S3Prefix')
validation_data = sagemaker.inputs.TrainingInput(s3_validation_data, distribution='FullyReplicated', 
                             content_type='text/plain', s3_data_type='S3Prefix')
data_channels = {'train': train_data, 'validation': validation_data}

In [None]:
sa_model.fit(inputs=data_channels, logs=True)

## Deploying the endpoint for testing

Testing the trained model on sample sentences

In [None]:
from sagemaker.serializers import JSONSerializer

text_classifier = sa_model.deploy(
    initial_instance_count = 1,
    instance_type = 'ml.m4.xlarge',
    serializer=JSONSerializer()
)

In [None]:
sentences = ["Great product to have.",
            "unsatisfied."]

# using the same nltk tokenizer that we used during data preparation for training
tokenized_sentences = [' '.join(nltk.word_tokenize(sent)) for sent in sentences]

payload = {"instances" : tokenized_sentences}

response = text_classifier.predict(payload)

predictions = json.loads(response)
print(json.dumps(predictions, indent=2))

In [None]:
sa_model.delete_endpoint()