## Introduction



## Setup



In [57]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3

sess = sagemaker.Session()

role = get_execution_role()
print(role) 

bucket = 'phdata-nlp-s3-bucket' 
print(bucket)
prefix = 'blazingtext/supervised' 

arn:aws:iam::023375022819:role/service-role/AmazonSageMaker-ExecutionRole-20191220T213935
phdata-nlp-s3-bucket


### Data Preparation



In [6]:
!aws s3 cp s3://phdata-nlp-s3-bucket/labelled_reviews.csv `pwd`

download: s3://phdata-nlp-s3-bucket/labelled_reviews.csv to ./labelled_reviews.csv


Inspection 

In [58]:
import pandas as pd

!head labelled_reviews.csv -n 3

reviews = pd.read_csv('labelled_reviews.csv')['Review']


,Review,topic
0,"__label__book Buyer beware This is a self-published book, and if you want to know why--read a few paragraphs! Those 5 star reviews must have been written by Ms. Haddon's family and friends--or perhaps, by herself! I can't imagine anyone reading the whole thing--I spent an evening with the book and a friend and we were in hysterics reading bits and pieces of it to one another. It is most definitely bad enough to be entered into some kind of a ""worst book"" contest. I can't believe Amazon even sells this kind of thing. Maybe I can offer them my 8th grade term paper on ""To Kill a Mockingbird""--a book I am quite sure Ms. Haddon never heard of. Anyway, unless you are in a mood to send a book to someone as a joke---stay far, far away from this one!",book
1,"__label__book The Worst! A complete waste of time. Typographical errors, poor grammar, and a totally pathetic plot add up to absolutely nothing. I'm embarrassed for this author and very disappointed I actually paid f

In [39]:
#split data into 70% train and 30% test
train = reviews[:420001]
test = reviews[420001:600000]

print(str(len(train) ) + ":::" + str ( len(test) ) )
 
train.to_csv("reviews.train.csv", index=False)
test.to_csv("reviews.test.csv", index=False)

420001:::179999




In [59]:
!head reviews.train.csv -n 3


"__label__book Buyer beware This is a self-published book, and if you want to know why--read a few paragraphs! Those 5 star reviews must have been written by Ms. Haddon's family and friends--or perhaps, by herself! I can't imagine anyone reading the whole thing--I spent an evening with the book and a friend and we were in hysterics reading bits and pieces of it to one another. It is most definitely bad enough to be entered into some kind of a ""worst book"" contest. I can't believe Amazon even sells this kind of thing. Maybe I can offer them my 8th grade term paper on ""To Kill a Mockingbird""--a book I am quite sure Ms. Haddon never heard of. Anyway, unless you are in a mood to send a book to someone as a joke---stay far, far away from this one!"
"__label__book The Worst! A complete waste of time. Typographical errors, poor grammar, and a totally pathetic plot add up to absolutely nothing. I'm embarrassed for this author and very disappointed I actually paid for this book."
"__label

In [60]:
!head reviews.test.csv  -n 3


"__label__movie Not worth seeing My wife and I tried, really tried to watch this movie, and it went nowhere. It was probably the biggest disappointment we've seen in years. Boring, and so s-l-o-w you feel like cutting off your own limbs for entertainment. We give this an all toes down."
"__label__book Black comedy at its worst Jon Keyes directed this black-humored and odd black comedy that is not scary or funny but dumb and dull with unlikable characters,a badly written plot,and a very low budget. The movie is very reminiscent to Paul Bartel's Eating Raoul and Bob Balaban's Parents but the movie is a couple that argue about everything and then the wife gets mad and harasses a teenage girl in the basement and then tries to kill her husband as well as the husband trying to kill her. For a low-budget black comedy this is a unwatchable mess with no redeeming values. Despite the title the movie takes place in the couple's dining room,kitchen,and the living room and the front cover has noth

## Data Preprocessing


Download the nltk tokenizer and other libraries

In [40]:
from random import shuffle
import multiprocessing
from multiprocessing import Pool
import csv
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [41]:
def transform_instance(row):
    cur_row = []
    cur_row.extend(nltk.word_tokenize(row[0].lower()))
    return cur_row

The `transform_instance` will be applied to each data instance in parallel using python's multiprocessing module

In [42]:
def preprocess(input_file, output_file, keep=1):
    all_rows = []
    with open(input_file, 'r') as csvinfile:
        csv_reader = csv.reader(csvinfile, delimiter=',')
        for row in csv_reader:
            all_rows.append(row)
    shuffle(all_rows)
    all_rows = all_rows[:int(keep*len(all_rows))]
    pool = Pool(processes=multiprocessing.cpu_count())
    transformed_rows = pool.map(transform_instance, all_rows)
    pool.close() 
    pool.join()
    
    with open(output_file, 'w') as csvoutfile:
        csv_writer = csv.writer(csvoutfile, delimiter=' ', lineterminator='\n')
        csv_writer.writerows(transformed_rows)

In [43]:
%%time

# Preparing the training dataset
preprocess('reviews.train.csv', 'reviews.train', keep=1)
        
# Preparing the validation dataset        
preprocess('reviews.test.csv', 'reviews.validation')

CPU times: user 22.3 s, sys: 3.04 s, total: 25.4 s
Wall time: 3min 23s


In [61]:
!head reviews.train  -n 3


__label__movie a horrible movie of titanic proportions this animated adaption of this so-called `` movie '' should not be seen by the human eye . it 's because of the infamous traumatizing , horrible `` rapping dog '' scene . it rots both your brain and your very innocent soul . this movie should sink into the depths of the ocean like the titanic .
__label__music soundgarden fan hes an awesome singser but he needs the guys from audioslave or soundgarden to back him up . not that he needs it but when they are together they make the best music known ! ! !
__label__review prejudice on parade this book is not a work of history . it is the painful autobiography of an excommnicated ex-priest on why he hates the catholic church.as the end of the book makes perfectly clear , the author hates the church 's teaching on revelation , morality , authority , sexuality -- -just about everything . it is in light of this ardent hate and resentment that the author risibly distorts the record on the ca

In [62]:
!head reviews.test  -n 3


"__label__movie Not worth seeing My wife and I tried, really tried to watch this movie, and it went nowhere. It was probably the biggest disappointment we've seen in years. Boring, and so s-l-o-w you feel like cutting off your own limbs for entertainment. We give this an all toes down."
"__label__book Black comedy at its worst Jon Keyes directed this black-humored and odd black comedy that is not scary or funny but dumb and dull with unlikable characters,a badly written plot,and a very low budget. The movie is very reminiscent to Paul Bartel's Eating Raoul and Bob Balaban's Parents but the movie is a couple that argue about everything and then the wife gets mad and harasses a teenage girl in the basement and then tries to kill her husband as well as the husband trying to kill her. For a low-budget black comedy this is a unwatchable mess with no redeeming values. Despite the title the movie takes place in the couple's dining room,kitchen,and the living room and the front cover has noth

The data preprocessing cell might take a minute to run. After the data preprocessing is complete, we need to upload it to S3 so that it can be consumed by SageMaker to execute training jobs. We'll use Python SDK to upload these two files to the bucket and prefix location that we have set above.   

In [45]:
%%time

train_channel = prefix + '/train'
validation_channel = prefix + '/validation'

sess.upload_data(path='reviews.train', bucket=bucket, key_prefix=train_channel)
sess.upload_data(path='reviews.validation', bucket=bucket, key_prefix=validation_channel)

s3_train_data = 's3://{}/{}'.format(bucket, train_channel)
s3_validation_data = 's3://{}/{}'.format(bucket, validation_channel)

CPU times: user 2.55 s, sys: 1.77 s, total: 4.32 s
Wall time: 4.51 s


Next we need to setup an output location at S3, where the model artifact will be dumped. These artifacts are also the output of the algorithm's traning job.

In [46]:
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)

## Training
Now that we are done with all the setup that is needed, we are ready to train our reviews muliti classifier. To begin, let us create a ``sageMaker.estimator.Estimator`` object. This estimator will launch the training job.

In [47]:
region_name = boto3.Session().region_name

In [48]:
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))

Using SageMaker BlazingText container: 811284229777.dkr.ecr.us-east-1.amazonaws.com/blazingtext:latest (us-east-1)


## Training the BlazingText model for supervised text classification

In [49]:
bt_model = sagemaker.estimator.Estimator(container,
                                         role, 
                                         base_job_name='phdataBlazingText',
                                         train_instance_count=1, 
                                         train_instance_type='ml.c4.4xlarge',
                                         train_volume_size = 30,
                                         train_max_run = 360000,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)

For complete list of hyperparameters, refer to [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext_hyperparameters.html) 

In [50]:
bt_model.set_hyperparameters(mode="supervised",
                            epochs=10,
                            min_count=2,
                            learning_rate=0.05,
                            vector_dim=10,
                            early_stopping=True,
                            patience=4,
                            min_epochs=5,
                            word_ngrams=2)

Now that the hyper-parameters are setup, let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the `sagemaker.session.s3_input` objects from our data channels. These objects are then put in a simple dictionary, which the algorithm consumes.

In [51]:
train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', 
                        content_type='text/plain', s3_data_type='S3Prefix')
validation_data = sagemaker.session.s3_input(s3_validation_data, distribution='FullyReplicated', 
                             content_type='text/plain', s3_data_type='S3Prefix')
data_channels = {'train': train_data, 'validation': validation_data}

We have our `Estimator` object, we have set the hyper-parameters for this object and we have our data channels linked with the algorithm. The only  remaining thing to do is to train the algorithm. The following command will train the algorithm. Training the algorithm involves a few steps. Firstly, the instance that we requested while creating the `Estimator` classes is provisioned and is setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take some time, depending on the size of the data. Therefore it might be a few minutes before we start getting training logs for our training jobs. The data logs will also print out Accuracy on the validation data for every epoch after training job has executed `min_epochs`. This metric is a proxy for the quality of the algorithm. 

Once the job has finished a "Job complete" message will be printed. The trained model can be found in the S3 bucket that was setup as `output_path` in the estimator.

In [52]:
bt_model.fit(inputs=data_channels, logs=True)

2020-01-29 23:53:47 Starting - Starting the training job...
2020-01-29 23:53:48 Starting - Launching requested ML instances......
2020-01-29 23:54:49 Starting - Preparing the instances for training...
2020-01-29 23:55:43 Downloading - Downloading input data...
2020-01-29 23:56:10 Training - Training image download completed. Training in progress..[34mArguments: train[0m
[34m[01/29/2020 23:56:11 INFO 139977443522368] nvidia-smi took: 0.0251870155334 secs to identify 0 gpus[0m
[34m[01/29/2020 23:56:11 INFO 139977443522368] Running single machine CPU BlazingText training using supervised mode.[0m
[34m[01/29/2020 23:56:11 INFO 139977443522368] Processing /opt/ml/input/data/train/reviews.train . File size: 186 MB[0m
[34m[01/29/2020 23:56:11 INFO 139977443522368] Processing /opt/ml/input/data/validation/reviews.validation . File size: 79 MB[0m
[34mRead 10M words[0m
[34mRead 20M words[0m
[34mRead 30M words[0m
[34mRead 39M words[0m
[34mNumber of words:  151682[0m
[34mLoadi

## Hosting / Inference
Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same type of instance that we used to train. Because instance endpoints will be up and running for long, it's advisable to choose a cheaper instance for inference.

In [53]:
text_classifier = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')

-------------------!

In [54]:
text_classifier = sagemaker.predictor.RealTimePredictor(endpoint='phdataBlazingText-2020-01-29-23-53-47-082', sagemaker_session=sess)


#### Use JSON format for inference
BlazingText supports `application/json` as the content-type for inference. The payload should contain a list of sentences with the key as "**instances**" while being passed to the endpoint.

In [55]:
sentences = ["Not worth seeing My wife and I tried, really tried to watch this movie, and it went nowhere. It was probably the biggest disappointment we've seen in years. Boring, and so s-l-o-w you feel like cutting off your own limbs for entertainment. We give this an all toes down", "comedy at its worst Jon Keyes directed this black-humored and odd black comedy that is not scary or funny but dumb and dull with unlikable characters,a badly written plot,and a very low budget. The movie is very reminiscent to Paul Bartel's Eating Raoul and Bob Balaban's Parents but the movie is a couple that argue about everything and then the wife gets mad and harasses a teenage girl in the basement and then tries to kill her husband as well as the husband trying to kill her. For a low-budget black comedy this is a unwatchable mess with no redeeming values. Despite the title the movie takes place in the couple's dining room,kitchen,and the living room and the front cover has nothing to do with this movie because it is not a slasher flick but a bizarre soap opera-ish movie that looks like a Lifetime movie."] 
             
             

             
             
# using the same nltk tokenizer that we used during data preparation for training
tokenized_sentences = [' '.join(nltk.word_tokenize(sent)) for sent in sentences]

payload = {"instances" : tokenized_sentences}

response = text_classifier.predict(json.dumps(payload))

predictions = json.loads(response)
print(json.dumps(predictions, indent=2))

[
  {
    "prob": [
      0.8382528424263
    ],
    "label": [
      "__label__movie"
    ]
  },
  {
    "prob": [
      0.96994948387146
    ],
    "label": [
      "__label__movie"
    ]
  }
]


By default, the model will return only one prediction, the one with the highest probability. For retrieving the top k predictions, we can set `k` in the configuration as shown below:

In [56]:
payload = {"instances" : tokenized_sentences,
          "configuration": {"k": 2}}

response = text_classifier.predict(json.dumps(payload))

predictions = json.loads(response)
print(json.dumps(predictions, indent=2))

[
  {
    "prob": [
      0.8382528424263,
      0.07862398773431778
    ],
    "label": [
      "__label__movie",
      "__label__quality"
    ]
  },
  {
    "prob": [
      0.96994948387146,
      0.015944872051477432
    ],
    "label": [
      "__label__movie",
      "__label__review"
    ]
  }
]


### Stop / Close the Endpoint 
Finally, we should delete the endpoint before we close the notebook if we don't need to keep the endpoint running for serving realtime predictions.

In [55]:
#sess.delete_endpoint(text_classifier.endpoint)