# Sentiment Analysis

## Using PyTorch in SageMaker

_Deep Learning Nanodegree Program | Deployment_

---

Now that we've seen what SageMaker can do with a built in algorithm, we will try construction our own. In this case, we will be using PyTorch to construct a recurrent neural network model for the sentiment analysis problem we tackled in the previous notebook.

> **NOTE**: In order to complete this notebook it is important that you give your notebook instance permission to access the Elastic Container Repository. This can be done by modifying the SageMaker Execution Role that was generated when the notebook instance was created. Go to the IAM Roles dashboard and find the role that was automatically created when you created the notebook instance. Click on 'add policy' and add the `AmazonEC2ContainerRegistryFullAccess` policy.

## Instructions

Some template code has already been provided for you, and you will need to implement additional functionality to successfully complete this notebook. You will not need to modify the included code beyond what is requested. Sections that begin with '**TODO**' in the header indicate that you need to complete or implement some portion within them. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a `# TODO: ...` comment. Please be sure to read the instructions carefully!

In addition to implementing code, there will be questions for you to answer which relate to the task and your implementation. Each section where you will answer a question is preceded by a '**Question:**' header. Carefully read each question and provide your answer below the '**Answer:**' header by editing the Markdown cell.

> **Note**: Code and Markdown cells can be executed using the **Shift+Enter** keyboard shortcut. In addition, a cell can be edited by typically clicking it (double-click for Markdown cells) or by pressing **Enter** while it is highlighted.

## Step 1: Loading the data

This notebook should be thought of as a continuation of the XGBoost in Sagemaker notebook. As such, we will be using some of the prepared data that was processed in the first notebook. If you have not yet run the first notebook, do so now so that the IMDB sentiment data will have been downloaded and processed.

In [None]:
import os
import pickle

cache_dir = os.path.join("cache", "sentiment_analysis") # where we will be reading the pre-computed data from

def load_data(cache_dir = cache_dir, cache_file = "preprocessed_data.pkl"):
    
    # We will read in the cached data and then return the dataset
    cache_data = None
    with open(os.path.join(cache_dir, cache_file), "rb") as f:
        cache_data = pickle.load(f)
    print("Read preprocessed data from cache file:", cache_file)
    
    return cache_data['words_train'], cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test']

In [None]:
train_X, test_X, train_y, test_y = load_data()

## Step 2: Transform the data

In the XGBoost notebook we transformed the data from its word representation to a bag-of-words feature representation. This time we would like to think of the various words as categorical variables, that is, we will represent each word as an integer. Of course, some of the words that appear in the reviews occur very infrequently and so likely don't contain much information for the purposes of sentiment analysis. The way we will deal with this problem is that we will fix the size of our working vocabularly and we will only include the words that appear most frequently. The will then combine all of the infrequent words into a single category and, in our case, we will label it as `1`.

Furthermore, since we will be using a recurrent neural network, it will be convenient if the length of each review is the same. To do this, we will fix a size for our reviews and then pad short reviews with the category 'no word' (which we will label `0`) and truncate long reviews.

** NOTE: ** As in the XGBoost notebook when we were creating the Bag-Of-Words features, we can only create our feature transformer using the training data, otherwise we are cheating by looking at the answers.

### Create a word dictionary

To begin with, we need to construct a way to map words that appear in the reviews to integers. Here we fix the size of our vocabulary (including the 'no word' and 'infrequent' categories) to be `5000` but you may wish to change this to see how it affects the model.

In [None]:
import numpy as np

def build_dict(data, vocab_size = 5000):
    word_count = {} # A dict storing the words that appear in the reviews along with how often they occur
    
    for sentence in data:
        for word in sentence:
            if word not in word_count: # We haven't come across this word yet
                word_count[word] = 1
            else:                      # Otherwise, increase the count
                word_count[word] += 1
                
    # We only want to keep the most frequent words
    sorted_words = [word for word in sorted(word_count, key=word_count.get, reverse=True)]
    
    word_dict = {} # This is what we are building, a dictionary that translates words into integers
    for idx, word in enumerate(sorted_words[:vocab_size - 2]): # The -2 is so that we save room for the 'no word'
        word_dict[word] = idx + 2                              # 'infrequent' symbols
        
    return word_dict

In [None]:
word_dict = build_dict(train_X)

In the next notebook when we deploy our sentiment analysis model and make it accessible to the outside world, we will need to make use of this word dictionary. As such, we save it for future use.

In [None]:
data_dir = 'data/pytorch'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [None]:
with open(os.path.join(data_dir, 'word_dict.pkl'), "wb") as f:
    pickle.dump(word_dict, f)

### Transform the reviews

Now that we have our word dictionary which allows us to transform the words appearing in the reviews into integers, it is time to make use of it and convert our reviews to their integer representation, making sure to pad or truncate to a fixed length which in our case is `500`, but this could be changed.

In [None]:
def convert_and_pad(data, word_dict, pad=500):
    NOWORD = 0 # We will use 0 to represent the no word category
    INFREQ = 1 # and we use 1 to represent the infrequent words, i.e., words not appear in the word_dict
    
    result = []
    lengths = []
    
    for sentence in data:
        working_sentence = [NOWORD] * pad
        
        # We go through each word in the (possibly truncated) sentence and convert the words to integers
        for word_index, word in enumerate(sentence[:pad]):
            if word in word_dict:
                working_sentence[word_index] = word_dict[word]
            else:
                working_sentence[word_index] = INFREQ
                
        result.append(working_sentence)
        lengths.append(min(len(sentence), pad)) # We will need to keep track of the length of each review for use later
            
    return np.array(result), np.array(lengths)

In [None]:
train_X, train_X_len = convert_and_pad(train_X, word_dict)
test_X, test_X_len = convert_and_pad(test_X, word_dict)

### Save the processed training dataset

As in the XGBoost notebook, we will need to upload the training dataset to S3 in order for our training code to access it. For now we will save it locally and then we will upload to S3 later on.

It is important to note the format of the data that we are saving as we will need to know it when we write the training code. In our case, each row of the dataset has the form `label`, `length`, `review` where `review` is a sequence of `500` integers representing the words in the review.

In [None]:
import pandas as pd

datadir = './data/sentiment/pytorch'
if not os.path.exists(datadir):
    os.makedirs(datadir)
    
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(datadir, 'train.csv'), header=False, index=False)

## Step 3: Build and push the training container

In the XGBoost notebook we discussed how what a model is in the SageMaker framework. In particular, a model comprises three objects

 - Model Artifacts,
 - Training Code (Container), and
 - Inference Code (Container),
 
which interact with one another. In the XGBoost example we used training andi Inference code that was provided by Amazon, however, since we would now like to construct a custom model using PyTorch, we must write the code ourselves.

Amazon SageMaker uses packages called 'containers' created using the Docker utility. Essentially, a Docker Container contains the complete specification of a computing environment along with the code which you want executed. In the example we are looking at now we have provided a Dockerfile which specifies the computing environment along with the training code.

** Note: ** The compute environment we are using here requires a compute instance containing a GPU. We will discuss how to run this code on a compute instance which only has a CPU later in this notebook.

In order to construct the required container we can use the provided `build_and_push.sh` shell script which will create the container and then upload it to Amazon's Elastic Container Repository. Once it has been uploaded (pushed) we can create a SageMaker estimator object which uses our custom code.

To see the contents of the `build_and_push.sh` shell script, execute the cell below.

** Note: ** The `build_and_push.sh` script uploads the container using the name `sentiment-pytorch-gpu`. This will be important later when we construct the SageMaker estimator and tell it to use our custom code. Also, if you wish to change the name of the container you are free to do so, however, note that the name of a container that is used in a SageMaker estimator object can only contain the characters a-z, A-Z, 0-9 and -.

In [None]:
!cat ./train_container/build_and_push.sh

It is important to note that the `build_and_push.sh` script creates two executable scripts, `train` and `serve`, when building the Docker container. The `serve` script will be discussed later on, for now we are interested in the `train` script. The `train` file is a Python script which is run when the container is executed in 'training mode', that is when it is being used to fit a model to some training data.

The `train` script which has been provided is set up so that any modifications you would like to do to the RNN model for sentiment analysis can be done by modifying the model.py file instead.

It is certainly worth taking a look at both the `train` script and the `model.py` script to see how they work. The actual details of the PyTorch implementation of a simple RNN model for sentiment analysis is not as important as understanding how the model is being used, trained, saved, etc... 

### Executing the shell script

Once any changes have been made to the model it is time to actual build the container and upload it to the Elastic Container Repository. 

In [None]:
%cd train_container
!chmod +x ./build_and_push.sh
!./build_and_push.sh gpu
%cd ..

## Step 4: Build and train the model

Now that we have created the docker container we will be using to train our model, it is time to actually use it.

### Uploading Training files to S3

As in the XGBoost notebook, the training code that we have uploaded will have access to the training data that we choose by way of Amazon's S3 service. To give our training code access we will need to first upload the training data to an S3 bucket.

In [None]:
import sagemaker as sage

sess = sage.Session() # Store the current SageMaker session

#S3 prefix (which folder will we use)
prefix = 'sentiment-pytorch'

train_location = sess.upload_data(os.path.join(datadir, 'train.csv'), key_prefix=prefix)

### Creating the RNN model

Now that the data has been uploaded, it is time to construct the SageMaker estimator object. This will proceed much the same as the XGBoost example except that instead of using one of Amazon's containers we will be using the one that we constructed earlier.

In [None]:
# To construct the model, remember that SageMaker need to know our current IAM role
from sagemaker import get_execution_role
role = get_execution_role()

# We need to get our working account number and region in order to fully specify the name of
# the docker container we uploaded.
account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name

# This is the full name of our docker container. Remember that 'sentiment-pytorch-gpu' is the
# name we used earlier when we created and pushed the container.
training_image = '{}.dkr.ecr.{}.amazonaws.com/sentiment-pytorch-gpu'.format(account, region)

# These are some additional hyperparameters which are passed to our custom code. To see how
# this is used, see the code contained in model.py 
trainingParams = {
    'batch_size': 512,
}

pytorch_model = sage.estimator.Estimator(training_image, role,  # We need to provided a link to our custom code
                                        1, 'ml.p2.xlarge',      # This is the compute instance we are using, note that
                                                                # the p2 instance are gpu instances
                                        output_path="s3://{}/output".format(sess.default_bucket()),
                                        hyperparameters=trainingParams, # Some model hyperparameters
                                        sagemaker_session=sess)

In [None]:
pytorch_model.fit(train_location)

Now that we have fit our model to the training data we can use it to perform inference. In the next notebook we will use this same model to create a web app. To do this we will need to know where the model data is stored, which can be determined using the following member variable. Make sure to record this as we will need it in the next notebook.

In [None]:
pytorch_model.model_data

## Step 5: Build and push the inference container

As we discussed earlier in Step 3, a SageMaker model comprises three parts, the model artifacts, the training code and the inference code. So far we have used the first two objects, in particular we have built, pushed and used a training container and store the results which then become the model artifacts. However, what about the last part, the inference code?

We also mentioned when we created the earlier docker container that there were two executed scripts included, `train` and `serve`, and that the `train` script was used for training. It should come as no surprise that the `serve` script is responsible for performing inference and so it is executed when the container is run in 'serving' or 'deployed' mode.

So, if we wished, we could use the container we already created and, just like in the XGBoost notebook, we could deploy it and send it our test data. However, as noted earlier, the container that we created requires the compute instance to have a GPU. This seems excessive for our needs as performing prediction with an already fit model doesn't take nearly as many resources as actually training the model. Also, since in practice we don't know how long we will have the model deployed for we may wish to reduce costs and compute instances with GPUs are much more costly than CPU only compute instances.

Fortunately, we do not need to change any of the code that we have written, we only need to change the environment in which the code is run. That is, we need to change the Dockerfile. A CPU based Dockerfile has been provided and we can construct the CPU version of our containiner using the same `build_and_push.sh` script as before. Note that in this case we call our container `pytorch-sentiment-cpu`.

In [None]:
%cd train_container
!chmod +x ./build_and_push.sh
!./build_and_push.sh cpu
%cd ..

## Step 6: Deploy and test the model

Now that we have created a container for inference we can test the model that we created earlier. This essentially proceeds much the same as in the XGBoost example, the difference being that we explicitly tell SageMaker to use a different container when we deploy the model.

### Deploy the model

Here we deploy the model we constructed earlier using the new CPU container we created. Note that if we leave out the optional parameter `image` to the deploy method then SageMaker defaults to using the container that was used during training.

In [None]:
from sagemaker.predictor import csv_serializer # This helps specify how we want to send data to our inference code

# As in the training step, this is the full name for the cpu container
inference_image = '{}.dkr.ecr.{}.amazonaws.com/sentiment-pytorch-cpu:latest'.format(account, region)

In [None]:
pytorch_predictor = pytorch_model.deploy(1, 'ml.m4.xlarge', # The type of compute instance used, note that m4 is a cpu instance
                                        serializer=csv_serializer, # How do we want the data sent
                                        image=inference_image) # Which container to use

### Test the model

Now that the model has been deployed, it is running on an Amazon server somewhere. Now we will send it data and record the results so that we can see how well it performs on our test set.

In [None]:
def predict(data, rows=1000):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1)) # Break the data up into chunks
    predictions = np.array([])
    
    for array in split_array:
        chunk_predictions = pytorch_predictor.predict(array).decode('utf-8') # Get the predictions
        chunk_predictions = np.fromstring(chunk_predictions, sep='\n')       # Convert it to a numpy array
        predictions = np.append(predictions, chunk_predictions)
        
    return predictions

Remember that our custom inference code requires each input row to have the form `length`, `review` where `length` is the number of non-zero entries in `review` and `review` is a sequence of `500` integers.

In [None]:
test_comb = np.hstack((test_X_len.reshape([-1,1]), test_X))

In [None]:
predictions = predict(test_comb)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y, predictions)

### Delete the endpoint

Now that we are done testing our model we need to delete the endpoint so that it is no longer running.

In [None]:
sess.delete_endpoint(pytorch_predictor.endpoint)