# Sentiment Analysis Web App Using AWS Sagemaker
---

In this project, I am creating a simple web page which a user can use to enter a movie review. The web page will then send the review off to our deployed model which will predict the sentiment of the entered review.

### Steps Involved are:

1. Download or otherwise retrieve the data.
2. Process / Prepare the data.
3. Upload the processed data to S3 (AWS Sagemaker bucket)
4. Train a chosen model (RNN-LSTM).
5. Test the trained model (typically using a batch transform job).
6. Deploy the trained model.
7. Use the deployed model.

## Step 1: Downloading the data

In this project, we will be using the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/)

In [45]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

mkdir: cannot create directory ‘../data’: File exists
--2019-06-15 09:54:16--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2019-06-15 09:54:20 (22.0 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



## Step 2: Preparing and Processing the data

In this step, we will be doing some initial data processing. The first few steps include reading in each of the reviews and combine them into a single input structure. Then, we will split the dataset into a training set and a testing set.

In [46]:
import os
import glob

def read_imdb_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # Here we represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [47]:
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))

IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg


Now that I've read the raw training and testing data from the downloaded dataset, I will now combine the positive and negative reviews and shuffle the resulting records.

In [48]:
from sklearn.utils import shuffle

def prepare_imdb_data(data, labels):
    """Prepare training and test sets from IMDb movie reviews."""
    
    #Combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    #Shuffle reviews and corresponding labels within training and test sets
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # Return a unified training data, test data, training labels, test labets
    return data_train, data_test, labels_train, labels_test

In [49]:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {}, test = {}".format(len(train_X), len(test_X)))

IMDb reviews (combined): train = 25000, test = 25000


Now that I have my training and testing sets unified and prepared, I'll do a quick check and see an example of the data my model will be trained on. This is generally a good idea as it allows us to see how each of the further processing steps affects the reviews and it also ensures that the data has been loaded correctly.

In [50]:
print(train_X[100])
print(train_y[100])

I loved this show growing up and I still watch the first season DVD at age 19 today. What can I say? I grew up in a house much like the one on Full House. I had a dad, two sisters, and a dog. I guess the only difference was that I did not live with my uncle and my dad's best friend. Also, I grew up with my mom in the house. I don't know what I would have done without Full House on television. I think that Stephanie (played by Jodie Sweetin), D.J. (played by Kirk Cameron's sister Candace), and Michelle (Played by Mary-Kate and Ashley Olsen) are my favorite characters. I can relate to each of them because I am the middle child of my family like Steph, I am a younger sister like Michelle, and I am an older sister like D.J. I really like how the show always has moral values because I don't really like any of the O.C.-like shows today. I like the comedy of Full House, too. Uncle Jesse (John Stamos), Joey (Dave Coulier), and Danny (Bob Saget) are hilarious as the girls' uncle, dad's friend, 

The first step in processing the reviews is to make sure that any html tags that appear should be removed. In addition we wish to tokenize our input, that way words such as *entertained* and *entertaining* are considered the same with regard to sentiment analysis.

In [51]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *

import re
from bs4 import BeautifulSoup

def review_to_words(review):
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words


The `review_to_words` method defined above uses `BeautifulSoup` to remove any html tags that appear and uses the `nltk` package to tokenize the reviews. As a check to ensure we know how everything is working, try applying `review_to_words` to one of the reviews in the training set.

In [52]:
# TODO: Apply review_to_words to a review (train_X[100] or any other review)

review_to_words(train_X[100])

['love',
 'show',
 'grow',
 'still',
 'watch',
 'first',
 'season',
 'dvd',
 'age',
 '19',
 'today',
 'say',
 'grew',
 'hous',
 'much',
 'like',
 'one',
 'full',
 'hous',
 'dad',
 'two',
 'sister',
 'dog',
 'guess',
 'differ',
 'live',
 'uncl',
 'dad',
 'best',
 'friend',
 'also',
 'grew',
 'mom',
 'hous',
 'know',
 'would',
 'done',
 'without',
 'full',
 'hous',
 'televis',
 'think',
 'stephani',
 'play',
 'jodi',
 'sweetin',
 'j',
 'play',
 'kirk',
 'cameron',
 'sister',
 'candac',
 'michel',
 'play',
 'mari',
 'kate',
 'ashley',
 'olsen',
 'favorit',
 'charact',
 'relat',
 'middl',
 'child',
 'famili',
 'like',
 'steph',
 'younger',
 'sister',
 'like',
 'michel',
 'older',
 'sister',
 'like',
 'j',
 'realli',
 'like',
 'show',
 'alway',
 'moral',
 'valu',
 'realli',
 'like',
 'c',
 'like',
 'show',
 'today',
 'like',
 'comedi',
 'full',
 'hous',
 'uncl',
 'jess',
 'john',
 'stamo',
 'joey',
 'dave',
 'coulier',
 'danni',
 'bob',
 'saget',
 'hilari',
 'girl',
 'uncl',
 'dad',
 'frien

This above *review_to_words* method removes html formatting and allows us to tokenize the words found in a review, for example, converting *entertained* and *entertaining* into *entertain* so that they are treated as though they are the same word and also, it removes any special characters like punctuation , convert all texts to lower case & split it into individual words , it then create a vocabulary that assigns each unique word a numeric value which is called a vectorization or tokenization.  

The method below applies the `review_to_words` method to each of the reviews in the training and testing datasets. In addition it caches the results. This is because performing this processing step can take a long time. This way if I am unable to complete the notebook in the current session, I can come back without needing to process the data a second time.

In [53]:
import pickle

cache_dir = os.path.join("../cache", "sentiment_analysis")  # where to store cache files
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    """Convert each review to words; read from cache if available."""

    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Preprocess training and test data to obtain words for each review
        #words_train = list(map(review_to_words, data_train))
        #words_test = list(map(review_to_words, data_test))
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        # Write to cache file for future runs
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [54]:
# Preprocess data
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

Read preprocessed data from cache file: preprocessed_data.pkl


## Transform the data

In this step, I will construct a feature representation which includes transforming the data from its word representation to a bag-of-words feature representation. To start, I will represent each word as an integer. Of course, some of the words that appear in the reviews occur very infrequently and so likely don't contain much information for the purposes of sentiment analysis. The way I will deal with this problem is that I will fix the size of my working vocabulary and will only include the words that appear most frequently. I'll then combine all of the infrequent words into a single category and, in my case, I'll label it as `1`.

Since I'll be using a recurrent neural network, it will be convenient if the length of each review is the same. To do this, I'll fix a size for reviews and then pad short reviews with the category 'no word' (which we will label `0`) and truncate long reviews.

### Creating a word dictionary

To begin with this step, I need to construct a way to map words that appear in the reviews to integers. Here I'll fix the size of the vocabulary (including the 'no word' and 'infrequent' categories) to be `5000`. One thing, here, even though the vocab_size is set to `5000`, I only want to construct a mapping for the most frequently appearing `4998` words. This is because I want to reserve the special labels `0` for 'no word' and `1` for 'infrequent word'.

In [55]:
import numpy as np

def build_dict(data, vocab_size = 5000):
    """Construct and return a dictionary mapping each of the most frequently appearing words to a unique integer."""
    
    # TODO: Determine how often each word appears in `data`. Note that `data` is a list of sentences and that a
    #       sentence is a list of words.
    
    word_count = {} # A dict storing the words that appear in the reviews along with how often they occur
    
    for sentences in data:
        for words in sentences:
            if words in word_count:
                word_count[words] += 1
            else:
                word_count[words] = 1
    
    
    # TODO: Sort the words found in `data` so that sorted_words[0] is the most frequently appearing word and
    #       sorted_words[-1] is the least frequently appearing word.
    
    # This will generate a sorted list of tuples of words in descreasing order
    sorted_words = sorted(word_count.items(), key = lambda values : values[1], reverse = True)
    
    
    sorted_words = [words[0] for words in sorted_words]
    
    
    word_dict = {} # This is what we are building, a dictionary that translates words into integers
    for idx, word in enumerate(sorted_words[:vocab_size - 2]): # The -2 is so that we save room for the 'no word'
        word_dict[word] = idx + 2                              # 'infrequent' labels
        
    return word_dict

In [56]:
word_dict = build_dict(train_X)

#print(word_dict)   #For debugging purpose

As can be seen from the following cell that the 5 most frequently appearing words are *`movi`*, *`film`*, *`one`*, *`like`*, *`time`* . It does makes sense that these words appear frequently in the training set as these are the most commonly used words without which one can't form a review for a movie. 

In [57]:
# determining the five most frequently appearing words in the training set.

l = []

for word in list(word_dict)[0:5]:
    l.append(word)

print(l)   

['movi', 'film', 'one', 'like', 'time']


### Saving `word_dict`

Later on when I construct an endpoint which processes a submitted review I'll need to make use of the `word_dict` which I have created. As such, I will save it to a file now for future use.

In [58]:
data_dir = '../data/pytorch' # The folder where I will store data
if not os.path.exists(data_dir): # Making sure that the folder exists
    os.makedirs(data_dir)

In [59]:
with open(os.path.join(data_dir, 'word_dict.pkl'), "wb") as f:
    pickle.dump(word_dict, f)

### Transforming the reviews

Now that I have the word dictionary which allows me to transform the words appearing in the reviews into integers, it is time to make use of it and convert the reviews to their integer sequence representation, making sure to pad or truncate to a fixed length, which in my case is `500`.

In [60]:
def convert_and_pad(word_dict, sentence, pad=500):
    NOWORD = 0 # We will use 0 to represent the 'no word' category
    INFREQ = 1 # and we use 1 to represent the infrequent words, i.e., words not appearing in word_dict
    
    working_sentence = [NOWORD] * pad
    
    for word_index, word in enumerate(sentence[:pad]):
        if word in word_dict:
            working_sentence[word_index] = word_dict[word]
        else:
            working_sentence[word_index] = INFREQ
            
    return working_sentence, min(len(sentence), pad)

def convert_and_pad_data(word_dict, data, pad=500):
    result = []
    lengths = []
    
    for sentence in data:
        converted, leng = convert_and_pad(word_dict, sentence, pad)
        result.append(converted)
        lengths.append(leng)
        
    return np.array(result), np.array(lengths)

In [61]:
train_X, train_X_len = convert_and_pad_data(word_dict, train_X)
test_X, test_X_len = convert_and_pad_data(word_dict, test_X)

In [62]:
# Use this cell to examine one of the processed reviews to make sure everything is working as intended.

print(train_X[60])

[ 534 1324    1  709  362 3671    1   42  239 2176   42  101 4643   86
  960    3  285   42   13   39 3966   28  715  855    4   79 2457    2
   28 1183  110    1  792  909  283   49  260 3185  230  665   91  245
   75   13  268  107   55  157    1    1   61  145    1   94  997 3447
 1139 1577  721  835  941  314 1682  919 4286   42 1175 3672  715   45
  223  405    1 1047  819    1   55 1980 2568  377  844 1355  419  124
 1049 1966  916    1  534 1324  156    1 3788  362  583   91  419  391
  169 2354 1530  319  271   60   63  991  431  568  144    3 3820    1
 2465   96 1835  159    1    1 4176  827 1772  346   14 4099  100   14
   74  677   39    5 2941 2665  917   23   16   37   96    8   20  137
 1469    1  230    1  251    1    1  423    9  270    1  162   19   17
   42  239    1 2925  349    1  391  447    1 1268  230    1   36 2166
  584    1  124  463 2666    6  387  119  230  426   36   57 4362    8
  262  145   30  332    3  270 1734  454  955 1883    1 4320   14  308
 1364 

By using *preprocess_data*, I am applying `review_to_word` method to every training & testing set which would remove all the unneccesary elements (like html tags, punctuation, special characters etc) & by using *convert_and_pad_data*, I am padding the shorter reviews with 0 or truncating the larger reviews to fixed length which is 500. Applying these methods are not the problem as one can easily understand the sentiment of the user by just looking over these mostly used 500 words.   

## Step 3: Uploading the data to S3

In this step, I am uploading the training dataset to S3 in order for my training code to access it. For now I will save it locally and I will upload to S3 later on.

### Saving the processed training dataset locally

It is important to note the format of the data that I am saving as I will need to know it when I'll write the training code. here, each row of the dataset has the form `label`, `length`, `review[500]` where `review[500]` is a sequence of `500` integers representing the words in the review.

In [63]:
import pandas as pd
    
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

### Uploading the training data


Next, I need to upload the training data to the SageMaker default S3 bucket so that I can provide access to it while training the model.

In [64]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/sentiment_rnn'

role = sagemaker.get_execution_role()

In [65]:
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

**NOTE:** The cell above uploads the entire contents of my data directory. This includes the `word_dict.pkl` file. This is fortunate as I will need this later on when I'll create an endpoint that accepts an arbitrary review. For now, I will just take note of the fact that it resides in the data directory (and so also in the S3 training bucket) and that I'll need to make sure it gets saved in the model directory.

## Step 4: Building and Training the PyTorch Model

Now, a model comprises three objects

 - Model Artifacts,
 - Training Code, and
 - Inference Code,
 
each of which interact with one another. Here in this step, I will use the training and inference code containers provided by Amazon with the added benefit of being able to include my own custom code. I will start by implementing neural network in PyTorch along with a training script.

In [66]:
!pygmentize train/model.py

[34mimport[39;49;00m [04m[36mtorch.nn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m

[34mclass[39;49;00m [04m[32mLSTMClassifier[39;49;00m(nn.Module):
    [33m"""[39;49;00m
[33m    This is the simple RNN model we will be using to perform Sentiment Analysis.[39;49;00m
[33m    """[39;49;00m

    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m, embedding_dim, hidden_dim, vocab_size):
        [33m"""[39;49;00m
[33m        Initialize the model by settingg up the various layers.[39;49;00m
[33m        """[39;49;00m
        [36msuper[39;49;00m(LSTMClassifier, [36mself[39;49;00m).[32m__init__[39;49;00m()

        [36mself[39;49;00m.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=[34m0[39;49;00m)
        [36mself[39;49;00m.lstm = nn.LSTM(embedding_dim, hidden_dim)
        [36mself[39;49;00m.dense = nn.Linear(in_features=hidden_dim, out_features=[34m1[39;49;00m)
        [36mself[39;49;00m.sig = nn.Sigm

The important takeaway from the implementation provided is that there are three parameters that I may wish to tweak to improve the performance of the model. These are the `embedding dimension`, the `hidden dimension` and the `size of the vocabulary`. I will likely want to make these parameters configurable in the training script so that if I wish to modify them then I do not need to modify the script itself.

First I will load a small portion of the training data set to use as a sample. It would be very time consuming to try and train the model completely in the notebook as I do not have access to a gpu and the compute instance that I am using is not particularly powerful. However, I can work on a small bit of the data to get a feel for how the training script is behaving.

In [67]:
import torch
import torch.utils.data

# Reading in only the first 250 rows
train_sample = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None, names=None, nrows=250)

# Turning the input pandas dataframe into tensors
train_sample_y = torch.from_numpy(train_sample[[0]].values).float().squeeze()
train_sample_X = torch.from_numpy(train_sample.drop([0], axis=1).values).long()

# Building the dataset
train_sample_ds = torch.utils.data.TensorDataset(train_sample_X, train_sample_y)
# Building the dataloader
train_sample_dl = torch.utils.data.DataLoader(train_sample_ds, batch_size=50)

### The training method:

In [68]:
def train(model, train_loader, epochs, optimizer, loss_fn, device):
    for epoch in range(1, epochs + 1):
        model.train()
        total_loss = 0
        for batch in train_loader:         
            batch_X, batch_y = batch
            
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            
            # TODO: Complete this train method to train the model provided.
              
            optimizer.zero_grad()       # This will not allow the gradients to get accumulated during back propagation
            
            output = model(batch_X)     # This will take the output by passing training dataset during feed-forward propagation
            
            loss = loss_fn(output, batch_y) #This will calculate the loss by taking difference of true label & actual label 
            
            loss.backward()             # This will move that loss in back propagation in order to reduce it
            
            optimizer.step()            # This will do optimization step by step
            
            total_loss += loss.data.item() # This will calculate the total loss after each iteration
        
        print("Epoch: {}, BCELoss: {}".format(epoch, total_loss / len(train_loader)))

In [69]:
import torch.optim as optim
from train.model import LSTMClassifier

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LSTMClassifier(32, 100, 5000).to(device)
optimizer = optim.Adam(model.parameters())
loss_fn = torch.nn.BCELoss()

train(model, train_sample_dl, 5, optimizer, loss_fn, device)

Epoch: 1, BCELoss: 0.6936596155166626
Epoch: 2, BCELoss: 0.6825093269348145
Epoch: 3, BCELoss: 0.6729172229766845
Epoch: 4, BCELoss: 0.6621326684951783
Epoch: 5, BCELoss: 0.648519766330719


In order to construct a PyTorch model using SageMaker I have to provide SageMaker with a training script. I may have optionally include a directory which will be copied to the container and from which my training code will be run. When the training container is executed it will check the uploaded directory (if there is one) for a `requirements.txt` file and install any required Python libraries, after which the training script will be run.

### Training the model

When a PyTorch model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained. Inside of the `train` directory is a file called `train.py` which has been provided and which contains most of the necessary code to train the model. I will my above training method in that file.

The way that SageMaker passes hyperparameters to the training script is by way of arguments. These arguments can then be parsed and used in the training script.

In [70]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point="train.py",
                    source_dir="train",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type='ml.p2.xlarge',
                    hyperparameters={
                        'epochs': 10,
                        'hidden_dim': 200,
                    })

In [71]:
estimator.fit({'training': input_data})

2019-06-15 09:55:08 Starting - Starting the training job...
2019-06-15 09:55:09 Starting - Launching requested ML instances...
2019-06-15 09:56:07 Starting - Preparing the instances for training.........
2019-06-15 09:57:17 Downloading - Downloading input data...
2019-06-15 09:57:50 Training - Downloading the training image...
2019-06-15 09:58:23 Training - Training image download completed. Training in progress.
[31mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[31mbash: no job control in this shell[0m
[31m2019-06-15 09:58:24,036 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[31m2019-06-15 09:58:24,060 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[31m2019-06-15 09:58:24,286 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[31m2019-06-15 09:58:24,519 sagemaker-containers INFO     Module train does not provide a setup.py. [

[31mEpoch: 1, BCELoss: 0.6760185506879067[0m
[31mEpoch: 2, BCELoss: 0.6241538244850782[0m
[31mEpoch: 3, BCELoss: 0.523089519568852[0m
[31mEpoch: 4, BCELoss: 0.4732413790663894[0m
[31mEpoch: 5, BCELoss: 0.4147447706485281[0m
[31mEpoch: 6, BCELoss: 0.3695293981201795[0m
[31mEpoch: 7, BCELoss: 0.3502051313312686[0m
[31mEpoch: 8, BCELoss: 0.3186509895081423[0m
[31mEpoch: 9, BCELoss: 0.321619415161561[0m

2019-06-15 10:01:41 Uploading - Uploading generated training model
2019-06-15 10:01:41 Completed - Training job completed
[31mEpoch: 10, BCELoss: 0.35252104425916864[0m
[31m2019-06-15 10:01:35,168 sagemaker-containers INFO     Reporting training SUCCESS[0m
Billable seconds: 265


## Step 5: Testing the model

In this, first I will test this model by first deploying it and then sending the testing data to the deployed endpoint. I will do this so that we can make sure that the deployed model is working correctly.

## Step 6: Deploying the model for testing

Now that I have trained the model, I would like to test it to see how it performs. Currently the model takes input of the form `review_length, review[500]` where `review[500]` is a sequence of `500` integers which describe the words present in the review, encoded using `word_dict`. Fortunately, SageMaker provides built-in inference code for models with simple inputs such as this.

There is one thing that I need to provide, however, and that is a function which loads the saved model. This function must be called `model_fn()` and takes as its only parameter a path to the directory where the model artifacts are stored. This function must also be present in the python file which I have specified as the entry point.

**NOTE**: When the built-in inference code is run it must import the `model_fn()` method from the `train.py` file. This is why the training code is wrapped in a main guard ( ie, `if __name__ == '__main__':` )

Since I don't need to change anything in the code that was uploaded during training, I can simply deploy the current model as-is.

**NOTE:** When deploying a model I am asking SageMaker to launch an compute instance that will wait for data to be sent to it. As a result, this compute instance will continue to run until *I* shut it down. This is important to know since the cost of a deployed endpoint depends on how long it has been running for.

In other words **If you are no longer using a deployed endpoint, shut it down!**

In [72]:
# Deploy the trained model

predictor = estimator.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')

---------------------------------------------------------------------------!

## Step 7 - Using the deployed model for testing

Once deployed, I can now read in the test data and send it off to the deployed model to get some results. Once I will collect all of the results I can determine how accurate the model is.

In [73]:
test_X = pd.concat([pd.DataFrame(test_X_len), pd.DataFrame(test_X)], axis=1)

In [74]:
# We split the data into chunks and send each chunk seperately, accumulating the results.

def predict(data, rows=512):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = np.array([])
    for array in split_array:
        predictions = np.append(predictions, predictor.predict(array))
    
    return predictions

In [75]:
predictions = predict(test_X.values)
predictions = [round(num) for num in predictions]

In [76]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y, predictions)

0.83148

### More testing

Now that I have a trained model which has been deployed and which I can send processed reviews to and which returns the predicted sentiment. However, ultimately I would like to be able to send the model an unprocessed review. That is, I would like to send the review itself as a string. For example, suppose I wish to send the following review to the model.

In [77]:
test_review = 'The simplest pleasures in life are the best, and this film is one of them. Combining a rather basic storyline of love and adventure this movie transcends the usual weekend fair with wit and unmitigated charm.'

Now the question is - how would I send this review to the model?

Recall in the first section of this notebook I did a bunch of data processing to the IMDb dataset. In particular, I did two specific things to the provided reviews.
 - Removed any html tags and stemmed the input
 - Encoded the review as a sequence of integers using `word_dict`
 
In order to process the review I will need to repeat these two steps.

Using the `review_to_words` and `convert_and_pad` methods from section one, I will convert `test_review` into a numpy array `test_data` suitable to send to the model. Remember that the model expects input of the form `review_length, review[500]`.

In [78]:
# Converting test_review into a form usable by the model and save the results in test_data

# converting test_review into words 
words = review_to_words(test_review)

# Obtaining integer numbers for each words in the obtained list and then pad them by 0 to make a length of 500
word_with_occurence = convert_and_pad(word_dict, words)

# To get the no. of words I will take the first element of the above encoded review as the first element is most frequent one
review = np.array(word_with_occurence[0])[None]

# To get the length taking second element as length is n-1
review_length = np.array(word_with_occurence[1])[None]


test_data = pd.concat([pd.DataFrame(review_length), pd.DataFrame(review)], axis = 1)

Now that I have processed the review, I can send the resulting array to the model to predict the sentiment of the review.

In [79]:
predictor.predict(test_data)

array(0.82010174, dtype=float32)

Since the return value of the model is close to `1`, I can be certain that the review I have submitted is positive.

### Deleting the endpoint

once I've deployed an endpoint it continues to run until I tell it to shut down. Since I am done using the endpoint for now, It's safe to delete it.

In [80]:
estimator.delete_endpoint()

## Step 6 (again) - Deploying the model for the web app

Now that I know that the model is working, it's time to create some custom inference code so that I can send the model a review which has not been processed and have it determine the sentiment of the review.

As given above, by default the estimator which I've created, when deployed, will use the entry script and directory which I've provided when creating the model. However, since I now wish to accept a string as input and the model expects a processed review, I need to write some custom inference code.

I will store the code that I write in the `serve` directory. Provided in this directory is the `model.py` file that I've used to construct the model, a `utils.py` file which contains the `review_to_words` and `convert_and_pad` pre-processing functions which I've used during the initial data processing, and `predict.py`, the file which will contain the custom inference code. Note that, `requirements.txt` is present which will tell SageMaker what Python libraries are required by the custom inference code.

When deploying a PyTorch model in SageMaker, there are four functions which the SageMaker inference container will use.
 - `model_fn`: This function is the same function that I've used in the training script and it tells SageMaker how to load the model.
 - `input_fn`: This function receives the raw serialized input that has been sent to the model's endpoint and its job is to de-serialize and make the input available for the inference code.
 - `output_fn`: This function takes the output of the inference code and its job is to serialize this output and return it to the caller of the model's endpoint.
 - `predict_fn`: The heart of the inference script, this is where the actual prediction is done.

For the simple website that I am constructing during this project, the `input_fn` and `output_fn` methods are relatively straightforward. I only require being able to accept a string as input and I expect to return a single value as output.

### Writing inference code

In [81]:
!pygmentize serve/predict.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpickle[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36msagemaker_containers[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.nn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.optim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.utils.data[39;49;00m

[34mfrom[39;49;00m [04m[36mmodel[39;49;00m [34mimport[39;49;00m LSTMClassifier

[34mfrom[39;49;00m [04m[36mutils[39;49;00m [34mimport[39;49;00m review_to_words, 

As mentioned earlier, the `model_fn` method is the same as the one provided in the training code and the `input_fn` and `output_fn` methods are very simple and I will write the `predict_fn` method. Making sure that I will save the completed file as `predict.py` in the `serve` directory.

### Deploying the model

Now that the custom inference code has been written, I will create and deploy the model. To begin with, I need to construct a new PyTorchModel object which points to the model artifacts created during training and also points to the inference code that I wish to use. Then I can call the deploy method to launch the deployment container.

**NOTE**: The default behaviour for a deployed PyTorch model is to assume that any input passed to the predictor is a `numpy` array. In my case I want to send a string so I need to construct a simple wrapper around the `RealTimePredictor` class to accomodate simple strings.

In [82]:
from sagemaker.predictor import RealTimePredictor
from sagemaker.pytorch import PyTorchModel

class StringPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain')

model = PyTorchModel(model_data=estimator.model_data,
                     role = role,
                     framework_version='0.4.0',
                     entry_point='predict.py',
                     source_dir='serve',
                     predictor_cls=StringPredictor)
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

---------------------------------------------------------------------------------------!

### Testing the model

Now that I have deployed the model with the custom inference code, I will test to see if everything is working. Here I test the model by loading the first `250` positive and negative reviews and send them to the endpoint, then collect the results. The reason for only sending some of the data is that the amount of time it takes for the model to process the input and then perform inference is quite long and so testing the entire data set would be prohibitive.

In [83]:
import glob

def test_reviews(data_dir='../data/aclImdb', stop=250):
    
    results = []
    ground = []
    
    # Making sure to test both positive and negative reviews    
    for sentiment in ['pos', 'neg']:
        
        path = os.path.join(data_dir, 'test', sentiment, '*.txt')
        files = glob.glob(path)
        
        files_read = 0
        
        print('Starting ', sentiment, ' files')
        
        # Iterate through the files and send them to the predictor
        for f in files:
            with open(f) as review:
                # First, store the ground truth (was the review positive or negative)
                if sentiment == 'pos':
                    ground.append(1)
                else:
                    ground.append(0)
                # Reading in the review and convert to 'utf-8' for transmission via HTTP
                review_input = review.read().encode('utf-8')
                # Sending the review to the predictor and store the results
                results.append(float(predictor.predict(review_input)))
                
            # Sending reviews to the endpoint one at a time takes a while so I
            # only send a small number of reviews
            files_read += 1
            if files_read == stop:
                break
            
    return ground, results

In [84]:
ground, results = test_reviews()

Starting  pos  files
Starting  neg  files


In [85]:
from sklearn.metrics import accuracy_score
accuracy_score(ground, results)

0.854

As an additional test, I will try sending the `test_review` that I've looked at earlier.

In [86]:
predictor.predict(test_review)

b'1.0'

Now that I know the endpoint is working as expected, I can set up the web page that will interact with it.

## Step 7 (again): Using the model for the web app

So far I have been accessing the model endpoint by constructing a predictor object which uses the endpoint and then just using the predictor object to perform inference. What if I wanted to create a web app which accessed our model? The way things are set up currently makes that not possible since in order to access a SageMaker endpoint the app would first have to authenticate with AWS using an IAM role which included access to SageMaker endpoints. However, there is an easier way! I just need to use some additional AWS services which includes : Lambda function, which you can think of as a straightforward Python function that can be executed whenever a specified event occurs. I will give this function permission to send and recieve data from a SageMaker endpoint and Lastly, the method I will use to execute the Lambda function is a new endpoint that I will create using API Gateway. This endpoint will be a url that listens for data to be sent to it. Once it gets some data it will pass that data on to the Lambda function and then return whatever the Lambda function returns. Essentially it will act as an interface that lets the web app communicate with the Lambda function.

### Setting up a Lambda function

The first thing I am going to do is set up a Lambda function. This Lambda function will be executed whenever the public API has data sent to it. When it is executed it will receive the data, perform any sort of processing that is required, send the data (the review) to the SageMaker endpoint I've created and then return the result.

#### Part A: Creating an IAM Role for the Lambda function - steps

Since I want the Lambda function to call a SageMaker endpoint, I need to make sure that it has permission to do so. To do this, I will construct a role that I can later give the Lambda function.

Using the AWS Console, navigate to the **IAM** page and click on **Roles**. Then, click on **Create role**. Make sure that the **AWS service** is the type of trusted entity selected and choose **Lambda** as the service that will use this role, then click **Next: Permissions**.

In the search box type `sagemaker` and select the check box next to the **AmazonSageMakerFullAccess** policy. Then, click on **Next: Review**.

Lastly, give this role a name. Make sure you use a name that you will remember later on, for example `LambdaSageMakerRole`. Then, click on **Create role**.

#### Part B: Creating a Lambda function - steps

Now it is time to actually create the Lambda function.

Using the AWS Console, navigate to the AWS Lambda page and click on **Create a function**. When you get to the next page, make sure that **Author from scratch** is selected. Now, name your Lambda function, using a name that you will remember later on, for example `sentiment_analysis_func`. Make sure that the **Python 3.6** runtime is selected and then choose the role that you created in the previous part. Then, click on **Create Function**.

On the next page you will see some information about the Lambda function you've just created. If you scroll down you should see an editor in which you can write the code that will be executed when your Lambda function is triggered. In my example, I will use the code below. 

```python
# We need to use the low-level library to interact with SageMaker since the SageMaker API
# is not available natively through Lambda.
import boto3

def lambda_handler(event, context):

    # The SageMaker runtime is what allows us to invoke the endpoint that we've created.
    runtime = boto3.Session().client('sagemaker-runtime')

    # Now I'll use the SageMaker runtime to invoke our endpoint, sending the review I have
    response = runtime.invoke_endpoint(EndpointName = '**ENDPOINT NAME HERE**',    # The name of the endpoint we created
                                       ContentType = 'text/plain',                 # The data format that is expected
                                       Body = event['body'])                       # The actual review

    # The response is an HTTP response whose body contains the result of our inference
    result = response['Body'].read().decode('utf-8')

    return {
        'statusCode' : 200,
        'headers' : { 'Content-Type' : 'text/plain', 'Access-Control-Allow-Origin' : '*' },
        'body' : result
    }
```

I will copy and paste the code above into the Lambda code editor, then replace the `**ENDPOINT NAME HERE**` portion with the name of the endpoint that I've deployed earlier. Checking the name of the endpoint using the code cell below.

In [87]:
predictor.endpoint

'sagemaker-pytorch-2019-06-15-10-12-45-765'

After adding the endpoint name to the Lambda function, we will click on **Save**. The Lambda function is now up and running. Next I need to create a way for my web app to execute the Lambda function.

### Setting up API Gateway - steps

Now that the Lambda function is set up, it is time to create a new API using API Gateway that will trigger the Lambda function I have just created.

Using AWS Console, navigate to **Amazon API Gateway** and then click on **Get started**.

On the next page, make sure that **New API** is selected and give the new api a name, for example, `sentiment_analysis_api`. Then, click on **Create API**.

Now I have created an API, however it doesn't currently do anything. What I want it to do is to trigger the Lambda function that I've created earlier.

Select the **Actions** dropdown menu and click **Create Method**. A new blank method will be created, select its dropdown menu and select **POST**, then click on the check mark beside it.

For the integration point, make sure that **Lambda Function** is selected and click on the **Use Lambda Proxy integration**. This option makes sure that the data that is sent to the API is then sent directly to the Lambda function with no processing. It also means that the return value must be a proper response object as it will also not be processed by API Gateway.

Type the name of the Lambda function created earlier into the **Lambda Function** text entry box and then click on **Save**. Click on **OK** in the pop-up box that then appears, giving permission to API Gateway to invoke the Lambda function created earlier.

The last step in creating the API Gateway is to select the **Actions** dropdown and click on **Deploy API**. I will need to create a new Deployment stage and name it anything liek, for example `prod`.

Now, I have successfully set up a public API to access your SageMaker model. Make sure to copy or write down the URL provided to invoke the newly created public API as this will be needed in the next step. This URL can be found at the top of the page, highlighted in blue next to the text **Invoke URL**.

## Step 4: Deploying our web app

Now that I have a publicly available API, I can start using it in a web app. For this, I have created a simple static html file `index.html` which can make use of the public api created earlier. I will add the API in that html file and then I will run it on my browser which will behave as a local web serve.

Now, if you open `index.html` on your local computer, your browser will behave as a local web server and you can use the provided site to interact with your SageMaker model.

**Some Examples:** 

 > **Review :** That movie was really amazing!!
 
 > **Your review was POSITIVE!**
 
 > **Review :** That movie was fake
 
 > **Your review was NEGATIVE!**

### Deleting the endpoint - MOST IMPORTANT STEP

In [44]:
predictor.delete_endpoint()