# Sentiment Analysis

## Updating a Model in SageMaker

_Deep Learning Nanodegree Program | Deployment_

---

In this notebook we will consider a situation in which a model that we constructed is no longer working as we intended. In particular, we will look at the XGBoost sentiment analysis model that we constructed earlier. In this case, however, we have some new data that our model doesn't seem to perform very well on. As a result, we will re-train our model and update an existing endpoint so that it uses our new model.

This notebook starts by re-creating the XGBoost sentiment analysis model that was created in the deployment section. This means the cells up to the end of Step 4 are the same as before. _The new content in this notebook begins at Step 5._

## Step 1: Downloading the data

In [1]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

mkdir: cannot create directory ‘../data’: File exists
--2019-12-30 14:09:09--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2019-12-30 14:09:20 (7.43 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



## Step 2: Preparing the data

In [1]:
import os
import glob

def read_imdb_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # Here we represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [2]:
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))

IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg


In [3]:
from sklearn.utils import shuffle

def prepare_imdb_data(data, labels):
    """Prepare training and test sets from IMDb movie reviews."""
    
    #Combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    #Shuffle reviews and corresponding labels within training and test sets
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # Return a unified training data, test data, training labels, test labets
    return data_train, data_test, labels_train, labels_test

In [4]:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {}, test = {}".format(len(train_X), len(test_X)))

IMDb reviews (combined): train = 25000, test = 25000


In [5]:
train_X[100]

"I like this movie because it is a fine work of cinema, made by people who care enough to make it art and not just home movies. It is filled with Super-surfer Greg Noll's home movies, and a boatload of amateur video from others who align themselves with his 50-year passion. Nevertheless, it has been expanded to the degree that it approaches aesthetic glory. It is filled with artistic talent, and athletic talent, however trivial you might think surfing to be athletic. Surfers are not astronauts nor test-pilots. Nor are they surgeons(perhaps) or Ph.d's(again, perhaps). It believes in the quest of the surfer. It believes in the beauty of human goofiness. It believes in the great gift of peace, which comes from the cessation of war. Surfers celebrate the cessation of war on the north beach of an Hawaiian island attacked by Japanese zeroes fifteen years before. It celebrates the down-time of a country which fought a cold war-instead of a hot-war - with the Russian socialists. Surfing is the

## Step 3: Processing the data

(Alternative approach, as we have more flexibility as we had when using a lambda function.)

In [6]:
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.stem.porter import *
stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [7]:
import re
from bs4 import BeautifulSoup

def review_to_words(review):
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words

In [8]:
review_to_words(train_X[100])

['like',
 'movi',
 'fine',
 'work',
 'cinema',
 'made',
 'peopl',
 'care',
 'enough',
 'make',
 'art',
 'home',
 'movi',
 'fill',
 'super',
 'surfer',
 'greg',
 'noll',
 'home',
 'movi',
 'boatload',
 'amateur',
 'video',
 'other',
 'align',
 '50',
 'year',
 'passion',
 'nevertheless',
 'expand',
 'degre',
 'approach',
 'aesthet',
 'glori',
 'fill',
 'artist',
 'talent',
 'athlet',
 'talent',
 'howev',
 'trivial',
 'might',
 'think',
 'surf',
 'athlet',
 'surfer',
 'astronaut',
 'test',
 'pilot',
 'surgeon',
 'perhap',
 'ph',
 'perhap',
 'believ',
 'quest',
 'surfer',
 'believ',
 'beauti',
 'human',
 'goofi',
 'believ',
 'great',
 'gift',
 'peac',
 'come',
 'cessat',
 'war',
 'surfer',
 'celebr',
 'cessat',
 'war',
 'north',
 'beach',
 'hawaiian',
 'island',
 'attack',
 'japanes',
 'zero',
 'fifteen',
 'year',
 'celebr',
 'time',
 'countri',
 'fought',
 'cold',
 'war',
 'instead',
 'hot',
 'war',
 'russian',
 'socialist',
 'surf',
 'ultim',
 'narciss',
 'danger',
 'slightli',
 'histor'

In [9]:
import pickle

cache_dir = os.path.join("../cache", "sentiment_analysis")  # where to store cache files
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    """Convert each review to words; read from cache if available."""

    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Preprocess training and test data to obtain words for each review
        #words_train = list(map(review_to_words, data_train))
        #words_test = list(map(review_to_words, data_test))
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        # Write to cache file for future runs
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [10]:
# Preprocess data
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

Read preprocessed data from cache file: preprocessed_data.pkl


### Extract Bag-of-Words features

For the model we will be implementing, rather than using the reviews directly, we are going to transform each review into a Bag-of-Words feature representation. Keep in mind that 'in the wild' we will only have access to the training set so our transformer can only use the training set to construct a representation.

In [11]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.externals import joblib
# joblib is an enhanced version of pickle that is more efficient for storing NumPy arrays

def extract_BoW_features(words_train, words_test, vocabulary_size=5000,
                         cache_dir=cache_dir, cache_file="bow_features.pkl"):
    """Extract Bag-of-Words for a given set of documents, already preprocessed into words."""
    
    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = joblib.load(f)
            print("Read features from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Fit a vectorizer to training documents and use it to transform them
        # NOTE: Training documents have already been preprocessed and tokenized into words;
        #       pass in dummy functions to skip those steps, e.g. preprocessor=lambda x: x
        vectorizer = CountVectorizer(max_features=vocabulary_size,
                preprocessor=lambda x: x, tokenizer=lambda x: x)  # already preprocessed
        features_train = vectorizer.fit_transform(words_train).toarray()

        # Apply the same vectorizer to transform the test documents (ignore unknown words)
        features_test = vectorizer.transform(words_test).toarray()
        
        # NOTE: Remember to convert the features using .toarray() for a compact representation
        
        # Write to cache file for future runs (store vocabulary as well)
        if cache_file is not None:
            vocabulary = vectorizer.vocabulary_
            cache_data = dict(features_train=features_train, features_test=features_test,
                             vocabulary=vocabulary)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                joblib.dump(cache_data, f)
            print("Wrote features to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        features_train, features_test, vocabulary = (cache_data['features_train'],
                cache_data['features_test'], cache_data['vocabulary'])
    
    # Return both the extracted features as well as the vocabulary
    return features_train, features_test, vocabulary

In [12]:
# Extract Bag of Words features for both training and test datasets
train_X, test_X, vocabulary = extract_BoW_features(train_X, test_X)

Read features from cache file: bow_features.pkl


In [13]:
len(train_X[100])

5000

## Step 4: Classification using XGBoost

### Writing the dataset

In [14]:
import pandas as pd

# Earlier we shuffled the training dataset so to make things simple we can just assign
# the first 10 000 reviews to the validation set and use the remaining reviews for training.
val_X = pd.DataFrame(train_X[:10000])
train_X = pd.DataFrame(train_X[10000:])

val_y = pd.DataFrame(train_y[:10000])
train_y = pd.DataFrame(train_y[10000:])

In [15]:
# First we make sure that the local directory in which we'd like to store the training and validation csv files exists.
data_dir = '../data/sentiment_update'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [16]:
pd.DataFrame(test_X).to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)

pd.concat([val_y, val_X], axis=1).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)
pd.concat([train_y, train_X], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

In [17]:
# To save a bit of memory we can set text_X, train_X, val_X, train_y and val_y to None.
test_X = train_X = val_X = train_y = val_y = None

### Uploading Training / Validation files to S3

In [18]:
import sagemaker

session = sagemaker.Session() # Store the current SageMaker session

# S3 prefix (which folder will we use)
prefix = 'sentiment-update'

test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)

### Creating the XGBoost model

- Model Artifacts
- Training Code (Container)
- Inference Code (Container)

In [19]:
from sagemaker import get_execution_role

# Our current execution role is require when creating the model as the training
# and inference code will need to access the model artifacts.
role = get_execution_role()

In [20]:
# We need to retrieve the location of the container which is provided by Amazon for using XGBoost.
# As a matter of convenience, the training and inference code both use the same container.
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(session.boto_region_name, 'xgboost')

	get_image_uri(region, 'xgboost', '0.90-1').


In [21]:
# First we create a SageMaker estimator object for our model.
xgb = sagemaker.estimator.Estimator(container, # The location of the container we wish to use
                                    role,                                    # What is our current IAM Role
                                    train_instance_count=1,                  # How many compute instances
                                    train_instance_type='ml.m4.xlarge',      # What kind of compute instances
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
                                    sagemaker_session=session)

# And then set the algorithm specific parameters.
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        early_stopping_rounds=10,
                        num_round=500)

### Fit the XGBoost model

In [22]:
s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=val_location, content_type='csv')

In [23]:
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

2019-12-30 15:45:20 Starting - Starting the training job...
2019-12-30 15:45:21 Starting - Launching requested ML instances...
2019-12-30 15:46:19 Starting - Preparing the instances for training......
2019-12-30 15:47:12 Downloading - Downloading input data...
2019-12-30 15:47:33 Training - Downloading the training image..[34mArguments: train[0m
[34m[2019-12-30:15:47:54:INFO] Running standalone xgboost training.[0m
[34m[2019-12-30:15:47:54:INFO] File size need to be processed in the node: 238.47mb. Available memory size in the node: 8523.48mb[0m
[34m[2019-12-30:15:47:54:INFO] Determined delimiter of CSV input is ','[0m
[34m[15:47:54] S3DistributionType set as FullyReplicated[0m
[34m[15:47:55] 15000x5000 matrix with 75000000 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2019-12-30:15:47:55:INFO] Determined delimiter of CSV input is ','[0m
[34m[15:47:55] S3DistributionType set as FullyReplicated[0m
[34m[15:47:57] 10000x5000 ma

[34m[15:48:48] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 24 extra nodes, 4 pruned nodes, max_depth=5[0m
[34m[37]#011train-error:0.154067#011validation-error:0.1772[0m
[34m[15:48:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 26 extra nodes, 6 pruned nodes, max_depth=5[0m
[34m[38]#011train-error:0.152867#011validation-error:0.1759[0m
[34m[15:48:50] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 2 pruned nodes, max_depth=5[0m
[34m[39]#011train-error:0.150467#011validation-error:0.1751[0m
[34m[15:48:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 10 pruned nodes, max_depth=5[0m
[34m[40]#011train-error:0.149067#011validation-error:0.1744[0m
[34m[15:48:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 10 pruned nodes, max_depth=5[0m
[34m[41]#011train-error:0.148667#011validation-error:0.1741[0m
[34m[15:48:54] src/tree/updater_prune.cc:74: tree pruning end, 1 ro

[34m[15:49:48] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 10 pruned nodes, max_depth=5[0m
[34m[85]#011train-error:0.118133#011validation-error:0.1507[0m
[34m[15:49:50] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 8 pruned nodes, max_depth=5[0m
[34m[86]#011train-error:0.117467#011validation-error:0.15[0m
[34m[15:49:51] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 6 pruned nodes, max_depth=5[0m
[34m[87]#011train-error:0.116267#011validation-error:0.1498[0m
[34m[15:49:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 10 pruned nodes, max_depth=5[0m
[34m[88]#011train-error:0.116#011validation-error:0.1501[0m
[34m[15:49:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 10 pruned nodes, max_depth=5[0m
[34m[89]#011train-error:0.115133#011validation-error:0.1494[0m
[34m[15:49:55] src/tree/updater_prune.cc:74: tree pruning end, 1 roots,

[34m[15:50:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 8 pruned nodes, max_depth=5[0m
[34m[133]#011train-error:0.097867#011validation-error:0.1401[0m
[34m[15:50:51] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 2 pruned nodes, max_depth=5[0m
[34m[134]#011train-error:0.0976#011validation-error:0.14[0m
[34m[15:50:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 4 pruned nodes, max_depth=5[0m
[34m[135]#011train-error:0.097333#011validation-error:0.1404[0m
[34m[15:50:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 10 pruned nodes, max_depth=5[0m
[34m[136]#011train-error:0.096933#011validation-error:0.1403[0m
[34mStopping. Best iteration:[0m
[34m[126]#011train-error:0.100067#011validation-error:0.1398
[0m

2019-12-30 15:51:01 Uploading - Uploading generated training model
2019-12-30 15:51:01 Completed - Training job completed
Training seconds: 229
Billable

### Testing the model

In [24]:
xgb_transformer = xgb.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')

In [25]:
xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')
xgb_transformer.wait()

....................[34mArguments: serve[0m
[34m[2019-12-30 15:54:46 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2019-12-30 15:54:46 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2019-12-30 15:54:46 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2019-12-30 15:54:46 +0000] [37] [INFO] Booting worker with pid: 37[0m
[34m[2019-12-30 15:54:46 +0000] [38] [INFO] Booting worker with pid: 38[0m
[34m[2019-12-30 15:54:46 +0000] [39] [INFO] Booting worker with pid: 39[0m
[34m[2019-12-30:15:54:46:INFO] Model loaded successfully for worker : 37[0m
[34m[2019-12-30 15:54:46 +0000] [40] [INFO] Booting worker with pid: 40[0m
[34m[2019-12-30:15:54:46:INFO] Model loaded successfully for worker : 39[0m
[34m[2019-12-30:15:54:46:INFO] Model loaded successfully for worker : 38[0m
[34m[2019-12-30:15:54:46:INFO] Model loaded successfully for worker : 40[0m
[32m2019-12-30T15:55:10.634:[sagemaker logs]: MaxConcurrentTransforms=4, MaxPayloadInMB=6, BatchStrategy=

[34m[2019-12-30:15:55:32:INFO] Determined delimiter of CSV input is ','[0m
[35m[2019-12-30:15:55:32:INFO] Determined delimiter of CSV input is ','[0m
[34m[2019-12-30:15:55:34:INFO] Sniff delimiter as ','[0m
[34m[2019-12-30:15:55:34:INFO] Determined delimiter of CSV input is ','[0m
[35m[2019-12-30:15:55:34:INFO] Sniff delimiter as ','[0m
[35m[2019-12-30:15:55:34:INFO] Determined delimiter of CSV input is ','[0m
[34m[2019-12-30:15:55:34:INFO] Sniff delimiter as ','[0m
[34m[2019-12-30:15:55:34:INFO] Determined delimiter of CSV input is ','[0m
[34m[2019-12-30:15:55:34:INFO] Sniff delimiter as ','[0m
[34m[2019-12-30:15:55:34:INFO] Determined delimiter of CSV input is ','[0m
[35m[2019-12-30:15:55:34:INFO] Sniff delimiter as ','[0m
[35m[2019-12-30:15:55:34:INFO] Determined delimiter of CSV input is ','[0m
[35m[2019-12-30:15:55:34:INFO] Sniff delimiter as ','[0m
[35m[2019-12-30:15:55:34:INFO] Determined delimiter of CSV input is ','[0m
[34m[2019-12-30:15:55:35:INFO

In [26]:
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir

Completed 256.0 KiB/369.2 KiB (3.3 MiB/s) with 1 file(s) remainingCompleted 369.2 KiB/369.2 KiB (4.7 MiB/s) with 1 file(s) remainingdownload: s3://sagemaker-eu-west-1-873674308518/xgboost-2019-12-30-15-51-34-273/test.csv.out to ../data/sentiment_update/test.csv.out


In [27]:
predictions = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)
predictions = [round(num) for num in predictions.squeeze().values]

In [28]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y, predictions)

0.85632

## Step 5: Looking at New Data

So now we have an XGBoost sentiment analysis model that we believe is working pretty well. As a result, we deployed it and we are using it in some sort of app.

However, as we allow users to use our app we periodically record submitted movie reviews so that we can perform some quality control on our deployed model. Once we've accumulated enough reviews we go through them by hand and evaluate whether they are positive or negative (there are many ways you might do this in practice aside from by hand). The reason for doing this is so that we can check to see how well our model is doing.

In [30]:
import new_data
new_X, new_Y = new_data.get_new_data()

**NOTE:** The `new_data` module assumes that the cache created earlier in Step 3 is still stored in `../cache/sentiment_analysis`.

### Testing the current model on new data

First, note that the data that has been loaded has already been pre-processed so that each entry in `new_X` is a list of words that have been processed using `nltk`. However, we have not yet constructed the bag of words encoding, which we will do now.

First, we use the vocabulary that we constructed earlier using the original training data to construct a `CountVectorizer` which we will use to transform our new data into its bag of words encoding.

In [31]:
# Create CountVectorizer using the previously constructed vocabulary
vectorizer = CountVectorizer(vocabulary=vocabulary,
                             preprocessor=lambda x: x,  # already preprocessed
                             tokenizer=lambda x: x)  # already tokenized

# Transform our new data set and store the transformed data in the variable new_XV
new_XV = vectorizer.transform(new_X).toarray()

As a quick sanity check, we make sure that the length of each of our bag of words encoded reviews is correct. In particular, it must be the same size as the vocabulary which in our case is `5000`.

In [32]:
len(new_XV[100])

5000

Now that we've performed the data processing that is required by our model we can save it locally and then upload it to S3 so that we can construct a batch transform job in order to see how well our model is working.

In [33]:
# Save the data contained in new_XV locally in the data_dir with the file name new_data.csv
pd.DataFrame(new_XV).to_csv(os.path.join(data_dir, 'new_data.csv'), header=False, index=False)

In [35]:
# Upload to S3 and save the resulting URI as new_data_location
new_data_location = session.upload_data(os.path.join(data_dir, 'new_data.csv'), key_prefix=prefix)

Then, once the new data has been uploaded to S3, we create and run the batch transform job to get our model's predictions about the sentiment of the new movie reviews (using the `xgb_transformer` object that was created earlier).

In [36]:
# TODO: Using xgb_transformer, transform the new_data_location data. You may wish to **wait** until
#       the batch transform job has finished.
xgb_transformer.transform(new_data_location, content_type='text/csv', split_type='Line')
xgb_transformer.wait()

....................[34mArguments: serve[0m
[34m[2019-12-30 16:03:58 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2019-12-30 16:03:58 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2019-12-30 16:03:58 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2019-12-30 16:03:58 +0000] [38] [INFO] Booting worker with pid: 38[0m
[34m[2019-12-30 16:03:58 +0000] [39] [INFO] Booting worker with pid: 39[0m
[34m[2019-12-30 16:03:58 +0000] [40] [INFO] Booting worker with pid: 40[0m
[34m[2019-12-30 16:03:58 +0000] [41] [INFO] Booting worker with pid: 41[0m
[34m[2019-12-30:16:03:58:INFO] Model loaded successfully for worker : 38[0m
[34m[2019-12-30:16:03:58:INFO] Model loaded successfully for worker : 39[0m
[34m[2019-12-30:16:03:58:INFO] Model loaded successfully for worker : 40[0m
[34m[2019-12-30:16:03:58:INFO] Model loaded successfully for worker : 41[0m
[34m[2019-12-30:16:04:31:INFO] Sniff delimiter as ','[0m
[34m[2019-12-30:16:04:31:INFO] Determined deli




As usual, we copy the results of the batch transform job to our local instance.

In [37]:
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir

Completed 256.0 KiB/369.4 KiB (4.4 MiB/s) with 1 file(s) remainingCompleted 369.4 KiB/369.4 KiB (6.2 MiB/s) with 1 file(s) remainingdownload: s3://sagemaker-eu-west-1-873674308518/xgboost-2019-12-30-16-00-48-533/new_data.csv.out to ../data/sentiment_update/new_data.csv.out


Read in the results of the batch transform job.

In [38]:
predictions = pd.read_csv(os.path.join(data_dir, 'new_data.csv.out'), header=None)
predictions = [round(num) for num in predictions.squeeze().values]

And check the accuracy of our current model.

In [39]:
accuracy_score(new_Y, predictions)

0.72748

So it would appear that *something* has changed since our model is no longer (as) effective at determining the sentiment of a user provided review.

In a real life scenario you would check a number of different things to see what exactly is going on. In our case, we are only going to check one and that is whether some aspect of the underlying distribution has changed. In other words, we want to see if the words that appear in our new collection of reviews matches the words that appear in the original training set. Of course, we want to narrow our scope a little bit so we will only look at the `5000` most frequently appearing words in each data set, or in other words, the vocabulary generated by each data set.

Before doing that, however, let's take a look at some of the incorrectly classified reviews in the new data set.

To start, we will deploy the original XGBoost model. We will then use the deployed model to infer the sentiment of some of the new reviews. This will also serve as a nice excuse to deploy our model so that we can mimic a real life scenario where we have a model that has been deployed and is being used in production.

In [40]:
# Deploy the model that was created earlier. Recall that the object name is 'xgb'.
xgb_predictor = xgb.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')



--------------------------------------------------------------------------------------!

### Diagnose the problem

Now that we have our deployed "production" model, we can send some of our new data to it and filter out some of the incorrectly classified reviews.

In [41]:
from sagemaker.predictor import csv_serializer

# We need to tell the endpoint what format the data we are sending is in so that SageMaker can perform the serialization.
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer

It will be useful to look at a few different examples of incorrectly classified reviews so we will start by creating a *generator* which we will use to iterate through some of the new reviews and find ones that are incorrect.

In [42]:
def get_sample(in_X, in_XV, in_Y):
    for idx, smp in enumerate(in_X):
        res = round(float(xgb_predictor.predict(in_XV[idx])))
        if res != in_Y[idx]:
            yield smp, in_Y[idx]

At this point, `gn` is the *generator* which generates samples from the new data set which are not classified correctly. To get the *next* sample we simply call the `next` method on our generator. (Note: The reason we use generators here is so that we don't have to iterate through all of the new reviews, searching for incorrectly classified samples.)

In [51]:
# gn = get_sample(new_X, new_XV, new_Y)
print(next(gn))

(['real', 'charact', 'stori', 'driven', 'drama', 'level', 'shame', 'see', 'tv', 'mo', 'impress', 'right', 'start', 'put', 'sci', 'fi', 'nut', 'like', 'could', 'happen', 'earth', 'fact', 'anoth', 'galaxi', 'make', 'show', 'interest', 'space', 'ship', 'laser', 'gun', 'none', 'yet', 'anyway', 'far', 'seen', 's01', 'e04', 'grip', 'wonder', 'what', 'go', 'happen', 'next', 'mani', 'possibl', 'cast', 'play', 'role', 'pasion', 'eric', 'stoltz', 'especi', 'strong', 'show', 'realli', 'stand', 'alon', 'well', 'matter', 'watch', 'bsg', 'fact', 'quit', 'differ', 'read', 'neg', 'review', 'sci', 'fi', 'geek', 'expect', 'less', 'drama', 'alien', 'ray', 'gun', 'etc', 'would', 'say', 'ignor', 'realli', 'posit', 'start', 'show', 'let', 'hope', 'cann', '1', '2', 'season', 'like', 'normal', 'good', 'show', 'day'], 1)


After looking at a few examples, maybe we decide to look at the most frequently appearing `5000` words in each data set, the original training data set and the new data set. The reason for looking at this might be that we expect the frequency of use of different words to have changed, maybe there is some new slang that has been introduced or some other artifact of popular culture that has changed the way that people write movie reviews.

To do this, we start by fitting a `CountVectorizer` to the new data.

In [45]:
new_vectorizer = CountVectorizer(max_features=5000,
                preprocessor=lambda x: x, tokenizer=lambda x: x)
new_vectorizer.fit(new_X)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=5000, min_df=1,
        ngram_range=(1, 1),
        preprocessor=<function <lambda> at 0x7efee051c7b8>,
        stop_words=None, strip_accents=None,
        token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=<function <lambda> at 0x7efee051c840>, vocabulary=None)

Now that we have this new `CountVectorizor` object, we can check to see if the corresponding vocabulary has changed between the two data sets.

In [46]:
original_vocabulary = set(vocabulary.keys())
new_vocabulary = set(new_vectorizer.vocabulary_.keys())

We can look at the words that were in the original vocabulary but not in the new vocabulary.

In [47]:
print(original_vocabulary - new_vocabulary)

{'ghetto', 'playboy', '21st', 'reincarn', 'weari', 'victorian', 'spill'}


And similarly, we can look at the words that are in the new vocabulary but which were not in the original vocabulary.

In [48]:
print(new_vocabulary - original_vocabulary)

{'banana', 'optimist', 'masterson', 'dubiou', 'orchestr', 'sophi', 'omin'}


These words themselves don't tell us much, however if one of these words occured with a large frequency, that might tell us something. In particular, we wouldn't really expect any of the words above to appear with too much frequency.

**Question** What exactly is going on here. Not only what (if any) words appear with a larger than expected frequency but also, what does this mean? What has changed about the world that our original model no longer takes into account?

In [71]:
# Check frequencies of new vocab
new_vocab_flat = [word for review in new_X for word in review]

from collections import Counter
count_dict = Counter(new_vocab_flat)

for word in (new_vocabulary - original_vocabulary):
    print(word+", "+str(count_dict[word]))

banana, 5042
optimist, 62
masterson, 62
dubiou, 62
orchestr, 62
sophi, 62
omin, 62


**Answer:** The word banana has been introduced quite often ...

### Build a new model

Supposing that we believe something has changed about the underlying distribution of the words that our reviews are made up of, we need to create a new model. This way our new model will take into account whatever it is that has changed.

To begin with, we will use the new vocabulary to create a bag of words encoding of the new data. We will then use this data to train a new XGBoost model.

**NOTE:** Because we believe that the underlying distribution of words has changed it should follow that the original vocabulary that we used to construct a bag of words encoding of the reviews is no longer valid. This means that we need to be careful with our data. If we send an bag of words encoded review using the *original* vocabulary we should not expect any sort of meaningful results.

In particular, this means that if we had deployed our XGBoost model like we did in the Web App notebook then we would need to implement this vocabulary change in the Lambda function as well.

In [72]:
new_XV = new_vectorizer.transform(new_X).toarray()

In [73]:
# Check result
len(new_XV[0])

5000

Now that we have our newly encoded, newly collected data, we can split it up into a training and validation set so that we can train a new XGBoost model. As usual, we first split up the data, then save it locally and then upload it to S3.

In [74]:
import pandas as pd

# Earlier we shuffled the training dataset so to make things simple we can just assign
# the first 10 000 reviews to the validation set and use the remaining reviews for training.
new_val_X = pd.DataFrame(new_XV[:10000])
new_train_X = pd.DataFrame(new_XV[10000:])

new_val_y = pd.DataFrame(new_Y[:10000])
new_train_y = pd.DataFrame(new_Y[10000:])

In order to save some memory we will effectively delete the `new_X` variable. Remember that this contained a list of reviews and each review was a list of words. Note that once this cell has been executed you will need to read the new data in again if you want to work with it.

In [75]:
new_X = None

Next we save the new training and validation sets locally. Note that we overwrite the training and validation sets used earlier. This is mostly because the amount of space that we have available on our notebook instance is limited. Of course, you can increase this if you'd like but to do so may increase the cost of running the notebook instance.

In [76]:
pd.DataFrame(new_XV).to_csv(os.path.join(data_dir, 'new_data.csv'), header=False, index=False)

pd.concat([new_val_y, new_val_X], axis=1).to_csv(os.path.join(data_dir, 'new_validation.csv'), header=False, index=False)
pd.concat([new_train_y, new_train_X], axis=1).to_csv(os.path.join(data_dir, 'new_train.csv'), header=False, index=False)

Now that we've saved our data to the local instance, we can safely delete the variables to save on memory.

In [77]:
new_val_y = new_val_X = new_train_y = new_train_X = new_XV = None

Lastly, we make sure to upload the new training and validation sets to S3.

In [78]:
# TODO: Upload the new data and the new validation.csv and train.csv files in the data_dir directory to S3.
new_data_location = session.upload_data(os.path.join(data_dir, 'new_data.csv'), key_prefix=prefix)
new_val_location = session.upload_data(os.path.join(data_dir, 'new_validation.csv'), key_prefix=prefix)
new_train_location = session.upload_data(os.path.join(data_dir, 'new_train.csv'), key_prefix=prefix)

Once our new training data has been uploaded to S3, we can create a new XGBoost model that will take into account the changes that have occured in our data set.

In [79]:
# Create a SageMaker estimator object for our model.
new_xgb = sagemaker.estimator.Estimator(container, # The location of the container we wish to use
                                    role,                                    # What is our current IAM Role
                                    train_instance_count=1,                  # How many compute instances
                                    train_instance_type='ml.m4.xlarge',      # What kind of compute instances
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
                                    sagemaker_session=session)

# Set the algorithm specific parameters.
new_xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        early_stopping_rounds=10,
                        num_round=500)

Once the model has been created, we can train it with our new data.

In [80]:
# Create s3 input objects
s3_new_input_train = sagemaker.s3_input(s3_data=new_train_location, content_type='csv')
s3_new_input_validation = sagemaker.s3_input(s3_data=new_val_location, content_type='csv')

In [81]:
# 'Fit' your new model.
new_xgb.fit({'train': s3_new_input_train, 'validation': s3_new_input_validation})

2019-12-30 16:51:20 Starting - Starting the training job...
2019-12-30 16:51:22 Starting - Launching requested ML instances...
2019-12-30 16:52:21 Starting - Preparing the instances for training.........
2019-12-30 16:53:50 Downloading - Downloading input data
2019-12-30 16:53:50 Training - Downloading the training image...
2019-12-30 16:54:11 Training - Training image download completed. Training in progress.[34mArguments: train[0m
[34m[2019-12-30:16:54:12:INFO] Running standalone xgboost training.[0m
[34m[2019-12-30:16:54:12:INFO] File size need to be processed in the node: 238.47mb. Available memory size in the node: 8516.01mb[0m
[34m[2019-12-30:16:54:12:INFO] Determined delimiter of CSV input is ','[0m
[34m[16:54:12] S3DistributionType set as FullyReplicated[0m
[34m[16:54:14] 15000x5000 matrix with 75000000 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2019-12-30:16:54:14:INFO] Determined delimiter of CSV input is ','[0m
[

[34m[16:55:10] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 8 pruned nodes, max_depth=5[0m
[34m[40]#011train-error:0.161133#011validation-error:0.1881[0m
[34m[16:55:11] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 24 extra nodes, 8 pruned nodes, max_depth=5[0m
[34m[41]#011train-error:0.161133#011validation-error:0.1867[0m
[34m[16:55:12] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 26 extra nodes, 2 pruned nodes, max_depth=5[0m
[34m[42]#011train-error:0.1598#011validation-error:0.1867[0m
[34m[16:55:14] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 14 pruned nodes, max_depth=5[0m
[34m[43]#011train-error:0.158#011validation-error:0.1867[0m
[34m[16:55:15] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 8 pruned nodes, max_depth=5[0m
[34m[44]#011train-error:0.1572#011validation-error:0.1869[0m
[34m[16:55:16] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 

[34m[16:56:10] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 8 pruned nodes, max_depth=5[0m
[34m[88]#011train-error:0.128067#011validation-error:0.1739[0m
[34m[16:56:11] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 24 extra nodes, 10 pruned nodes, max_depth=5[0m
[34m[89]#011train-error:0.1274#011validation-error:0.1743[0m
[34m[16:56:13] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 14 pruned nodes, max_depth=5[0m
[34m[90]#011train-error:0.127133#011validation-error:0.1737[0m
[34m[16:56:14] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 6 pruned nodes, max_depth=5[0m
[34m[91]#011train-error:0.126733#011validation-error:0.1728[0m
[34m[16:56:15] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 2 pruned nodes, max_depth=5[0m
[34m[92]#011train-error:0.126333#011validation-error:0.1724[0m
[34m[16:56:17] src/tree/updater_prune.cc:74: tree pruning end, 1 root

### Check the new model

So now we have a new XGBoost model that we believe more accurately represents the state of the world at this time, at least in how it relates to the sentiment analysis problem that we are working on. The next step is to double check that our model is performing reasonably. To do this, we will first test our model on the new data.

**Note:** In practice this is a pretty bad idea. We already trained our model on the new data, so testing it shouldn't really tell us much. In fact, this is sort of a textbook example of leakage. We are only doing it here so that we have a numerical baseline.

In [82]:
# Create a transformer object from the new_xgb model
new_xgb_transformer = new_xgb.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')

Next we test our model on the new data.

In [83]:
# Transform the new_data_location data
new_xgb_transformer.transform(new_data_location, content_type='text/csv', split_type='Line')
new_xgb_transformer.wait()

....................[34mArguments: serve[0m
[34m[2019-12-30 17:00:43 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2019-12-30 17:00:43 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2019-12-30 17:00:43 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2019-12-30 17:00:43 +0000] [37] [INFO] Booting worker with pid: 37[0m
[34m[2019-12-30:17:00:43:INFO] Model loaded successfully for worker : 37[0m
[34m[2019-12-30 17:00:43 +0000] [38] [INFO] Booting worker with pid: 38[0m
[34m[2019-12-30 17:00:43 +0000] [39] [INFO] Booting worker with pid: 39[0m
[34m[2019-12-30 17:00:43 +0000] [40] [INFO] Booting worker with pid: 40[0m
[34m[2019-12-30:17:00:43:INFO] Model loaded successfully for worker : 39[0m
[34m[2019-12-30:17:00:43:INFO] Model loaded successfully for worker : 38[0m
[34m[2019-12-30:17:00:43:INFO] Model loaded successfully for worker : 40[0m
[34m[2019-12-30:17:00:51:INFO] Sniff delimiter as ','[0m
[34m[2019-12-30:17:00:51:INFO] Determined deli

[34m[2019-12-30:17:01:12:INFO] Sniff delimiter as ','[0m
[34m[2019-12-30:17:01:12:INFO] Determined delimiter of CSV input is ','[0m
[34m[2019-12-30:17:01:13:INFO] Sniff delimiter as ','[0m
[34m[2019-12-30:17:01:13:INFO] Determined delimiter of CSV input is ','[0m
[35m[2019-12-30:17:01:12:INFO] Sniff delimiter as ','[0m
[35m[2019-12-30:17:01:12:INFO] Determined delimiter of CSV input is ','[0m
[35m[2019-12-30:17:01:13:INFO] Sniff delimiter as ','[0m
[35m[2019-12-30:17:01:13:INFO] Determined delimiter of CSV input is ','[0m



Copy the results to our local instance.

In [84]:
!aws s3 cp --recursive $new_xgb_transformer.output_path $data_dir

Completed 256.0 KiB/366.6 KiB (2.1 MiB/s) with 1 file(s) remainingCompleted 366.6 KiB/366.6 KiB (2.9 MiB/s) with 1 file(s) remainingdownload: s3://sagemaker-eu-west-1-873674308518/xgboost-2019-12-30-16-57-34-398/new_data.csv.out to ../data/sentiment_update/new_data.csv.out


And see how well the model did.

In [85]:
predictions = pd.read_csv(os.path.join(data_dir, 'new_data.csv.out'), header=None)
predictions = [round(num) for num in predictions.squeeze().values]

In [86]:
accuracy_score(new_Y, predictions)

0.85908

As expected, since we trained the model on this data, our model performs pretty well. So, we have reason to believe that our new XGBoost model is a "better" model.

However, before we start changing our deployed model, we should first make sure that our new model isn't too different. In other words, if our new model performed really poorly on the original test data then this might be an indication that something else has gone wrong.

To start with, since we got rid of the variable that stored the original test reviews, we will read them in again from the cache that we created in Step 3. Note that we need to make sure that we read in the original test data after it has been pre-processed with `nltk` but before it has been bag of words encoded. This is because we need to use the new vocabulary instead of the original one.

In [87]:
cache_data = None
with open(os.path.join(cache_dir, "preprocessed_data.pkl"), "rb") as f:
            cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", "preprocessed_data.pkl")
            
test_X = cache_data['words_test']
test_Y = cache_data['labels_test']

# Here we set cache_data to None so that it doesn't occupy memory
cache_data = None

Read preprocessed data from cache file: preprocessed_data.pkl


Once we've loaded the original test reviews, we need to create a bag of words encoding of them using the new vocabulary that we created, based on the new data.

In [88]:
# Use the new_vectorizer object that you created earlier to transform the test_X data.
test_X = new_vectorizer.transform(test_X).toarray()

Now that we have correctly encoded the original test data, we can write it to the local instance, upload it to S3 and test it.

In [89]:
pd.DataFrame(test_X).to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)

In [90]:
test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)

In [91]:
new_xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')
new_xgb_transformer.wait()

....................[34mArguments: serve[0m
[34m[2019-12-30 17:07:16 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2019-12-30 17:07:16 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2019-12-30 17:07:16 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2019-12-30 17:07:16 +0000] [39] [INFO] Booting worker with pid: 39[0m
[34m[2019-12-30 17:07:16 +0000] [40] [INFO] Booting worker with pid: 40[0m
[34m[2019-12-30 17:07:16 +0000] [41] [INFO] Booting worker with pid: 41[0m
[34m[2019-12-30 17:07:16 +0000] [42] [INFO] Booting worker with pid: 42[0m
[34m[2019-12-30:17:07:16:INFO] Model loaded successfully for worker : 39[0m
[34m[2019-12-30:17:07:16:INFO] Model loaded successfully for worker : 41[0m
[34m[2019-12-30:17:07:16:INFO] Model loaded successfully for worker : 40[0m
[34m[2019-12-30:17:07:16:INFO] Model loaded successfully for worker : 42[0m
[32m2019-12-30T17:07:35.369:[sagemaker logs]: MaxConcurrentTransforms=4, MaxPayloadInMB=6, BatchStrategy=




In [92]:
!aws s3 cp --recursive $new_xgb_transformer.output_path $data_dir

Completed 256.0 KiB/367.0 KiB (3.8 MiB/s) with 1 file(s) remainingCompleted 367.0 KiB/367.0 KiB (5.1 MiB/s) with 1 file(s) remainingdownload: s3://sagemaker-eu-west-1-873674308518/xgboost-2019-12-30-17-04-14-420/test.csv.out to ../data/sentiment_update/test.csv.out


In [93]:
predictions = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)
predictions = [round(num) for num in predictions.squeeze().values]

In [94]:
accuracy_score(test_Y, predictions)

0.83992

It would appear that our new XGBoost model is performing quite well on the old test data. This gives us some indication that our new model should be put into production and replace our original model.

## Step 6: Updating the Model

So we have a new model that we'd like to use instead of one that is already deployed. Furthermore, we are assuming that the model that is already deployed is being used in some sort of application. As a result, what we want to do is update the existing endpoint so that it uses our new model.

Of course, to do this we need to create an endpoint configuration for our newly created model.

First, note that we can access the name of the model that we created above using the `model_name` property of the transformer. The reason for this is that in order for the transformer to create a batch transform job it needs to first create the model object inside of SageMaker. Since we've sort of already done this we should take advantage of it.

In [95]:
new_xgb_transformer.model_name

'xgboost-2019-12-30-16-51-20-594'

Next, we create an endpoint configuration using the low level approach of creating the dictionary object which describes the endpoint configuration we want.

In [96]:
from time import gmtime, strftime

# Give endpoint a unique configuration a name
new_xgb_endpoint_config_name = "sentiment-update-xgboost-endpoint-config-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

# Using the SageMaker Client, construct the endpoint configuration
new_xgb_endpoint_config_info = session.sagemaker_client.create_endpoint_config(
                            EndpointConfigName = new_xgb_endpoint_config_name,
                            ProductionVariants = [{
                                "InstanceType": "ml.m4.xlarge",
                                "InitialVariantWeight": 1,
                                "InitialInstanceCount": 1,
                                "ModelName": new_xgb_transformer.model_name,
                                "VariantName": "XGB-Model"
                            }])

Once the endpoint configuration has been constructed, it is a straightforward matter to ask SageMaker to update the existing endpoint so that it uses the new endpoint configuration.

Of note here is that SageMaker does this in such a way that there is no downtime. Essentially, SageMaker deploys the new model and then updates the original endpoint so that it points to the newly deployed model. After that, the original model is shut down. This way, whatever app is using our endpoint won't notice that we've changed the model that is being used.

In [97]:
# Update the xgb_predictor.endpoint so that it uses new_xgb_endpoint_config_name.
session.sagemaker_client.update_endpoint(EndpointName=xgb_predictor.endpoint, 
                                         EndpointConfigName=new_xgb_endpoint_config_name)

{'EndpointArn': 'arn:aws:sagemaker:eu-west-1:873674308518:endpoint/xgboost-2019-12-30-15-45-20-409',
 'ResponseMetadata': {'RequestId': '0f8ce405-8c89-4730-be37-0215f0c7a530',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '0f8ce405-8c89-4730-be37-0215f0c7a530',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '99',
   'date': 'Mon, 30 Dec 2019 17:08:29 GMT'},
  'RetryAttempts': 0}}

And, as is generally the case with SageMaker requests, this is being done in the background so if we want to wait for it to complete we need to call the appropriate method.

In [98]:
session.wait_for_endpoint(xgb_predictor.endpoint)

--------------------------------------------------------------------------------------------------!

{'EndpointName': 'xgboost-2019-12-30-15-45-20-409',
 'EndpointArn': 'arn:aws:sagemaker:eu-west-1:873674308518:endpoint/xgboost-2019-12-30-15-45-20-409',
 'EndpointConfigName': 'sentiment-update-xgboost-endpoint-config-2019-12-30-17-08-28',
 'ProductionVariants': [{'VariantName': 'XGB-Model',
   'DeployedImages': [{'SpecifiedImage': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:1',
     'ResolvedImage': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost@sha256:5fe3063b6797a14fec0da6c3c6d6b8cb484773c595864e91b85fa1e6168d3a38',
     'ResolutionTime': datetime.datetime(2019, 12, 30, 17, 8, 33, 45000, tzinfo=tzlocal())}],
   'CurrentWeight': 1.0,
   'DesiredWeight': 1.0,
   'CurrentInstanceCount': 1,
   'DesiredInstanceCount': 1}],
 'EndpointStatus': 'InService',
 'CreationTime': datetime.datetime(2019, 12, 30, 16, 5, 33, 141000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2019, 12, 30, 17, 16, 45, 595000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': '8ccdf

## Step 7: Delete the Endpoint

Of course, since we are done with the deployed endpoint we need to make sure to shut it down, otherwise we will continue to be charged for it.

In [99]:
xgb_predictor.delete_endpoint()

## Optional: Clean up

In [100]:
!rm $data_dir/*
!rmdir $data_dir
!rm $cache_dir/*
!rmdir $cache_dir

---