Note: The BlazingText portion of this notebook does not work the default `t2.medium`.  It only works with `t2.large` and bigger.  The BlazingText model is 750MB and requires a larger instance.

# Building a Recommender System with Amazon SageMaker BlazingText

- What happens when a new movie is added?
  - No feature to set to "1" in the dataset
  - No previous ratings to find similar items
  - Cold start problem is hard with factorization machines
- Word2vec
  - Word embeddings for natural language processing (similar words get similar vectors)
  - Use concatenated product titles as words, customer review history as sentences
  - SageMaker BlazingText is an extremely fast implementation that can work with subwords

In [2]:
import sagemaker

sess = sagemaker.Session()
bucket = sess.default_bucket()
base = 'DEMO-loft-recommender'
prefix = 'sagemaker/' + base

role = sagemaker.get_execution_role()

In [3]:
import sagemaker
import os
import pandas as pd
import numpy as np
import boto3
import json
import io
import matplotlib.pyplot as plt
import sagemaker.amazon.common as smac
from sagemaker.predictor import json_deserializer
from scipy.sparse import csr_matrix

---

## Data

[Amazon Reviews AWS Public Dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html)
- 1 to 5 star ratings
- 2M+ Amazon customers
- 160K+ digital videos 

In [4]:
!mkdir /tmp/recsys/
!aws s3 cp s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz /tmp/recsys/

mkdir: cannot create directory ‘/tmp/recsys/’: File exists
download: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz to ../../../../../tmp/recsys/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz


In [5]:
df = pd.read_csv('/tmp/recsys/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz', delimiter='\t',error_bad_lines=False)
df.head()

b'Skipping line 92523: expected 15 fields, saw 22\n'
b'Skipping line 343254: expected 15 fields, saw 22\n'
b'Skipping line 524626: expected 15 fields, saw 22\n'
b'Skipping line 623024: expected 15 fields, saw 22\n'
b'Skipping line 977412: expected 15 fields, saw 22\n'
b'Skipping line 1496867: expected 15 fields, saw 22\n'
b'Skipping line 1711638: expected 15 fields, saw 22\n'
b'Skipping line 1787213: expected 15 fields, saw 22\n'
b'Skipping line 2395306: expected 15 fields, saw 22\n'
b'Skipping line 2527690: expected 15 fields, saw 22\n'


Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,12190288,R3FU16928EP5TC,B00AYB1482,668895143,Enlightened: Season 1,Digital_Video_Download,5,0,0,N,Y,I loved it and I wish there was a season 3,I loved it and I wish there was a season 3... ...,2015-08-31
1,US,30549954,R1IZHHS1MH3AQ4,B00KQD28OM,246219280,Vicious,Digital_Video_Download,5,0,0,N,Y,As always it seems that the best shows come fr...,As always it seems that the best shows come fr...,2015-08-31
2,US,52895410,R52R85WC6TIAH,B01489L5LQ,534732318,After Words,Digital_Video_Download,4,17,18,N,Y,Charming movie,"This movie isn't perfect, but it gets a lot of...",2015-08-31
3,US,27072354,R7HOOYTVIB0DS,B008LOVIIK,239012694,Masterpiece: Inspector Lewis Season 5,Digital_Video_Download,5,0,0,N,Y,Five Stars,excellant this is what tv should be,2015-08-31
4,US,26939022,R1XQ2N5CDOZGNX,B0094LZMT0,535858974,On The Waterfront,Digital_Video_Download,5,0,0,N,Y,Brilliant film from beginning to end,Brilliant film from beginning to end. All of t...,2015-08-31


Dataset columns:

- `marketplace`: 2-letter country code (in this case all "US").
- `customer_id`: Random identifier that can be used to aggregate reviews written by a single author.
- `review_id`: A unique ID for the review.
- `product_id`: The Amazon Standard Identification Number (ASIN).  `http://www.amazon.com/dp/<ASIN>` links to the product's detail page.
- `product_parent`: The parent of that ASIN.  Multiple ASINs (color or format variations of the same product) can roll up into a single parent parent.
- `product_title`: Title description of the product.
- `product_category`: Broad product category that can be used to group reviews (in this case digital videos).
- `star_rating`: The review's rating (1 to 5 stars).
- `helpful_votes`: Number of helpful votes for the review.
- `total_votes`: Number of total votes the review received.
- `vine`: Was the review written as part of the [Vine](https://www.amazon.com/gp/vine/help) program?
- `verified_purchase`: Was the review from a verified purchase?
- `review_headline`: The title of the review itself.
- `review_body`: The text of the review.
- `review_date`: The date the review was written.

Drop some fields that won't be used

In [6]:
df = df[['customer_id', 'product_id', 'product_title', 'star_rating', 'review_date']]

Most users don't rate most movies - Check our long tail

In [7]:
customers = df['customer_id'].value_counts()
products = df['product_id'].value_counts()

quantiles = [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.96, 0.97, 0.98, 0.99, 1]
print('customers\n', customers.quantile(quantiles))
print('products\n', products.quantile(quantiles))

customers
 0.00       1.0
0.01       1.0
0.02       1.0
0.03       1.0
0.04       1.0
0.05       1.0
0.10       1.0
0.25       1.0
0.50       1.0
0.75       2.0
0.90       4.0
0.95       5.0
0.96       6.0
0.97       7.0
0.98       9.0
0.99      13.0
1.00    2704.0
Name: customer_id, dtype: float64
products
 0.00        1.00
0.01        1.00
0.02        1.00
0.03        1.00
0.04        1.00
0.05        1.00
0.10        1.00
0.25        1.00
0.50        3.00
0.75        9.00
0.90       31.00
0.95       73.00
0.96       95.00
0.97      130.00
0.98      199.00
0.99      386.67
1.00    32790.00
Name: product_id, dtype: float64


Filter out customers who haven't rated many movies

In [8]:
customers = customers[customers >= 5]
products = products[products >= 10]

reduced_df = df.merge(pd.DataFrame({'customer_id': customers.index})).merge(pd.DataFrame({'product_id': products.index}))

Create a sequential index for customers and movies

In [9]:
customers = reduced_df['customer_id'].value_counts()
products = reduced_df['product_id'].value_counts()

In [10]:
customer_index = pd.DataFrame({'customer_id': customers.index, 'user': np.arange(customers.shape[0])})
product_index = pd.DataFrame({'product_id': products.index, 
                              'item': np.arange(products.shape[0]) + customer_index.shape[0]})

reduced_df = reduced_df.merge(customer_index).merge(product_index)
reduced_df.head()

Unnamed: 0,customer_id,product_id,product_title,star_rating,review_date,user,item
0,27072354,B008LOVIIK,Masterpiece: Inspector Lewis Season 5,5,2015-08-31,10463,140450
1,16030865,B008LOVIIK,Masterpiece: Inspector Lewis Season 5,5,2014-06-20,489,140450
2,44025160,B008LOVIIK,Masterpiece: Inspector Lewis Season 5,5,2014-05-27,32100,140450
3,18602179,B008LOVIIK,Masterpiece: Inspector Lewis Season 5,5,2014-12-23,2237,140450
4,14424972,B008LOVIIK,Masterpiece: Inspector Lewis Season 5,5,2015-08-31,32340,140450


Count days since first review (included as a feature to capture trend)

In [11]:
reduced_df['review_date'] = pd.to_datetime(reduced_df['review_date'])
customer_first_date = reduced_df.groupby('customer_id')['review_date'].min().reset_index()
customer_first_date.columns = ['customer_id', 'first_review_date']

In [12]:
reduced_df = reduced_df.merge(customer_first_date)
reduced_df['days_since_first'] = (reduced_df['review_date'] - reduced_df['first_review_date']).dt.days
reduced_df['days_since_first'] = reduced_df['days_since_first'].fillna(0)

---

## Data

Concatenate product titles to treat each one as a single word

In [13]:
reduced_df['product_title'] = reduced_df['product_title'].apply(lambda x: x.lower().replace(' ', '-'))

Write customer purchase histories

In [14]:
first = True
with open('customer_purchases.txt', 'w') as f:
    for customer, data in reduced_df.sort_values(['customer_id', 'review_date']).groupby('customer_id'):
        if first:
            first = False
        else:
            f.write('\n')
        f.write(' '.join(data['product_title'].tolist()))

Write to S3 so SageMaker training can use it

In [15]:
inputs = sess.upload_data('customer_purchases.txt', bucket, '{}/word2vec/train'.format(prefix))

---

## Train

Create a SageMaker estimator:
- Specify training job arguments
- Set hyperparameters
  - Remove titles that occur less than 5 times
  - Embed in a 100-dimensional subspace
  - Use subwords to capture similarity in titles

In [24]:
bt = sagemaker.estimator.Estimator(
    sagemaker.amazon.amazon_estimator.get_image_uri(boto3.Session().region_name, 'blazingtext', 'latest'),
    role, 
    train_instance_count=1, 
    train_instance_type='ml.c5.4xlarge',
    train_volume_size = 5,
    output_path='s3://{}/{}/output'.format(bucket, prefix),
    sagemaker_session=sess)

bt.set_hyperparameters(mode="skipgram",
    epochs=10,
    min_count=5,
    sampling_threshold=0.0001,
    learning_rate=0.05,
    window_size=5,
    vector_dim=100,
    negative_samples=5,
    min_char=5,
    max_char=10,
    evaluation=False,
    subwords=True)

bt.fit({'train': sagemaker.s3_input(inputs, distribution='FullyReplicated', content_type='text/plain')})

2020-01-09 05:52:10 Starting - Starting the training job...
2020-01-09 05:52:12 Starting - Launching requested ML instances......
2020-01-09 05:53:14 Starting - Preparing the instances for training...
2020-01-09 05:54:10 Downloading - Downloading input data...
2020-01-09 05:54:34 Training - Training image download completed. Training in progress....[34mArguments: train[0m
[34m[01/09/2020 05:54:35 INFO 140526547863360] nvidia-smi took: 0.0251200199127 secs to identify 0 gpus[0m
[34m[01/09/2020 05:54:35 INFO 140526547863360] Running single machine CPU BlazingText training using skipgram mode.[0m
[34m[01/09/2020 05:54:35 INFO 140526547863360] Processing /opt/ml/input/data/train/customer_purchases.txt . File size: 23 MB[0m
[34mRead 1M words[0m
[34mNumber of words:  17990[0m
[34m##### Alpha: 0.0490  Progress: 2.08%  Million Words/sec: 0.14 #####[0m
[34m##### Alpha: 0.0464  Progress: 7.21%  Million Words/sec: 0.27 #####[0m
[34m##### Alpha: 0.0438  Progress: 12.34%  Million W

---

## Model

- Bring in and extract the model from S3
- Take a look at the embeddings

In [25]:
!aws s3 cp $bt.model_data ./

download: s3://sagemaker-us-east-1-362377691630/sagemaker/DEMO-loft-recommender/output/blazingtext-2020-01-09-05-52-10-284/output/model.tar.gz to ./model.tar.gz


In [26]:
!tar -xvzf model.tar.gz

vectors.txt
vectors.bin


In [27]:
vectors = pd.read_csv('vectors.txt', delimiter=' ', skiprows=2, header=None)

Do the embeddings appear to have meaning

In [28]:
vectors.sort_values(1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,92,93,94,95,96,97,98,99,100,101
2530,numb3rs-season-6,-0.90651,0.127700,0.667570,0.055374,1.181100,-0.819990,0.234570,0.439920,0.499160,...,0.348260,-0.568520,0.366540,0.515380,0.274960,-0.138550,-0.383820,-0.901550,0.105460,
1034,numb3rs-season-1,-0.86358,-0.194940,0.754700,0.043741,1.180700,-0.554540,0.274390,0.584430,0.735490,...,0.170100,-0.988190,0.174950,0.417200,0.256700,-0.190390,-0.104640,-0.824950,0.326300,
3496,cursed,-0.85070,0.209000,0.265480,-0.267140,0.037559,-0.110060,-0.180380,-0.068710,0.042822,...,0.140200,-0.164850,-0.151460,-0.145750,-0.162990,-0.133210,-0.375560,-0.092639,-0.599520,
1425,numb3rs-season-2,-0.84059,0.001696,0.762740,0.189050,1.089400,-0.699110,0.430100,0.645460,0.834560,...,0.105020,-0.967700,0.207410,0.330840,0.240110,-0.301450,-0.030282,-0.963380,0.346920,
2573,numb3rs-season-5,-0.80754,0.199250,0.637460,0.160100,1.222200,-0.675070,0.216450,0.606940,0.625380,...,0.246740,-0.610580,0.379970,0.500300,0.265790,-0.014594,-0.125530,-0.973740,0.158150,
5466,fire-from-below,-0.78355,0.192180,0.055132,0.021436,-0.066111,-0.956950,-0.072112,0.041428,-0.171080,...,0.261230,-0.772530,0.391030,0.485670,-0.455830,-0.540120,-0.246230,-0.044138,-0.126760,
6325,end-times-revelation,-0.77774,0.423640,-0.232110,0.022216,0.338800,-0.686190,0.145520,-0.020163,-0.103700,...,0.281180,-0.149970,0.045357,0.392350,-0.678900,-0.128510,-0.464420,-0.582490,-0.122270,
2547,numb3rs-season-4,-0.75168,0.146380,0.803450,0.053697,1.123100,-0.631200,0.290460,0.612800,0.773750,...,0.110720,-0.761690,0.114120,0.666170,0.339160,-0.085017,-0.209920,-0.959330,0.404270,
3509,stigmata,-0.74468,0.182770,-0.442090,0.111600,0.083292,-0.137630,-0.082389,-0.206520,-0.150950,...,-0.015477,-0.691730,0.137230,0.214650,-0.175680,-0.379170,-0.095327,-0.336800,0.197770,
1479,bram-stoker's-dracula,-0.73966,-0.288070,-0.291150,0.142630,0.341260,0.134500,-0.519610,-0.004268,0.546000,...,0.678790,0.128390,-0.281790,-0.076962,-0.306970,0.022116,0.089595,0.007248,0.323410,


In [29]:
vectors.sort_values(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,92,93,94,95,96,97,98,99,100,101
995,the-adventures-of-tintin,-0.093420,-1.08160,0.543510,-0.239500,0.039762,-0.634300,0.294580,0.426340,-0.194960,...,-0.041992,-0.999320,-0.206640,0.054267,-0.463410,-0.057899,0.758520,-0.000526,0.089425,
1593,how-i-met-your-mother-season-1,0.740590,-1.07860,0.660890,-0.676670,0.914080,-0.183640,0.820360,0.257430,-0.771640,...,-0.368380,-0.089059,0.141970,-0.076741,-0.164710,0.271620,0.157230,-0.943230,0.539290,
394,season-1,0.506310,-1.03370,0.571990,-0.900620,-0.216040,-0.061494,0.258290,0.422750,0.449360,...,-0.298190,-0.110980,-0.292620,0.142680,-0.280000,0.217430,-0.149720,-0.282150,-0.134690,
3091,how-i-met-your-mother-season-3,0.893200,-0.96857,0.684930,-0.666690,0.881920,-0.183030,0.850290,0.272530,-0.793600,...,-0.427020,-0.003645,0.147100,-0.078973,-0.128570,0.281360,0.213420,-0.989230,0.555720,
3142,how-i-met-your-mother-season-2,0.781610,-0.95721,0.678780,-0.639380,0.902510,-0.229770,0.896270,0.292640,-0.724170,...,-0.413400,-0.061734,0.187520,-0.141080,-0.177840,0.261650,0.190810,-1.009000,0.557620,
1414,how-i-met-your-mother-season-8,0.739540,-0.95655,0.694340,-0.622020,0.851040,-0.187440,0.829920,0.303930,-0.831650,...,-0.496270,0.006360,0.186970,-0.076398,-0.159600,0.328270,0.126170,-1.077100,0.493330,
2593,the-adventures-of-milo-and-otis,0.216880,-0.92064,0.023902,-0.285050,0.326150,-0.604130,0.521820,0.401390,-0.054085,...,-0.074290,-0.642430,-0.379080,-0.255500,-0.680350,-0.073363,0.338810,-0.361110,-0.071190,
97,24-season-1,-0.162100,-0.91065,0.580510,-0.054211,0.377230,-0.566230,0.164400,-0.120790,-0.312160,...,-0.208760,-0.436370,0.175400,0.279250,-0.159910,0.541640,-0.233200,0.033374,0.001321,
881,how-i-met-your-mother-season-9,0.782110,-0.90954,0.564600,-0.665030,0.831250,-0.137770,0.817150,0.413600,-0.620930,...,-0.527870,0.114100,0.261150,0.000578,-0.009337,0.308180,0.062584,-0.991760,0.504320,
3779,how-i-met-your-mother-season-7,0.832710,-0.90893,0.589650,-0.594010,0.892870,-0.262890,0.730840,0.267520,-0.853150,...,-0.351100,0.032584,0.250850,-0.012182,-0.202830,0.272330,0.145820,-1.010800,0.356690,


Yes, but there's 100.  Let's reduce this further with t-SNE and map the top 100 titles.

In [30]:
product_titles = vectors[0]
vectors = vectors.drop([0, 101], axis=1)

In [31]:
from sklearn.manifold import TSNE

tsne = TSNE(perplexity=40, n_components=2, init='pca', n_iter=10000)
embeddings = tsne.fit_transform(vectors.values[:100, ])

ImportError: cannot import name 'asmatrix'

In [None]:
from matplotlib import pylab
%matplotlib inline

def plot(embeddings, labels):
    pylab.figure(figsize=(20,20))
    for i, label in enumerate(labels):
        x, y = embeddings[i,:]
        pylab.scatter(x, y)
        pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                       ha='right', va='bottom')
    pylab.show()

plot(embeddings, product_titles[:100])

---

## Host

Deploy our model to a real-time endpoint.

In [None]:
bt_endpoint = bt.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')

-------

Try generating predictions for a set of titles (some of which are real, some of which are made up).

In [None]:
words = ["sherlock-season-1", 
         "sherlock-season-2",
         "sherlock-season-5",
         'arbitrary-sherlock-holmes-string',
         'the-imitation-game',
         "abcdefghijklmn",
         "keeping-up-with-the-kardashians-season-1"]

payload = {"instances" : words}

response = bt_endpoint.predict(json.dumps(payload))

vecs_df = pd.DataFrame(json.loads(response))

Calculate correlation and distance.

In [None]:
vecs_df = pd.DataFrame(vecs_df['vector'].values.tolist(), index=vecs_df['word'])

In [None]:
vecs_df = vecs_df.transpose()
vecs_df.corr()

In [None]:
for column in vecs_df.columns:
    print(column + ':', np.sum((vecs_df[column] - vecs_df['sherlock-season-1']) ** 2))

Relative to 'sherlock-season-1':
- 'sherlock-season-5' is made up, but relates well with 'sherlock-season-1' and 'sherlock-season-2'
- 'arbitrary-sherlock-holmes-string' is also made up and relates less well but still fairly strong
- 'the-imitation-game' is another popular Prime video title starring Benedict Cumberbatch and has a moderate relationship, but worse than the arbitrary Sherlock title
- 'abcdefghijklmn' is made up and relates even worse
- 'keeping-up-with-the-kardashians-season-1' somehow manages to relate even worse

Clean-up the endpoint

In [None]:
bt_endpoint.delete_endpoint()


# Beyond the Basics
- Add more features to improve the recommendations
- Compare to other methods
- Ensemble two model recommendations