# Overview

This notebook and associated files will be used to train and deploy a custom container in SageMaker.  This notebook uses the Q&A dataset from Kaggle where the objective is to find duplicate questions in the dataset.  We'll use the SimHash library to encode the text into a 64 bit representation and the Spotify Annoy library to build an index.  Once the index is built we'll deploy the index for nearest neighbor search via SageMaker

In [146]:
# install simhash
!pip install simhash



In [147]:
import sagemaker
from sagemaker.estimator import Estimator
import pandas as pd
import numpy as np
from simhash import Simhash
import re
import boto3
import time
import json

In [148]:
session = sagemaker.Session()
bucket_name = session.default_bucket()
prefix = 'simhash-annoy'

# Training and Inference Container

To build the custom continer we'll use a Dockerfile, training script, and inference script.  We'll build the container on the notebook instance and upload to ECR.  We are following documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-create.html


First create the ECR repository

In [149]:
repo_name = 'simhash-annoy'

In [185]:
!aws ecr create-repository --repository-name {repo_name} --region {boto3.Session().region_name}


An error occurred (RepositoryAlreadyExistsException) when calling the CreateRepository operation: The repository with name 'simhash-annoy' already exists in the registry with id '431615879134'



Next, build the container using docker, tag the continer, and push it to the repo above.  

*Update the code below with the correct repo and version name*


In [150]:
%%sh
export REGION=us-east-1
export ACCOUNT_ID=`aws sts get-caller-identity --query Account --output text`
export IMAGE_ID=`docker images -q simhash-annoy:traindeploy`
echo $ACCOUNT_ID
docker build -t simhash-annoy:traindeploy -f Dockerfile ./  
docker tag $IMAGE_ID $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/simhash-annoy:traindeploy
aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/simhash-annoy:traindeploy
docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/simhash-annoy:traindeploy

431615879134
Sending build context to Docker daemon    240MB
Step 1/15 : FROM python:3.7
 ---> 9b5e75b69a4f
Step 2/15 : ENV PATH="/root/miniconda3/bin:${PATH}"
 ---> Using cache
 ---> 03da77366816
Step 3/15 : ARG PATH="/root/miniconda3/bin:${PATH}"
 ---> Using cache
 ---> 98632610c687
Step 4/15 : RUN apt-get update
 ---> Using cache
 ---> 676785a28291
Step 5/15 : RUN apt-get install -y wget && rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> 6030d4c921d9
Step 6/15 : RUN wget     https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh     && mkdir /root/.conda     && bash Miniconda3-latest-Linux-x86_64.sh -b     && rm -f Miniconda3-latest-Linux-x86_64.sh
 ---> Using cache
 ---> 21ce0c332477
Step 7/15 : RUN conda --version
 ---> Using cache
 ---> 5d2192b628d2
Step 8/15 : RUN pip3 install --no-cache numpy pandas
 ---> Using cache
 ---> 5391e01be4ad
Step 9/15 : RUN pip3 install --no-cache simhash
 ---> Using cache
 ---> d30c0e4e40f6
Step 10/15 : RUN conda install -c conda-

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



# Data
Let's prepare our dataset for training and inference.  This dataset is from Kaggle.  We are taking the first 200k question pairs.  Note that not all question pairs are duplicates.  In total we'll use 400k questions

In [151]:
q_df = pd.read_csv('questions.csv')
data = np.concatenate((q_df['question1'].values[0:200000], q_df['question2'].values[0:200000]))

In [152]:
# save text array as compressed numpy format.  Compression is reducing the file size by ~1/3
np.savez('data.npz',data=data)

In [153]:
# upload data to s3
data_path = session.upload_data('data.npz',bucket=bucket_name,key_prefix=prefix+'/data')

In [154]:
data[0]

'What is the step by step guide to invest in share market in india?'

# Estimator

Let's define the estimator using our custom container and hyperparameters

In [155]:
output_path = f's3://{bucket_name}/{prefix}/output/'

In [156]:
estimator = Estimator(
    role=sagemaker.get_execution_role(),
    image_uri='431615879134.dkr.ecr.us-east-1.amazonaws.com/simhash-annoy:traindeploy',
    instance_count=1,
    instance_type='ml.m5.xlarge',
    output=output_path,
    hyperparameters={
        'dimension': 64,
        'distance': 'hamming',
        'topk': 25,
        'numtrees': 10
    })

In [157]:
%%time
estimator.fit({'training':data_path})

2021-06-03 16:38:17 Starting - Starting the training job...
2021-06-03 16:38:41 Starting - Launching requested ML instancesProfilerReport-1622738296: InProgress
......
2021-06-03 16:39:41 Starting - Preparing the instances for training......
2021-06-03 16:40:42 Downloading - Downloading input data
2021-06-03 16:40:42 Training - Downloading the training image......
2021-06-03 16:41:42 Training - Training image download completed. Training in progress.....[34mhyperparameters: 64 hamming 25 10[0m
[34mhyperparameters parsed[0m
[34mData loaded from .npz in 0.31 seconds[0m
[34mData converted to Simhash in 41.53 seconds[0m
[34mAnnoy Index built in 6.61 seconds[0m
[34mModel Saved[0m

2021-06-03 16:42:42 Uploading - Uploading generated training model
2021-06-03 16:42:42 Completed - Training job completed
Training seconds: 128
Billable seconds: 128
CPU times: user 518 ms, sys: 32.5 ms, total: 551 ms
Wall time: 4min 42s


# Model Artifacts

Now that we've trained our model, we can find the compressed model artifacts in s3.  Let's copy them to the notebook instance for inspection

In [158]:
!aws s3 cp {estimator.model_data} .

download: s3://sagemaker-us-east-1-431615879134/simhash-annoy-2021-06-03-16-38-16-876/output/model.tar.gz to ./model.tar.gz


In [159]:
estimator.model_data

's3://sagemaker-us-east-1-431615879134/simhash-annoy-2021-06-03-16-38-16-876/output/model.tar.gz'

In [160]:
# uncompress
!tar xvf model.tar.gz

test.ann


# Inference

We'll use sagemaker to deploy the model and send some data through the real time endpoint

In [138]:
# optional - code to load a model object from the provided trained model
# estimator = sagemaker.model.Model(
#     model_data='s3://sagemaker-us-east-1-431615879134/simhash-annoy-2021-06-03-16-38-16-876/output/model.tar.gz',
#     image_uri='431615879134.dkr.ecr.us-east-1.amazonaws.com/simhash-annoy:traindeploy',
#     role=sagemaker.get_execution_role(),
#     predictor_cls = sagemaker.predictor.Predictor)

In [161]:
from sagemaker.serializers import CSVSerializer
predictor = estimator.deploy(instance_type='ml.m5.xlarge',initial_instance_count=1,serializer=CSVSerializer())

-------------!

In [134]:
# optional - connect to endpoint that has already been deployed
# end_point_name = ''
# predictor = sagemaker.predictor.Predictor(end_point_name,sagemaker_session=session)

In [162]:
# character n-gram function
def get_features(s):
    width = 3
    s = s.lower()
    s = re.sub(r'[^\w]+', '', s)
    return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]

In [163]:
test_text = 'What is the step by step guide to invest in share market in india?'
text_hash = Simhash(get_features(str(test_text))).value
binary_hash = format(int(text_hash), f'0{64}b')
vector_hash = ''
for i,char in enumerate(binary_hash):
    if i != len(binary_hash)-1:
        vector_hash = vector_hash + char + ", "
    else:
        vector_hash = vector_hash + char

In [164]:
t = time.time()
response = predictor.predict(vector_hash)
print(f'Prediction Latency {1000*(time.time()-t):.2f} ms')

Prediction Latency 184.72 ms


In [165]:
for q,res in enumerate(json.loads(response.decode(encoding='utf8'))):
    print(data[res])
    if q == 0: print('-------------------------')

What is the step by step guide to invest in share market in india?
-------------------------
What is the step by step guide to invest in share market?
Which movies are the best examples of the Hero's Journey?
How can I invest $100 into myself?
How can I invest $100 into myself?
How can I invest $100 into myself?
What is the biggest animal with more than four legs?
Which is the best way to invest in stock market?
Which is the best way to invest in stock market?
What are the best places to visit in San Francisco at night?
What are the best long term investment options in India?
Which are the best love story novels?
Which are the best love story novels?
What are the best musical venues in San Francisco?
What are the best long term investment options in India?
How can I cultivate patience?
Which are the best love story novels?
What are the best ways to trade in share markets in India?
Which are the best love story novels?
What are the best examples of "girls will be girls"?
Which are the b

# Cleanup (Optional)

In [105]:
# used if docker gets in a strange state
!sudo service docker restart

Stopping docker: [60G[[0;32m  OK  [0;39m]
Starting docker:	.[60G[[0;32m  OK  [0;39m]


In [69]:
predictor.delete_endpoint()

Gracefully stopping... (press Ctrl+C again to force)
