# Training a TensorFlow Multitask Recommender on a SageMaker Cluster

In this tutorial, we build a simple matrix factorization model using the [MovieLens 100K dataset](https://grouplens.org/datasets/movielens/100k/) with TensorFlow Recommender System (TFRS) using Amazon SageMaker. 

We will use this model to recommend movies for a given user.

In [1]:
!pip install -q sagemaker==2.9.2
!pip install -q sagemaker-experiments==0.1.24
!pip install -q tensorflow==2.3.0
!pip install -q tensorflow-recommenders==0.2.0
!pip install -q tensorflow-datasets==4.0.0

In [2]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

# Specify Input Data S3 URI and `Distribution Strategy`

In [3]:
from sagemaker.inputs import TrainingInput

input_train_data_s3_uri ='s3://{}/tensorflow_datasets/train/'.format(bucket)

s3_input_train_data = TrainingInput(s3_data=input_train_data_s3_uri,
                                    distribution='ShardedByS3Key')
print(s3_input_train_data.config)

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-835319576252/tensorflow_datasets/train/', 'S3DataDistributionType': 'ShardedByS3Key'}}}


# Setup Metrics To Track Model Performance

These sample log lines...
```
499/500 [=====>..] - ETA: 3s - root_mean_squared_error: 1.1198 - factorized_top_k/top_10_categorical_accuracy: 0.481 - factorized_top_k/top_50_categorical_accuracy: 0.607 - factorized_top_k/top_100_categorical_accuracy: 0.885
```
...will produce the following metrics in CloudWatch:

`root_mean_squared_error` = 1.1198

`factorized_top_k/top_10_categorical_accuracy` = 0.481

`factorized_top_k/top_50_categorical_accuracy` = 0.607

`factorized_top_k/top_100_categorical_accuracy` = 0.885

In [4]:
metrics_definitions = [    
     {'Name': 'root_mean_squared_error', 'Regex': 'root_mean_squared_error: ([0-9\\.]+)'},
     {'Name': 'top_10_categorical_accuracy', 'Regex': 'factorized_top_k/top_10_categorical_accuracy: ([0-9\\.]+)'},
     {'Name': 'top_50_categorical_accuracy', 'Regex': 'factorized_top_k/top_50_categorical_accuracy: ([0-9\\.]+)'},
     {'Name': 'top_100_categorical_accuracy', 'Regex': 'factorized_top_k/top_100_categorical_accuracy: ([0-9\\.]+)'}
]

# Setup Hyper-Parameters for Classification Layer

In [5]:
epochs=10
learning_rate=0.00003
dataset_variant='100k' # movielens 100k, 1m, 20m, 25m, etc
embedding_dimension=32 # dimension (k) of our user and item embeddings
enable_tensorboard=True
train_instance_count=1
train_instance_type='ml.p3.2xlarge'

# Setup Our TensorFlow Script to Run on SageMaker
Prepare our TensorFlow model to run on the managed SageMaker service

In [6]:
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(entry_point='train_multitask.py',
                       source_dir='src',
                       role=role,
                       instance_count=train_instance_count,
                       instance_type=train_instance_type,
                       py_version='py37',
                       framework_version='2.3.0',
                       hyperparameters={
                           'epochs': epochs,
                           'learning_rate': learning_rate,
                           'dataset_variant': dataset_variant,
                           'embedding_dimension': embedding_dimension,                           
                           'enable_tensorboard': enable_tensorboard
                       },
                       metric_definitions=metrics_definitions,
                       debugger_hook_config=False
            )

# Create the Experiment

In [7]:
import time
from smexperiments.experiment import Experiment

timestamp = int(time.time())

recommender_experiment = Experiment.create(
                         experiment_name='MovieLens-Recommender-{}'.format(timestamp),
                         description='MovieLens Recommender', 
                         sagemaker_boto_client=sm)

recommender_experiment_name = recommender_experiment.experiment_name
print('Experiment name: {}'.format(recommender_experiment_name))

Experiment name: MovieLens-Recommender-1606694566


In [8]:
import time
from smexperiments.trial import Trial

timestamp = int(time.time())

trial_name = 'trial-{}-{}-{}-{}'.format(timestamp, epochs, dataset_variant, embedding_dimension)

trial = Trial.create(trial_name=trial_name,
                     experiment_name=recommender_experiment_name,
                     sagemaker_boto_client=sm)

trial_name = trial.trial_name
print('Trial name: {}'.format(trial_name))

Trial name: trial-1606694566-10-100k-32


In [9]:
recommender_experiment_config = {
    'ExperimentName': recommender_experiment_name,
    'TrialName': trial.trial_name,
    'TrialComponentDisplayName': 'train'
}

# Train the Model on SageMaker

In [10]:
estimator.fit(
              inputs={
                  'train': s3_input_train_data, 
              },              
              experiment_config=recommender_experiment_config,                   
              wait=False)

INFO:sagemaker:Creating training-job with name: tensorflow-training-2020-11-30-00-02-47-224


In [11]:
recommender_multitask_training_job_name = estimator.latest_training_job.name
print('Training Job Name:  {}'.format(recommender_multitask_training_job_name))

Training Job Name:  tensorflow-training-2020-11-30-00-02-47-224


In [12]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job</a></b>'.format(region, recommender_multitask_training_job_name)))


In [13]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/TrainingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a></b>'.format(region, recommender_multitask_training_job_name)))


In [14]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Training Job Has Completed</b>'.format(bucket, recommender_multitask_training_job_name, region)))


# Wait for Training Job to Finish

In [15]:
%%time
    
estimator.latest_training_job.wait(logs=False)


2020-11-30 00:02:48 Starting - Starting the training job
2020-11-30 00:02:52 Starting - Launching requested ML instances.............
2020-11-30 00:04:04 Starting - Preparing the instances for training............
2020-11-30 00:05:07 Downloading - Downloading input data.............
2020-11-30 00:06:17 Training - Downloading the training image.......
2020-11-30 00:06:58 Training - Training image download completed. Training in progress.......................................
2020-11-30 00:10:13 Uploading - Uploading generated training model
2020-11-30 00:10:20 Completed - Training job completed
CPU times: user 440 ms, sys: 28.5 ms, total: 469 ms
Wall time: 7min 32s


# Copy the Trained Model from S3

In [16]:
!aws s3 cp s3://$bucket/$recommender_multitask_training_job_name/output/model.tar.gz ./model.tar.gz

download: s3://sagemaker-us-east-1-835319576252/tensorflow-training-2020-11-30-00-02-47-224/output/model.tar.gz to ./model.tar.gz


In [17]:
!mkdir -p ./model/
!tar -xvzf ./model.tar.gz -C ./model/

code/
code/inference.py
tensorflow/
tensorflow/saved_model/
tensorflow/saved_model/0/
tensorflow/saved_model/0/saved_model.pb
tensorflow/saved_model/0/assets/
tensorflow/saved_model/0/variables/
tensorflow/saved_model/0/variables/variables.data-00000-of-00001
tensorflow/saved_model/0/variables/variables.index
tensorboard/
tensorboard/train/
tensorboard/train/events.out.tfevents.1606694851.ip-10-2-103-179.ec2.internal.33.545.v2
tensorboard/train/events.out.tfevents.1606694955.ip-10-2-103-179.ec2.internal.33.7808.v2
tensorboard/train/plugins/
tensorboard/train/plugins/profile/
tensorboard/train/plugins/profile/2020_11_30_00_09_17/
tensorboard/train/plugins/profile/2020_11_30_00_09_17/ip-10-2-103-179.ec2.internal.xplane.pb
tensorboard/train/plugins/profile/2020_11_30_00_09_17/ip-10-2-103-179.ec2.internal.kernel_stats.pb
tensorboard/train/plugins/profile/2020_11_30_00_09_17/ip-10-2-103-179.ec2.internal.tensorflow_stats.pb
tensorboard/train/plugins/profile/2020_11_30_00_09_17/ip-10-2-103-17

# Inspect the Model

In [18]:
!saved_model_cli show --all --dir ./model/tensorflow/saved_model/0/

2020-11-30 00:10:26.902779: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2020-11-30 00:10:26.903864: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['__saved_model_init_op']:
  The given SavedModel SignatureDef contains the following input(s):
  The given SavedModel SignatureDef contains the following output(s):
    outputs['__saved_model_init_op'] tensor_info:
        dtype: DT_INVALID
        shape: unknown_rank
        name: NoOp
  Method name is: 

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['input_1'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: serving_default_in

# Make a Sample Prediction

In [19]:
user_id = "42"

In [20]:
!saved_model_cli run --input_exprs 'input_1=np.array(["$user_id"])' --tag_set serve --signature_def serving_default --dir ./model/tensorflow/saved_model/0

2020-11-30 00:10:34.464143: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2020-11-30 00:10:34.464182: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-11-30 00:10:37.821908: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-11-30 00:10:37.827935: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2020-11-30 00:10:37.828553: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (datascience-1-0-ml-t3-medium-1abf3407f667f989be9d86559395): /proc/driver/nvidia/version does not exist
2020-11-30 00:10:37

# Show the Experiment Tracking Lineage

In [21]:
from sagemaker.analytics import ExperimentAnalytics

lineage_table = ExperimentAnalytics(
    sagemaker_session=sess,
    experiment_name=recommender_experiment_name,
    metric_names=[
        'root_mean_squared_error',
        'top_10_categorical_accuracy',
        'top_50_categorical_accuracy',
        'top_100_categorical_accuracy'
    ],
    sort_by="CreationTime",
    sort_order="Ascending",
)

lineage_df = lineage_table.dataframe()
lineage_df.shape

(1, 48)

In [22]:
lineage_df.columns

Index(['TrialComponentName', 'DisplayName', 'SourceArn', 'SageMaker.ImageUri',
       'SageMaker.InstanceCount', 'SageMaker.InstanceType',
       'SageMaker.VolumeSizeInGB', 'dataset_variant', 'embedding_dimension',
       'enable_tensorboard', 'epochs', 'learning_rate', 'model_dir',
       'sagemaker_container_log_level', 'sagemaker_job_name',
       'sagemaker_program', 'sagemaker_region', 'sagemaker_submit_directory',
       'top_100_categorical_accuracy - Min',
       'top_100_categorical_accuracy - Max',
       'top_100_categorical_accuracy - Avg',
       'top_100_categorical_accuracy - StdDev',
       'top_100_categorical_accuracy - Last',
       'top_100_categorical_accuracy - Count', 'root_mean_squared_error - Min',
       'root_mean_squared_error - Max', 'root_mean_squared_error - Avg',
       'root_mean_squared_error - StdDev', 'root_mean_squared_error - Last',
       'root_mean_squared_error - Count', 'top_10_categorical_accuracy - Min',
       'top_10_categorical_accuracy -

In [23]:
lineage_df

Unnamed: 0,TrialComponentName,DisplayName,SourceArn,SageMaker.ImageUri,SageMaker.InstanceCount,SageMaker.InstanceType,SageMaker.VolumeSizeInGB,dataset_variant,embedding_dimension,enable_tensorboard,...,top_50_categorical_accuracy - Avg,top_50_categorical_accuracy - StdDev,top_50_categorical_accuracy - Last,top_50_categorical_accuracy - Count,train - MediaType,train - Value,SageMaker.ModelArtifact - MediaType,SageMaker.ModelArtifact - Value,Trials,Experiments
0,tensorflow-training-2020-11-30-00-02-47-224-aw...,train,arn:aws:sagemaker:us-east-1:835319576252:train...,763104351884.dkr.ecr.us-east-1.amazonaws.com/t...,1.0,ml.p3.2xlarge,30.0,"""100k""",32.0,True,...,0.031139,0.003402,0.0276,33,,s3://sagemaker-us-east-1-835319576252/tensorfl...,,s3://sagemaker-us-east-1-835319576252/tensorfl...,[trial-1606694566-10-100k-32],[MovieLens-Recommender-1606694566]


In [24]:
sm.describe_trial_component(TrialComponentName=lineage_df.TrialComponentName[0])

{'TrialComponentName': 'tensorflow-training-2020-11-30-00-02-47-224-aws-training-job',
 'TrialComponentArn': 'arn:aws:sagemaker:us-east-1:835319576252:experiment-trial-component/tensorflow-training-2020-11-30-00-02-47-224-aws-training-job',
 'DisplayName': 'train',
 'Source': {'SourceArn': 'arn:aws:sagemaker:us-east-1:835319576252:training-job/tensorflow-training-2020-11-30-00-02-47-224',
  'SourceType': 'SageMakerTrainingJob'},
 'Status': {'PrimaryStatus': 'Completed',
  'Message': 'Status: Completed, secondary status: Completed, failure reason: .'},
 'StartTime': datetime.datetime(2020, 11, 30, 0, 5, 7, tzinfo=tzlocal()),
 'EndTime': datetime.datetime(2020, 11, 30, 0, 10, 20, tzinfo=tzlocal()),
 'CreationTime': datetime.datetime(2020, 11, 30, 0, 2, 49, 541000, tzinfo=tzlocal()),
 'CreatedBy': {'UserProfileArn': 'arn:aws:sagemaker:us-east-1:835319576252:user-profile/d-dsxoghy6ztwy/default-1602368497083',
  'UserProfileName': 'default-1602368497083',
  'DomainId': 'd-dsxoghy6ztwy'},
 '

# Pass Variables to the Next Notebook(s)

In [25]:
%store recommender_multitask_training_job_name

Stored 'recommender_multitask_training_job_name' (str)
