# Using TensorFlow Scripts in SageMaker - Quickstart

Starting with TensorFlow version 1.11, you can use SageMaker's TensorFlow containers to train TensorFlow scripts the same way you would train outside SageMaker. This feature is named **Script Mode**. 

This example uses 
[Multi-layer Recurrent Neural Networks (LSTM, RNN) for character-level language models in Python using Tensorflow](https://github.com/sherjilozair/char-rnn-tensorflow). 
You can use the same technique for other scripts or repositories, including 
[TensorFlow Model Zoo](https://github.com/tensorflow/models) and 
[TensorFlow benchmark scripts](https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks).

## Test locally using SageMaker Python SDK TensorFlow Estimator

You can use the SageMaker Python SDK [`TensorFlow`](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/README.rst#training-with-tensorflow) estimator to easily train locally and in SageMaker. 

Let's start by setting the training script arguments `--num_epochs` and `--data_dir` as hyperparameters. Remember that we don't need to provide `--model_dir`:

In [1]:
from sagemaker import get_execution_role

role = get_execution_role()

In [5]:
hyperparameters = {'train_steps': 10, 'model_name': 'DeepFM'}

This notebook shows how to use the SageMaker Python SDK to run your code in a local container before deploying to SageMaker's managed training or hosting environments. Just change your estimator's train_instance_type to local or local_gpu. For more information, see: https://github.com/aws/sagemaker-python-sdk#local-mode.

In order to use this feature you'll need to install docker-compose (and nvidia-docker if training with a GPU). Running following script will install docker-compose or nvidia-docker-compose and configure the notebook environment for you.

Note, you can only run a single local notebook at a time.

In [6]:
!/bin/bash ./utils/setup.sh

The user has root access.
SageMaker instance route table setup is ok. We are good to go.
SageMaker instance routing for Docker is ok. We are good to go!


To train locally, you set `train_instance_type` to [local](https://github.com/aws/sagemaker-python-sdk#local-mode):

In [7]:
import subprocess

train_instance_type='local'

if subprocess.call('nvidia-smi') == 0:
    ## Set type to GPU if one is present
    train_instance_type = 'local_gpu'
    
print("Train instance type = " + train_instance_type)

Train instance type = local


We create the `TensorFlow` Estimator, passing the `git_config` argument and the flag `script_mode=True`. Note that we are using Git integration here, so `source_dir` should be a relative path inside the Git repo; otherwise it should be a relative or absolute local path. the `Tensorflow` Estimator is created as following: 


In [8]:
import os

import sagemaker
from sagemaker.tensorflow import TensorFlow


estimator = TensorFlow(entry_point='train_estimator.py',
                       source_dir='.',
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=role,
                       framework_version='2.2.0',
                       py_version='py37',
                       script_mode=True,
                       model_dir='/opt/ml/model')

train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


To start a training job, we call `estimator.fit(inputs)`, where inputs is a dictionary where the keys, named **channels**, 
have values pointing to the data location. `estimator.fit(inputs)` downloads the TensorFlow container with TensorFlow Python 3, CPU version, locally and simulates a SageMaker training job. 
When training starts, the TensorFlow container executes **train.py**, passing `hyperparameters` and `model_dir` as script arguments, executing the example as follows:
```bash
python -m train --num-epochs 1 --data_dir /opt/ml/input/data/training --model_dir /opt/ml/model
```


In [9]:
inputs = {'training': f'file:///home/ec2-user/SageMaker/deepctr_sagemaker/data/'}

estimator.fit(inputs)

Creating tmpnkeyfy1i_algo-1-hvde6_1 ... 
[1BAttaching to tmpnkeyfy1i_algo-1-hvde6_12mdone[0m
[36malgo-1-hvde6_1  |[0m 2021-01-19 11:09:52.402282: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:425] Initializing the SageMaker Profiler.
[36malgo-1-hvde6_1  |[0m 2021-01-19 11:09:52.402408: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:106] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.
[36malgo-1-hvde6_1  |[0m 2021-01-19 11:09:52.421586: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:425] Initializing the SageMaker Profiler.
[36malgo-1-hvde6_1  |[0m 2021-01-19 11:09:53,746 sagemaker-training-toolkit INFO     Imported framework sagemaker_tensorflow_container.training
[36malgo-1-hvde6_1  |[0m 2021-01-19 11:09:53,753 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-hvde6_1  |[0m 2021-01-19 11:09:53,941 sagemaker-training-t

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[36malgo-1-hvde6_1  |[0m Collecting tensorboard~=2.4
[36malgo-1-hvde6_1  |[0m   Downloading tensorboard-2.4.1-py3-none-any.whl (10.6 MB)
[K     |████████████████████████████████| 10.6 MB 81.3 MB/s eta 0:00:01
[36malgo-1-hvde6_1  |[0m Collecting typing-extensions~=3.7.4
[36malgo-1-hvde6_1  |[0m   Downloading typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
[36malgo-1-hvde6_1  |[0m Collecting tensorflow-estimator<2.5.0,>=2.4.0rc0
[36malgo-1-hvde6_1  |[0m   Downloading tensorflow_estimator-2.4.0-py2.py3-none-any.whl (462 kB)
[K     |████████████████████████████████| 462 kB 97.1 MB/s eta 0:00:01
[36malgo-1-hvde6_1  |[0m [?25hCollecting flatbuffers~=1.12.0
[36malgo-1-hvde6_1  |[0m   Downloading flatbuffers-1.12-py2.py3-none-any.whl (15 kB)
[36malgo-1-hvde6_1  |[0m Collecting absl-py~=0.10
[36malgo-1-hvde6_1  |[0m   Downloading absl_py-0.11.0-py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 81.6 MB/s eta 0:00:01
[36malgo-1-hvde6_1  |[0m I

Let's explain the values of `--data_dir` and `--model_dir` with more details:

- **/opt/ml/input/data/training** is the directory inside the container where the training data is downloaded. The data is downloaded to this folder because `training` is the channel name defined in ```estimator.fit({'training': inputs})```. See [training data](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html#your-algorithms-training-algo-running-container-trainingdata) for more information. 

- **/opt/ml/model** use this directory to save models, checkpoints, or any other data. Any data saved in this folder is saved in the S3 bucket defined for training. See [model data](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html#your-algorithms-training-algo-envvariables) for more information.

### Reading additional information from the container

Often, a user script needs additional information from the container that is not available in ```hyperparameters```.
SageMaker containers write this information as **environment variables** that are available inside the script.

For example, the example above can read information about the `training` channel provided in the training job request by adding the environment variable `SM_CHANNEL_TRAINING` as the default value for the `--data_dir` argument:

```python
if __name__ == '__main__':
  parser = argparse.ArgumentParser()
  # reads input channels training and testing from the environment variables
  parser.add_argument('--data_dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
```

Script mode displays the list of available environment variables in the training logs. You can find the [entire list here](https://github.com/aws/sagemaker-containers/blob/master/README.rst#list-of-provided-environment-variables-by-sagemaker-containers).

# Training in SageMaker

After you test the training job locally, upload the dataset to an S3 bucket so SageMaker can access the data during training:

In [10]:
import sagemaker

inputs = sagemaker.Session().upload_data(path='/home/ec2-user/SageMaker/deepctr_sagemaker/data', key_prefix='DEMO-tensorflow-deepctr')
print(inputs)

s3://sagemaker-us-east-1-579019700964/DEMO-tensorflow-deepctr


The returned variable inputs above is a string with a S3 location which SageMaker Tranining has permissions
to read data from.

To train in SageMaker:
- change the estimator argument `train_instance_type` to any SageMaker ml instance available for training.
- set the `training` channel to a S3 location.

In [11]:
estimator = TensorFlow(entry_point='train_estimator.py',
                       source_dir='.',
                       train_instance_type='ml.p3.2xlarge', # Executes training in a ml.p2.xlarge/ml.p3.2xlarge/ml.p3.8xlarge instance
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=role,
                       framework_version='2.2.0',
                       py_version='py37',
                       script_mode=True,
                       model_dir='/opt/ml/model')

estimator.fit({'training': inputs})

train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


2021-01-19 11:10:36 Starting - Starting the training job...
2021-01-19 11:11:00 Starting - Launching requested ML instancesProfilerReport-1611054636: InProgress
......
2021-01-19 11:12:02 Starting - Preparing the instances for training.........
2021-01-19 11:13:23 Downloading - Downloading input data
2021-01-19 11:13:23 Training - Downloading the training image.........
2021-01-19 11:15:05 Training - Training image download completed. Training in progress..[34m2021-01-19 11:15:03,455 sagemaker-training-toolkit INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2021-01-19 11:15:03,893 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m
[34m/usr/local/bin/python3.7 -m pip install -r requirements.txt[0m
[34mCollecting deepctr[gpu]
  Downloading deepctr-0.8.3-py3-none-any.whl (114 kB)[0m
[34mInstalling collected packages: deepctr[0m
[34mSuccessfully installed deepctr-0.8.3[0m
[34m2021-01-19 11:15:05,644 sagemaker-traini

## Git Support

In [44]:
git_config = {'repo': 'https://github.com/whn09/deepctr_sagemaker.git', 'branch': 'main'}

estimator = TensorFlow(entry_point='train.py',
                       source_dir='.',
                       git_config=git_config,
                       train_instance_type='ml.p3.2xlarge', # Executes training in a ml.p2.xlarge instance
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=role,
                       framework_version='2.2.0',
                       py_version='py37',
                       script_mode=True,
                       model_dir='/opt/ml/model')

estimator.fit({'training': inputs})

train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


2021-01-09 18:38:55 Starting - Starting the training job...
2021-01-09 18:39:19 Starting - Launching requested ML instancesProfilerReport-1610217534: InProgress
......
2021-01-09 18:40:25 Starting - Preparing the instances for training.........
2021-01-09 18:41:49 Downloading - Downloading input data
2021-01-09 18:41:49 Training - Downloading the training image.........
2021-01-09 18:43:22 Training - Training image download completed. Training in progress..[34m2021-01-09 18:43:20,787 sagemaker-training-toolkit INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2021-01-09 18:43:21,167 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m
[34m/usr/local/bin/python3.7 -m pip install -r requirements.txt[0m
[34mCollecting deepctr[gpu]
  Downloading deepctr-0.8.3-py3-none-any.whl (114 kB)[0m
[34mInstalling collected packages: deepctr[0m
[34mSuccessfully installed deepctr-0.8.3[0m
[34m2021-01-09 18:43:22,774 sagemaker-traini

## Deploy the trained model to an endpoint

The deploy() method creates a SageMaker model, which is then deployed to an endpoint to serve prediction requests in real time. We will use the TensorFlow Serving container for the endpoint, because we trained with script mode. This serving container runs an implementation of a web server that is compatible with SageMaker hosting protocol. The Using your own inference code document explains how SageMaker runs inference containers.

In [13]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

update_endpoint is a no-op in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


-----------------!

## Invoke the endpoint

Let's download the training data and use that as input for inference.

In [34]:
import json

def test_REST_serving():
    '''
    test rest api 
    '''
    fea_dict1 = {'I1':[0.0],'I2':[0.001332],'I3':[0.092362],'I4':[0.0],'I5':[0.034825],'I6':[0.0],'I7':[0.0],'I8':[0.673468],'I9':[0.0],'I10':[0.0],'I11':[0.0],'I12':[0.0],'I13':[0.0],'C1':[0],'C2':[4],'C3':[96],'C4':[146],'C5':[1],'C6':[4],'C7':[163],'C8':[1],'C9':[1],'C10':[72],'C11':[117],'C12':[127],'C13':[157],'C14':[7],'C15':[127],'C16':[126],'C17':[8],'C18':[66],'C19':[0],'C20':[0],'C21':[3],'C22':[0],'C23':[1],'C24':[96],'C25':[0],'C26':[0]}
    fea_dict2 = {'I1':[0.0],'I2':[0.0],'I3':[0.00675],'I4':[0.402298],'I5':[0.059628],'I6':[0.117284],'I7':[0.003322],'I8':[0.714284],'I9':[0.154739],'I10':[0.0],'I11':[0.03125],'I12':[0.0],'I13':[0.343137],'C1':[11],'C2':[1],'C3':[98],'C4':[98],'C5':[1],'C6':[6],'C7':[179],'C8':[0],'C9':[1],'C10':[89],'C11':[58],'C12':[97],'C13':[79],'C14':[7],'C15':[72],'C16':[26],'C17':[7],'C18':[52],'C19':[0],'C20':[0],'C21':[47],'C22':[0],'C23':[7],'C24':[112],'C25':[0],'C26':[0]}
    fea_dict3 = {'I1':[0.0],'I2':[0.000333],'I3':[0.00071],'I4':[0.137931],'I5':[0.003968],'I6':[0.077873],'I7':[0.019934],'I8':[0.714284],'I9':[0.505803],'I10':[0.0],'I11':[0.09375],'I12':[0.0],'I13':[0.17647],'C1':[0],'C2':[18],'C3':[39],'C4':[52],'C5':[3],'C6':[4],'C7':[140],'C8':[2],'C9':[1],'C10':[93],'C11':[31],'C12':[122],'C13':[16],'C14':[7],'C15':[129],'C16':[97],'C17':[8],'C18':[49],'C19':[0],'C20':[0],'C21':[25],'C22':[0],'C23':[6],'C24':[53],'C25':[0],'C26':[0]}
    fea_dict4 = {'I1':[0.0],'I2':[0.004664],'I3':[0.000355],'I4':[0.045977],'I5':[0.033185],'I6':[0.094967],'I7':[0.016611],'I8':[0.081632],'I9':[0.028046],'I10':[0.0],'I11':[0.0625],'I12':[0.0],'I13':[0.039216],'C1':[0],'C2':[45],'C3':[7],'C4':[117],'C5':[1],'C6':[0],'C7':[164],'C8':[1],'C9':[0],'C10':[20],'C11':[61],'C12':[104],'C13':[36],'C14':[1],'C15':[43],'C16':[43],'C17':[8],'C18':[37],'C19':[0],'C20':[0],'C21':[156],'C22':[0],'C23':[0],'C24':[32],'C25':[0],'C26':[0]}
    fea_dict5 = {'I1':[0.0],'I2':[0.000333],'I3':[0.036945],'I4':[0.310344],'I5':[0.003922],'I6':[0.067426],'I7':[0.013289],'I8':[0.65306],'I9':[0.035783],'I10':[0.0],'I11':[0.03125],'I12':[0.0],'I13':[0.264706],'C1':[0],'C2':[11],'C3':[59],'C4':[77],'C5':[1],'C6':[5],'C7':[18],'C8':[1],'C9':[1],'C10':[45],'C11':[171],'C12':[162],'C13':[96],'C14':[4],'C15':[36],'C16':[121],'C17':[8],'C18':[14],'C19':[5],'C20':[3],'C21':[9],'C22':[0],'C23':[0],'C24':[5],'C25':[1],'C26':[47]}

    data = {"instances": [fea_dict1,fea_dict2,fea_dict3,fea_dict4,fea_dict5]}
    # print(data)

    json_response = predictor.predict(data)
    predictions = json_response['predictions']
#     print(predictions)
    return predictions

In [35]:
test_REST_serving()

[{'logits': [-5.87468815], 'pred': [0.00280179805]},
 {'logits': [-7.26665306], 'pred': [0.000697958225]},
 {'logits': [0.18031919], 'pred': [0.544958055]},
 {'logits': [-2.74384737], 'pred': [0.0604350679]},
 {'logits': [-5.67439556], 'pred': [0.003421]}]

## Delete the endpoint

Let's delete the endpoint we just created to prevent incurring any extra costs.

In [36]:
sagemaker.Session().delete_endpoint(predictor.endpoint)

The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
