This notebook introduces development using TensorFlow on SageMaker. It starts with some "hello world"-style code examples running in local mode, and progresses to a full TensorFlow tutorial example running on SageMaker Training in the cloud.

"Training Code"

The first example script we'll look at demonstrates how to access information from the SageMaker environment, as well as parameters specified at runtime, from within your script. Here is the code:

```python
import tensorflow as tf
import os
import json
import argparse


if __name__ =='__main__':
    # Training environment metadata is available from environment variables.
    # Lots of additional information available not used in this example, such as # gpus/cpus.
    # Easy to interact with once loaded from json, since it's just a regular python dict.
    train_env = json.loads(os.environ['SM_TRAINING_ENV'])
    print(train_env)

    # You can read script parameters either from SM_TRAINING_ENV, or by using argparse.
    # Both approaches demonstrated here.
    foo = float(train_env['hyperparameters']['foo'])

    parser = argparse.ArgumentParser()
    parser.add_argument('--bar', type=float)
    args, unknown = parser.parse_known_args()
    bar = args.bar

    # Load training data file. SageMaker downloads it from S3 for you, and makes it
    # accessible from the local file system.
    train_data_dir = train_env['channel_input_dirs']['training']
    data_file = os.path.join(train_data_dir, 'data.txt')
    with open(data_file, 'r') as f:
        print(f.readlines())

    # TODO: output artifacts?
        
    # Use script parameters to do a trivial TensorFlow operation using eager execution,
    # similar to example from: https://www.tensorflow.org/guide/eager#setup_and_basic_usage
    tf.enable_eager_execution()

    m = tf.matmul([[foo]], [[bar]])
    print("hello, {}".format(m))  # => "hello, [[12]]"
```

You can modify this script by opening the file located in the same directory as this notebook through the Jupyter UI.

Control flow code

The code below uses the TensorFlow class from SageMaker Python SDK to configure and manage the training environment. The actual training code to be executed, contained in 'script.py', will be executed in a container running the official SageMaker TensorFlow image. Note that because the 'train_instance_type' parameter is 'local', it will run the container locally, rather than on SageMaker's cloud infrastructure.

This approach may seem cumbersome for initial development, and indeed it may be easier initially to iterate on your training code directly in the notebook. However, the advantage of the setup shown here is that it lets you verify the integration of your script with the SageMaker training environment, while still operating with relatively fast local iterations. And then, once you're ready, moving to the cloud is as simple as changing a few parameters.

In [32]:
from sagemaker.tensorflow import TensorFlow
import sagemaker

tfe = TensorFlow(entry_point='hello.py', role='SageMakerRole',
                #image_name='826912895975.dkr.ecr.us-west-2.amazonaws.com/preprod-tensorflow:scriptmode',
                image_name='tf-scriptmode',
                train_instance_type='local', train_instance_count=1,
                hyperparameters={'foo': '3', 'bar': '4'})

# TODO: remove S3 usage for local stuff
sess = sagemaker.Session()
s3_data = sess.upload_data('data.txt', key_prefix='')
tfe.fit({'training': s3_data})

INFO:sagemaker:Creating training-job with name: tf-scriptmode-2018-10-02-18-05-31-608


Creating tmp04zy0_qu_algo-1-3KW1K_1 ... 
[1BAttaching to tmp04zy0_qu_algo-1-3KW1K_12mdone[0m
[36malgo-1-3KW1K_1  |[0m 2018-10-02 18:05:36,683 sagemaker-containers INFO     Imported framework tf_container.training
[36malgo-1-3KW1K_1  |[0m 2018-10-02 18:05:36,693 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-3KW1K_1  |[0m 2018-10-02 18:05:37,262 sagemaker-containers INFO     Module hello does not provide a setup.py. 
[36malgo-1-3KW1K_1  |[0m Generating setup.py
[36malgo-1-3KW1K_1  |[0m 2018-10-02 18:05:37,262 sagemaker-containers INFO     Generating setup.cfg
[36malgo-1-3KW1K_1  |[0m 2018-10-02 18:05:37,263 sagemaker-containers INFO     Generating MANIFEST.in
[36malgo-1-3KW1K_1  |[0m 2018-10-02 18:05:37,263 sagemaker-containers INFO     Installing module with the following command:
[36malgo-1-3KW1K_1  |[0m /usr/bin/python -m pip install -vvv -U . 
[36malgo-1-3KW1K_1  |[0m Created temporary directory: /tmp/pip-ephem-wheel-cache

[36malgo-1-3KW1K_1  |[0m Installing collected packages: hello
[36malgo-1-3KW1K_1  |[0m 
[36malgo-1-3KW1K_1  |[0m   Removing source in /tmp/pip-install-HiLj0m/hello
[36malgo-1-3KW1K_1  |[0m Successfully installed hello-1.0.0
[36malgo-1-3KW1K_1  |[0m Cleaning up...
[36malgo-1-3KW1K_1  |[0m Removed build tracker '/tmp/pip-req-tracker-_d5x5w'
[36malgo-1-3KW1K_1  |[0m 2018-10-02 18:05:38,956 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-3KW1K_1  |[0m 2018-10-02 18:05:38,975 sagemaker-containers INFO     Invoking user script
[36malgo-1-3KW1K_1  |[0m 
[36malgo-1-3KW1K_1  |[0m Training Env:
[36malgo-1-3KW1K_1  |[0m 
[36malgo-1-3KW1K_1  |[0m {
[36malgo-1-3KW1K_1  |[0m     "network_interface_name": "ethwe", 
[36malgo-1-3KW1K_1  |[0m     "log_level": 20, 
[36malgo-1-3KW1K_1  |[0m     "model_dir": "/opt/ml/model", 
[36malgo-1-3KW1K_1  |[0m     "num_gpus": 0, 
[36malgo-1-3KW1K_1  |[0m     "channel_input_dirs": {
[36malgo-1

You can see the sagemaker environment information printed out, as well as the result of our TensorFlow computation.

Running on the SageMaker Cloud

Once you have a script working to your satisfaction in local mode, simply change the train_instance_type parameter to run it on SageMaker's infrastructure. The SageMaker Python SDK will create a training job for you behind the scenes for you. Once the instances are launched and your script has started running, its output will be streamed in real time below.

In [35]:
tfe = TensorFlow(entry_point='hello.py', role='SageMakerRole',
                image_name='826912895975.dkr.ecr.us-west-2.amazonaws.com/preprod-tensorflow:scriptmode',
                train_instance_type='ml.c4.xlarge', # only parameter changed
                train_instance_count=1,
                hyperparameters={'foo': '3', 'bar': '4'})

tfe.fit({'training': s3_data})

INFO:sagemaker:Creating training-job with name: preprod-tensorflow-2018-10-02-18-07-58-250


2018-10-02 18:08:01 Starting - Starting the training job...
Launching requested ML instances......
Preparing the instances for training...
2018-10-02 18:09:42 Downloading - Downloading input data
2018-10-02 18:09:48 Training - Downloading the training image...
2018-10-02 18:10:35 Uploading - Uploading generated training model
2018-10-02 18:10:40 Completed - Training job completed

[31m2018-10-02 18:10:28,985 sagemaker-containers INFO     Imported framework tf_container.training[0m
[31m2018-10-02 18:10:28,988 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2018-10-02 18:10:29,374 sagemaker-containers INFO     Module hello does not provide a setup.py. [0m
[31mGenerating setup.py[0m
[31m2018-10-02 18:10:29,375 sagemaker-containers INFO     Generating setup.cfg[0m
[31m2018-10-02 18:10:29,375 sagemaker-containers INFO     Generating MANIFEST.in[0m
[31m2018-10-02 18:10:29,375 sagemaker-containers INFO     Installing module with the following c

Billable seconds: 59
