# Gluon CIFAR-10 Trained in Local Mode
_**ResNet model in Gluon trained locally in a notebook instance**_

---

---

_This notebook was created and tested on an ml.p3.8xlarge notebook instance._

## Setup

Import libraries and set IAM role ARN.

In [1]:
import sagemaker
from sagemaker.mxnet import MXNet

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

Install pre-requisites for local training.

In [2]:
!/bin/bash setup.sh

The user has root access.
SageMaker instance route table setup is ok. We are good to go.
SageMaker instance routing for Docker is ok. We are good to go!


---

## Data

We use the helper scripts to download CIFAR-10 training data and sample images.

In [4]:
from cifar10_utils import download_training_data
download_training_data()

downloading training data...
done


We use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value `inputs` identifies the location -- we will use this later when we start the training job.

Even though we are training within our notebook instance, we'll continue to use the S3 data location since it will allow us to easily transition to training in SageMaker's managed environment.

In [5]:
inputs = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-gluon-cifar10')
print('input spec (in this case, just an S3 path): {}'.format(inputs))

input spec (in this case, just an S3 path): s3://sagemaker-us-west-2-272949293984/data/DEMO-gluon-cifar10


---

## Script

We need to provide a training script that can run on the SageMaker platform. When SageMaker calls your function, it will pass in arguments that describe the training environment. The `train` function will check for the validation accuracy at the end of every epoch and checkpoints the best model so far, along with the optimizer state, in the folder `/opt/ml/checkpoints` if that folder path exists, else it will skip the checkpointing. Check the script below to see how this works.

The network itself is a pre-built version contained in the [Gluon Model Zoo](https://mxnet.incubator.apache.org/versions/master/api/python/gluon/model_zoo.html).

In [6]:
!cat 'cifar10.py'

from __future__ import print_function

import json
import logging
import os
import time

import mxnet as mx
from mxnet import autograd as ag
from mxnet import gluon
from mxnet.gluon.model_zoo import vision as models


# ------------------------------------------------------------ #
# Training methods                                             #
# ------------------------------------------------------------ #

def train(current_host, hosts, num_cpus, num_gpus, channel_input_dirs, model_dir, hyperparameters, **kwargs):
    # retrieve the hyperparameters we set in notebook (with some defaults)
    batch_size = hyperparameters.get('batch_size', 128)
    epochs = hyperparameters.get('epochs', 100)
    learning_rate = hyperparameters.get('learning_rate', 0.1)
    momentum = hyperparameters.get('momentum', 0.9)
    log_interval = hyperparameters.get('log_interval', 1)
    wd = hyperparameters.get('wd', 0.0001)

    if len(hosts) == 1:
        kvstore = 'device' if 

---

## Train (Local Mode)

The ```MXNet``` estimator will create our training job. To switch from training in SageMaker's managed environment to training within a notebook instance, just set `train_instance_type` to `local_gpu`.

In [13]:
m = MXNet('cifar10.py',
          py_version='py3',
          role=role, 
          train_instance_count=1,
          train_instance_type='local_gpu',
          framework_version='1.1.0',
          hyperparameters={'batch_size': 1024,
                           'epochs': 50,
                           'learning_rate': 0.1,
                           'momentum': 0.9})

After we've constructed our `MXNet` object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.

In [14]:
m.fit(inputs)

Creating tmphruntu_algo-1-i0df6_1 ... 
[1Bting tmphruntu_algo-1-i0df6_1 ... [31merror[0m
ERROR: for tmphruntu_algo-1-i0df6_1  Cannot start service algo-1-i0df6: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=10148 /var/lib/docker/overlay2/e05b3ace24682a5facb97af35f65eef28074df608cdb3bf648c4c6e45e86e45d/merged]\\\\nnvidia-container-cli: initialization error: cuda error: no cuda-capable device is detected\\\\n\\\"\"": unknown

ERROR: for algo-1-i0df6  Cannot start service algo-1-i0df6: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: runni

RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmphrunTU/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

---

## Host

After training, we use the MXNet estimator object to deploy an endpoint. Because we trained locally, we'll also deploy the endpoint locally.  The predictor object returned by `deploy` lets us call the endpoint and perform inference on our sample images.

In [None]:
predictor = m.deploy(initial_instance_count=1, instance_type='local_gpu')

### Evaluate

We'll use these CIFAR-10 sample images to test the service:

<img style="display: inline; height: 32px; margin: 0.25em" src="images/airplane1.png" />
<img style="display: inline; height: 32px; margin: 0.25em" src="images/automobile1.png" />
<img style="display: inline; height: 32px; margin: 0.25em" src="images/bird1.png" />
<img style="display: inline; height: 32px; margin: 0.25em" src="images/cat1.png" />
<img style="display: inline; height: 32px; margin: 0.25em" src="images/deer1.png" />
<img style="display: inline; height: 32px; margin: 0.25em" src="images/dog1.png" />
<img style="display: inline; height: 32px; margin: 0.25em" src="images/frog1.png" />
<img style="display: inline; height: 32px; margin: 0.25em" src="images/horse1.png" />
<img style="display: inline; height: 32px; margin: 0.25em" src="images/ship1.png" />
<img style="display: inline; height: 32px; margin: 0.25em" src="images/truck1.png" />



In [None]:
# load the CIFAR-10 samples, and convert them into format we can use with the prediction endpoint
from cifar10_utils import read_images

filenames = ['images/airplane1.png',
             'images/automobile1.png',
             'images/bird1.png',
             'images/cat1.png',
             'images/deer1.png',
             'images/dog1.png',
             'images/frog1.png',
             'images/horse1.png',
             'images/ship1.png',
             'images/truck1.png']

image_data = read_images(filenames)

The predictor runs inference on our input data and returns the predicted class label (as a float value, so we convert to int for display).

In [None]:
for i, img in enumerate(image_data):
    response = predictor.predict(img)
    print('image {}: class: {}'.format(i, int(response)))

---

## Cleanup

After you have finished with this example, remember to delete the prediction endpoint.  Only one local endpoint can be running at a time.

In [None]:
m.delete_endpoint()