# ML Training in SageMaker

The model used for this notebook is a basic Convolutional Neural Network (CNN).  
We'll train the CNN to classify images using the [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html), a well-known computer vision dataset.

![cifar10](https://maet3608.github.io/nuts-ml/_images/cifar10.png)

## 데모 순서
아래의 순서로 학습을 실행할 예정입니다.
1. 개발 환경 구축
2. Input Data 준비
3. SOTA 코드 수정하기 (input data, output model)
4. Development 환경에서 학습 정상 동작여부 확인
5. Training job으로 던지기
6. 학습 모니터링 (상태 확인, 로그 확인 etc)
7. Deploy saved models

## 1. Build development environment

데모를 수행할 개발 환경으로 SageMaker studio를 사용하겠다

- Sagemaker Studio is an extension of Jupyter Lab
- SageMaker Studio is an ML development environment that can easily use the Sagemaker features
- To load the notebook kernel, computing instances must be launched

Training script for demo is based on `Pytorch` framework.
So, before start the demo, we have to set the instance spec and kernel gateway image.
- instance spec: ml.g4dn.xlarge (4vCPU + 16GB + 1GPU)
- kernel image: PyTorch 1.6 Python 3.6 GPU Optimized

## 2. Input Data 준비

In [None]:
from get_cifar10 import get_train_data_loader, get_test_data_loader, imshow, classes

trainloader = get_train_data_loader()
testloader = get_test_data_loader()

In [None]:
import numpy as np
import torchvision, torch

# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()

# show images
imshow(torchvision.utils.make_grid(images))

# print labels
print(" ".join("%9s" % classes[labels[j]] for j in range(4)))

In [None]:
import sagemaker

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

inputs = sagemaker_session.upbload_data(path="data", 
                                       bucket=sagemaker_session.default_bucket(), 
                                       key_prefix="data/cifar10")

print(inputs)

## 2. Modify SOTA code

sagemaker를 통해 생성된 trainig job이 우리가 지정한 input data를 받도록 하고, 생성된 모델을 관리하도록 하기 위해서  
We have to set input data path and output model path using environment variable.

This is SageMake [Environment variables guide](https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md).

In [None]:
!diff source/train.py source/train_sagemaker.py

## 3. Development 환경에서 학습 정상 동작여부 확인

SOTA 코드 수정을 완료하였으면, 실제로 해당 코드가 정상적으로 작동하는 지 확인해보아야 한다.  
epoch을 1로 설정하여 job을 던진다.

단 동작을 확인할 때마다 job을 생성하는 방식이기 때문에 시간이 오래걸린다.
(mlp는 job에 interactive하게 접속하여 테스트 수행이 가능)

보다 빠른 테스트를 위해서는 사용자의 local 환경에서도 테스트를 돌려볼 수 있다.  
To train in Local Mode, it is necessary to have `docker-compose` or `nvidia-docker-compose` (for GPU) installed in the notebook instance.

In [None]:
from sagemaker.pytorch.estimator import PyTorch

instance_type = 'ml.p2.xlarge'

pytorch_estimator = PyTorch(
    source_dir='source',
    entry_point='train_sagemaker.py',
    framework_version="1.7.1",
    py_version='py3',
    role=sagemaker.get_execution_role(),
    base_job_name='06231855-cifar10-test',
    instance_count=1,
    instance_type=instance_type,
    hyperparameters = {'epochs': 1, 'lr': 0.01, 'batch': 64},
    metric_definitions=[
        {'Name': 'accuracy', 'Regex': 'Test Accuracy: (\S+)'}
    ]
)

In [None]:
pytorch_estimator.fit(inputs, logs=False)

In [None]:
sagemaker.analytics.TrainingJobAnalytics(pytorch_estimator._current_job_name, metric_names = ['accuracy']).dataframe()

## 4. Training Job으로 던지기

동작 확인이 완료되면 epoch을 다시 지정하고 원하는 스펙의 instance type에서 다시 job을 생성한다.

In [None]:
from sagemaker.pytorch.estimator import PyTorch

instance_type = 'ml.p3.2xlarge'

pytorch_estimator = PyTorch(
    source_dir='source',
    entry_point='train_sagemaker.py',
    framework_version="1.7.1",
    py_version='py3',
    role=sagemaker.get_execution_role(),
    base_job_name='06231930-cifar10-train',
    instance_count=1,
    instance_type=instance_type,
    hyperparameters = {'epochs': 10, 'lr': 0.01, 'batch': 64},
    metric_definitions=[
        {'Name': 'accuracy', 'Regex': 'Test Accuracy: (\S+)'}
    ]
)

pytorch_estimator.fit(inputs, logs=False)

## 5. 학습 모니터링 (상태 확인, 로그 확인 etc)

학습 진행 상황을 web이나 sdk를 통해 확인이 가능하다.

또한 학습이 완료되면 sagemaker sdk를 통해 metric 결과를 확인할 수 있으며 다양한 분석이 가능하다.

In [None]:
sagemaker.analytics.TrainingJobAnalytics(pytorch_estimator._current_job_name, metric_names = ['accuracy']).dataframe()

# 6. Deploy saved models

After a PyTorch Estimator has been fit, we can host the newly created model in SageMaker.

After calling `fit`, we can call deploy on a `PyTorch` Estimator to create a SageMaker Endpoint.  
The Endpoint runs a <U>SageMaker-provided PyTorch model server</U> and hosts the <U>model produced by our training script</U>. (the model we saved to `model_dir`)

`deploy` returns a Predictor object, which we can use to do inference on the Endpoint hosting PyTorch model.  
Each Predictor provides a predict method which can do inference with numpy arrays or Python lists.  
Inference arrays or lists are serialized and sent to the PyTorch model server.

`predict` returns the result of inference against your model. By default, the inference result a NumPy array.

In [None]:
cifar10_predictor = pytorch_estimator.deploy(initial_instance_count=1, instance_type='ml.p2.xlarge')

In [None]:
# get some test images
dataiter = iter(testloader)
images, labels = dataiter.next()

# print images
imshow(torchvision.utils.make_grid(images))
print("GroundTruth: ", " ".join("%4s" % classes[labels[j]] for j in range(4)))

outputs = cifar10_predictor.predict(images.numpy())

_, predicted = torch.max(torch.from_numpy(np.array(outputs)), 1)

print("Predicted: ", " ".join("%4s" % classes[predicted[j]] for j in range(4)))

In [None]:
# clean up
pytorch_estimator.delete_endpoint()