# ML Training in MLP

We'll train the same CNN model in MLP

## Sequence of demo
#### Training is done in the following order:
1. Build development environment
2. Prepare input data
3. Modify training code (input data, output model)
4. Test training code
5. Run training job
6. Monitoring (job status, log etc)
7. Analysis saved models

## 1. Build development environment

### Set up the Testing Chamber

Testing Chamber is used to prepare training (not for actual ML training)

- Establish custom environment.
- Code modification and debugging.

Used training envrionment is "ubuntu18.04-python3.8-jupyter-cuda10.1-pytorch1.7.1"

## 2. Prepare input data

### Download the CIFAR-10 dataset

I prepared `get_cifar10.py` file which download cifar10 dataset and convert to pytorch `DataLoader` type.

In [None]:
from get_cifar10 import get_train_data_loader, get_test_data_loader, imshow, classes

trainloader = get_train_data_loader()
testloader = get_test_data_loader()

### Data preview

In [None]:
import numpy as np
import torchvision, torch

# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()

# show images
imshow(torchvision.utils.make_grid(images))

# print labels
print(" ".join("%9s" % classes[labels[j]] for j in range(4)))

### Save the dataset in GPFS volume

We will save our datasets to user volume (GPFS).  
If users save thier data in GPFS, they can use the data in several jobs.

In [None]:
!mv ./data /gpfs-volume/cifar10

## 3. Modify training code

### Prepared pre-developed code
```
source
├── my_cnn.py    # CNN Model network code
└── train.py     # training and evaluation code
```

For using this code in SageMaker, I created `train_mlp.py` file.

In [None]:
!diff source/train.py source/train_mlp.py

### What to modify

- In order to register the output model after finish job, add code block that use `mltracker` sdk.
- For logging parameter, metric, model artifacts

## 4. Test training code

After code modification, user should check whether the code actually working.  
For the test, run the execution command in this environment with epoch 1.

In [None]:
!python3 ~/testcode/0623_demo/source/train_mlp.py --epochs=1 --lr=0.01 --batch=64 --data-dir='/gpfs-volume/cifar10'

## 5. Run training job

### Save testing chamber env to custom image
After confirming the job, we can save current environment as it is to the custom image.

### Define training job spec
Create a training job by modifying the gpu spec and exec command with the same environment as the test environment.

- Container image
- Job name
- GPU Pool & Core
- GPU count
- Output Model Set
- Execution command

```sh
python3 ~/testcode/0623_demo/source/train_mlp.py --epochs=50 --lr=0.01 --batch=64  --data-dir='/gpfs-volume/cifar10'
```

## 6. Monitoring (job status, log etc)

User can check job status and job log through web interface.

## 7. Analysis saved models

User can check the model from the training result on the Model Set screen.