# Fashion-MNIST PyTorch image classification w/ Tensorboard
Source
- https://tutorials.pytorch.kr/intermediate/tensorboard_tutorial.html
- https://github.com/aws/amazon-sagemaker-examples/blob/master/frameworks/pytorch/get_started_mnist_train.ipynb

## Initial setup

In [2]:
install_needed = True
# install_needed = False

In [3]:
import sys
import IPython

if install_needed:
    print("installing deps and restarting kernel")
    !{sys.executable} -m pip install -U 'sagemaker[local]'
    !{sys.executable} -m pip install -U sagemaker-experiments # SageMaker Experiments SDK 
    !{sys.executable} -m pip install -U sagemaker             # SageMaker Python SDK
    !/bin/bash ./local/local_mode_setup.sh
    IPython.Application.instance().kernel.do_shutdown(True)

installing deps and restarting kernel
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting sagemaker[local]
  Downloading sagemaker-2.72.1.tar.gz (473 kB)
     |████████████████████████████████| 473 kB 1.9 MB/s            
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting PyYAML<6,>=5.3
  Using cached PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
Collecting jsonschema<4,>=2.5.1
  Using cached jsonschema-3.2.0-py2.py3-none-any.whl (56 kB)
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker: filename=sagemaker-2.72.1-py2.py3-none-any.whl size=650653 sha256=4b32cd38aef73debbd81b1aa311f360c09dd733c402663d379088c3c86480899
  Stored in directory: /home/ec2-user/.cache/pip/wheels/2d/2f/42/cb1762fdbe69d2b678e36c3b81e9f6fa2c03b08d5de5471edd
Successfully built sagemaker
Installing collected packages: PyYAML, jsonschema, sagemaker
  Attempting

In [1]:
# imports
import matplotlib.pyplot as plt
import numpy as np

import torch
import torchvision
import torchvision.transforms as transforms

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from time import strftime

## Prepare dataset

In [2]:
# transforms
transform = transforms.Compose(
    [transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))])

# datasets
trainset = torchvision.datasets.FashionMNIST('./data',
    download=True,
    train=True,
    transform=transform)
testset = torchvision.datasets.FashionMNIST('./data',
    download=True,
    train=False,
    transform=transform)

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/26421880 [00:00<?, ?it/s]

Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/29515 [00:00<?, ?it/s]

Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/4422102 [00:00<?, ?it/s]

Extracting ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/5148 [00:00<?, ?it/s]

Extracting ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw

Processing...
Done!


  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)


## Set up the SageMaker environment

In [3]:
import os
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role

from smexperiments.experiment import Experiment ### SM Experiment
from smexperiments.trial import Trial           ### SM Experiment

from sagemaker.debugger import TensorBoardOutputConfig ### For TensorBoard 

sagemaker_session = sagemaker.Session()

role = get_execution_role()

bucket = sagemaker_session.default_bucket()
prefix = "tensorboard_pytorch_fashion_mnist"
tensorboard_logs_path = "s3://{}/{}/logs".format(bucket, prefix) ### For TensorBoard
output_path = "s3://{}/{}/output".format(bucket, prefix)

print("Bucket: {}".format(bucket))
print("SageMaker ver: " + sagemaker.__version__)
print("Tensorboard log path: {}".format(tensorboard_logs_path))

Bucket: sagemaker-ap-northeast-2-889750940888
SageMaker ver: 2.68.0
Tensorboard log path: s3://sagemaker-ap-northeast-2-889750940888/tensorboard_pytorch_fashion_mnist/logs


## Uploading the data to s3

In [4]:
!aws s3 cp ./data/FashionMNIST/raw s3://{bucket}/{prefix}/data --recursive

upload: data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to s3://sagemaker-ap-northeast-2-889750940888/tensorboard_pytorch_fashion_mnist/data/train-labels-idx1-ubyte.gz
upload: data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to s3://sagemaker-ap-northeast-2-889750940888/tensorboard_pytorch_fashion_mnist/data/t10k-labels-idx1-ubyte.gz
upload: data/FashionMNIST/raw/t10k-labels-idx1-ubyte to s3://sagemaker-ap-northeast-2-889750940888/tensorboard_pytorch_fashion_mnist/data/t10k-labels-idx1-ubyte
upload: data/FashionMNIST/raw/train-labels-idx1-ubyte to s3://sagemaker-ap-northeast-2-889750940888/tensorboard_pytorch_fashion_mnist/data/train-labels-idx1-ubyte
upload: data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to s3://sagemaker-ap-northeast-2-889750940888/tensorboard_pytorch_fashion_mnist/data/t10k-images-idx3-ubyte.gz
upload: data/FashionMNIST/raw/t10k-images-idx3-ubyte to s3://sagemaker-ap-northeast-2-889750940888/tensorboard_pytorch_fashion_mnist/data/t10k-images-idx3-ubyte
upload: data

In [5]:
train_location = 's3://{}/{}/data'.format(bucket, prefix)
test_location = 's3://{}/{}/data'.format(bucket, prefix)

In [6]:
!aws s3 ls {train_location} --recursive

2021-12-22 00:12:46    7840016 tensorboard_pytorch_fashion_mnist/data/t10k-images-idx3-ubyte
2021-12-22 00:12:46    4422102 tensorboard_pytorch_fashion_mnist/data/t10k-images-idx3-ubyte.gz
2021-12-22 00:12:46      10008 tensorboard_pytorch_fashion_mnist/data/t10k-labels-idx1-ubyte
2021-12-22 00:12:46       5148 tensorboard_pytorch_fashion_mnist/data/t10k-labels-idx1-ubyte.gz
2021-12-22 00:12:46   47040016 tensorboard_pytorch_fashion_mnist/data/train-images-idx3-ubyte
2021-12-22 00:12:46   26421880 tensorboard_pytorch_fashion_mnist/data/train-images-idx3-ubyte.gz
2021-12-22 00:12:46      60008 tensorboard_pytorch_fashion_mnist/data/train-labels-idx1-ubyte
2021-12-22 00:12:46      29515 tensorboard_pytorch_fashion_mnist/data/train-labels-idx1-ubyte.gz


In [7]:
!aws s3 ls {test_location} --recursive

2021-12-22 00:12:46    7840016 tensorboard_pytorch_fashion_mnist/data/t10k-images-idx3-ubyte
2021-12-22 00:12:46    4422102 tensorboard_pytorch_fashion_mnist/data/t10k-images-idx3-ubyte.gz
2021-12-22 00:12:46      10008 tensorboard_pytorch_fashion_mnist/data/t10k-labels-idx1-ubyte
2021-12-22 00:12:46       5148 tensorboard_pytorch_fashion_mnist/data/t10k-labels-idx1-ubyte.gz
2021-12-22 00:12:46   47040016 tensorboard_pytorch_fashion_mnist/data/train-images-idx3-ubyte
2021-12-22 00:12:46   26421880 tensorboard_pytorch_fashion_mnist/data/train-images-idx3-ubyte.gz
2021-12-22 00:12:46      60008 tensorboard_pytorch_fashion_mnist/data/train-labels-idx1-ubyte
2021-12-22 00:12:46      29515 tensorboard_pytorch_fashion_mnist/data/train-labels-idx1-ubyte.gz


## Local mode training

In [8]:
# from sagemaker.local import LocalSession
# sagemaker_session = LocalSession()

In [9]:
from sagemaker.debugger import TensorBoardOutputConfig ### For TensorBoard 

# An error occurred (ValidationException) when calling the CreateTrainingJob operation:
# "LocalPath" of "TensorBoardOutputConfig" cannot start with the following reserved path: [/opt/ml, /tmp, /usr/local/nvidia]

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=tensorboard_logs_path,
    container_local_output_path='/pytorch/tensors' # See code/train.py
) 

In [10]:
hyperparameters_local = {"batch-size": 128,
                         "epochs": 1,
                         "learning-rate": 1e-3,
                         "log-interval": 100,
                         "tensorboard-logs-path": tensorboard_logs_path} # Not working in local mode

In [11]:
# set local_mode to be True if you want to run the training script
# on the machine that runs this notebook

local_mode = True

if local_mode:
    instance_type = "local"
else:
    instance_type = "ml.c5.xlarge"

est_local = PyTorch(
            entry_point="train.py",
            source_dir="code",  # directory of your training script
            role=role,
            framework_version="1.8.1",
            py_version="py3",
            instance_type=instance_type,
            instance_count=1,
            output_path=output_path,
            hyperparameters=hyperparameters_local,
            tensorboard_output_config=tensorboard_output_config
)

In [12]:
channels = {"training": train_location, "testing": test_location}
est_local.fit(inputs=channels)

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: pytorch-training-2021-12-22-00-13-14-352
INFO:sagemaker.local.local_session:Starting training job
INFO:sagemaker.local.image:No AWS credentials found in session but credentials from EC2 Metadata Service are available.
INFO:sagemaker.local.image:docker compose file: 
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-jxhg8:
    command: train
    container_name: tuagljbb8u-algo-1-jxhg8
    environment:
    - '[Masked]'
    - '[Masked]'
    image: 763104351884.dkr.ecr.ap-northeast-2.amazonaws.com/pytorch-training:1.8.1-cpu-py3
    networks:
      sagemaker-local:
        aliases:
        - algo-1-jxhg8
    stdin_open: true
    tty: true
    volumes:
    - /tmp/tmp77n_ozch/algo-1-jxhg8/output/data:/opt/ml/output/data
    - /tmp/tmp77n_ozch/algo-1-jxhg8/outpu

Creating tuagljbb8u-algo-1-jxhg8 ... 
Creating tuagljbb8u-algo-1-jxhg8 ... done
Attaching to tuagljbb8u-algo-1-jxhg8
[36mtuagljbb8u-algo-1-jxhg8 |[0m 2021-12-22 00:14:22,018 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
[36mtuagljbb8u-algo-1-jxhg8 |[0m 2021-12-22 00:14:22,020 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mtuagljbb8u-algo-1-jxhg8 |[0m 2021-12-22 00:14:22,029 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36mtuagljbb8u-algo-1-jxhg8 |[0m 2021-12-22 00:14:22,031 sagemaker_pytorch_container.training INFO     Invoking user training script.
[36mtuagljbb8u-algo-1-jxhg8 |[0m 2021-12-22 00:14:22,162 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
[36mtuagljbb8u-algo-1-jxhg8 |[0m /opt/conda/bin/python3.6 -m pip install -r requirements.txt
[36mtuagljbb8u-algo-1-jxhg8 |[0m Collecting tensorboard<2.4
[36mtua

## Managed training

### Experiments

In [19]:
def create_experiment(experiment_name):
    try:
        sm_experiment = Experiment.load(experiment_name)
    except:
        sm_experiment = Experiment.create(experiment_name=experiment_name,
                                          tags=[
                                              {
                                                  'Key': 'modelname',
                                                  'Value': 'fashion-mnist'
                                              },
                                          ])

In [20]:
def create_trial(experiment_name, i_type, i_cnt, spot):
    create_date = strftime("%m%d-%H%M%s")
    
    algo = 'dp'
    
    spot = 's' if spot else 'd'
    i_tag = 'test'
    
    if i_type == 'ml.p3.16xlarge':
        i_tag = 'p3'
    elif i_type == 'ml.p2.8xlarge':
        i_tag = 'p2'
    elif i_type == 'ml.p3dn.24xlarge':
        i_tag = 'p3dn'
    elif i_type == 'ml.p4d.24xlarge':
        i_tag = 'p4d'
    else:
        i_tag = 'others'
        
    trial = "-".join([i_tag,str(i_cnt),algo, spot])
       
    sm_trial = Trial.create(trial_name=f'{experiment_name}-{trial}-{create_date}',
                            experiment_name=experiment_name)

    job_name = f'{sm_trial.trial_name}'
    return job_name

### Debugger rules

In [21]:
from sagemaker.debugger import Rule, ProfilerRule, rule_configs

rules = [
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    ProfilerRule.sagemaker(rule_configs.LowGPUUtilization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]

### Debugger Profiling

In [22]:
from sagemaker.debugger import ProfilerConfig, FrameworkProfile

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500, framework_profile_params=FrameworkProfile(num_steps=10)
)

### Training environments

In [23]:
metric_definitions = [{'Name': 'average loss',
                       'Regex': 'Average loss: ([0-9\\.]+)'},
                      {'Name': 'accuracy',
                       'Regex': 'Accuracy: [0-9]+/[0-9]+, ([0-9\\.]+)'}]

In [17]:
type(metric_definitions)

list

In [41]:
hyperparameters = {"batch-size": 128,
                   "epochs": 100,
                   "learning-rate": 1e-3,
                   "log-interval": 100,}

In [42]:
# set local_mode to be True if you want to run the training script
# on the machine that runs this notebook

local_mode = False

instance_count = 1

if local_mode:
    instance_type = "local"
else:
    instance_type = "ml.p3.2xlarge"

estimator = PyTorch(
            entry_point="train.py",
            source_dir="code",  # directory of your training script
            role=role,
            framework_version="1.8.1",
            py_version="py3",
            instance_type=instance_type,
            instance_count=instance_count,
            output_path=output_path,
            hyperparameters=hyperparameters,
            tensorboard_output_config=tensorboard_output_config,
            base_job_name='pytorch-tensorboard',
            metric_definitions=metric_definitions,
            profiler_config=profiler_config,
            rules=rules,
            disable_profiler=False # default: False
)

In [43]:
experiment_name = 'pytorch-tensorboard'
do_spot_training=False

create_experiment(experiment_name)
job_name = create_trial(experiment_name, instance_type, instance_count, do_spot_training)
job_name

'pytorch-tensorboard-others-1-dp-d-1222-00401640133602'

In [44]:
channels = {"training": train_location, "testing": test_location}
estimator.fit(inputs=channels,
              experiment_config={
                  'TrialName': job_name,
                  'TrialComponentDisplayName': job_name,
                },
              wait=False)

INFO:sagemaker:Creating training-job with name: pytorch-tensorboard-2021-12-22-00-40-07-255


In [45]:
job_name=estimator.latest_training_job.name
sagemaker_session.logs_for_job(job_name=job_name, wait=True)

2021-12-22 00:40:10 Starting - Starting the training job...
2021-12-22 00:40:33 Starting - Launching requested ML instancesLossNotDecreasing: InProgress
LowGPUUtilization: InProgress
ProfilerReport: InProgress
......
2021-12-22 00:41:33 Starting - Preparing the instances for training......
2021-12-22 00:42:40 Downloading - Downloading input data...
2021-12-22 00:43:10 Training - Downloading the training image.........................[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-12-22 00:47:12,817 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-12-22 00:47:12,842 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-12-22 00:47:12,862 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-12-22 00:47:13,494 sagemaker-training-toolkit INFO     Installing d

### Download profiler report

In [29]:
rule_output_path = estimator.output_path + '/' + estimator.latest_training_job.job_name + "/rule-output"
rule_output_path

's3://sagemaker-ap-northeast-2-889750940888/tensorboard_pytorch_fashion_mnist/output/pytorch-tensorboard-2021-12-22-00-15-45-032/rule-output'

In [30]:
! aws s3 ls {rule_output_path} --recursive

2021-12-22 00:29:06     424537 tensorboard_pytorch_fashion_mnist/output/pytorch-tensorboard-2021-12-22-00-15-45-032/rule-output/ProfilerReport/profiler-output/profiler-report.html
2021-12-22 00:29:06     281005 tensorboard_pytorch_fashion_mnist/output/pytorch-tensorboard-2021-12-22-00-15-45-032/rule-output/ProfilerReport/profiler-output/profiler-report.ipynb
2021-12-22 00:29:00        536 tensorboard_pytorch_fashion_mnist/output/pytorch-tensorboard-2021-12-22-00-15-45-032/rule-output/ProfilerReport/profiler-output/profiler-reports/BatchSize.json
2021-12-22 00:29:00      11696 tensorboard_pytorch_fashion_mnist/output/pytorch-tensorboard-2021-12-22-00-15-45-032/rule-output/ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json
2021-12-22 00:29:00       2041 tensorboard_pytorch_fashion_mnist/output/pytorch-tensorboard-2021-12-22-00-15-45-032/rule-output/ProfilerReport/profiler-output/profiler-reports/Dataloader.json
2021-12-22 00:29:00        130 tensorboard_pytorch_fashion_mn

In [31]:
os.makedirs('./profiler', exist_ok=True)

In [32]:
! aws s3 cp {rule_output_path} ./profiler --recursive

download: s3://sagemaker-ap-northeast-2-889750940888/tensorboard_pytorch_fashion_mnist/output/pytorch-tensorboard-2021-12-22-00-15-45-032/rule-output/ProfilerReport/profiler-output/profiler-reports/LowGPUUtilization.json to profiler/ProfilerReport/profiler-output/profiler-reports/LowGPUUtilization.json
download: s3://sagemaker-ap-northeast-2-889750940888/tensorboard_pytorch_fashion_mnist/output/pytorch-tensorboard-2021-12-22-00-15-45-032/rule-output/ProfilerReport/profiler-output/profiler-reports/BatchSize.json to profiler/ProfilerReport/profiler-output/profiler-reports/BatchSize.json
download: s3://sagemaker-ap-northeast-2-889750940888/tensorboard_pytorch_fashion_mnist/output/pytorch-tensorboard-2021-12-22-00-15-45-032/rule-output/ProfilerReport/profiler-output/profiler-reports/LoadBalancing.json to profiler/ProfilerReport/profiler-output/profiler-reports/LoadBalancing.json
download: s3://sagemaker-ap-northeast-2-889750940888/tensorboard_pytorch_fashion_mnist/output/pytorch-tensorboar

## Tensorboard

In [46]:
# https://docs.aws.amazon.com/sagemaker/latest/dg/studio-tensorboard.html
!pip install tensorboard==2.3

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


In [47]:
aws_region = sagemaker_session.boto_region_name
!AWS_REGION={aws_region}
!echo tensorboard --logdir {tensorboard_logs_path}

tensorboard --logdir s3://sagemaker-ap-northeast-2-889750940888/tensorboard_pytorch_fashion_mnist/logs


[**Click here to access TensorBoard instance**](/proxy/6006/)

SageMaker notebook이 아닌 환경에서 접속하려면? `https://<notebook instance hostname>/proxy/6006/`

# Screenshots

![tensorboard](image/01.tensorboard.png)

![SM-experiments](image/02.SM-experiments.png)

![SM-debugger](image/03.SM-debugger.png)