# TABLE OF CONTENTS:
---
* [Setup](#Setup)
* [Data](#Data)
* [Compute Target](#Compute-Target)
* [Training Artifacts](#Training-Artifacts)
* [Development Environment](#Development-Environment)
* [Compute Target](#Compute-Target)
* [Development Environment](#Development-Environment)
* [Experiment & Run Configuration](#Experiment-&-Run-Configuration)
    * [Option 1: Normal Script Run](#Option-1-Script-Run)
    * [Option 2: Hyperdrive Run](#Option-1-Script-Run)
* [Model Registration](#Model-Registration)
* [Resource Clean Up](#Resource-Clean-Up)
---

# Setup

Append parent directory to sys path to be able to import modules from src directory.

In [1]:
import sys
sys.path.append(os.path.dirname(os.path.abspath("")))

In [6]:
import azureml.core
from dotenv import load_dotenv
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
import os
import scipy.io
import torch
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import uuid

from azureml.core.authentication import MsiAuthentication
from azureml.core import Dataset, Environment, Experiment, Keyvault, Model, ScriptRunConfig, Workspace
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.core.model import InferenceConfig 
from azureml.train.hyperdrive import BanditPolicy, HyperDriveConfig, PrimaryMetricGoal, RandomParameterSampling
from azureml.train.hyperdrive import choice, uniform
from azureml.widgets import RunDetails
from torchvision import datasets

from src.training.data_utils import load_data, show_image
from src.training.download_utils import download_file, extract_stanford_dogs_archive

print(f"azureml.core version: {azureml.core.VERSION}")

azureml.core version: 1.14.0


Automatically reload modules when changes are made.

In [7]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Create a training directory. This directory will contain all artifacts needed for model training. For AML remote training this directory will be copied to the remote compute at runtime.

In [8]:
training_folder = os.path.join(os.getcwd(), "../src/training")
os.makedirs(training_folder, exist_ok=True)
print(f"Training folder {training_folder} has been created.")

Training folder /mnt/batch/tasks/shared/LS_root/mounts/clusters/amlbriksevnetci/code/Users/BRIKSE/pytorch-use-cases-azure-ml/template_project/notebooks/../src/training has been created.


### Connect to Workspace

In order to connect and communicate with the Azure Machine Learning (AML) workspace, a workspace object needs to be instantiated using the Azure ML SDK.

In [9]:
# Connect to the AML workspace. MsiAuthenthication only works out of the box on the AML Compute Instance.
# For alternative connection options see the aml_snippets directory.
msi_auth = MsiAuthentication()

ws = Workspace(subscription_id="bf088f59-f015-4332-bd36-54b988be7c90",
               resource_group="amlbrikserg",
               workspace_name="amlbriksews",
               auth=msi_auth)

# Data

The [stanford dogs dataset](http://vision.stanford.edu/aditya86/ImageNetDogs/) is an image dataset that will be used to train a multiclass dog breed classification model. In total there are 120 different dog breeds/classes and 20580 images. The dataset has been built using images and annotations from ImageNet for the task of fine-grained image categorization. The images are three-channel color images of variable pixels in size. While a file with a given train/test split can be downloaded from the website, the test dataset will be further split into a validation and real test set (50:50). This will ultimately lead into a data distribution as follows:
- 12013 training images (58.34%)
- 4290 validation images (20.83%)
- 4290 test images (20.83%)

Create a directory to store all data.

In [None]:
data_folder = os.path.join(os.getcwd(), "../data")
os.makedirs(data_folder, exist_ok=True)
print(f"Data folder {data_folder} has been created.")

### Download Data

Create a utility file with functions to download the dogs dataset archive files from the stanford vision website and extract the archive into a format expected by the [torchvision.datasets.ImageFolder](https://pytorch.org/docs/stable/torchvision/datasets.html#imagefolder).

In [10]:
%%writefile $training_folder/download_utils.py
import os
import scipy.io
import shutil
import tarfile
import tqdm
import urllib


def download_file(download_url: str,
                  file_dir: str,
                  file_name: str,
                  skip_if_dir_exists: bool = False,
                  force_dir_deletion: bool = False) -> None:
    """
    Download a file
    :param download_url: url from where to download
    :param file_dir: directory to which to download
    :param file_name: name of the file
    :param skip_if_dir_exists: flag that indicates whether to skip the download if the directory already exists
    :param force_dir_deletion: flag that indicates whether to delete the existing directory before the download
    """
    
    # Remove file directory if it exists
    if force_dir_deletion and os.path.exists(file_dir):
        shutil.rmtree(file_dir)
        print(f"Directory {file_dir} has been removed.")
    
    # Check if download should be triggered
    if not os.path.exists(file_dir) or not skip_if_dir_exists:
    
        # Create file directory if it does not exist
        os.makedirs(file_dir, exist_ok=True)
    
        # Download the file
        file_path = os.path.join(file_dir, file_name)
        print("Downloading " + download_url + " to " + file_path + ".")
        urllib.request.urlretrieve(download_url, filename=file_path, reporthook=generate_bar_updater())
        

def extract_stanford_dogs_archive(archive_dir_path: str = "../data",
                                  target_dir_path: str = "../data",
                                  remove_archives: bool = True) -> None:
    """
    Extract the stanford dogs image archive and separate the images into training,
    validation and test set
    :param archive_dir_path: path of the "image.tar" and "lists.tar" files to be extracted
    :param target_dir_path: path of the target directory where the files should be extracted to
    :param remove_archives: flag that indicates whether the archives are removed after extraction
    """
 
    # Specify directory paths
    training_dir = os.path.join(target_dir_path, "train")
    validation_dir = os.path.join(target_dir_path, "val")
    test_dir = os.path.join(target_dir_path, "test")    
    
    # Remove directories if they exist
    for directory in [training_dir, validation_dir, test_dir]:
        if os.path.exists(directory):
            shutil.rmtree(directory)
            print(f"Directory {directory} has been removed.")

    # Extract lists.tar archive
    with tarfile.open(os.path.join(archive_dir_path, "lists.tar"), "r") as lists_tar:
        lists_tar.extractall(path=archive_dir_path)
                             
    print("Lists.tar archive has been extracted successfully.")
    
    # Load list.mat files
    train_list_mat = scipy.io.loadmat(os.path.join(archive_dir_path, "train_list.mat"))
    test_list_mat = scipy.io.loadmat(os.path.join(archive_dir_path, "test_list.mat"))
    
    training_files = []
    test_and_val_files = []
    
    # Extract training data file names
    for array in train_list_mat["file_list"]:
        training_files.append(array[0][0])

    # Extract test data file names
    for array in test_list_mat["file_list"]:
        test_and_val_files.append(array[0][0])
                             
    print("File lists have been read successfully.")
    print("Extracting images.tar archive...")
                             
    # Extract images.tar archive
    with tarfile.open(os.path.join(archive_dir_path, "images.tar"), "r") as images_tar:
        test_val_idx = 0
        for member in tqdm.tqdm(images_tar.getmembers()):
            if member.isreg(): # Skip if TarInfo is not files
                member.name = member.name.split("/", 1)[1] # Retrieve only relevant part of file name
                
                # Extract files to corresponding directories
                if member.name in training_files:
                    images_tar.extract(member, training_dir)
                    
                elif member.name in test_and_val_files: # Every 2nd file goes to the validation data
                    test_val_idx+=1
                    if test_val_idx % 2 != 0:
                        images_tar.extract(member, validation_dir)
                    else:
                        images_tar.extract(member, test_dir)
                             
    print("Images.tar archive has been extracted successfully.")

    # Remove list.mat files
    os.remove(os.path.join(archive_dir_path, "file_list.mat"))
    os.remove(os.path.join(archive_dir_path, "test_list.mat"))
    os.remove(os.path.join(archive_dir_path, "train_list.mat"))
    
    # Remove archive files if flag is set to true
    if remove_archives:
        print("Removing archive files.")
        os.remove(os.path.join(archive_dir_path, "lists.tar"))
        os.remove(os.path.join(archive_dir_path, "images.tar"))

                             
def generate_bar_updater():
    """
    Create a tqdm reporthook function for urlretrieve
    :returns: bar_update function which can be used by urlretrieve 
              to display and update a progress bar
    """
    
    pbar = tqdm.tqdm(total=None)

    # Define progress bar update function
    def bar_update(count, block_size, total_size):
        if pbar.total is None and total_size:
            pbar.total = total_size
        progress_bytes = count * block_size
        pbar.update(progress_bytes - pbar.n)

    return bar_update

Overwriting /mnt/batch/tasks/shared/LS_root/mounts/clusters/amlbriksevnetci/code/Users/BRIKSE/pytorch-use-cases-azure-ml/template_project/notebooks/../src/training/download_utils.py


Download the data to the local compute.

In [11]:
archive_file_list = ["images.tar", "lists.tar"]
force_dir_deletion_list = [True, False] # Delete directory before starting to download images.tar

# Download archive files from the stanford vision website
for i, archive_file in enumerate(archive_file_list):
    download_file(download_url="http://vision.stanford.edu/aditya86/ImageNetDogs/" + archive_file,
                  file_dir="../data",
                  file_name=archive_file,
                  skip_if_dir_exists=False,
                  force_dir_deletion=force_dir_deletion_list[i])

0it [00:00, ?it/s]

Directory ../data has been removed.
Downloading http://vision.stanford.edu/aditya86/ImageNetDogs/images.tar to ../data/images.tar.


100%|█████████▉| 791773184/793579520 [00:50<00:00, 19189693.81it/s]
0it [00:00, ?it/s][A

Downloading http://vision.stanford.edu/aditya86/ImageNetDogs/lists.tar to ../data/lists.tar.



  0%|          | 0/481280 [00:00<?, ?it/s][A
  7%|▋         | 32768/481280 [00:00<00:02, 176146.88it/s][A
 12%|█▏        | 57344/481280 [00:00<00:02, 172303.20it/s][A
 17%|█▋        | 81920/481280 [00:00<00:02, 169628.73it/s][A
 24%|██▍       | 114688/481280 [00:01<00:02, 181883.75it/s][A
 32%|███▏      | 155648/481280 [00:01<00:01, 202136.89it/s][A
 41%|████      | 196608/481280 [00:01<00:01, 219138.02it/s][A
 51%|█████     | 245760/481280 [00:01<00:00, 243347.68it/s][A
 63%|██████▎   | 303104/481280 [00:01<00:00, 273042.13it/s][A
 75%|███████▍  | 360448/481280 [00:01<00:00, 298658.87it/s][A
483328it [00:01, 243315.63it/s]                            [A


In [12]:
# Extract archive files
extract_stanford_dogs_archive()

Lists.tar archive has been extracted successfully.
File lists have been read successfully.
Extracting images.tar archive...


793583616it [01:10, 19189693.81it/s]                               
  0%|          | 0/20701 [00:00<?, ?it/s][A
  0%|          | 3/20701 [00:00<21:53, 15.75it/s][A
  0%|          | 5/20701 [00:00<21:04, 16.37it/s][A
  0%|          | 6/20701 [00:00<27:24, 12.58it/s][A
  0%|          | 7/20701 [00:00<34:21, 10.04it/s][A
  0%|          | 9/20701 [00:00<34:35,  9.97it/s][A
  0%|          | 11/20701 [00:00<30:09, 11.44it/s][A
  0%|          | 13/20701 [00:01<27:36, 12.49it/s][A
  0%|          | 15/20701 [00:01<27:11, 12.68it/s][A
  0%|          | 17/20701 [00:01<24:57, 13.81it/s][A
  0%|          | 19/20701 [00:01<23:30, 14.66it/s][A
  0%|          | 21/20701 [00:01<25:51, 13.33it/s][A
  0%|          | 23/20701 [00:01<23:59, 14.36it/s][A
  0%|          | 25/20701 [00:02<34:39,  9.94it/s][A
  0%|          | 27/20701 [00:02<32:46, 10.51it/s][A
  0%|          | 29/20701 [00:02<32:43, 10.53it/s][A
  0%|          | 31/20701 [00:02<30:26, 11.32it/s][A
  0%|          | 33/20701 [0

Images.tar archive has been extracted successfully.
Removing archive files.


### Load Data

Create a utility file with functions to generate train, val and test dataloaders and to show example images.

In [None]:
%%writefile $training_folder/data_utils.py
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import os
import torch
import torchvision.transforms as transforms

from torchvision import datasets
from typing import Tuple


def load_data(data_dir: str) -> Tuple[dict, dict, list]:
    """
    Load the train, val and test data.
    :param data_dir: path where the images are stored
    :return (dataloaders, dataset_sizes, class_names)
        dataloaders: dictionary of train, val, and test torch dataloaders
        dataset_sizes: dictionary of train, val and test torch dataset lengths
        class_names: list of all classes
    """

    # Data augmentation and normalization for training set
    # Just normalization for validation and test set
    data_transforms = {
        "train": transforms.Compose([
            transforms.RandomResizedCrop(224, scale=(0.5, 1)),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
        ]),
        "val": transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
        ]),
        "test": transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
        ]),
    }
    
    # Dictionary of image datasets
    image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
                                              data_transforms[x])
                      for x in ["train", "val", "test"]}
    
    # Dictionary of image dataloaders
    dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=4,
                                                  shuffle=True, num_workers=2) 
                         for x in ["train", "val", "test"]}
    
    # Dictionary of dataset sizes
    dataset_sizes = {x: len(image_datasets[x]) for x in ["train", "val", "test"]}
    
    # List of class names
    class_names = image_datasets["train"].classes
    
    return dataloaders, dataset_sizes, class_names


def show_image(image_path: str) -> None:
    """
    Load and show an example image
    :param image_path: path of the image to be loaded
    """
    # Read in example image
    img = mpimg.imread(image_path)

    # Check format of image
    print(f"Image shape: {img.shape}")

    # Show example image
    imgplot = plt.imshow(img)

In [None]:
# Load data
dataloaders, dataset_sizes, class_names = load_data("../data")

### Explore Data

Display an example image. All images have different shapes.

In [None]:
show_image(image_path="../data/train/n02085620-Chihuahua/n02085620_11140.jpg")

Display the first batch of 4 images.

In [None]:
def imshow(img):
    img = img / 2 + 0.5 # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0))) # transpose dimensions from Pytorch format to default numpy format
    plt.show()

# Get some random training images
dataiter = iter(dataloaders["train"])
images, labels = dataiter.next()

# Show images
imshow(torchvision.utils.make_grid(images))
# Print labels
print("\n".join("%s" % class_names[labels[j]].split('-')[1] for j in range(4)))

Check number of classes.

In [None]:
print(f"Number of classes: {len(class_names)}")

### Upload Data

Upload the data to the default AML datastore.

In [15]:
datastore = ws.get_default_datastore()
datastore.upload(src_dir="../data", target_path="data/stanford_dogs", overwrite=True, show_progress=False)

$AZUREML_DATAREFERENCE_c2732f6b964349b499d99f2cc857dd1e

### Create and Register AML Datasets

Register the data as a dataset in the AML workspace.

In [16]:
# Create dataset object from datastore location
dataset = Dataset.File.from_files(path=(datastore, "data/stanford_dogs"))

In [17]:
# Register the dataset
dataset = dataset.register(workspace=ws,
                           name="stanford_dogs",
                           description="Stanford dogs dataset",
                           create_new_version=True)

# Compute Target

Create a remote compute target to run experiments on. The below code will first check whether a compute target with name `cpu_cluster_name` already exists and if it does, it will use that instead of creating a new one.

In [18]:
# Choose a name for the CPU cluster
cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_V2", # CPU
                                                           # vm_size='STANDARD_NC6', # GPU
                                                           max_nodes=4,
                                                           idle_seconds_before_scaledown=2400)
    
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

# Use get_status() to get a detailed status for the current cluster
print(compute_target.get_status().serialize())

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-01-18T15:17:11.553000+00:00', 'errors': None, 'creationTime': '2021-01-15T09:55:01.226729+00:00', 'modifiedTime': '2021-01-15T09:55:16.691497+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT2400S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_D2_V2'}


# Training Artifacts

Create a training script in the training directory. The training script will make use of transfer learning and use a pretrained Resnet18 model. The final fully connected layer of this model will be adjusted for multiclass classification with 120 target classes. All parameters of the model will then be trained on the stanford dogs dataset.

In [19]:
%%writefile $training_folder/train.py
import argparse
import copy
import numpy as np
import os
import time
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import urllib

from azureml.core import Run
from torch.optim import lr_scheduler
from torchvision import datasets, models, transforms
from zipfile import ZipFile

from data_utils import load_data
from model import Net

run = Run.get_context()


def train_model(model: torchvision.models,
                criterion: torch.nn.modules.loss,
                optimizer: torch.optim,
                scheduler: torch.optim.lr_scheduler,
                num_epochs: int,
                dataloaders: dict,
                dataset_sizes: dict) -> torchvision.models:
    """
    Train the model on the stanford dogs dataset and track training
    and validation loss and accuracy.
    :param model: pretrained model which will be trained further
    :param criterion: torch loss function
    :param optimizer: torch optimizer
    :param scheduler: torch learning rate scheduler
    :param num_epochs: number of epochs to train the model
    :param dataloaders: dictionary of torch dataloaders
    :param dataset_sizes: dictionary with lengths of the training, val and test set
    :return model: pretrained model with tuned final fully connected layer
    """
    
    # Leverage GPU if available
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    start_time = time.time()

    # Load in weights of model
    best_model_weights = copy.deepcopy(model.state_dict())
    
    best_acc = 0.0

    for epoch in range(num_epochs):
        print("-" * 20)
        print(f"Epoch {epoch + 1}/{num_epochs}")
        print("-" * 20)

        # Each epoch has a training and validation phase
        for phase in ["train", "val"]:
            if phase == "train":
                model.train() # Set model to training mode
            else:
                model.eval() # Set model to evaluate mode

            running_loss = 0.0
            running_correct_preds = 0

            # Iterate over data
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                # Zero the parameter gradients
                optimizer.zero_grad()

                # Forward pass
                # Track history only if in training phase
                with torch.set_grad_enabled(phase == "train"):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # Backward pass and gradient optimization only if in training phase
                    if phase == "train":
                        loss.backward()
                        optimizer.step()

                # Calculate statistics
                running_loss += loss.item() * inputs.size(0)
                running_correct_preds += torch.sum(preds == labels.data)
                
            # Update learning rate if in training phase
            if phase == "train":
                scheduler.step() 

            # Average loss and accuracy over examples
            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_correct_preds.double() / dataset_sizes[phase]

            print(f"{phase.capitalize()} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}")

            # Deep copy the model
            if phase == "val" and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_weights = copy.deepcopy(model.state_dict())

            # Log the best val accuracy to AML run
            run.log("best_val_acc", np.float(best_acc))
            print("-" * 20)

    time_elapsed = time.time() - start_time
    
    print(f"Training completed in {time_elapsed // 60:.0f}m {time_elapsed % 60:.0f}s.")
    print(f"Best Val Acc: {best_acc:4f}")
          
    # Load best model weights
    model.load_state_dict(best_model_weights)
          
    return model


def fine_tune_model(num_epochs: int,
                    num_classes: int,
                    dataloaders: dict,
                    dataset_sizes: dict,
                    learning_rate: float,
                    momentum: float) -> torchvision.models:
    """
    Load a pretrained model and reset the final fully connected layer.
    :param num_epochs: number of epochs to train the model
    :param num_classes: number of target classes 
        (supports binary and multiclass classification)
    :param dataloaders: dictionary of torch dataloaders
    :param dataset_sizes: dictionary with lengths of the training, val and test set
    :param learning_rate: learning rate hyperparameter
    :param momentum: momentum hyperparameter
    :return model: pretrained model with tuned final fully connected layer
    """

    print("-" * 20)
    print("START TRAINING")
    print("-" * 20)
    
    # Log the hyperparameter metrics to the AML run
    run.log("lr", np.float(learning_rate))
    run.log("momentum", np.float(momentum))

    # Load pretrained model and reset final fully connected layer to have num_classes output neurons
    model_ft = models.resnet18(pretrained=True)
    num_ftrs = model_ft.fc.in_features
    model_ft.fc = nn.Linear(num_ftrs, num_classes)

    # Leverage GPU if available
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model_ft = model_ft.to(device)

    # Specify loss function
    criterion = nn.CrossEntropyLoss()

    # Create SGD optimizer to optimize all parameters
    optimizer_ft = optim.SGD(model_ft.parameters(),
                             lr=learning_rate,
                             momentum=momentum)
                            
    # Create scheduler to decay LR by a factor of 0.1 every 7 epochs
    exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft,
                                           step_size=7,
                                           gamma=0.1)
    
    # Start model training
    model = train_model(model_ft, criterion, optimizer_ft,
                        exp_lr_scheduler, num_epochs, dataloaders,
                        dataset_sizes)

    return model


def main():
    
    print("Torch version:", torch.__version__)
    
    # Retrieve command-line arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--data_path", type=str, help="Path where the images are stored")
    parser.add_argument("--num_epochs", type=int, default=25, help="Number of epochs to train")
    parser.add_argument("--output_dir", type=str, help="Output directory")
    parser.add_argument("--learning_rate", type=float, default=0.001, help="Learning rate")
    parser.add_argument("--momentum", type=float, default=0.9, help="Momentum")
    args = parser.parse_args()
         
    print("-" * 20)
    print("LOAD DATA")      
    print("-" * 20)
          
    # Load training and validation data
    dataloaders, dataset_sizes, class_names = load_data(args.data_path)
          
    print("Data has been load successfully.")
        
    # Train the model
    model = fine_tune_model(num_epochs=args.num_epochs,
                            num_classes=len(class_names),
                            dataloaders=dataloaders,
                            dataset_sizes=dataset_sizes,
                            learning_rate=args.learning_rate,
                            momentum=args.momentum)
    
    # Save the model
    os.makedirs(args.output_dir, exist_ok=True)
    torch.save(model, os.path.join(args.output_dir, "model.pt"))
    print("-" * 20)
    print(f"Model saved in {args.output_dir}.")


if __name__ == "__main__":
    main()

Overwriting /mnt/batch/tasks/shared/LS_root/mounts/clusters/amlbrikseci/code/Users/BRIKSE/pytorch-use-cases-azure-ml/template_project/notebooks/../src/training/train.py


Alternatively to transfer learning, create a model file which contains the network architecture of a new model.

In [20]:
%%writefile $training_folder/model.py
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Overwriting /mnt/batch/tasks/shared/LS_root/mounts/clusters/amlbrikseci/code/Users/BRIKSE/pytorch-use-cases-azure-ml/template_project/notebooks/../src/training/model.py


Run the training script locally for 1 epoch for debugging purposes.

In [21]:
!python ../src/training/train.py --data_path "../data" --num_epochs 1 --output_dir "../outputs" --learning_rate 0.003 --momentum 0.9

Torch version: 1.6.0
--------------------
LOAD DATA
--------------------
Data has been load successfully.
--------------------
START TRAINING
--------------------
Attempted to log scalar metric lr:
0.003
Attempted to log scalar metric momentum:
0.9
--------------------
Epoch 1/1
--------------------
Train Loss: 3.6719 Acc: 0.1603
Attempted to log scalar metric best_val_acc:
0.0
--------------------
Val Loss: 2.8961 Acc: 0.3287
Attempted to log scalar metric best_val_acc:
0.32867132867132864
--------------------
Training completed in 27m 14s.
Best Val Acc: 0.328671
Model saved in ../outputs.


# Development Environment

Create an **environment.yml** file which contains all packages needed to create a conda environment for development, training and deployment. If the different stages (development, training and deployment) vary greatly, a separate conda environment file for each stage can be created. In that case they should be prefixed with their respective stage, e.g. **training_environment.yml**.

In [None]:
%%writefile ../environments/conda/environment.yml
name: pytorch-aml-env
dependencies:
- python=3.7.1
- pytorch::pytorch=1.7.0
- pytorch::torchvision=0.8.1
- pip:
    - azureml-defaults
    - azureml-sdk
    - azureml-widgets
    - python-dotenv==0.15.0
channels:
- pytorch

By instantiating an environment object, this conda environment can be used for the remote training run. Alternatively, AML curated environments can also be used. AML curated environments cover common ML scenarios and are backed by cached Docker images. Cached Docker images make the first remote run preparation faster.

In [None]:
# # Display AML Curated Environments
# envs = Environment.list(workspace=ws)

# for env in envs:
#     if env.startswith("AzureML"):
#         print("Name", env)
#         print("packages", envs[env].python.conda_dependencies.serialize_to_string())

In [None]:
# # List workspace environments
# for name, env in ws.environments.items():
#     print(f"Name {name} \t version {env.version}")

# # Retrieve an environment
# env = Environment.get(workspace=ws, name="AzureML-PyTorch-1.3-CPU", version="1")

# # Get base image of retrieved environment
# print(env.docker.base_image)

# print("\n Attributes of retrieved environment:")
# env

On the first run in a given environment, Azure ML spends some time building the environment. On the subsequent runs, Azure ML keeps track of changes and uses the existing environment, resulting in faster run completion.

In [None]:
env = Environment.from_conda_specification(name="pytorch-aml-env",
                                           file_path="../environments/conda/environment.yml")

# Attribute docker.enabled controls whether to use Docker container or host OS for execution.
# This is only relevant for local execution as execution on AML Compute Cluster will always use Docker container.
# env.docker.enabled = True

# Use Python dependencies from your Docker image (as opposed to from conda specification)
# env.python.user_managed_dependencies=True

## Only uncomment one of the three below options
# OPTION 1: Use mcr base image
#env.docker.base_image = "mcr.microsoft.com/azureml/intelmpi2018.3-ubuntu16.04:20201113.v1"
#env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04' # GPU base image

# Option 2: Use custom base image from workspace-native ACR
#env.docker.base_image = "eafc0c3ef9714c74a4fa655ee90531ba.azurecr.io/base/pytorch"

# OPTION 3: Use custom base image from standalone ACR and use admin user credentials. For this you need to enable admin user in the ACR.
env.docker.base_image = "sbirkacr.azurecr.io/base/pytorch"
env.docker.base_image_registry.address = "sbirkacr.azurecr.io"
env.docker.base_image_registry.username = "sbirkacr"
env.docker.base_image_registry.password = "HqAu5Y2We0gZ42IunR5MBXkKc+shf2uj" # replace with Key Vault

# Option 4: Use custom base image from standalone ACR and use service principal authentication. 
#           The service principal needs the AcrPull permission on the standalone ACR.
env.docker.base_image = "sbirkacr.azurecr.io/base/pytorch"
env.docker.base_image_registry.address = "sbirkacr.azurecr.io"
env.docker.base_image_registry.username = keyvault.get_secret(name="sbirk-acr-sp-username")
env.docker.base_image_registry.password = keyvault.get_secret(name="sbirk-acr-sp-password")

# Option 5: Use custom base image from standalone ACR with anonymous access preview feature.
# env.docker.base_image = "sbirkacr.azurecr.io/base/pytorch"

# Create an environment variable.
# This can be retrieved in the training script with os.environ.get("MESSAGE").
# env.environment_variables = {"MESSAGE": "Hello from Azure Machine Learning"}

env.register(workspace=ws)

# Experiment & Run Configuration

Now that the training artifacts are prepared, a model can be trained on the remote compute cluster. You can take advantage of Azure compute to leverage GPUs to cut down your training time. 

### Option 1: "Normal" Script Run

In [None]:
# Create the experiment
experiment = Experiment(workspace=ws, 
                        name="cifar-image-classification-pytorch")

# Create the script run configuration
config = ScriptRunConfig(source_directory="../src/training",
                         script="train.py",
                         compute_target=cpu_cluster_name,
                         arguments=[
                             "--data_path", dataset.as_named_input("input").as_mount(),
                             "--num_epochs", 25,
                             "--output_dir", "./outputs"
                             "--learning_rate", 0.001,
                             "--momentum", 0.9])

config.run_config.environment = env

# Submit the run
run = experiment.submit(config)
RunDetails(run).show()

### Option 2: Hyperdrive Run

Hyperparameters can be tuned using AML's hyperdrive capability.

The initial learning rate is tuned. The training script uses a LR schedule to decay the learning rate every several epochs starting from that initial learning rate.

Random sampling is used to try different configuration sets of hyperparameters to maximize the primary metric, the best validation accuracy (best_val_acc).

An early termination policy is specified to early terminate poorly performing runs. The BanditPolicy is used, which will terminate any run that doesn't fall within the slack factor of the primary evaluation metric. In this tutorial, this policy will be applied every epoch (since the best_val_acc metric is reported every epoch and evaluation_interval=1). The first policy evaluation will be delayed until after the first 10 epochs (delay_evaluation=10). 

In [None]:
param_sampling = RandomParameterSampling({
        "num_epochs": choice(10,15,20),
        "learning_rate": uniform(0.0005, 0.005), 
        "momentum": uniform(0.9, 0.99),
    }
)

early_termination_policy = BanditPolicy(slack_factor=0.15, evaluation_interval=1, delay_evaluation=10)

hyperdrive_config = HyperDriveConfig(run_config=config,
                                     hyperparameter_sampling=param_sampling, 
                                     policy=early_termination_policy,
                                     primary_metric_name="best_val_acc",
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                     max_total_runs=8,
                                     max_concurrent_runs=4)

In [None]:
# Start the HyperDrive run
hyperdrive_run = experiment.submit(hyperdrive_config)

In [None]:
# Get portal URL
run.get_portal_url()

In [None]:

RunDetails(hyperdrive_run).show()

In [None]:
run.wait_for_completion(show_output=False)

In [None]:
# Check run metrics, details and file names
print(run.get_metrics())
print(run.get_details())
print(run.get_file_names())

# Model Registration

In [None]:
model_path = "outputs/cifar_net.pt"

model = run.register_model(model_name="cifar10-model",
                           model_path=model_path,
                           model_framework=Model.Framework.PYTORCH,
                           description="cifar10 model")

print(model.name, model.id, model.version, sep="\t")

In [None]:
# Download the model
run.download_file(name=os.path.join("../", model_path))

# Resource Clean Up

In [None]:
# cpu_cluster.delete()

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/tutorials/get-started-day1/day1-part4-data.png)

In [None]:
from azureml.core.authentication import InteractiveLoginAuthentication
interactive_auth = InteractiveLoginAuthentication(tenant_id="your-tenant-id")
Additional details on authentication can be found here: https://aka.ms/aml-notebook-auth 

> <span style="color:purple; font-weight:bold">! NOTE <br>
> The very first run will take 5-10minutes to complete. This is because in the background a docker image is built in the cloud, the compute cluster is resized from 0 to 1 node, and the docker image is downloaded to the compute. Subsequent runs are much quicker (~15 seconds) as the docker image is cached on the compute - you can test this by resubmitting the code below after the first run has completed.</span>

> <span style="color:purple; font-weight:bold">! NOTE <br>
> The first time you run this script, Azure Machine Learning will build a new docker image from your PyTorch environment. The whole run could take 5-10 minutes to complete. You can see the docker build logs in the widget by selecting the `20_image_build_log.txt` in the log files dropdown. This image will be reused in future runs making them run much quicker.</span>

## Monitor a remote run

In total, the first run takes **approximately 10 minutes**. But for subsequent runs, as long as the dependencies in the Azure ML environment don't change, the same image is reused and hence the container start up time is much faster.

Here is what's happening while you wait:

- **Image creation**: A Docker image is created matching the Python environment specified by the Azure ML environment. The image is built and stored in the ACR (Azure Container Registry) associated with your workspace. Image creation and uploading takes **about 5 minutes**. 

  This stage happens once for each Python environment since the container is cached for subsequent runs.  During image creation, logs are streamed to the run history. You can monitor the image creation progress using these logs.

- **Scaling**: If the remote cluster requires more nodes to execute the run than currently available, additional nodes are added automatically. Scaling typically takes **about 5 minutes.**

- **Running**: In this stage, the necessary scripts and files are sent to the compute target, then data stores are mounted/copied, then the entry_script is run. While the job is running, stdout and the files in the ./logs directory are streamed to the run history. You can monitor the run's progress using these logs.

- **Post-Processing**: The ./outputs directory of the run is copied over to the run history in your workspace so you can access these results.

To run this notebook you will need to create an Azure Machine Learning _compute instance_. The benefits of a compute instance over a local machine (e.g. laptop) or cloud VM are as follows:

* It is a pre-configured with all the latest data science libaries (e.g. panads, scikit, TensorFlow, PyTorch) and tools (Jupyter, RStudio). In this tutorial we make extensive use of PyTorch, AzureML SDK, matplotlib and we do not need to install these components on a compute instance.
* Notebooks are seperate from the compute instance - this means that you can develop your notebook on a small VM size, and then seamlessly scale up (and/or use a GPU-enabled) the machine when needed to train a model.
* You can easily turn on/off the instance to control costs. 