# Pipeline of Digits

This is a starting notebook for solving the "Pipeline of Digits" assignment.


This notebook was created by [Sushant Gautam](https://www.linkedin.com/in/susan-gautam/) as part of the [Machine Learning School Assignment](https://www.ml.school) program.

Let's make sure we are running the latest version of the SakeMaker's SDK. **Restart the notebook** after you upgrade the library.

In [2]:
!pip install -q --upgrade awscli
!pip install -q --upgrade pip
!pip install -q --upgrade sagemaker
!pip show sagemaker

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0mName: sagemaker
Version: 2.148.0
Summary: Open source library for training and deploying models on Amazon SageMaker.
Home-page: https://github.com/aws/sagemaker-python-sdk/
Author: Amazon Web Services
Author-email: 
License: Apache License 2.0
Location: /opt/conda/lib/python3.8/site-packages
Requires: attrs, boto3, cloudpickle, google-pasta, importlib-metadata, jsonschema, numpy, packaging, pandas, pathos, platformdirs, protobuf, protobuf3-to-dict, PyYAML, schema, smdebug-rulesconfig, tblib
Required-by: 


In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
import boto3
import sagemaker
import pandas as pd

from pathlib import Path

role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()

## Creating the S3 Bucket

Let's create an S3 bucket where you will upload all the information generated by the pipeline. Make sure you set `BUCKET` to the name of the bucket you want to use. This name has to be unique.

If you want to create a bucket in a region other than `us-east-1`, use this command instead:

```
!aws s3api create-bucket --bucket $BUCKET --create-bucket-configuration LocationConstraint=$region
```

The `LocationConstraint` argument should specify the region where you want to create the bucket.

In [5]:
BUCKET = "mlschooldata"

!aws s3api create-bucket --bucket $BUCKET

{
    "Location": "/mlschooldata"
}


## Loading the dataset

We have two CSV files containing the MNIST dataset. These files come from the [MNIST in CSV](https://www.kaggle.com/datasets/oddrationale/mnist-in-csv) Kaggle dataset.

The `mnist_train.csv` file contains 60,000 training examples and labels. The `mnist_test.csv` contains 10,000 test examples and labels. Each row consists of 785 values: the first value is the label (a number from 0 to 9) and the remaining 784 values are the pixel values (a number from 0 to 255).

Let's extract the `dataset.tar.gz` file.

In [6]:
MNIST_FOLDER = "mnist"
DATASET_FOLDER = Path(MNIST_FOLDER) / "dataset"

!tar -xvzf $MNIST_FOLDER/dataset.tar.gz -C $MNIST_FOLDER --no-same-owner

dataset/
dataset/mnist_test.csv
dataset/mnist_train.csv


Let's load the first 10 rows of the test set.

In [7]:
train_df = pd.read_csv(DATASET_FOLDER / "mnist_train.csv")
test_df = pd.read_csv(DATASET_FOLDER / "mnist_test.csv")
train_df.head(10)


Unnamed: 0,label,1x1,1x2,1x3,1x4,1x5,1x6,1x7,1x8,1x9,...,28x19,28x20,28x21,28x22,28x23,28x24,28x25,28x26,28x27,28x28
0,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
print(train_df.shape)
print(test_df.shape)

(60000, 785)
(10000, 785)


## Step 1: Preprocessing Step

In [9]:
%%writefile preprocessor.py

import os
import numpy as np
import pandas as pd
import tempfile

from pathlib import Path
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit


# This is the location where the SageMaker Processing job
# will save the input dataset.
BASE_DIR = "/opt/ml/processing"
DATA_FILEPATH_TRAIN = Path(BASE_DIR) / "input" / "mnist_train.csv"
DATA_FILEPATH_TEST = Path(BASE_DIR) / "input" / "mnist_test.csv"


def save_splits(base_dir, train, validation, test):
    """
    One of the goals of this script is to output the three
    dataset splits. This function will save each of these
    splits to disk.
    """
    
    train_path = Path(base_dir) / "train" 
    validation_path = Path(base_dir) / "validation" 
    test_path = Path(base_dir) / "test"
    
    train_path.mkdir(parents=True, exist_ok=True)
    validation_path.mkdir(parents=True, exist_ok=True)
    test_path.mkdir(parents=True, exist_ok=True)
    
    pd.DataFrame(train).to_csv(train_path / "train.csv", header=False, index=False)
    pd.DataFrame(validation).to_csv(validation_path / "validation.csv", header=False, index=False)
    pd.DataFrame(test).to_csv(test_path / "test.csv", header=False, index=False)
    
    
def preprocess(base_dir, train_data_filepath, test_data_filepath):
    """
    Preprocesses the supplied raw dataset and splits it into a train, validation,
    and a test set.
    """
    
    train_df = pd.read_csv(train_data_filepath)
    test_df = pd.read_csv(test_data_filepath)
    
    x_train = (train_df.drop(['label'], axis=1).values)/255.
    x_test = (test_df.drop(['label'], axis=1).values)/255.
    y_train = train_df['label'].values
    y_test = test_df['label'].values
    
    # validation set
    validation_split = StratifiedShuffleSplit(n_splits=1, test_size=0.25, random_state=46)
    validation_split.split(x_train, y_train)
    training_idx, validation_idx = list(validation_split.split(x_train, y_train))[0]
    
    x_training = x_train[training_idx]
    y_training = y_train[training_idx]

    x_validation = x_train[validation_idx]
    y_validation = y_train[validation_idx]
    
    training_df = train_df.iloc[training_idx]
    validation_df = train_df.iloc[validation_idx]
    
    
    train = np.concatenate((x_train, np.expand_dims(y_train, axis=1)), axis=1)
    validation = np.concatenate((x_validation, np.expand_dims(y_validation, axis=1)), axis=1)
    test = np.concatenate((x_test, np.expand_dims(y_test, axis=1)), axis=1)
    
    save_splits(base_dir, train, validation, test)
        

if __name__ == "__main__":
    preprocess(BASE_DIR, DATA_FILEPATH_TRAIN, DATA_FILEPATH_TEST)


Overwriting preprocessor.py


## Uploading dataset to S3

In [10]:
S3_FILEPATH = f"s3://{BUCKET}/{MNIST_FOLDER}"


TRAIN_SET_S3_URI = sagemaker.s3.S3Uploader.upload(
    local_path=str(DATASET_FOLDER / "mnist_train.csv"), 
    desired_s3_uri=S3_FILEPATH,
)

TEST_SET_S3_URI = sagemaker.s3.S3Uploader.upload(
    local_path=str(DATASET_FOLDER / "mnist_test.csv"), 
    desired_s3_uri=S3_FILEPATH,
)

print(f"Dataset location: {S3_FILEPATH}")
print(f"Train set S3 location: {TRAIN_SET_S3_URI}")
print(f"Test set S3 location: {TEST_SET_S3_URI}")

Dataset location: s3://mlschooldata/mnist
Train set S3 location: s3://mlschooldata/mnist/mnist_train.csv
Test set S3 location: s3://mlschooldata/mnist/mnist_test.csv


## Step 5 - Pipeline Configuration

When we create a SageMaker Pipeline we can specify a list of paramaters that we can use on individual pipeline steps. To read more about these parameters, check [Pipeline Parameters](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-parameters.html).

These are the parameters that we need right now:

* `dataset_location`: This parameter represents the location of the dataset in S3. We will use this parameter to indicate the SageMaker Processing Job where the dataset is located. The Processing Job will download the dataset from S3 and make it available on the instance running the script.
* `preprocessor_destination`: We need to define the location where the SageMaker Processing Job will store the output. When it finishes, the Processing Job will copy the script's output to the S3 location specified by this parameter. By default, SageMaker uploads the output of a job to a custom location in S3, but unfortunately, if we relay on that functionality, we can't cache the Processing Step in the Pipeline.
* `baseline_destination`: This parameter represents the location where we will store the baseline data. We will use this baseline data in Session 6 to compute general statistics about the model. This will be helpful to monitor the quality of the model results.

In [23]:
import os
import sagemaker
import numpy as np
import boto3
import json
import pandas as pd
import numpy as np
import urllib.request
import argparse
import tempfile
from pathlib import Path

from botocore.exceptions import ClientError
from sagemaker.inputs import FileSystemInput
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.processing import ScriptProcessor
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.workflow.model_step import ModelStep
from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.workflow.parameters import ParameterInteger, ParameterString, ParameterFloat
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import CacheConfig
from sagemaker.pytorch import PyTorch


In [17]:
# train_dataset_location = ParameterString(
#     name="dataset_location",
#     default_value=TRAIN_SET_S3_URI,
# )

# test_dataset_location = ParameterString(
#     name="dataset_location",
#     default_value=TEST_SET_S3_URI,
# )
dataset_location = ParameterString(
    name="dataset_location",
    default_value=S3_FILEPATH,
)

preprocessor_destination = ParameterString(
    name="preprocessor_destination",
    default_value=f"{S3_FILEPATH}/preprocessing",
)

### Setting Github username and email

In [18]:
! git config --global user.email "sushant@gmail.com"
! git config --global user.name "sushant"

! git config --global credential.helper '!aws codecommit credential-helper $@'
! git config --global credential.UseHttpPath true

### Caching

In [19]:
cache_config = CacheConfig(
    enable_caching=True, 
    expire_after="15d"
)

## Setting up a Processing Step

The first step we need in the pipeline is a [Processing Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-processing) to run the preprocessing script. Check the [ProcessingStep](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.steps.ProcessingStep) SageMaker's SDK documentation for more information. This Processing Step will create a SageMaker Processing Job in the background, run the script, and upload the output to S3. You can use Processing Jobs to perform data pre-processing, post-processing, feature engineering, data validation, and model evaluation.

In [20]:
sklearn_processor = SKLearnProcessor(
    base_job_name="mnist-preprocessing",
    framework_version="0.23-1",
    instance_type="ml.t3.large",
    instance_count=1,
    role=role,
)

In [21]:
preprocess_step = ProcessingStep(
    name="preprocessing",
    processor=sklearn_processor,
    inputs=[
        ProcessingInput(source=dataset_location, destination="/opt/ml/processing/input"),  
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train", destination=preprocessor_destination),
        ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation", destination=preprocessor_destination),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test", destination=preprocessor_destination),
    ],
    code="preprocessor.py",
    cache_config=cache_config
)

##  Running the Pipeline

Let's define and run the SageMaker Pipeline. Check [Pipeline Structure and Execution](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-pipeline.html) for more information about how to define a pipeline and [Run a Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/run-pipeline.html) for information about how to run it.

The pipeline uses the parameters we defined before and a single step: the Preprocess Step that will preprocess the dataset.

In [24]:
session1_pipeline = Pipeline(
    name="mnist-session1-pipeline",
    parameters=[
        dataset_location, 
        preprocessor_destination,
    ],
    steps=[
        preprocess_step, 
    ]
)

Submit the pipeline definition to the SageMaker Pipelines service to create a pipeline if it doesn't exist, or update the pipeline if it does.

In [40]:
# session1_pipeline.upsert(role_arn=role)
# execution = session1_pipeline.start()

### Import

### Train Step

In [25]:
# set learning rate
learning_rate = ParameterString(name="learning_rate", default_value="0.001")


Model Creation Reference: https://www.kaggle.com/code/mercedeszkistoth/digit-classifier-vanilla-nn-pytorch

In [26]:
%%writefile train.py

import os
import argparse

import numpy as np
import pandas as pd
import tensorflow as tf
import torch

from pathlib import Path
from sklearn.metrics import accuracy_score

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
import numpy as np
from sklearn import metrics
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader


# Expects a .csv file where the first column is the label
# and returns a tensor dataset, the input_size and the number of classes
def csv_to_tensor(file, delimiter=',', skip_header=True):
    np_array = np.genfromtxt(file, delimiter=',', skip_header=True)
    tensor_labels = torch.from_numpy(np_array[:,0]).long()
    tensor_data = torch.from_numpy(np_array[:,1:]).float()
    return TensorDataset(tensor_data, tensor_labels)

# DataLoader used for mini-batch training
def make_data_loader(train_file, val_file, test_file batch_size):
    train_dataset = csv_to_tensor(train_file)
    test_dataset = csv_to_tensor(test_file)
    val_dataset = csv_to_tensor(val_file)
    
    input_size = len(val_dataset.tensors[0][0])
    tensor_labels = val_dataset.tensors[1]
    labels = set(label.item() for label in tensor_labels)
    num_classes = len(labels)
    
    train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(dataset=val_dataset)
    test_loader = DataLoader(dataset=test_dataset)
    return train_loader, val_loader, test_loader, input_size, num_classes
    
# Our model, the heart
class DigitClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super().__init__()
        self.hidden_layer_1 = nn.Linear(input_size, hidden_size) 
        self.activation_1 = nn.ReLU()
        self.hidden_layer_2 = nn.Linear(hidden_size, hidden_size) 
        self.activation_2 = nn.ReLU()
        self.output_layer = nn.Linear(hidden_size, num_classes)
        self.probabilities = nn.Softmax(dim=0)
    
    def forward(self, x):
        hidden_1 = self.hidden_layer_1(x)
        hidden_activated_1 = self.activation_1(hidden_1)
        hidden_2 = self.hidden_layer_2(hidden_activated_1)
        hidden_activated_2 = self.activation_2(hidden_2)
        out_layer = self.output_layer(hidden_activated_2)
        return out_layer
    

def calculate_loss(model, device, input_data, expected_output, loss_fn):
    input_data.to(device)
    expected_output.to(device)
    output = model(input_data)
    loss = loss_fn(output, expected_output)
    return loss

def train_model(model, train_dataset, loss_fn, optimizer, batch_size=100, n_epochs=2):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    model.to(device)
    model.train()    
    training_losses = []
    
    for epoch in range(n_epochs):
        for i, (x_batch, y_batch) in enumerate(train_dataset):
            loss = calculate_loss(model, device, x_batch, y_batch, loss_fn)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            training_losses.append(loss.item())
            
            if (i) % batch_size == 0:
                print(f'Epoch: {epoch+1}/{num_epochs}, '+
                      f'Batch_num:{i}/{len(train_dataset)}, '+
                      f'Loss: {loss.item():.4f}')
        print(f'Epoch: {epoch+1}/{num_epochs}, '+
              f'Batch_num:{i}/{len(train_dataset)}, '+
              f'Loss: {loss.item():.4f}')
            
    
    return model, training_losses

def predict(model, input_tensor):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    model.to(device)
    input_tensor.to(device)
    
    with torch.no_grad():
        raw_output = model(input_tensor)
        output_probabilities = model.probabilities(raw_output)
        pred_category = torch.argmax(output_probabilities).item()
        pred_probability = torch.max(output_probabilities).item()
    return pred_category, round(pred_probability, 4)

def evaluate(model, test_dataset):
    actual = []
    predicted = []
    n_correct = 0
    incorrect = []
    
    for i, (x, y) in enumerate(test_dataset):
        actual.append(y.item())
        label, prob = predict(model, x)
        predicted.append(label)
        
        if (label == y.item()):
            n_correct+=1
        else:
            incorrect.append([i, y.item(), label])
    
    print(f'Accuracy on test set: {n_correct/len(test_dataset)*100:.2f}%')

    confusion_matrix = metrics.confusion_matrix(actual, predicted)
    print(f"Confusion Matrix: {confusion_matrix}")
    

# Hyperparameters
hidden_size = 16
learning_rate = 0.001


def train(base_directory, train_path, validation_path, test_path epochs=50, batch_size=32):
    train_loader, val_loader,test_loader, input_size, num_classes = make_data_loader(Path(train_path) / "train.csv", 
                                                                      Path(validation_path) / "validation.csv", 
                                                                      Path(test_path) / "test.csv", 
                                                                      batch_size)
   
    model = DigitClassifier(input_size, hidden_size, num_classes)
    loss_fn = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

    model, training_loss = train_model(model, 
                                    train_loader,
                                    loss_fn, 
                                    optimizer,
                                    batch_size,
                                    epochs)
    
    evaluate(model, val_loader.dataset)
    
    model_filepath = Path(base_directory) / "model" / "001"
    torch.save(model, model_filepath)
    
if __name__ == "__main__":
    # Any hyperparameters provided by the training job are passed to the entry point
    # as script arguments. SageMaker will also provide a list of special parameters
    # that you can capture here. Here is the full list: 
    # https://github.com/aws/sagemaker-training-toolkit/blob/master/src/sagemaker_training/params.py
    parser = argparse.ArgumentParser()
    parser.add_argument("--base_directory", type=str, default="/opt/ml/")
    parser.add_argument("--train_path", type=str, default=os.environ.get("SM_CHANNEL_TRAIN", None))
    parser.add_argument("--validation_path", type=str, default=os.environ.get("SM_CHANNEL_VALIDATION", None))
    parser.add_argument("--epochs", type=int, default=10)
    parser.add_argument("--batch_size", type=int, default=32)
    args, _ = parser.parse_known_args()
    
    train(
        base_directory=args.base_directory,
        train_path=args.train_path,
        validation_path=args.validation_path,
        epochs=args.epochs,
        batch_size=args.batch_size
    )

Overwriting train.py


# Session 2 - Training and Tuning

This session extends the [SageMaker Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-sdk.html) we built in the previous session with a step to train a model. We'll explore the [Training](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-training) and the [Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-tuning) steps.


In [31]:
from sagemaker.tuner import HyperparameterTuner
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TuningStep
from sagemaker.parameter import IntegerParameter
from sagemaker.inputs import TrainingInput
from sagemaker.tensorflow import TensorFlow
from sagemaker.workflow.steps import TrainingStep
from sagemaker.workflow.pipeline_context import PipelineSession

## Step 3 - Switching Between Training and Tuning

There are two ways we can create a model: Using a [Training Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-training) or using a [Tuning Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-tuning).

In this notebook we are going to alternate between both methods, and we'll use the `USE_TUNING_STEP` flag to indicate which method we want to run.

In [32]:
USE_TUNING_STEP = True

## Step 4 - Setting up a Training Step

We can now create a [Training Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-training) that we can add to the pipeline. Check the [TrainingStep](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.steps.TrainingStep) SageMaker's SDK documentation for more information. This Training Step will create a SageMaker Training Job in the background, run the training script, and upload the output to S3. 

In [33]:
hyperparameters = {
    "epochs": 50,
    "batch_size": 32,
}

estimator = PyTorch(
    entry_point="train.py",
    hyperparameters=hyperparameters,
    framework_version="1.11.0",
    py_version="py37",
    instance_type="ml.m5.xlarge",
    instance_count=1,
    script_mode=True,
    disable_profiler=True,
    role=role,
)

We can now create the [TrainingStep](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.steps.TrainingStep) using the estimator we defined before.

This step will receive the train and validation splits from the preprocessing step as inputs. Notice how we reference both splits using the `preprocess_step` variable. This creates a dependency between the Training Step and the Processing Step that we defined in Session 1. When we build a new Pipeline, we'll see that the Training Step won't run until the Processing Step finishes.

In [34]:
training_step = TrainingStep(
    name="training",
    estimator=estimator,
    inputs={
        "train": TrainingInput(
            s3_data=preprocess_step.properties.ProcessingOutputConfig.Outputs[
                "train"
            ].S3Output.S3Uri,
            content_type="text/csv"
        ),
        "validation": TrainingInput(
            s3_data=preprocess_step.properties.ProcessingOutputConfig.Outputs[
                "validation"
            ].S3Output.S3Uri,
            content_type="text/csv"
        )
    },
    cache_config=cache_config
)

## Step 5 - Setting up a Tuning Step

Let's now create a [Tuning Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-tuning) to add it to our pipeline. Check the [TuningStep](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.steps.TuningStep) SageMaker's SDK documentation for more information. This Tuning Step will create a SageMaker Hyperparameter Tuning Job in the background and use the training script to train different variants of the model and choose the best one.

The Tuning Step requires a [HyperparameterTuner](https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html) reference to configure the Hyperparameter Tuning Job. In this example, the tuner will use the same `Estimator` we defined to train the model.

Here is the configuration that we'll use to find the best model:

1. `objective_metric_name`: This is the name of the metric the tuner will use to determine the best model.
2. `objective_type`: This is the objective of the tuner. Should it "Minimize" the metric or "Maximize" it? In this example, since we are using the validation accuracy of the model, we want the objetive to be "Maximize." If we were using the loss of the model, we would set the objective to "Minimize."
3. `metric_definitions`: Defines how the tuner will determine the value of the metric by looking at the output logs of the training process.

The tuner expects the list of the hyperparameters you want to explore. You can use subclasses of the [Parameter](https://sagemaker.readthedocs.io/en/stable/api/training/parameter.html#sagemaker.parameter.ParameterRange) class to specify different types of hyperparameters. In this example, we are exploring different values for the `epochs` hyperparameter.

Finally, you can control the number of jobs and how many of them will run in parallel using the following two arguments:

* `max_jobs`: Defines the maximum total number of training jobs to start for the hyperparameter tuning job.
* `max_parallel_jobs`: Defines the maximum number of parallel training jobs to start.

In [35]:
from sagemaker.tuner import (
    IntegerParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

In [37]:
hyperparameter_ranges = {
    "epochs": IntegerParameter(10, 50),
    "learning_rate": ContinuousParameter(0.01, 0.03),
}

objective_metric_name = "val_accuracy"
objective_type = "Maximize"
metric_definitions = [{"Name": objective_metric_name, "Regex": "val_accuracy: ([0-9\\.]+)"}]
    
tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    objective_type=objective_type,
    max_jobs=3,
    max_parallel_jobs=3,
)

We can now create the [TuningStep](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.steps.TuningStep). 

This step will use the tuner we configured before and will receive the train and validation splits from the preprocessing step as inputs. Notice how we reference both splits using the `preprocess_step` variable. This creates a dependency between the Tuning Step and the Processing Step that we defined in Session 1. When we build a new Pipeline, we'll see that the Tuning Step won't run until the Processing Step finishes.

In [38]:
tuning_step = TuningStep(
    name = "tuning",
    tuner=tuner,
    inputs={
        "train": TrainingInput(
            s3_data=preprocess_step.properties.ProcessingOutputConfig.Outputs[
                "train"
            ].S3Output.S3Uri,
            content_type="text/csv"
        ),
        "validation": TrainingInput(
            s3_data=preprocess_step.properties.ProcessingOutputConfig.Outputs[
                "validation"
            ].S3Output.S3Uri,
            content_type="text/csv"
        )
    },
    cache_config=cache_config
)

## Step 6 - Running the Pipeline

We can now define and run the SageMaker Pipeline, this time using the Training Step or the Tuning Step.

In [39]:
session2_pipeline = Pipeline(
    name="mnist_pipeline",
    parameters=[
        dataset_location, 
        preprocessor_destination,
        # baseline_destination,
    ],
    steps=[
        preprocess_step, 
        tuning_step if USE_TUNING_STEP else training_step
    ]
)

Submit the pipeline definition to the SageMaker Pipelines service to create a pipeline if it doesn't exist, or update the pipeline if it does.

In [None]:
# session2_pipeline.upsert(role_arn=role)
# execution = session2_pipeline.start()

In [None]:
pt_estimator = PyTorch(
    base_job_name="training_mnist",
    entry_point="train.py",
    sagemaker_session=pipeline_session,
    role=role,
    py_version="py38",
    framework_version="1.11.0",
    instance_count=1,
    instance_type="ml.g4dn.xlarge",
    tensorboard_output_config=tensorboard_output_config,
    use_spot_instances=True,
    max_wait=2000,
    max_run=1800,
    environment={
        "ModelName":model_name,
        "OptimName":optim_name,
        "Learning_rate":learning_rate,
        "Batch_size":batch_size,
        # "ModelName":"resnet18",
        # "OptimName":"RMS",
        # "Learning_rate":"3.5999257898500047e-05",
        # "Batch_size":"256",
        "GIT_USER": "sushant",
        "GIT_EMAIL": "sushantgautm@gmail.com",
    },
)