Image Classification MLOps using Amazon Sagemaker
This notebook lists all the steps that you need to complete the complete this project. 

In [17]:

!pip install "smdebug==1.0.12" "bokeh==2.3.3"

Collecting smdebug
  Using cached smdebug-1.0.34-py2.py3-none-any.whl.metadata (1.8 kB)
Collecting protobuf<=3.20.3,>=3.20.0 (from smdebug)
  Using cached protobuf-3.20.3-py2.py3-none-any.whl.metadata (720 bytes)
Collecting pyinstrument==3.4.2 (from smdebug)
  Using cached pyinstrument-3.4.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting pyinstrument-cext>=0.2.2 (from pyinstrument==3.4.2->smdebug)
  Using cached pyinstrument_cext-0.2.4-cp312-cp312-linux_x86_64.whl
Using cached smdebug-1.0.34-py2.py3-none-any.whl (280 kB)
Using cached pyinstrument-3.4.2-py2.py3-none-any.whl (83 kB)
Using cached protobuf-3.20.3-py2.py3-none-any.whl (162 kB)
Installing collected packages: pyinstrument-cext, pyinstrument, protobuf, smdebug
[2K  Attempting uninstall: protobuf
[2K    Found existing installation: protobuf 5.28.3
[2K    Uninstalling protobuf-5.28.3:
[2K      Successfully uninstalled protobuf-5.28.3
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4/4[0m [smdebug]m3/4[0m [s

In [1]:
import sagemaker
import os, time, json
import boto3
from sagemaker.pytorch import PyTorch
from sagemaker.tuner import HyperparameterTuner, ContinuousParameter, IntegerParameter
from sagemaker.inputs import TrainingInput
from sagemaker.debugger import ProfilerRule, FrameworkProfile, ProfilerConfig, rule_configs, DebuggerHookConfig, CollectionConfig, Rule

sess = sagemaker.Session()
role = sagemaker.get_execution_role()

print("Role:", role)
print("Default S3 bucket:", sess.default_bucket())

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
Role: arn:aws:iam::106660882488:role/service-role/AmazonSageMaker-ExecutionRole-20251027T142948
Default S3 bucket: sagemaker-us-east-1-106660882488


## Dataset
This project uses the Dog Breed Classification dataset provided in the Udacity classroom. The dataset contains images from 133 different dog breeds, covering a wide range of sizes, coat types, and geographic origins. The dataset is already split into training, validation, and testing sets, which supports a clean and reproducible ML workflow. Images vary in lighting, pose, and background, making the classification task more realistic and challenging. This variety encourages strong generalization and helps evaluate the effectiveness of transfer learning when adapting a pre trained model like ResNet to a multi class image classification problem.

In [2]:
# Command to download and unzip data
# Uncomment and run the below two lines of code only the first time when you want to download and upload the data to s3

#!wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
#!unzip dogImages.zip


# Upload Data to S3
# Run this cell only the first time, to upload the data once.
LOCAL_DIR = 'dogImages'
S3_BUCKET = sess.default_bucket()
DATA_PREFIX = "dogimages"
input_data_path = sess.upload_data(path=LOCAL_DIR, bucket=S3_BUCKET, key_prefix=DATA_PREFIX)

print(f"input data path: {input_data_path}")

input data path: s3://sagemaker-us-east-1-106660882488/dogimages


In [3]:

# Set Input and Output path on S3 for the project
train = f"s3://{S3_BUCKET}/{DATA_PREFIX}/train"
valid   = f"s3://{S3_BUCKET}/{DATA_PREFIX}/valid"
test = f"s3://{S3_BUCKET}/{DATA_PREFIX}/valid"

timestamp = time.strftime("%Y%m%d-%H%M%S")
output_path = f"s3://{S3_BUCKET}/{DATA_PREFIX}/outputs/{timestamp}"
code_location = f"s3://{S3_BUCKET}/{DATA_PREFIX}/code/{timestamp}"

print(f"output path: {output_path}")
print(f"code location: {code_location}")



output path: s3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251105-224251
code location: s3://sagemaker-us-east-1-106660882488/dogimages/code/20251105-224251


## Hyperparameter Tuning
This section focuses on fine-tuning a pretrained ResNet-50 using SageMaker Hyperparameter Optimization (HPO).
The goal is to systematically explore parameter combinations that improve validation performance.
I use hpo.py as the training entry script so SageMaker can run multiple jobs in parallel with different settings.
Key hyperparameters tuned include learning rate, batch size, and epochs.
Learning rate controls convergence speed, batch size affects stability and generalization, and epochs balance training time versus overfitting.
I chose these ranges—learning rate (1e-4 to 1e-2), batch size (8–32), and epochs (3–10)—to stay within GPU memory and runtime limits.
The objective metric for HPO is validation loss (val_loss), since it measures generalization without leaking test data.
SageMaker automatically tracks printed metrics (val_loss, val_accuracy, test_loss, test_accuracy) from the training script.
All training artifacts and logs are stored in versioned S3 paths to ensure full reproducibility.
After tuning completes, the best model and its optimal hyperparameters are retrieved for final evaluation on the test set.

In [4]:
#Declare your HP ranges, metrics etc.
hyperparameter_ranges = {
    "learning_rate": ContinuousParameter(1e-4, 1e-2),  
    "batch_size": IntegerParameter(8, 32),             
    "epochs": IntegerParameter(3, 10),                 
}

metric_definitions = [
    {"Name": "val_loss",       "Regex": r"val_loss=([0-9.+-eE]+);"},
    {"Name": "test_loss",      "Regex": r"test_loss=([0-9.+-eE]+);"},
    {"Name": "test_accuracy",  "Regex": r"test_accuracy=([0-9.+-eE]+);"},
    {"Name": "train_loss",     "Regex": r"train_loss=([0-9.+-eE]+);"},
]

objective_metric_name = "val_loss"
objective_type = "Minimize"

In [5]:
# Create estimators for your HPs

INSTANCE_TYPE = "ml.g4dn.xlarge"  

estimator = PyTorch(
    entry_point="hpo.py",
    source_dir=".",
    role=role,
    framework_version="1.13",
    py_version="py39",
    instance_type=INSTANCE_TYPE,
    instance_count=1,
    output_path=output_path,         
    code_location=code_location,     
    metric_definitions=metric_definitions,
    hyperparameters={
        "num_classes": 133, # Dataset consists of 133 classes
        "image_size": 224, # Input requirement for the pre trained ResNet-50 model
        "device": "cuda",            
    },
)

tuner = HyperparameterTuner(
    estimator=estimator,
    metric_definitions=metric_definitions,
    early_stopping_type = "Auto",
    objective_metric_name="val_loss",
    hyperparameter_ranges=hyperparameter_ranges,
    objective_type="Minimize",
    max_jobs=8,            
    max_parallel_jobs=2,   
)

print([m["Name"] for m in estimator.metric_definitions])

['val_loss', 'test_loss', 'test_accuracy', 'train_loss']


In [8]:
single = PyTorch(
    entry_point="hpo.py",           
    source_dir=".",                 
    role=role,                      
    framework_version="1.13",
    py_version="py39",
    instance_type="ml.g4dn.xlarge", 
    instance_count=1,
    output_path=output_path,
    code_location=code_location,
    metric_definitions=metric_definitions,
    hyperparameters={
        "num_classes": 133, # Dataset consists of 133 classes
        "image_size": 224, # Input requirement for the pre trained ResNet-50 model
        "device": "cuda",            
    },
)

inputs = {
    "training": TrainingInput(s3_data=input_data_path, distribution="FullyReplicated")
}

In [None]:
#launch a single job
single.fit(inputs, logs="All")

INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2025-11-04-20-46-39-624


2025-11-04 20:48:15 Starting - Starting the training job
2025-11-04 20:48:15 Pending - Training job waiting for capacity......
2025-11-04 20:49:10 Pending - Preparing the instances for training...
2025-11-04 20:49:44 Downloading - Downloading input data......................
2025-11-04 20:53:47 Training - Training image download completed. Training in progress...[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
  "cipher": algorithms.TripleDES,[0m
  "class": algorithms.TripleDES,[0m
[34m2025-11-04 20:54:01,131 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2025-11-04 20:54:01,156 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2025-11-04 20:54:01,171 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2025-11-04 20:54:01,180 sagemaker_pytorch_container.train

In [None]:
# Launch HPO tuner
tuner.fit(inputs, wait=True)


No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................

In [10]:
# Get the best estimators and the best HPs

best_estimator = tuner.best_estimator()

# Get the hyperparameters of the best trained model
print("Best training job name:", best_estimator.latest_training_job.name)
print("\nBest hyperparameters:")
print(best_estimator.hyperparameters())



2025-11-05 23:44:19 Starting - Starting the training job
2025-11-05 23:44:19 Pending - Found matching resource for reuse
2025-11-05 23:44:19 Downloading - Downloading the training image
2025-11-05 23:44:19 Training - Training image download completed. Training in progress.
2025-11-05 23:44:19 Uploading - Uploading generated training model
2025-11-05 23:44:19 Completed - Resource released due to keep alive period expiry
Best training job name: pytorch-training-251105-2246-007-3aabdfab

Best hyperparameters:
{'_tuning_objective_metric': '"val_loss"', 'batch_size': '15', 'device': '"cuda"', 'epochs': '10', 'image_size': '224', 'learning_rate': '0.0004999469127974824', 'num_classes': '133', 'sagemaker_container_log_level': '20', 'sagemaker_estimator_class_name': '"PyTorch"', 'sagemaker_estimator_module': '"sagemaker.pytorch.estimator"', 'sagemaker_job_name': '"pytorch-training-2025-11-05-22-44-21-933"', 'sagemaker_program': '"hpo.py"', 'sagemaker_region': '"us-east-1"', 'sagemaker_submit_

## Model Profiling and Debugging
In this step, I fine-tuned the model using the best hyperparameters identified from hyperparameter tuning.The train_model.py script was used to configure SageMaker Debugger and Profiler for monitoring.A DebuggerHookConfig was added with save intervals for training and evaluation metrics.
Profiler configuration tracked system metrics every 500 ms for CPU, GPU, and memory usage.
Rules were added to detect vanishing gradients, overfitting, overtraining, and poor initialization.
The ProfilerReport rule automatically generated detailed performance summaries.
Debugger hooks collected losses, gradients, and weights to analyze model convergence.
All profiling and debugging data were stored in S3 for reproducibility and further analysis.
This setup ensures the final model is not only accurate but also computationally efficient and stable.

In [11]:
# Choose the best hyperparameters

best_hps = best_estimator.hyperparameters()

# Fixed (dataset/model-specific)
num_classes = int(best_hps.get('num_classes'))
image_size  = int(best_hps.get('image_size'))
device      = str(best_hps.get('device'))

# Tuned values from the best estimator

epochs        = int(best_hps.get('epochs'))
batch_size    = int(best_hps.get('batch_size'))
learning_rate = float(best_hps.get('learning_rate'))

best_hyperparameters={
        "num_classes":   num_classes,
        "image_size":    image_size,
        "device":        device,
        "epochs":        epochs,
        "batch_size":    batch_size,
        "learning_rate": learning_rate,
    }

print(best_hyperparameters)

{'num_classes': 133, 'image_size': 224, 'device': '"cuda"', 'epochs': 10, 'batch_size': 15, 'learning_rate': 0.0004999469127974824}


In [12]:
# Set up debugging and profiling rules and hooks
debugger_hook_config = DebuggerHookConfig(
    hook_parameters={
        "train.save_interval": "100",  
        "eval.save_interval": "10"     
    }
)

profiler_config = ProfilerConfig(system_monitor_interval_millis=500)
rules = [
    # Profiler rule
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),

    # Debugger rules
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
]

In [13]:
# Create an estimator

profile_estimator = PyTorch(
    entry_point="train_model.py",
    source_dir=".",
    role=role,
    framework_version="1.13",
    py_version="py39",
    instance_type="ml.g4dn.xlarge",
    instance_count=1,
    output_path=output_path,
    code_location=code_location,
    metric_definitions=[
        {"Name": "val_loss",      "Regex": r"val_loss=([0-9.+-eE]+);"},
        {"Name": "test_loss",     "Regex": r"test_loss=([0-9.+-eE]+);"},
        {"Name": "test_accuracy", "Regex": r"test_accuracy=([0-9.+-eE]+);"},
    ],
    debugger_hook_config=debugger_hook_config,
    profiler_config=profiler_config,
    rules=rules,
    hyperparameters=best_hyperparameters,
)

In [None]:
# Fit the estimator
profile_estimator.fit(inputs, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2025-11-06-03-29-08-196


2025-11-06 03:30:45 Starting - Starting the training job...VanishingGradient: InProgress
Overfit: InProgress
Overtraining: InProgress
PoorWeightInitialization: InProgress
ProfilerReport: InProgress
...
2025-11-06 03:31:46 Pending - Preparing the instances for training...
2025-11-06 03:32:20 Downloading - Downloading input data...........................
2025-11-06 03:36:44 Downloading - Downloading the training image...
2025-11-06 03:37:21 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
  "cipher": algorithms.TripleDES,[0m
  "class": algorithms.TripleDES,[0m
[34m2025-11-06 03:37:16,852 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2025-11-06 03:37:16,873 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2025-11-06 03:37:16,887 sagemak

In [40]:
# Locate the debugger output on S3

session = boto3.session.Session()

sm = sess.sagemaker_client

job_name = profile_estimator.latest_training_job.name  # or put your job name string here
region = session.region_name
print("Latest Job name:", job_name)
desc = sm.describe_training_job(TrainingJobName=job_name)

# Debugger output S3 path (where tensors are stored)
debug_s3 = desc["DebugHookConfig"]["S3OutputPath"]
print("Debugger S3 path:", debug_s3)

Latest Job name: pytorch-training-2025-11-06-03-29-08-196
Debugger S3 path: s3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251105-224251


In [33]:
!pip show smdebug

Name: smdebug
Version: 1.0.34
Summary: Amazon SageMaker Debugger is an offering from AWS which helps you automate the debugging of machine learning training jobs.
Home-page: https://github.com/awslabs/sagemaker-debugger
Author: AWS DeepLearning Team
Author-email: 
License: Apache License Version 2.0
Location: /opt/conda/lib/python3.12/site-packages
Requires: boto3, numpy, packaging, protobuf, pyinstrument
Required-by: 


**TODO**: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?  
**TODO**: If not, suppose there was an error. What would that error look like and how would you have fixed it?

In [60]:
!pip show bokeh

Name: bokeh
Version: 2.4.3
Summary: Interactive plots and applications in the browser from Python
Home-page: https://github.com/bokeh/bokeh
Author: Bokeh Team
Author-email: info@bokeh.org
License: BSD-3-Clause
Location: /opt/conda/lib/python3.12/site-packages
Requires: Jinja2, numpy, packaging, pillow, PyYAML, tornado, typing-extensions
Required-by: 


In [66]:
!pip uninstall -y bokeh


Found existing installation: bokeh 2.4.3
Uninstalling bokeh-2.4.3:
  Successfully uninstalled bokeh-2.4.3


In [None]:
!pip uninstall -y smdebug

In [63]:
!pip install "bokeh<3"

Collecting bokeh<3
  Using cached bokeh-2.4.3-py3-none-any.whl.metadata (14 kB)
Using cached bokeh-2.4.3-py3-none-any.whl (18.5 MB)
Installing collected packages: bokeh
Successfully installed bokeh-2.4.3


In [64]:
import bokeh.plotting as bp
from bokeh.plotting.figure import figure as bokeh_figure

def figure(*args, **kwargs):
    if "plot_height" in kwargs:
        kwargs["height"] = kwargs.pop("plot_height")
    if "plot_width" in kwargs:
        kwargs["width"] = kwargs.pop("plot_width")
    return bokeh_figure(*args, **kwargs)

bp.figure = figure



In [65]:
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
from urllib.parse import urlparse

job_name = job_name
region = region 

# Create the TrainingJob object
tj = TrainingJob(training_job_name=job_name, region=region)
s3_output = desc["OutputDataConfig"]["S3OutputPath"].rstrip("/")  
u = urlparse(s3_output)
bucket, base_prefix = u.netloc, u.path.lstrip("/")
profiler_prefix = f"s3://{bucket}/{base_prefix}/{job_name}/profiler-output/"
print("Profiler S3 prefix:", profiler_prefix)

# Wait until profiling data becomes available
tj.wait_for_sys_profiling_data_to_be_available()

# Get the trial (profiler output) path
trial_path = f"s3://{bucket}/{base_prefix}/{job_name}/profiler-output/"
print("Profiler output path:", trial_path)

# Initialize the system metrics reader
system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

# Plot CPU/GPU utilization timeline
view_timeline_charts = TimelineCharts(
    system_metrics_reader=system_metrics_reader,
    framework_metrics_reader=None,
    select_dimensions=["CPU", "GPU"],  
    select_events=["total"],          
)

ProfilerConfig:{'S3OutputPath': 's3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251105-224251', 'ProfilingIntervalInMilliseconds': 500, 'DisableProfiler': False}
s3 path:s3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251105-224251/pytorch-training-2025-11-06-03-29-08-196/profiler-output
Profiler S3 prefix: s3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251105-224251/pytorch-training-2025-11-06-03-29-08-196/profiler-output/


Profiler data from system is available
Profiler output path: s3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251105-224251/pytorch-training-2025-11-06-03-29-08-196/profiler-output/
select events:['total']
select dimensions:['CPU', 'GPU']
filtered_events:{'total'}
filtered_dimensions:{'CPUUtilization-nodeid:algo-1', 'GPUMemoryUtilization-nodeid:algo-1', 'GPUUtilization-nodeid:algo-1'}


## Model Deploying

In [None]:
# TODO: Deploy your model to an endpoint

predictor=estimator.deploy() # TODO: Add your deployment configuration like instance type and number of instances

In [None]:
# TODO: Run an prediction on the endpoint

image = # TODO: Your code to load and preprocess image to send to endpoint for prediction
response = predictor.predict(image)

In [None]:
# TODO: Remember to shutdown/delete your endpoint once your work is done
predictor.delete_endpoint()