Image Classification MLOps using Amazon Sagemaker
This notebook lists all the steps that you need to complete the complete this project. 

In [1]:

!pip install smdebug

Collecting smdebug
  Using cached smdebug-1.0.34-py2.py3-none-any.whl.metadata (1.8 kB)
Collecting protobuf<=3.20.3,>=3.20.0 (from smdebug)
  Using cached protobuf-3.20.3-py2.py3-none-any.whl.metadata (720 bytes)
Collecting pyinstrument==3.4.2 (from smdebug)
  Using cached pyinstrument-3.4.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting pyinstrument-cext>=0.2.2 (from pyinstrument==3.4.2->smdebug)
  Using cached pyinstrument_cext-0.2.4-cp312-cp312-linux_x86_64.whl
Using cached smdebug-1.0.34-py2.py3-none-any.whl (280 kB)
Using cached pyinstrument-3.4.2-py2.py3-none-any.whl (83 kB)
Using cached protobuf-3.20.3-py2.py3-none-any.whl (162 kB)
Installing collected packages: pyinstrument-cext, pyinstrument, protobuf, smdebug
[2K  Attempting uninstall: protobuf
[2K    Found existing installation: protobuf 5.28.3
[2K    Uninstalling protobuf-5.28.3:
[2K      Successfully uninstalled protobuf-5.28.3
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4/4[0m [smdebug]m3/4[0m [s

In [21]:
import sagemaker
import os, time, json
import boto3
from sagemaker.pytorch import PyTorch
from sagemaker.tuner import HyperparameterTuner, ContinuousParameter, IntegerParameter
from sagemaker.inputs import TrainingInput
from sagemaker.debugger import ProfilerRule, FrameworkProfile, ProfilerConfig, rule_configs, DebuggerHookConfig, CollectionConfig, Rule

sess = sagemaker.Session()
role = sagemaker.get_execution_role()

print("Role:", role)
print("Default S3 bucket:", sess.default_bucket())

Role: arn:aws:iam::106660882488:role/service-role/AmazonSageMaker-ExecutionRole-20251027T142948
Default S3 bucket: sagemaker-us-east-1-106660882488


## Dataset
This project uses the Dog Breed Classification dataset provided in the Udacity classroom. The dataset contains images from 133 different dog breeds, covering a wide range of sizes, coat types, and geographic origins. The dataset is already split into training, validation, and testing sets, which supports a clean and reproducible ML workflow. Images vary in lighting, pose, and background, making the classification task more realistic and challenging. This variety encourages strong generalization and helps evaluate the effectiveness of transfer learning when adapting a pre trained model like ResNet to a multi class image classification problem.

In [None]:
# Command to download and unzip data
# Uncomment and run the below two lines of code only the first time when you want to download and upload the data to s3

#!wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
#!unzip dogImages.zip


# Upload Data to S3
# Run this cell only the first time, to upload the data once.
LOCAL_DIR = 'dogImages'
S3_BUCKET = sess.default_bucket()
DATA_PREFIX = "dogimages"
input_data_path = sess.upload_data(path=LOCAL_DIR, bucket=S3_BUCKET, key_prefix=DATA_PREFIX)

print(f"input data path: {input_data_path}")

In [9]:

# Set Input and Output path on S3 for the project
train = f"s3://{S3_BUCKET}/{DATA_PREFIX}/train"
valid   = f"s3://{S3_BUCKET}/{DATA_PREFIX}/valid"
test = f"s3://{S3_BUCKET}/{DATA_PREFIX}/valid"

timestamp = time.strftime("%Y%m%d-%H%M%S")
output_path = f"s3://{S3_BUCKET}/{DATA_PREFIX}/outputs/{timestamp}"
code_location = f"s3://{S3_BUCKET}/{DATA_PREFIX}/code/{timestamp}"

print(f"output path: {output_path}")
print(f"code location: {code_location}")



output path: s3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251104-201831
code location: s3://sagemaker-us-east-1-106660882488/dogimages/code/20251104-201831


## Hyperparameter Tuning
This section focuses on fine-tuning a pretrained ResNet-50 using SageMaker Hyperparameter Optimization (HPO).
The goal is to systematically explore parameter combinations that improve validation performance.
I use hpo.py as the training entry script so SageMaker can run multiple jobs in parallel with different settings.
Key hyperparameters tuned include learning rate, batch size, and epochs.
Learning rate controls convergence speed, batch size affects stability and generalization, and epochs balance training time versus overfitting.
I chose these ranges—learning rate (1e-4 to 1e-2), batch size (8–32), and epochs (3–10)—to stay within GPU memory and runtime limits.
The objective metric for HPO is validation loss (val_loss), since it measures generalization without leaking test data.
SageMaker automatically tracks printed metrics (val_loss, val_accuracy, test_loss, test_accuracy) from the training script.
All training artifacts and logs are stored in versioned S3 paths to ensure full reproducibility.
After tuning completes, the best model and its optimal hyperparameters are retrieved for final evaluation on the test set.

In [10]:
#Declare your HP ranges, metrics etc.
hyperparameter_ranges = {
    "learning_rate": ContinuousParameter(1e-4, 1e-2),  
    "batch_size": IntegerParameter(8, 32),             
    "epochs": IntegerParameter(3, 10),                 
}

metric_definitions = [
    {"Name": "val_loss",       "Regex": r"val_loss=([0-9.+-eE]+);"},
    {"Name": "test_loss",      "Regex": r"test_loss=([0-9.+-eE]+);"},
    {"Name": "test_accuracy",  "Regex": r"test_accuracy=([0-9.+-eE]+);"},
    {"Name": "train_loss",     "Regex": r"train_loss=([0-9.+-eE]+);"},
]

objective_metric_name = "val_loss"
objective_type = "Minimize"

In [11]:
# Create estimators for your HPs

INSTANCE_TYPE = "ml.g4dn.xlarge"  

estimator = PyTorch(
    entry_point="hpo.py",
    source_dir=".",
    role=role,
    framework_version="1.13",
    py_version="py39",
    instance_type=INSTANCE_TYPE,
    instance_count=1,
    output_path=output_path,         
    code_location=code_location,     
    metric_definitions=metric_definitions,
    hyperparameters={
        "num_classes": 133, # Dataset consists of 133 classes
        "image_size": 224, # Input requirement for the pre trained ResNet-50 model
        "device": "cuda",            
    },
)

tuner = HyperparameterTuner(
    estimator=estimator,
    metric_definitions=metric_definitions,
    early_stopping_type = "Auto",
    objective_metric_name="val_loss",
    hyperparameter_ranges=hyperparameter_ranges,
    objective_type="Minimize",
    max_jobs=8,            
    max_parallel_jobs=2,   
)

print([m["Name"] for m in estimator.metric_definitions])

['val_loss', 'test_loss', 'test_accuracy', 'train_loss']


In [14]:
single = PyTorch(
    entry_point="hpo.py",           
    source_dir=".",                 
    role=role,                      
    framework_version="1.13",
    py_version="py39",
    instance_type="ml.g4dn.xlarge", 
    instance_count=1,
    output_path=output_path,
    code_location=code_location,
    metric_definitions=metric_definitions,
    hyperparameters={
        "num_classes": 133, # Dataset consists of 133 classes
        "image_size": 224, # Input requirement for the pre trained ResNet-50 model
        "device": "cuda",            
    },
)

inputs = {
    "training": TrainingInput(s3_data=input_data_path, distribution="FullyReplicated")
}

In [None]:
#launch a single job
single.fit(inputs, logs="All")

INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2025-11-04-20-46-39-624


2025-11-04 20:48:15 Starting - Starting the training job
2025-11-04 20:48:15 Pending - Training job waiting for capacity......
2025-11-04 20:49:10 Pending - Preparing the instances for training...
2025-11-04 20:49:44 Downloading - Downloading input data......................
2025-11-04 20:53:47 Training - Training image download completed. Training in progress...[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
  "cipher": algorithms.TripleDES,[0m
  "class": algorithms.TripleDES,[0m
[34m2025-11-04 20:54:01,131 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2025-11-04 20:54:01,156 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2025-11-04 20:54:01,171 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2025-11-04 20:54:01,180 sagemaker_pytorch_container.train

In [None]:
# Launch HPO tuner
tuner.fit(inputs, wait=True)


INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating hyperparameter tuning job with name: pytorch-training-251104-2116


.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................!


In [17]:
# Get the best estimators and the best HPs

best_estimator = tuner.best_estimator()

# Get the hyperparameters of the best trained model
print("Best training job name:", best_estimator.latest_training_job.name)
print("\nBest hyperparameters:")
print(best_estimator.hyperparameters())



2025-11-04 21:51:04 Starting - Found matching resource for reuse
2025-11-04 21:51:04 Downloading - Downloading the training image
2025-11-04 21:51:04 Training - Training image download completed. Training in progress.
2025-11-04 21:51:04 Uploading - Uploading generated training model
2025-11-04 21:51:04 Completed - Resource reused by training job: pytorch-training-251104-2116-005-0cd206d9
Best training job name: pytorch-training-251104-2116-003-f46de529

Best hyperparameters:
{'_tuning_objective_metric': '"val_loss"', 'batch_size': '30', 'device': '"cuda"', 'epochs': '6', 'image_size': '224', 'learning_rate': '0.0005854322292922792', 'num_classes': '133', 'sagemaker_container_log_level': '20', 'sagemaker_estimator_class_name': '"PyTorch"', 'sagemaker_estimator_module': '"sagemaker.pytorch.estimator"', 'sagemaker_job_name': '"pytorch-training-2025-11-04-21-14-10-378"', 'sagemaker_program': '"hpo.py"', 'sagemaker_region': '"us-east-1"', 'sagemaker_submit_directory': '"s3://sagemaker-us-

## Model Profiling and Debugging
In this step, I fine-tuned the model using the best hyperparameters identified from hyperparameter tuning.The train_model.py script was used to configure SageMaker Debugger and Profiler for monitoring.A DebuggerHookConfig was added with save intervals for training and evaluation metrics.
Profiler configuration tracked system metrics every 500 ms for CPU, GPU, and memory usage.
Rules were added to detect vanishing gradients, overfitting, overtraining, and poor initialization.
The ProfilerReport rule automatically generated detailed performance summaries.
Debugger hooks collected losses, gradients, and weights to analyze model convergence.
All profiling and debugging data were stored in S3 for reproducibility and further analysis.
This setup ensures the final model is not only accurate but also computationally efficient and stable.

In [22]:
# Choose the best hyperparameters

best_hps = best_estimator.hyperparameters()

# Fixed (dataset/model-specific)
num_classes = int(best_hps.get('num_classes'))
image_size  = int(best_hps.get('image_size'))
device      = str(best_hps.get('device'))

# Tuned values from the best estimator

epochs        = int(best_hps.get('epochs'))
batch_size    = int(best_hps.get('batch_size'))
learning_rate = float(best_hps.get('learning_rate'))

best_hyperparameters={
        "num_classes":   num_classes,
        "image_size":    image_size,
        "device":        device,
        "epochs":        epochs,
        "batch_size":    batch_size,
        "learning_rate": learning_rate,
    }

print(best_hyperparameters)

{'num_classes': 133, 'image_size': 224, 'device': '"cuda"', 'epochs': 6, 'batch_size': 30, 'learning_rate': 0.0005854322292922792}


In [27]:
# Set up debugging and profiling rules and hooks
debugger_hook_config = DebuggerHookConfig(
    hook_parameters={
        "train.save_interval": "100",  
        "eval.save_interval": "10"     
    }
)

profiler_config = ProfilerConfig(system_monitor_interval_millis=500)
rules = [
    # Profiler rule
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),

    # Debugger rules
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
]

In [28]:
# Create an estimator

profile_estimator = PyTorch(
    entry_point="train_model.py",
    source_dir=".",
    role=role,
    framework_version="1.13",
    py_version="py39",
    instance_type="ml.g4dn.xlarge",
    instance_count=1,
    output_path=output_path,
    code_location=code_location,
    metric_definitions=[
        {"Name": "val_loss",      "Regex": r"val_loss=([0-9.+-eE]+);"},
        {"Name": "test_loss",     "Regex": r"test_loss=([0-9.+-eE]+);"},
        {"Name": "test_accuracy", "Regex": r"test_accuracy=([0-9.+-eE]+);"},
    ],
    debugger_hook_config=debugger_hook_config,
    profiler_config=profiler_config,
    rules=rules,
    hyperparameters=best_hyperparameters,
)

In [None]:
# Fit the estimator
profile_estimator.fit(inputs, wait=True)

INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO

2025-11-04 23:23:57 Starting - Starting the training job
2025-11-04 23:23:57 Pending - Training job waiting for capacity...
2025-11-04 23:24:26 Pending - Preparing the instances for trainingVanishingGradient: InProgress
Overfit: InProgress
Overtraining: InProgress
PoorWeightInitialization: InProgress
ProfilerReport: InProgress
...
2025-11-04 23:25:00 Downloading - Downloading input data...............
2025-11-04 23:27:21 Downloading - Downloading the training image...........
2025-11-04 23:29:22 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
  "cipher": algorithms.TripleDES,[0m
  "class": algorithms.TripleDES,[0m
[34m2025-11-04 23:29:29,202 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2025-11-04 23:29:29,225 sagemaker-training-toolkit INFO     No Neurons detected (normal if no ne

In [None]:
# TODO: Plot a debugging output.

**TODO**: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?  
**TODO**: If not, suppose there was an error. What would that error look like and how would you have fixed it?

In [None]:
# TODO: Display the profiler output

## Model Deploying

In [None]:
# TODO: Deploy your model to an endpoint

predictor=estimator.deploy() # TODO: Add your deployment configuration like instance type and number of instances

In [None]:
# TODO: Run an prediction on the endpoint

image = # TODO: Your code to load and preprocess image to send to endpoint for prediction
response = predictor.predict(image)

In [None]:
# TODO: Remember to shutdown/delete your endpoint once your work is done
predictor.delete_endpoint()