Image Classification MLOps using Amazon Sagemaker
This notebook lists all the steps that you need to complete the complete this project. 

In [2]:

!pip install "smdebug==1.0.12" "bokeh==2.3.3"

Collecting smdebug==1.0.12
  Downloading smdebug-1.0.12-py2.py3-none-any.whl.metadata (1.7 kB)
Collecting bokeh==2.3.3
  Downloading bokeh-2.3.3.tar.gz (10.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.7/10.7 MB[0m [31m140.5 MB/s[0m  [33m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting pyinstrument==3.4.2 (from smdebug==1.0.12)
  Using cached pyinstrument-3.4.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting pyinstrument-cext>=0.2.2 (from pyinstrument==3.4.2->smdebug==1.0.12)
  Using cached pyinstrument_cext-0.2.4-cp312-cp312-linux_x86_64.whl
Downloading smdebug-1.0.12-py2.py3-none-any.whl (270 kB)
Using cached pyinstrument-3.4.2-py2.py3-none-any.whl (83 kB)
Building wheels for collected packages: bokeh
[33m  DEPRECATION: Building 'bokeh' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized buil

In [3]:
import sagemaker
import os, time, json
import boto3
from sagemaker.pytorch import PyTorch
from sagemaker.tuner import HyperparameterTuner, ContinuousParameter, IntegerParameter
from sagemaker.inputs import TrainingInput
from sagemaker.debugger import ProfilerRule, FrameworkProfile, ProfilerConfig, rule_configs, DebuggerHookConfig, CollectionConfig, Rule

sess = sagemaker.Session()
role = sagemaker.get_execution_role()

print("Role:", role)
print("Default S3 bucket:", sess.default_bucket())

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
Role: arn:aws:iam::106660882488:role/service-role/AmazonSageMaker-ExecutionRole-20251027T142948
Default S3 bucket: sagemaker-us-east-1-106660882488


## Dataset
This project uses the Dog Breed Classification dataset provided in the Udacity classroom. The dataset contains images from 133 different dog breeds, covering a wide range of sizes, coat types, and geographic origins. The dataset is already split into training, validation, and testing sets, which supports a clean and reproducible ML workflow. Images vary in lighting, pose, and background, making the classification task more realistic and challenging. This variety encourages strong generalization and helps evaluate the effectiveness of transfer learning when adapting a pre trained model like ResNet to a multi class image classification problem.

In [6]:
LOCAL_DIR = 'dogImages'
S3_BUCKET = sess.default_bucket()
DATA_PREFIX = "dogimages"

In [5]:
# Command to download and unzip data
# Uncomment and run the below  lines of code only the first time when you want to download 
# and upload the data to s3

!wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
!unzip dogImages.zip
input_data_path = sess.upload_data(path=LOCAL_DIR, bucket=S3_BUCKET, key_prefix=DATA_PREFIX)
print(f"input data path: {input_data_path}")

input data path: s3://sagemaker-us-east-1-106660882488/dogimages/


In [8]:
# to set the input data path without having to download and upload data each time
input_data_path = f"s3://{S3_BUCKET}/{DATA_PREFIX}/"
print(f"input data path: {input_data_path}")

input data path: s3://sagemaker-us-east-1-106660882488/dogimages/


In [9]:

# Set Input and Output path on S3 for the project
train = f"s3://{S3_BUCKET}/{DATA_PREFIX}/train"
valid   = f"s3://{S3_BUCKET}/{DATA_PREFIX}/valid"
test = f"s3://{S3_BUCKET}/{DATA_PREFIX}/valid"

timestamp = time.strftime("%Y%m%d-%H%M%S")
output_path = f"s3://{S3_BUCKET}/{DATA_PREFIX}/outputs/{timestamp}"
code_location = f"s3://{S3_BUCKET}/{DATA_PREFIX}/code/{timestamp}"

print(f"output path: {output_path}")
print(f"code location: {code_location}")



output path: s3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251106-172121
code location: s3://sagemaker-us-east-1-106660882488/dogimages/code/20251106-172121


## Hyperparameter Tuning
This section focuses on fine-tuning a pretrained ResNet-50 using SageMaker Hyperparameter Optimization (HPO).
The goal is to systematically explore parameter combinations that improve validation performance.
I use hpo.py as the training entry script so SageMaker can run multiple jobs in parallel with different settings.
Key hyperparameters tuned include learning rate, batch size, and epochs.
Learning rate controls convergence speed, batch size affects stability and generalization, and epochs balance training time versus overfitting.
I chose these ranges—learning rate (1e-4 to 1e-2), batch size (8–32), and epochs (3–10)—to stay within GPU memory and runtime limits.
The objective metric for HPO is validation loss (val_loss), since it measures generalization without leaking test data.
SageMaker automatically tracks printed metrics (val_loss, val_accuracy, test_loss, test_accuracy) from the training script.
All training artifacts and logs are stored in versioned S3 paths to ensure full reproducibility.
After tuning completes, the best model and its optimal hyperparameters are retrieved for final evaluation on the test set.

In [10]:
#Declare your HP ranges, metrics etc.
hyperparameter_ranges = {
    "learning_rate": ContinuousParameter(1e-4, 1e-2),  
    "batch_size": IntegerParameter(8, 32),             
    "epochs": IntegerParameter(3, 10),                 
}

metric_definitions = [
    {"Name": "val_loss",       "Regex": r"val_loss=([0-9.+-eE]+);"},
    {"Name": "test_loss",      "Regex": r"test_loss=([0-9.+-eE]+);"},
    {"Name": "test_accuracy",  "Regex": r"test_accuracy=([0-9.+-eE]+);"},
    {"Name": "train_loss",     "Regex": r"train_loss=([0-9.+-eE]+);"},
]

objective_metric_name = "val_loss"
objective_type = "Minimize"

In [11]:
# Create estimators for your HPs

INSTANCE_TYPE = "ml.g4dn.xlarge"  

estimator = PyTorch(
    entry_point="hpo.py",
    source_dir=".",
    role=role,
    framework_version="1.13",
    py_version="py39",
    instance_type=INSTANCE_TYPE,
    instance_count=1,
    output_path=output_path,         
    code_location=code_location,     
    metric_definitions=metric_definitions,
    hyperparameters={
        "num_classes": 133, # Dataset consists of 133 classes
        "image_size": 224, # Input requirement for the pre trained ResNet-50 model
        "device": "cuda",            
    },
)

tuner = HyperparameterTuner(
    estimator=estimator,
    metric_definitions=metric_definitions,
    early_stopping_type = "Auto",
    objective_metric_name="val_loss",
    hyperparameter_ranges=hyperparameter_ranges,
    objective_type="Minimize",
    max_jobs=8,            
    max_parallel_jobs=2,   
)

print([m["Name"] for m in estimator.metric_definitions])

['val_loss', 'test_loss', 'test_accuracy', 'train_loss']


In [12]:
single = PyTorch(
    entry_point="hpo.py",           
    source_dir=".",                 
    role=role,                      
    framework_version="1.13",
    py_version="py39",
    instance_type="ml.g4dn.xlarge", 
    instance_count=1,
    output_path=output_path,
    code_location=code_location,
    metric_definitions=metric_definitions,
    hyperparameters={
        "num_classes": 133, # Dataset consists of 133 classes
        "image_size": 224, # Input requirement for the pre trained ResNet-50 model
        "device": "cuda",            
    },
)

In [13]:
# Format the input data
inputs = {
    "training": TrainingInput(s3_data=input_data_path, distribution="FullyReplicated")
}

In [None]:
#launch a single job, to test first if everything is setup correctly
single.fit(inputs, logs="All")

INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2025-11-04-20-46-39-624


2025-11-04 20:48:15 Starting - Starting the training job
2025-11-04 20:48:15 Pending - Training job waiting for capacity......
2025-11-04 20:49:10 Pending - Preparing the instances for training...
2025-11-04 20:49:44 Downloading - Downloading input data......................
2025-11-04 20:53:47 Training - Training image download completed. Training in progress...[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
  "cipher": algorithms.TripleDES,[0m
  "class": algorithms.TripleDES,[0m
[34m2025-11-04 20:54:01,131 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2025-11-04 20:54:01,156 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2025-11-04 20:54:01,171 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2025-11-04 20:54:01,180 sagemaker_pytorch_container.train

In [None]:
# Launch HPO tuner
tuner.fit(inputs, wait=True)


No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................!


In [15]:
# Get the best estimators and the best HPs

best_estimator = tuner.best_estimator()

# Get the hyperparameters of the best trained model
print("Best training job name:", best_estimator.latest_training_job.name)
print("\nBest hyperparameters:")
print(best_estimator.hyperparameters())



2025-11-06 17:47:56 Starting - Starting the training job
2025-11-06 17:47:56 Pending - Preparing the instances for training
2025-11-06 17:47:56 Downloading - Downloading the training image
2025-11-06 17:47:56 Training - Training image download completed. Training in progress.
2025-11-06 17:47:56 Uploading - Uploading generated training model
2025-11-06 17:47:56 Completed - Resource reused by training job: pytorch-training-251106-1724-004-d2763ed2
Best training job name: pytorch-training-251106-1724-002-46c315c1

Best hyperparameters:
{'_tuning_objective_metric': '"val_loss"', 'batch_size': '29', 'device': '"cuda"', 'epochs': '9', 'image_size': '224', 'learning_rate': '0.0003014176165055876', 'num_classes': '133', 'sagemaker_container_log_level': '20', 'sagemaker_estimator_class_name': '"PyTorch"', 'sagemaker_estimator_module': '"sagemaker.pytorch.estimator"', 'sagemaker_job_name': '"pytorch-training-2025-11-06-17-22-31-386"', 'sagemaker_program': '"hpo.py"', 'sagemaker_region': '"us-e

## Model Profiling and Debugging
In this step, I fine-tuned the model using the best hyperparameters identified from hyperparameter tuning.The train_model.py script was used to configure SageMaker Debugger and Profiler for monitoring.A DebuggerHookConfig was added with save intervals for training and evaluation metrics.
Profiler configuration tracked system metrics every 500 ms for CPU, GPU, and memory usage.
Rules were added to detect vanishing gradients, overfitting, overtraining, and poor initialization.
The ProfilerReport rule automatically generated detailed performance summaries.
Debugger hooks collected losses, gradients, and weights to analyze model convergence.
All profiling and debugging data were stored in S3 for reproducibility and further analysis.
This setup ensures the final model is not only accurate but also computationally efficient and stable.

In [16]:
# Choose the best hyperparameters

best_hps = best_estimator.hyperparameters()

# Fixed (dataset/model-specific)
num_classes = int(best_hps.get('num_classes'))
image_size  = int(best_hps.get('image_size'))
device      = str(best_hps.get('device'))

# Tuned values from the best estimator

epochs        = int(best_hps.get('epochs'))
batch_size    = int(best_hps.get('batch_size'))
learning_rate = float(best_hps.get('learning_rate'))

best_hyperparameters={
        "num_classes":   num_classes,
        "image_size":    image_size,
        "device":        device,
        "epochs":        epochs,
        "batch_size":    batch_size,
        "learning_rate": learning_rate,
    }

print(best_hyperparameters)

{'num_classes': 133, 'image_size': 224, 'device': '"cuda"', 'epochs': 9, 'batch_size': 29, 'learning_rate': 0.0003014176165055876}


In [17]:
# Set up debugging and profiling rules and hooks
debugger_hook_config = DebuggerHookConfig(
    hook_parameters={
        "train.save_interval": "100",  
        "eval.save_interval": "10"     
    }
)

profiler_config = ProfilerConfig(system_monitor_interval_millis=500, framework_profile_params=FrameworkProfile(num_steps=10))
rules = [
    # Profiler rule
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),

    # Debugger rules
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
]

Framework profiling will be deprecated from tensorflow 2.12 and pytorch 2.0 in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [18]:
# Create an estimator

profile_estimator = PyTorch(
    entry_point="train_model.py",
    source_dir=".",
    role=role,
    framework_version="1.13",
    py_version="py39",
    instance_type="ml.g4dn.xlarge",
    instance_count=1,
    output_path=output_path,
    code_location=code_location,
    metric_definitions=[
        {"Name": "val_loss",      "Regex": r"val_loss=([0-9.+-eE]+);"},
        {"Name": "test_loss",     "Regex": r"test_loss=([0-9.+-eE]+);"},
        {"Name": "test_accuracy", "Regex": r"test_accuracy=([0-9.+-eE]+);"},
    ],
    debugger_hook_config=debugger_hook_config,
    profiler_config=profiler_config,
    rules=rules,
    hyperparameters=best_hyperparameters,
)

In [19]:
# Fit the estimator
profile_estimator.fit(inputs, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2025-11-06-18-48-36-639


2025-11-06 18:50:14 Starting - Starting the training job...
2025-11-06 18:50:27 Pending - Training job waiting for capacityVanishingGradient: InProgress
Overfit: InProgress
Overtraining: InProgress
PoorWeightInitialization: InProgress
ProfilerReport: InProgress
.........
2025-11-06 18:52:20 Downloading - Downloading input data....................................
2025-11-06 18:58:21 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
  "cipher": algorithms.TripleDES,[0m
  "class": algorithms.TripleDES,[0m
[34m2025-11-06 18:58:10,860 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2025-11-06 18:58:10,883 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2025-11-06 18:58:10,896 sagemaker_pytorch_container.training INFO     Block until all host 

In [27]:
# Locate the debugger output on S3

session = boto3.session.Session()

sm = sess.sagemaker_client

job_name = profile_estimator.latest_training_job.name  
output_path = profile_estimator.output_path
region = session.region_name
print("Latest Job name:", job_name)
print("output path :", output_path)
print("region :", region)
desc = sm.describe_training_job(TrainingJobName=job_name)

# Debugger output S3 path (where tensors are stored)
debug_s3 = desc["DebugHookConfig"]["S3OutputPath"]
print("Debugger S3 path:", debug_s3)

Latest Job name: pytorch-training-2025-11-06-18-48-36-639
output path : s3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251106-172121
region : us-east-1
Debugger S3 path: s3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251106-172121


In [21]:
!pip show smdebug

Name: smdebug
Version: 1.0.12
Summary: Amazon SageMaker Debugger is an offering from AWS which helps you automate the debugging of machine learning training jobs.
Home-page: https://github.com/awslabs/sagemaker-debugger
Author: AWS DeepLearning Team
Author-email: 
License: Apache License Version 2.0
Location: /opt/conda/lib/python3.12/site-packages
Requires: boto3, numpy, packaging, protobuf, pyinstrument
Required-by: 


In [22]:
!pip show bokeh

Name: bokeh
Version: 2.3.3
Summary: Interactive plots and applications in the browser from Python
Home-page: http://github.com/bokeh/bokeh
Author: Bokeh Team
Author-email: info@bokeh.org
License: BSD-3-Clause
Location: /opt/conda/lib/python3.12/site-packages
Requires: Jinja2, numpy, packaging, pillow, python-dateutil, PyYAML, tornado, typing_extensions
Required-by: 


In [25]:
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
from urllib.parse import urlparse

job_name = job_name
region = region 

# Create the TrainingJob object
tj = TrainingJob(training_job_name=job_name, region=region)
s3_output = desc["OutputDataConfig"]["S3OutputPath"].rstrip("/")  
u = urlparse(s3_output)
bucket, base_prefix = u.netloc, u.path.lstrip("/")
profiler_prefix = f"s3://{bucket}/{base_prefix}/{job_name}/profiler-output/"
print("Profiler S3 prefix:", profiler_prefix)

# Wait until profiling data becomes available
tj.wait_for_sys_profiling_data_to_be_available()

# Get the trial (profiler output) path
trial_path = f"s3://{bucket}/{base_prefix}/{job_name}/profiler-output/"
print("Profiler output path:", trial_path)

# Initialize the system metrics reader
system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

# Plot CPU/GPU utilization timeline
view_timeline_charts = TimelineCharts(
    system_metrics_reader=system_metrics_reader,
    framework_metrics_reader=None,
    select_dimensions=["CPU", "GPU"],  
    select_events=["total"],          
)

ProfilerConfig:{'S3OutputPath': 's3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251106-172121', 'ProfilingIntervalInMilliseconds': 500, 'ProfilingParameters': {'DataloaderProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "MetricsRegex": ".*", }', 'DetailedProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'FileOpenFailThreshold': '50', 'HorovodProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'LocalPath': '/opt/ml/output/profiler', 'PythonProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "ProfilerName": "cprofile", "cProfileTimer": "total_time", }', 'RotateFileCloseIntervalInSeconds': '60', 'RotateMaxFileSizeInBytes': '10485760', 'SMDataParallelProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }'}, 'DisableProfiler': False}
s3 path:s3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251106-172121/pytorch-training-2025-11-06-18-48-36-639/profiler-output
Profiler S3 prefix: s3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251106-172121/pytorch-t

# SageMaker collected:

## GPUMemoryUtilization (how much GPU memory is being used):

The X-axis (Time in ms) shows the entire training duration.The Y-axis (0–100) shows GPU memory usage as a percentage.
The blue line shows usage fluctuating between 10–40% with many dips to zero.

This pattern indicates:
Low and unstable GPU memory utilization, meaning the GPU isn’t fully engaged most of the time.
i.e The GPU is waiting for input data (I/O bottleneck).
The quick oscillations show the model runs batches intermittently,each spike represents a batch being processed.The flat zero portions or sharp drops show data stalls, likely while waiting on the DataLoader.

Improvements:
Increase batch_size if memory allows.
Tune the DataLoader:
train_loader = DataLoader(
    train_data,
    batch_size=args.batch_size,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
    prefetch_factor=2,
    persistent_workers=True
)
Use mixed precision (torch.autocast) to improve throughput.

## CPUUtilization (percentage of CPU usage)
X-axis: Time during training (in milliseconds).Y-axis: CPU utilization percentage.
Blue line shows CPU usage over time, fluctuating mostly between 35–45%, with periodic short spikes up to 60–70%.

This pattern indicates:

Moderate, consistent CPU usage (30–50%)
The CPU is actively preparing batches for the GPU — this is normal. It indicates your data loader is doing work (image augmentation, normalization, etc.) each step.

Regular short spikes
These are likely at the start of each epoch or between validation phases when data loading or logging briefly peaks.

No sustained CPU saturation (near 100%)
This confirms the CPU isn’t a bottleneck; however, since your GPU utilization is low and erratic, it means the CPU and I/O pipeline still aren’t fast enough to keep the GPU continuously fed.

Improvements:
Keep moderate num_workers 

## GPUUtilization (GPU compute usage)

X-axis: Time in milliseconds during training.Y-axis: GPU utilization (% of compute capacity).
Blue line: GPU usage over time — fluctuating roughly between 10% and 40%, with frequent drops to near zero.

This pattern indicates:

Underutilized GPU
The GPU rarely exceeds 40% utilization and often drops to 0%, meaning your GPU spends a lot of time idle.This is a clear signal of I/O or CPU data bottlenecks,the GPU is waiting for the next batch of data to arrive from the CPU/DataLoader. Since the  data lives in S3, the constant I/O overhead can slow throughput.

Unstable compute pattern
The irregular spikes indicate batches are processed in bursts rather than continuously.
This can happen if the DataLoader or augmentation pipeline is slow, or if the batch size is too small.

No signs of compute saturation
Ideally, GPU utilization should hover between 70–90% during training. That indicates efficient GPU use and minimal waiting.

Improvement:
Increase data throughput and batch efficiency

1. Optimize the Data Loader

train_loader = DataLoader(
    train_data,
    batch_size=args.batch_size,
    shuffle=True,
    num_workers=4,          # or more depending on instance cores
    pin_memory=True,
    prefetch_factor=2,
    persistent_workers=True
)
3. Increase the batch size



In [49]:
# Download profiler report

# rule output path
rule_output_path = output_path + "/" + job_name + "/rule-output"+"/"
print(f"profiler report path in s3 {rule_output_path}")

# Download the profiler report from S3
# ! aws s3 ls {rule_output_path} --recursive
! aws s3 cp {rule_output_path} ./ --recursive


profiler report path in s3 s3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251106-172121/pytorch-training-2025-11-06-18-48-36-639/rule-output/
download: s3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251106-172121/pytorch-training-2025-11-06-18-48-36-639/rule-output/ProfilerReport/profiler-output/profiler-report.ipynb to ProfilerReport/profiler-output/profiler-report.ipynb
download: s3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251106-172121/pytorch-training-2025-11-06-18-48-36-639/rule-output/ProfilerReport/profiler-output/profiler-reports/OverallFrameworkMetrics.json to ProfilerReport/profiler-output/profiler-reports/OverallFrameworkMetrics.json
download: s3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251106-172121/pytorch-training-2025-11-06-18-48-36-639/rule-output/ProfilerReport/profiler-output/profiler-report.html to ProfilerReport/profiler-output/profiler-report.html
download: s3://sagemaker-us-east-1-106660882488/dogimages/outputs/202

## Model Deploying
Packaged the trained weights as model.pth inside a SageMaker model artifact (model.tar.gz) produced by the training job.

Wrote  inference.py that reconstructs ResNet-50 with the correct head size, loads model.pth, applies the same eval transforms, and returns top-k predictions.

Pulled the artifact URI from the training job description with desc["ModelArtifacts"]["S3ModelArtifacts"] and set s3_model_path.

Created a PyTorchModel with entry_point="inference.py" and deployed a real-time endpoint with one instance.

Invoked the endpoint by reading an image directly from S3 with boto3.get_object(...), setting the serializer to image/jpeg, and calling predictor.predict(payload).

Used the test image s3://sagemaker-us-east-1-106660882488/dogimages/test/001.Affenpinscher/Affenpinscher_00003.jpg.

First response returned numeric class IDs because no label mapping was included.

Built labels.json from your training folder using ImageFolder.class_to_idx, reversed it to idx_to_class, and saved it next to model.pth so predictions can map indices to breed names.

Verified readable outputs with top-1 “Affenpinscher” at ~98% confidence, confirming the mapping and the serving pipeline work as intended.

You now have an endpoint that accepts raw images and returns labeled top-k predictions, with a clear path to update the model artifact and redeploy whenever you retrain.


In [60]:
# Deploy your model to an endpoint

import sagemaker, time
from sagemaker.pytorch import PyTorchModel
from sagemaker.serializers import IdentitySerializer
from sagemaker.deserializers import JSONDeserializer

sess = sagemaker.Session()
role = sagemaker.get_execution_role()

# Get model artifact path from the completed training job
s3_model_path = desc["ModelArtifacts"]["S3ModelArtifacts"]
print("S3 model artifact path:", s3_model_path)

model_name = f"dogbreed-resnet50-{int(time.time())}"

pytorch_model = PyTorchModel(
    name=model_name,
    role=role,
    model_data=s3_model_path,
    entry_point="inference.py",   
    source_dir=".",               
    framework_version="1.13",
    py_version="py39",
)

# Choose instance_type: "ml.m5.xlarge" for CPU, "ml.g4dn.xlarge" for GPU
predictor = pytorch_model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.xlarge"
)

print("Endpoint:", predictor.endpoint_name)

S3 model artifact path: s3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251106-172121/pytorch-training-2025-11-06-18-48-36-639/output/model.tar.gz


INFO:sagemaker:Repacking model artifact (s3://sagemaker-us-east-1-106660882488/dogimages/outputs/20251106-172121/pytorch-training-2025-11-06-18-48-36-639/output/model.tar.gz), script artifact (.), and dependencies ([]) into single tar.gz file located at s3://sagemaker-us-east-1-106660882488/dogbreed-resnet50-1762464167/model.tar.gz. This may take some time depending on model size...
INFO:sagemaker:Creating model with name: dogbreed-resnet50-1762464167
INFO:sagemaker:Creating endpoint-config with name dogbreed-resnet50-1762464167-2025-11-06-21-24-32-347
INFO:sagemaker:Creating endpoint with name dogbreed-resnet50-1762464167-2025-11-06-21-24-32-347


------!Endpoint: dogbreed-resnet50-1762464167-2025-11-06-21-24-32-347


In [62]:
predictor.serializer = IdentitySerializer(content_type="image/jpeg")  
predictor.deserializer = JSONDeserializer()

s3_uri = "s3://sagemaker-us-east-1-106660882488/dogimages/test/001.Affenpinscher/Affenpinscher_00003.jpg"  
u = urlparse(s3_uri)
bucket, key = u.netloc, u.path.lstrip("/")

s3 = boto3.client("s3")
obj = s3.get_object(Bucket=bucket, Key=key)
payload = obj["Body"].read()

result = predictor.predict(payload)

print(json.dumps(result, indent=2))

{
  "topk_indices": [
    0,
    41,
    35,
    25,
    32
  ],
  "topk_labels": [
    "0",
    "41",
    "35",
    "25",
    "32"
  ],
  "topk_probs": [
    0.982056736946106,
    0.005460810381919146,
    0.0027177054435014725,
    0.002272977028042078,
    0.0018866917816922069
  ]
}


In [69]:
from torchvision import datasets
import json, os

# Recreate dataset the same way it was during training (same root, same transforms irrelevant here)
train_dir = "./dogImages/train"
train_data = datasets.ImageFolder(train_dir)

idx_to_class = {v: k for k, v in train_data.class_to_idx.items()}

# Write labels.json locally and repackage + redeploy the model artifact
os.makedirs("./model", exist_ok=True)
with open("./model/labels.json", "w") as f:
    json.dump({str(k): v for k, v in idx_to_class.items()}, f)

In [73]:
pred = {
  "topk_indices": [0, 41, 35, 25, 32],
  "topk_probs": [0.9820, 0.00546, 0.00272, 0.00227, 0.00189]
}

def pretty(name):
    return name.split(".", 1)[-1].replace("_", " ")

for k, p in readable:  # where readable = [(label, prob), ...]
    print(f"{pretty(k):22s} {p:.4f}")



Affenpinscher          0.9820
Cairn terrier          0.0055
Briard                 0.0027
Black russian terrier  0.0023
Bouvier des flandres   0.0019


In [74]:
THRESH = 0.80
top1_label, top1_p = readable[0]
if top1_p < THRESH:
    msg = "Low confidence — consider asking for another photo."
else:
    msg = f"Predicted: {pretty(top1_label)} ({top1_p:.1%})"
print(msg)

Predicted: Affenpinscher (98.2%)


In [75]:
# shutdown/delete your endpoint 
endpoint_name = predictor.endpoint_name
predictor.delete_endpoint()


INFO:sagemaker:Deleting endpoint configuration with name: dogbreed-resnet50-1762464167-2025-11-06-21-24-32-347
INFO:sagemaker:Deleting endpoint with name: dogbreed-resnet50-1762464167-2025-11-06-21-24-32-347
