# TODO: Title

This notebook lists all the steps that you need to complete the complete this project. You will need to complete all the TODOs in this notebook as well as in the README and the two python scripts included with the starter code.


**TODO**: Give a helpful introduction to what this notebook is for. Remember that comments, explanations and good documentation make your project informative and professional.

**Note:** This notebook has a bunch of code and markdown cells with TODOs that you have to complete. These are meant to be helpful guidelines for you to finish your project while meeting the requirements in the project rubrics. Feel free to change the order of these the TODO's and use more than one TODO code cell to do all your tasks.

In [51]:
# TODO: Install any packages that you might need
# For instance, you will need the smdebug package
!pip install smdebug
# !pip install protobuf==3.20.*



In [52]:
# TODO: Import any packages that you might need
# For instance you will need Boto3 and Sagemaker
import sagemaker
import boto3
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)
from sagemaker.pytorch import PyTorch

from sagemaker.debugger import DebuggerHookConfig, ProfilerConfig, FrameworkProfile
from sagemaker.debugger import Rule, ProfilerRule, rule_configs



In [3]:
# !pip install protobuf==3.20.3

In [53]:
from smdebug.trials import create_trial
from smdebug.core.modes import ModeKeys
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts

## Dataset
TODO: Explain what dataset you are using for this project. Maybe even give a small overview of the classes, class distributions etc that can help anyone not familiar with the dataset get a better understand of it.

In [54]:
sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "finalProject"

role = sagemaker.get_execution_role()

In [None]:
inputs = sagemaker_session.upload_data(path="dogImages", bucket=bucket, key_prefix=prefix)
print("input spec (in this case, just an S3 path): {}".format(inputs))

In [55]:
session = boto3.Session()
region = session.region_name
print("Current region:", region)

Current region: us-east-1


In [29]:
from sagemaker import image_uris

uri = image_uris.retrieve(
    framework='pytorch',
    region='us-east-1',
    version='2.0',         # PyTorch version
    py_version='py310',     # Python version
    image_scope='training',  # or 'inference'
    instance_type='ml.m5.xlarge'
)

print("Image URI:", uri)

Image URI: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0-cpu-py310


## Hyperparameter Tuning
**TODO:** This is the part where you will finetune a pretrained model with hyperparameter tuning. Remember that you have to tune a minimum of two hyperparameters. However you are encouraged to tune more. You are also encouraged to explain why you chose to tune those particular hyperparameters and the ranges.

**Note:** You will need to use the `hpo.py` script to perform hyperparameter tuning.

In [30]:
#Done: Declare your HP ranges, metrics etc.
hyperparameter_ranges = {
    "lr": ContinuousParameter(0.001, 0.1),
    "batch-size": CategoricalParameter([32, 64, 128, 256]),
    "dropout": ContinuousParameter(0.0, 0.5),#Randomly drops neurons during training to reduce overfitting.
    "epochs": IntegerParameter(1, 2)
}


In [36]:
#Done: Create estimators for your HPs

# Done: Your estimator here
estimator = PyTorch(
    entry_point="hpo.py",
    role=role,
    py_version='py310',
    framework_version="2.0",
    instance_count=1,
    instance_type="ml.m5.xlarge",
    base_job_name="hpo-job-tunning-main"
)

objective_metric_name = "average test loss"
objective_type = "Minimize"
metric_definitions = [{"Name": "average test loss", "Regex": "Test set: Average loss: ([0-9\\.]+)"}]

# Done: Your HP tuner here
tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=4,
    max_parallel_jobs=2,
    objective_type=objective_type,
) 

In [37]:
# TODO: Fit your HP Tuner
tuner.fit({'training': inputs}, wait=True) # Done: Remember to include your data channels

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating hyperparameter tuning job with name: pytorch-training-250725-2356


......................................................................................................................................................................................................................................................!


In [38]:
# TODO: Get the best estimators and the best HPs

best_estimator = tuner.best_estimator()

#Get the hyperparameters of the best trained model
best_estimator.hyperparameters()


2025-07-26 00:10:00 Starting - Preparing the instances for training
2025-07-26 00:10:00 Downloading - Downloading the training image
2025-07-26 00:10:00 Training - Training image download completed. Training in progress.
2025-07-26 00:10:00 Uploading - Uploading generated training model
2025-07-26 00:10:00 Completed - Resource reused by training job: pytorch-training-250725-2356-004-02ec834e


{'_tuning_objective_metric': '"average test loss"',
 'batch-size': '"128"',
 'dropout': '0.4062010099739338',
 'epochs': '2',
 'lr': '0.002498486418890782',
 'sagemaker_container_log_level': '20',
 'sagemaker_estimator_class_name': '"PyTorch"',
 'sagemaker_estimator_module': '"sagemaker.pytorch.estimator"',
 'sagemaker_job_name': '"hpo-job-tunning-main-2025-07-25-23-56-25-634"',
 'sagemaker_program': '"hpo.py"',
 'sagemaker_region': '"us-east-1"',
 'sagemaker_submit_directory': '"s3://sagemaker-us-east-1-376129869387/hpo-job-tunning-main-2025-07-25-23-56-25-634/source/sourcedir.tar.gz"'}

In [42]:
best_hyperparams = best_estimator.hyperparameters()
cleaned_hps = {
    "batch-size": int(best_hyperparams["batch-size"].strip('"')),
    "lr": float(best_hyperparams["lr"]),
    "epochs": int(best_hyperparams["epochs"]),
    "dropout": float(best_hyperparams["dropout"])
}
cleaned_hps

{'batch-size': 128,
 'lr': 0.002498486418890782,
 'epochs': 2,
 'dropout': 0.4062010099739338}

## Model Profiling and Debugging
TODO: Using the best hyperparameters, create and finetune a new model

**Note:** You will need to use the `train_model.py` script to perform model profiling and debugging.

In [43]:
# Done: Set up debugging and profiling rules and hooks

rules = [
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    ProfilerRule.sagemaker(rule_configs.LowGPUUtilization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
]



profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500, framework_profile_params=FrameworkProfile(num_steps=10)
)
debugger_config = DebuggerHookConfig(
    hook_parameters={"train.save_interval": "100", "eval.save_interval": "10"},
    s3_output_path=f"s3://{bucket}/{prefix}/debugger/"
)

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [None]:
# TODO: Create and fit an estimator

estimator = PyTorch(
    entry_point="train_model.py",
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type="ml.m5.xlarge",
    framework_version="1.8",
    py_version="py36",
    debugger_hook_config=debugger_config,
    profiler_config=profiler_config,
    rules=rules,
    base_job_name="debugger-profiler-job",
    hyperparameters=cleaned_hps
)

estimator.fit({'training': inputs}, wait=True)

# objective_metric_name = "average test loss"
# objective_type = "Minimize"
# metric_definitions = [{"Name": "average test loss", "Regex": "Test set: Average loss: ([0-9\\.]+)"}]


INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: debugger-profiler-job-2025-07-26-01-47-45-166


2025-07-26 01:47:47 Starting - Starting the training job...
2025-07-26 01:48:15 Starting - Preparing the instances for trainingLossNotDecreasing: InProgress
VanishingGradient: InProgress
Overfit: InProgress
Overtraining: InProgress
PoorWeightInitialization: InProgress
LowGPUUtilization: InProgress
ProfilerReport: InProgress
...
2025-07-26 01:48:50 Downloading - Downloading input data....

In [27]:
from smdebug.trials import create_trial
import matplotlib.pyplot as plt

trial = create_trial("s3://sagemaker-us-east-1-376129869387/finalProject/debugger/debugger-profiler-job-2025-07-25-19-36-00-266/debug-output")

# Get loss values
steps = trial.steps()
loss_tensor = trial.tensor("loss_output_0")
loss_values = [loss_tensor.value(step) for step in steps]

# Plot
plt.plot(steps, loss_values)
plt.xlabel("Step")
plt.ylabel("Training Loss")
plt.title("Training Loss over Time")
plt.grid(True)
plt.show()

INFO:matplotlib.font_manager:generated new fontManager


[2025-07-25 21:17:47.907 default:10969 INFO s3_trial.py:42] Loading trial debug-output at path s3://sagemaker-us-east-1-376129869387/finalProject/debugger/debugger-profiler-job-2025-07-25-19-36-00-266/debug-output


In [16]:
from smdebug.trials import create_trial
# trial = create_trial(f's3://{bucket}/{prefix}/{estimator.latest_training_job.name}/debug-output')
# print(trial.tensor_names())

In [17]:
print(type(estimator))
print(type(hyperparameter_ranges))
print(type(metric_definitions))
print(type(metric_definitions))


<class 'sagemaker.pytorch.estimator.PyTorch'>
<class 'dict'>
<class 'list'>
<class 'list'>


In [18]:
session = boto3.session.Session()
region = session.region_name

training_job_name = estimator.latest_training_job.name
print(f"Training jobname: {training_job_name}")
print(f"Region: {region}")

trial = create_trial(estimator.latest_job_debugger_artifacts_path())

print(trial.tensor_names())
print(len(trial.tensor("NLLLoss_output_0").steps(mode=ModeKeys.TRAIN)))
print(len(trial.tensor("NLLLoss_output_0").steps(mode=ModeKeys.EVAL)))

Training jobname: debugger-profiler-job-2025-07-25-19-36-00-266
Region: us-east-1
[2025-07-25 21:11:51.120 default:10969 INFO s3_trial.py:42] Loading trial debug-output at path s3://sagemaker-us-east-1-376129869387/finalProject/debugger/debugger-profiler-job-2025-07-25-19-36-00-266/debug-output
['NLLLoss_output_0', 'gradient/ResNet_fc.1.bias', 'gradient/ResNet_fc.1.weight', 'layer1.0.relu_input_0', 'layer1.0.relu_input_1', 'layer1.1.relu_input_0', 'layer1.1.relu_input_1', 'layer2.0.relu_input_0', 'layer2.0.relu_input_1', 'layer2.1.relu_input_0', 'layer2.1.relu_input_1', 'layer3.0.relu_input_0', 'layer3.0.relu_input_1', 'layer3.1.relu_input_0', 'layer3.1.relu_input_1', 'layer4.0.relu_input_0', 'layer4.0.relu_input_1', 'layer4.1.relu_input_0', 'layer4.1.relu_input_1', 'loss_output_0', 'relu_input_0', 'val_loss_output_0']
1
1


**TODO**: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?  
**TODO**: If not, suppose there was an error. What would that error look like and how would you have fixed it?

In [25]:
!pip install "bokeh<3.0" --upgrade
!pip install "smdebug>=1.0.12" --upgrade

Collecting bokeh<3.0
  Downloading bokeh-2.4.3-py3-none-any.whl.metadata (14 kB)
Downloading bokeh-2.4.3-py3-none-any.whl (18.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.5/18.5 MB[0m [31m97.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bokeh
  Attempting uninstall: bokeh
    Found existing installation: bokeh 3.7.3
    Uninstalling bokeh-3.7.3:
      Successfully uninstalled bokeh-3.7.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
panel 1.7.2 requires bokeh<3.8.0,>=3.5.0, but you have bokeh 2.4.3 which is incompatible.[0m[31m
[0mSuccessfully installed bokeh-2.4.3


In [26]:
# TODO: Display the profiler output
tj = TrainingJob(training_job_name, region)
tj.wait_for_sys_profiling_data_to_be_available()

system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

view_timeline_charts = TimelineCharts(
    system_metrics_reader,
    framework_metrics_reader=None,
    select_dimensions=["CPU", "GPU"],
    select_events=["total"],
)

rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print(f"You will find the profiler report in {rule_output_path}")

ProfilerConfig:{'S3OutputPath': 's3://sagemaker-us-east-1-376129869387/', 'ProfilingIntervalInMilliseconds': 500, 'ProfilingParameters': {'DataloaderProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "MetricsRegex": ".*", }', 'DetailedProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'FileOpenFailThreshold': '50', 'HorovodProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'LocalPath': '/opt/ml/output/profiler', 'PythonProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "ProfilerName": "cprofile", "cProfileTimer": "total_time", }', 'RotateFileCloseIntervalInSeconds': '60', 'RotateMaxFileSizeInBytes': '10485760', 'SMDataParallelProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }'}, 'DisableProfiler': False}
s3 path:s3://sagemaker-us-east-1-376129869387/debugger-profiler-job-2025-07-25-19-36-00-266/profiler-output


Profiler data from system is available
[2025-07-25 21:15:25.110 default:10969 INFO metrics_reader_base.py:134] Getting 99 event files
select events:['total']
sel

## Model Deploying

In [None]:
# TODO: Deploy your model to an endpoint

predictor=estimator.deploy() # TODO: Add your deployment configuration like instance type and number of instances

In [None]:
# TODO: Run an prediction on the endpoint

image = # TODO: Your code to load and preprocess image to send to endpoint for prediction
response = predictor.predict(image)

In [None]:
# TODO: Remember to shutdown/delete your endpoint once your work is done
predictor.delete_endpoint()