# TODO: Dog classification

This notebook lists all the steps that you need to complete the complete this project. You will need to complete all the TODOs in this notebook as well as in the README and the two python scripts included with the starter code.


**TODO**: Give a helpful introduction to what this notebook is for. Remember that comments, explanations and good documentation make your project informative and professional.

In this project, I will be using AWS Sagemaker to finetune a pretrained model that can perform image classification. I will use Sagemaker profiling, debugger, hyperparameter tuning. I am using the dog breed classication dataset to classify between different breeds of dogs in images.

**Note:** This notebook has a bunch of code and markdown cells with TODOs that you have to complete. These are meant to be helpful guidelines for you to finish your project while meeting the requirements in the project rubrics. Feel free to change the order of these the TODO's and use more than one TODO code cell to do all your tasks.

In [1]:
# TODO: Install any packages that you might need
# For instance, you will need the smdebug package
!pip install smdebug

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting smdebug
  Downloading smdebug-1.0.12-py2.py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.1/270.1 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting pyinstrument==3.4.2
  Downloading pyinstrument-3.4.2-py2.py3-none-any.whl (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.3/83.3 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
Collecting pyinstrument-cext>=0.2.2
  Downloading pyinstrument_cext-0.2.4.tar.gz (4.8 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting botocore<1.30.0,>=1.29.44
  Downloading botocore-1.29.48-py3-none-any.whl (10.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.3/10.3 MB[0m [31m41.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Building wheels for collected packages: pyinstrument-cext
  Building wheel for pyinstrument-cext (setup.py) ... [?25ldone


In [2]:
# TODO: Import any packages that you might need
# For instance you will need Boto3 and Sagemaker
import sagemaker
import boto3
from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role
import os
import fnmatch
import numpy as np
import pandas as pd

In [3]:
from sagemaker.debugger import (
    Rule,
    ProfilerRule,
    DebuggerHookConfig,
    rule_configs,
    ProfilerConfig, 
    FrameworkProfile
)

from sagemaker.session import TrainingInput

rules = [
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    ProfilerRule.sagemaker(rule_configs.LowGPUUtilization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
]

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500, framework_profile_params=FrameworkProfile(num_steps=10)
)
debugger_config = DebuggerHookConfig(
    hook_parameters={"train.save_interval": "100", "eval.save_interval": "10"}
)

In [4]:
# session = boto3.session.Session()
session = sagemaker.Session()
bucket = session.default_bucket() 

sess = boto3.session.Session()

region = sess.region_name


## Dataset
TODO: Explain what dataset you are using for this project. Maybe even give a small overview of the classes, class distributions etc that can help anyone not familiar with the dataset get a better understand of it.

I am using the dogImages dataset that has images for different dog breeds. 

In [5]:
# #TODO: Fetch and upload the data to AWS S3

# # Command to download and unzip data
# !wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
# !unzip dogImages.zip

In [6]:

# # I have commented out the data upload and fetching to avoid running the same command multiple times. 

# train_s3_path = sagemaker.Session().upload_data(bucket=bucket, 
#                                                   path='dogImages/train', 
#                                                   key_prefix='dogImages/train')

In [7]:
# valid_s3_path = sagemaker.Session().upload_data(bucket=bucket, 
#                                                   path='dogImages/valid', 
#                                                   key_prefix='dogImages/valid')

In [8]:
# test_s3_path = sagemaker.Session().upload_data(bucket=bucket, 
#                                                   path='dogImages/test', 
#                                                   key_prefix='dogImages/test')

In [9]:
def read_image_lst_info(srcdir):
    """Walk through base folder and collect paths for all image files.
        category info, return as a dataframe w/ 
        samp_index, cat_index, relpath, class name"""
    
    fileexts=['*.jpg']

    # search through source folder for sample files
    relpath = []
    subdirname = []
    for ext in fileexts:
        for root, dirnames, filenames in os.walk(srcdir):
            for filename in fnmatch.filter(filenames, ext):
                subdir = root.split('\\')[-1]
                relpath.append( subdir + '/' + filename)
                subdirname.append(subdir)
                
    # make sample id
    sampid = np.arange(len(subdirname))
    
    # subdir names will be used as class names
    classnames = np.unique(subdirname)
    
    # generate class id for each sample
    d = dict(zip(classnames,np.arange(len(classnames))))
    classid = [d[x] for x in subdirname]
    
    # return dataframe with file info
    return pd.DataFrame({'sampid': sampid, 
                        'classid':  classid,
                        'path': relpath,
                        'classname': subdirname} )

In [10]:
srcdir_train = './dogImages/train'
srcdir_valid = './dogImages/valid'
srcdir_test = './dogImages/test'

read_image_lst_info(srcdir_train).to_csv("train.lst", sep="\t", index=False, header=False)
read_image_lst_info(srcdir_valid).to_csv("valid.lst", sep="\t", index=False, header=False)
read_image_lst_info(srcdir_test).to_csv("test.lst", sep="\t", index=False, header=False)

import boto3

# # Upload files
# boto3.Session().resource('s3').Bucket(
#     bucket).Object('train.lst').upload_file('./train.lst')
# boto3.Session().resource('s3').Bucket(
#     bucket).Object('valid.lst').upload_file('./valid.lst')
# boto3.Session().resource('s3').Bucket(
#     bucket).Object('test.lst').upload_file('./test.lst') 

## Hyperparameter Tuning
**TODO:** This is the part where you will finetune a pretrained model with hyperparameter tuning. Remember that you have to tune a minimum of two hyperparameters. However you are encouraged to tune more. You are also encouraged to explain why you chose to tune those particular hyperparameters and the ranges.

**Note:** You will need to use the `hpo.py` script to perform hyperparameter tuning.

In [11]:
from sagemaker.tuner import (
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner
)

In [12]:
#TODO: Declare your HP ranges, metrics etc.

# hyperparameter_ranges = {
#     # "lr": ContinuousParameter(0.001, 0.1),
#     "batch-size": CategoricalParameter([32, 64, 128, 256, 512])
# }

hyperparameter_ranges = {
    "lr": ContinuousParameter(0.15, 0.2),
    "epochs": CategoricalParameter([1, 2])
}

In [13]:
objective_metric_name = "Accuracy"
objective_type = "Maximize"
# metric_definitions = [{"Name": "average test loss", "Regex": "Test set: Average loss: ([0-9\\.]+)"}]
metric_definitions = [{"Name": "Accuracy", "Regex": "Val_Accuracy=([0-9\\.]+)"}]
# Accuracy=([0-9\\.]+)%
# objective_metric_name = "Accuracy"
# objective_type = "Maximize"
# metric_definitions = [{"Name": "Accuracy", "Regex": "Accuracy=([0-9\\.]+)"}]

In [14]:
#TODO: Create estimators for your HPs

#estimator = # TODO: Your estimator here

## TROUBLESHOOT 
# maybe I need to add hyperparameters here
estimator = PyTorch(
    entry_point="hpo.py",
    role=get_execution_role(),
    py_version="py3",
    framework_version="1.8.0",
    instance_count=1,
    instance_type="ml.m5.xlarge")


# tuner = # TODO: Your HP tuner here
tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=1,
    max_parallel_jobs=1,
    objective_type=objective_type,
)

In [15]:
import os
training_path= "s3://{}/{}/".format(bucket, "dogImages")
s3_output_dir = "s3://{}/{}/".format(bucket, "output")
s3_model_dir = "s3://{}/{}/".format(bucket, "model")

os.environ['SM_CHANNEL_TRAIN']=training_path
os.environ['SM_MODEL_DIR']=s3_model_dir
os.environ['SM_OUTPUT_DATA_DIR']=s3_output_dir

In [11]:
tuner.fit({"train": training_path},wait=True) # TODO: Remember to include your data channels

No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


......................................................................................................................................................................................................................................................!


In [16]:
# TODO: Get the best estimators and the best HPs

best_estimator = tuner.best_estimator()

#best_estimator = #TODO

#Get the hyperparameters of the best trained model
best_estimator.hyperparameters()

ValueError: No tuning job available

In [17]:
best_estimator_hyperparameters = {'_tuning_objective_metric': '"Accuracy"',
 'epochs': '"1"',
 'lr': '0.1853535122882066',
 'sagemaker_container_log_level': '20',
 'sagemaker_estimator_class_name': '"PyTorch"',
 'sagemaker_estimator_module': '"sagemaker.pytorch.estimator"',
 'sagemaker_job_name': '"pytorch-training-2023-01-11-19-30-40-438"',
 'sagemaker_program': '"hpo.py"',
 'sagemaker_region': '"us-east-1"',
 'sagemaker_submit_directory': '"s3://sagemaker-us-east-1-777192073018/pytorch-training-2023-01-11-19-30-40-438/source/sourcedir.tar.gz"'}

## Model Profiling and Debugging
TODO: Using the best hyperparameters, create and finetune a new model

**Note:** You will need to use the `train_model.py` script to perform model profiling and debugging.

In [18]:
# TODO: Set up debugging and profiling rules and hooks

In [31]:
# TODO: Create and fit an estimator

# estimator = # TODO: Your estimator here

estimator = PyTorch(
    entry_point="train_model.py",
    role=get_execution_role(),
    py_version="py3",
    framework_version="1.8.0",
    instance_count=1,
    instance_type="ml.m5.xlarge",
    profiler_config=profiler_config,
    debugger_hook_config=debugger_config,
    rules=rules, 
    hyperpatameters = best_estimator_hyperparameters)


estimator.fit({"train": training_path},wait=True)
# estimator.fit(wait=True)

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2023-01-12-17-43-10-589


2023-01-12 17:43:11 Starting - Starting the training job...
2023-01-12 17:43:38 Starting - Preparing the instances for trainingLossNotDecreasing: InProgress
VanishingGradient: InProgress
Overfit: InProgress
Overtraining: InProgress
PoorWeightInitialization: InProgress
LowGPUUtilization: InProgress
ProfilerReport: InProgress
......
2023-01-12 17:44:39 Downloading - Downloading input data......
2023-01-12 17:45:39 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-01-12 17:45:42,863 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-01-12 17:45:42,865 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-01-12 17:45:42,875 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-01-12 17:45:42

In [32]:
from smdebug.trials import create_trial
from smdebug.core.modes import ModeKeys

[2023-01-12 18:09:13.283 ip-172-16-124-116.ec2.internal:7399 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None


In [33]:
trial = create_trial(estimator.latest_job_debugger_artifacts_path())
trial.tensor_names()

[2023-01-12 18:09:15.448 ip-172-16-124-116.ec2.internal:7399 INFO s3_trial.py:42] Loading trial debug-output at path s3://sagemaker-us-east-1-777192073018/pytorch-training-2023-01-12-17-43-10-589/debug-output


INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


[2023-01-12 18:09:15.827 ip-172-16-124-116.ec2.internal:7399 INFO trial.py:197] Training has ended, will refresh one final time in 1 sec.
[2023-01-12 18:09:16.845 ip-172-16-124-116.ec2.internal:7399 INFO trial.py:210] Loaded all steps


['CrossEntropyLoss_output_0',
 'gradient/ResNet_fc.0.bias',
 'gradient/ResNet_fc.0.weight',
 'gradient/ResNet_fc.2.bias',
 'gradient/ResNet_fc.2.weight',
 'layer1.0.relu_input_0',
 'layer1.0.relu_input_1',
 'layer1.0.relu_input_2',
 'layer1.1.relu_input_0',
 'layer1.1.relu_input_1',
 'layer1.1.relu_input_2',
 'layer1.2.relu_input_0',
 'layer1.2.relu_input_1',
 'layer1.2.relu_input_2',
 'layer2.0.relu_input_0',
 'layer2.0.relu_input_1',
 'layer2.0.relu_input_2',
 'layer2.1.relu_input_0',
 'layer2.1.relu_input_1',
 'layer2.1.relu_input_2',
 'layer2.2.relu_input_0',
 'layer2.2.relu_input_1',
 'layer2.2.relu_input_2',
 'layer2.3.relu_input_0',
 'layer2.3.relu_input_1',
 'layer2.3.relu_input_2',
 'layer3.0.relu_input_0',
 'layer3.0.relu_input_1',
 'layer3.0.relu_input_2',
 'layer3.1.relu_input_0',
 'layer3.1.relu_input_1',
 'layer3.1.relu_input_2',
 'layer3.2.relu_input_0',
 'layer3.2.relu_input_1',
 'layer3.2.relu_input_2',
 'layer3.3.relu_input_0',
 'layer3.3.relu_input_1',
 'layer3.3.rel

In [34]:
print(len(trial.tensor('CrossEntropyLoss_output_0').steps(mode=ModeKeys.TRAIN)))
print(len(trial.tensor('CrossEntropyLoss_output_0').steps(mode=ModeKeys.EVAL)))

1
1


In [35]:
def get_data(trial, tname, mode):
    tensor = trial.tensor(tname)
    steps = tensor.steps(mode=mode)
    vals = []
    for s in steps:
        vals.append(tensor.value(s, mode=mode))
    return steps, vals

import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import host_subplot

INFO:matplotlib.font_manager:generated new fontManager


In [36]:
import os
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob

training_job_name = estimator.latest_training_job.name
print(f"Training jobname: {training_job_name}")
print(f"Region: {region}")

Training jobname: pytorch-training-2023-01-12-17-43-10-589
Region: us-east-1


In [37]:
tj = TrainingJob(training_job_name, region)
tj.wait_for_sys_profiling_data_to_be_available()

from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts

system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

view_timeline_charts = TimelineCharts(
    system_metrics_reader,
    framework_metrics_reader=None,
    select_dimensions=["CPU", "GPU"],
    select_events=["total"],
)

rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print(f"You will find the profiler report in {rule_output_path}")

! aws s3 ls {rule_output_path} --recursive
! aws s3 cp {rule_output_path} ./ --recursive

ProfilerConfig:{'S3OutputPath': 's3://sagemaker-us-east-1-777192073018/', 'ProfilingIntervalInMilliseconds': 500, 'ProfilingParameters': {'DataloaderProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "MetricsRegex": ".*", }', 'DetailedProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'FileOpenFailThreshold': '50', 'HorovodProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'LocalPath': '/opt/ml/output/profiler', 'PythonProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "ProfilerName": "cprofile", "cProfileTimer": "total_time", }', 'RotateFileCloseIntervalInSeconds': '60', 'RotateMaxFileSizeInBytes': '10485760', 'SMDataParallelProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }'}, 'DisableProfiler': False}
s3 path:s3://sagemaker-us-east-1-777192073018/pytorch-training-2023-01-12-17-43-10-589/profiler-output


Profiler data from system is available


[2023-01-12 18:09:31.007 ip-172-16-124-116.ec2.internal:7399 INFO metrics_reader_base.py:134] Getting 25 event files
select events:['total']
select dimensions:['CPU', 'GPU']
filtered_events:{'total'}
filtered_dimensions:{'CPUUtilization-nodeid:algo-1'}


You will find the profiler report in s3://sagemaker-us-east-1-777192073018/pytorch-training-2023-01-12-17-43-10-589/rule-output
2023-01-12 18:08:30     364794 pytorch-training-2023-01-12-17-43-10-589/rule-output/ProfilerReport/profiler-output/profiler-report.html
2023-01-12 18:08:30     211844 pytorch-training-2023-01-12-17-43-10-589/rule-output/ProfilerReport/profiler-output/profiler-report.ipynb
2023-01-12 18:08:25        192 pytorch-training-2023-01-12-17-43-10-589/rule-output/ProfilerReport/profiler-output/profiler-reports/BatchSize.json
2023-01-12 18:08:25        200 pytorch-training-2023-01-12-17-43-10-589/rule-output/ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json
2023-01-12 18:08:25       1981 pytorch-training-2023-01-12-17-43-10-589/rule-output/ProfilerReport/profiler-output/profiler-reports/Dataloader.json
2023-01-12 18:08:25        127 pytorch-training-2023-01-12-17-43-10-589/rule-output/ProfilerReport/profiler-output/profiler-reports/GPUMemoryIncrease.jso

In [41]:
# get the autogenerated folder name of profiler report
profiler_report_name = [
    rule["RuleConfigurationName"]
    for rule in estimator.latest_training_job.rule_job_summary()
    if "Profiler" in rule["RuleConfigurationName"]
][0]

In [42]:
# TODO: Display the profiler output
import IPython

IPython.display.HTML(filename=profiler_report_name + "/profiler-output/profiler-report.html")

Unnamed: 0,Description,Recommendation,Number of times rule triggered,Number of datapoints,Rule parameters
GPUMemoryIncrease,Measures the average GPU memory footprint and triggers if there is a large increase.,Choose a larger instance type with more memory if footprint is close to maximum available memory.,0,0,increase:5  patience:1000  window:10
LowGPUUtilization,"Checks if the GPU utilization is low or fluctuating. This can happen due to bottlenecks, blocking calls for synchronizations, or a small batch size.","Check if there are bottlenecks, minimize blocking calls, change distributed training strategy, or increase the batch size.",0,0,threshold_p95:70  threshold_p5:10  window:500  patience:1000
BatchSize,"Checks if GPUs are underutilized because the batch size is too small. To detect this problem, the rule analyzes the average GPU memory footprint, the CPU and the GPU utilization.","The batch size is too small, and GPUs are underutilized. Consider running on a smaller instance type or increasing the batch size.",0,2837,cpu_threshold_p95:70  gpu_threshold_p95:70  gpu_memory_threshold_p95:70  patience:1000  window:500
CPUBottleneck,"Checks if the CPU utilization is high and the GPU utilization is low. It might indicate CPU bottlenecks, where the GPUs are waiting for data to arrive from the CPUs. The rule evaluates the CPU and GPU utilization rates, and triggers the issue if the time spent on the CPU bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent.",Consider increasing the number of data loaders or applying data pre-fetching.,0,2842,threshold:50  cpu_threshold:90  gpu_threshold:10  patience:1000
LoadBalancing,"Detects workload balancing issues across GPUs. Workload imbalance can occur in training jobs with data parallelism. The gradients are accumulated on a primary GPU, and this GPU might be overused with regard to other GPUs, resulting in reducing the efficiency of data parallelization.",Choose a different distributed training strategy or a different distributed training framework.,0,0,threshold:0.2  patience:1000
StepOutlier,"Detects outliers in step duration. The step duration for forward and backward pass should be roughly the same throughout the training. If there are significant outliers, it may indicate a system stall or bottleneck issues.","Check if there are any bottlenecks (CPU, I/O) correlated to the step outliers.",0,87,threshold:3  mode:None  n_outliers:10  stddev:3
MaxInitializationTime,Checks if the time spent on initialization exceeds a threshold percent of the total training time. The rule waits until the first step of training loop starts. The initialization can take longer if downloading the entire dataset from Amazon S3 in File mode. The default threshold is 20 minutes.,"Initialization takes too long. If using File mode, consider switching to Pipe mode in case you are using TensorFlow framework.",0,87,threshold:20
IOBottleneck,Checks if the data I/O wait time is high and the GPU utilization is low. It might indicate IO bottlenecks where GPU is waiting for data to arrive from storage. The rule evaluates the I/O and GPU utilization rates and triggers the issue if the time spent on the IO bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent.,"Pre-fetch data or choose different file formats, such as binary formats that improve I/O performance.",0,2842,threshold:50  io_threshold:50  gpu_threshold:10  patience:1000
Dataloader,"Checks how many data loaders are running in parallel and whether the total number is equal the number of available CPU cores. The rule triggers if number is much smaller or larger than the number of available cores. If too small, it might lead to low GPU utilization. If too large, it might impact other compute intensive operations on CPU.",Change the number of data loader processes.,0,10,min_threshold:70  max_threshold:200


**TODO**: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?  

yes, in the vanishing gradient and PoorWeightInitialization. 

**TODO**: If not, suppose there was an error. What would that error look like and how would you have fixed it?

I would examine it in cloud watch

In [46]:
estimator

<sagemaker.pytorch.estimator.PyTorch at 0x7efe9f456f20>

In [45]:
estimator_loaded = estimator.attach("", sagemaker_session = session)

ClientError: An error occurred (ValidationException) when calling the DescribeTrainingJob operation: Requested resource not found.

## Model Deploying

In [47]:
estimator.model_data

's3://sagemaker-us-east-1-777192073018/pytorch-training-2023-01-12-17-43-10-589/output/model.tar.gz'

In [48]:
model_location = estimator.model_data

In [49]:
from sagemaker.pytorch import PyTorchModel
role = get_execution_role()
from sagemaker.predictor import Predictor

In [50]:
jpeg_serializer = sagemaker.serializers.IdentitySerializer("image/jpeg")
json_deserializer = sagemaker.deserializers.JSONDeserializer()


class ImagePredictor(Predictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(ImagePredictor, self).__init__(
            endpoint_name,
            sagemaker_session=sagemaker_session,
            serializer=jpeg_serializer,
            deserializer=json_deserializer,
        )

In [110]:
# sagemaker.Session().upload_data(bucket=bucket, path = ".", key_prefix='code/inference2.py')

In [51]:
model = PyTorchModel(model_data=model_location, role=role, entry_point='code/inference2.py',py_version='py3',
                             framework_version='1.8.0', predictor_cls=ImagePredictor)

In [52]:
predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.t2.medium")

INFO:sagemaker:Creating model with name: pytorch-inference-2023-01-12-18-14-07-808
INFO:sagemaker:Creating endpoint-config with name pytorch-inference-2023-01-12-18-14-08-460
INFO:sagemaker:Creating endpoint with name pytorch-inference-2023-01-12-18-14-08-460


-----------------------------------------------------*

UnexpectedStatusException: Error hosting endpoint pytorch-inference-2023-01-12-18-14-08-460: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

In [79]:
# from sagemaker.serializers import IdentitySerializer
# import base64

# predictor.serializer = IdentitySerializer("image/jpeg")
# with open("s3://sagemaker-us-east-1-777192073018/dogImages/test/001.Affenpinscher/Affenpinscher_00003.jpg", "rb") as f:
#     payload = f.read()

    
# inference_1 = predictor.predict(payload)
# print(inference_1)

FileNotFoundError: [Errno 2] No such file or directory: 's3://sagemaker-us-east-1-777192073018/dogImages/test/001.Affenpinscher/Affenpinscher_00003.jpg'

In [39]:
import requests

In [40]:
request_dict={ "url": "https://sagemaker-us-east-1-777192073018.s3.amazonaws.com/dogImages/test/002.Afghan_hound/Afghan_hound_00116.jpg" }

img_bytes = requests.get(request_dict['url']).content
type(img_bytes)

bytes

In [41]:
response=predictor.predict(img_bytes, initial_args={"ContentType": "image/jpeg"})

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/pytorch-inference-2023-01-12-14-17-23-644 in account 777192073018 for more information.

In [91]:
# # TODO: Run an prediction on the endpoint

# image = "s3://sagemaker-us-east-1-777192073018/dogImages/test/001.Affenpinscher/Affenpinscher_00003.jpg"# TODO: Your code to load and preprocess image to send to endpoint for prediction
# response = predictor.predict(image,{"ContentType": "application/x-image", "Accept": "application/json;verbose"}
# )

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/pytorch-inference-2023-01-12-03-57-14-244 in account 777192073018 for more information.

In [None]:
# TODO: Remember to shutdown/delete your endpoint once your work is done
predictor.delete_endpoint()