# Models Benchmark

ResNet is a popular pre-trained model among computer vision and widely used for image classification. In this project, a quick benchmark will be conducted in a experimental phase to evaluate the performance of ResNeXt versus a traditional ResNet model.

* In order to evaluate and compare the performance of both pre-trained models in terms of accuracy, they will be fine-tuned on the custom project's dataset.
* The SageMaker JumpStart SDK will be used for this experimentation phase. The samples in `data_final/train` will be split into training and validation datasets `{'train': 3436, 'val': 859}`. 
* A generic script provided by Sagemaker is used for the transfer learning jobs.

In [2]:
from sagemaker import image_uris, model_uris, script_uris
from sagemaker.estimator import Estimator
from sagemaker.session import Session
from sagemaker import hyperparameters

bucket = "ml-capstone-project"
prefix = "data_final"

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [3]:
# More implementation details: https://sagemaker.readthedocs.io/en/stable/overview.html#fine-tune-a-pre-trained-model-on-a-custom-dataset-using-the-sagemaker-estimator-class
# Training execution: $python3 transfer_learning.py --batch_size 5 --epochs 10 --learning_rate 0.001

def create_estimator(model_id, model_version):
    training_instance_type = "ml.p3.2xlarge"
    #inference_instance_type = "ml.p3.2xlarge"
    instance_count = 1

    # Retrieve the JumpStart base model S3 URI
    base_model_uri = model_uris.retrieve(
        model_id=model_id, model_version=model_version, model_scope="training"
    )

    # Retrieve the training script and Docker image
    training_script_uri = script_uris.retrieve(
        model_id=model_id, model_version=model_version, script_scope="training"
    )
    training_image_uri = image_uris.retrieve(
        region=None,
        framework=None,
        image_scope="training",
        model_id=model_id,
        model_version=model_version,
        instance_type=training_instance_type,
    )

    # Get the default JumpStart hyperparameters
    default_hyperparameters = hyperparameters.retrieve_default(
        model_id=model_id,
        model_version=model_version,
    )
    # [Optional] Override default hyperparameters with custom values
    default_hyperparameters["epochs"] = "10"
    default_hyperparameters["batch_size"] = "5"

    # SageMaker Estimator instance
    estimator = Estimator(
        image_uri=training_image_uri,
        source_dir=training_script_uri,
        model_uri=base_model_uri,
        entry_point="transfer_learning.py",
        role=Session().get_caller_identity_arn(),
        hyperparameters=default_hyperparameters,
        instance_count=instance_count,
        instance_type=training_instance_type,
        enable_network_isolation=True,
    )
    return estimator

def fit_estimator():
    # URI of the training dataset
    training_dataset_s3_path = f's3://{bucket}/{prefix}/train'
    
    # S3 location of training data for the training channel
    estimator.fit(
        {
            "training": training_dataset_s3_path
        }
    )

In [4]:
estimator = create_estimator("pytorch-ic-resnet50", "2.2.4")
fit_estimator()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


INFO:sagemaker:Creating training-job with name: sagemaker-jumpstart-2023-11-21-15-09-01-000


2023-11-21 15:09:01 Starting - Starting the training job...
2023-11-21 15:09:30 Starting - Preparing the instances for training.........
2023-11-21 15:10:39 Downloading - Downloading input data...
2023-11-21 15:11:24 Training - Downloading the training image...........................
2023-11-21 15:15:35 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-11-21 15:15:58,982 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-11-21 15:15:59,009 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-11-21 15:15:59,013 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-11-21 15:15:59,095 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTraining Env:[0m
[34m{
    "additional_fram

ResNet50 results:

```
Epoch 0/9
train Loss: 1.7046 train Acc: 0.2602
val Loss: 1.6468 val Acc: 0.3166
Epoch 1/9
train Loss: 1.5470 train Acc: 0.2907
val Loss: 1.5144 val Acc: 0.3166
Epoch 2/9
train Loss: 1.5067 train Acc: 0.3178
val Loss: 1.5174 val Acc: 0.3120
Epoch 3/9
train Loss: 1.5082 train Acc: 0.3137
val Loss: 1.5087 val Acc: 0.3132
Epoch 4/9
train Loss: 1.5012 train Acc: 0.3225
val Loss: 1.5036 val Acc: 0.3143
Epoch 5/9
train Loss: 1.5038 train Acc: 0.3155
val Loss: 1.5029 val Acc: 0.3073
Epoch 6/9
train Loss: 1.5090 train Acc: 0.3102
val Loss: 1.5178 val Acc: 0.2899
Epoch 7/9
train Loss: 1.5030 train Acc: 0.3225
val Loss: 1.5190 val Acc: 0.3178
Epoch 8/9
train Loss: 1.4952 train Acc: 0.3242
val Loss: 1.5216 val Acc: 0.3003
Epoch 9/9
train Loss: 1.4966 train Acc: 0.3146
val Loss: 1.5155 val Acc: 0.3062
Training complete in 3m 43s
Best val Acc: 0.317811
```

In [6]:
create_estimator("pytorch-ic-resnext101-32x8d", "2.2.4")
fit_estimator()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


INFO:sagemaker:Creating training-job with name: sagemaker-jumpstart-2023-11-21-15-46-12-043


2023-11-21 15:46:12 Starting - Starting the training job...
2023-11-21 15:46:38 Starting - Preparing the instances for training.........
2023-11-21 15:48:01 Downloading - Downloading input data......
2023-11-21 15:48:46 Training - Downloading the training image........................
2023-11-21 15:52:53 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-11-21 15:53:16,818 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-11-21 15:53:16,845 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-11-21 15:53:16,849 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-11-21 15:53:16,931 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTraining Env:[0m
[34m{
    "additional_fram

ResNeXt101 results:

```
Epoch 0/9
train Loss: 1.7258 train Acc: 0.2456
val Loss: 1.6224 val Acc: 0.2922
Epoch 1/9
train Loss: 1.5247 train Acc: 0.3021
val Loss: 1.4895 val Acc: 0.3318
Epoch 2/9
train Loss: 1.5108 train Acc: 0.3041
val Loss: 1.4934 val Acc: 0.3120
Epoch 3/9
train Loss: 1.5061 train Acc: 0.3164
val Loss: 1.5069 val Acc: 0.2992
Epoch 4/9
train Loss: 1.4976 train Acc: 0.3332
val Loss: 1.5359 val Acc: 0.2992
Epoch 5/9
train Loss: 1.4945 train Acc: 0.3277
val Loss: 1.5283 val Acc: 0.2934
Epoch 6/9
train Loss: 1.5092 train Acc: 0.3193
val Loss: 1.5037 val Acc: 0.3248
Epoch 7/9
train Loss: 1.4985 train Acc: 0.3245
val Loss: 1.5257 val Acc: 0.3190
Epoch 8/9
train Loss: 1.5046 train Acc: 0.3228
val Loss: 1.5419 val Acc: 0.2957
Epoch 9/9
train Loss: 1.5066 train Acc: 0.3161
val Loss: 1.5270 val Acc: 0.2992
Training complete in 3m 43s
Best val Acc: 0.331781
```

In summary, the proposed ResNext101 model slightly outperforms the ResNet50, having a better validation accuracy of *33.18%* obtained in the second epoch.