## Project Description

Imagine you're developing a deep learning system tailored for sentiment analysis of product reviews, specifically for a newly established online beautiy product retail company. The goal is to assist the company in making informed decisions about inventory management – deciding what products to retain and what to remove from stock. The company, keen on enhancing customer satisfaction, has been actively monitoring comments on their website and has invested in annotators to label sentiments. They hand you a dataset comprising 80,000 customer reviews, each labeled with 0 for negative sentiment and 1 for positive sentiment. After extensive effort and refinement, you successfully train and deploy a classifier that predicts sentiment based on online comments. Excitedly, you report an 86% accuracy on a held-out test set to your bosses. However, to your disappointment, management expresses dissatisfaction, insisting on a minimum of 90% accuracy before considering the widespread implementation of the AI model. 
You suspect that certain annotators might have made errors, potentially affecting your model's effectiveness. Empowered by a newfound "confidence," you opt for "confidence" learning to pinpoint and rectify any inaccuracies in the dataset before embarking on the retraining process once more.

First, we prepare the environment for AWS SageMaker operations by setting up clients and retrieving essential configuration details like the default S3 bucket, execution role, and AWS region. 

In [1]:
import sagemaker
import logging
import boto3
import sagemaker
import pandas as pd
import json
import botocore
from botocore.exceptions import ClientError

config = botocore.config.Config(user_agent_extra='dlai-pds/c2/w3')

# low-level service client of the boto3 session
sm = boto3.client(service_name='sagemaker', 
                  config=config)

sm_runtime = boto3.client('sagemaker-runtime',
                          config=config)

sess = sagemaker.Session(sagemaker_client=sm,
                         sagemaker_runtime_client=sm_runtime)

bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = sess.boto_region_name

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


We then configure the data source for a training job in SageMaker, defining where the training data is located (in this case, an S3 bucket) and the nature of the data.

In [2]:
from sagemaker.inputs import TrainingInput

# TODO: set the path to the train data
train_data = TrainingInput(
    's3://rajat369007bucket/data/', 
    content_type='application/x-sagemaker-training-data'
)


A PyTorch estimator with the specified configurations for a SageMaker training job is created. The training job will use the provided entry point script, run on the specified instance type, and output the trained model to the specified S3 path. The entry point script main.py contains the main steps that needs to be completed in this project.

In [3]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point= "main.py",
    source_dir= "./",
    base_job_name="sagemaker-script-mode",
    role=role,
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.1",
    py_version="py310",
    dependencies= None,
    output_data_config={
        'S3OutputPath': 's3://rajat369007bucket/output/'
    },
    output_path= "s3://sagemaker-us-east-1-456165959068/output/",
    environment={'PYTHONPATH': 'src'}
)

The following script sets up a ModelCheckpoint callback to automatically save the best model (based on development loss) during the training process in a SageMaker training job. The best model will be stored at the specified directory path within the SageMaker environment.

In [4]:
# Save the best model during training by specifying the output path
# (Note: The output path should be where the best model will be saved within the S3 bucket)
model_checkpoint = {
    'ModelCheckpoint': {
        'monitor': 'dev_loss',
        'dirpath': '/opt/ml/model/',
        'filename': 'best_model',
        'save_top_k': 1,
        'mode': 'min'
    }
}

# Attach the ModelCheckpoint callback to the estimator
estimator._hyperparameters['callbacks'] = [model_checkpoint]


Starting the training process: 

In [5]:
estimator.fit({"train": train_data})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: sagemaker-script-mode-2024-05-07-20-23-36-142


2024-05-07 20:23:39 Starting - Starting the training job...
2024-05-07 20:23:40 Pending - Training job waiting for capacity............
2024-05-07 20:25:45 Pending - Preparing the instances for training...
2024-05-07 20:26:37 Downloading - Downloading the training image.....................
2024-05-07 20:29:49 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2024-05-07 20:30:08,864 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2024-05-07 20:30:08,882 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-05-07 20:30:08,894 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2024-05-07 20:30:08,896 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2024-05

[34mGPU available: True, used: False[0m
[34mTPU available: False, using: 0 TPU cores[0m
[34mIPU available: False, using: 0 IPUs[0m
[34mHPU available: False, using: 0 HPUs[0m
  rank_zero_warn([0m
[34mMissing logger folder: /opt/ml/code/lightning_logs[0m
[34m| Name  | Type       | Params[0m
[34m-------------------------------------[0m
[34m0 | model | Sequential | 49.3 K[0m
[34m-------------------------------------[0m
[34m49.3 K    Trainable params[0m
[34m0         Non-trainable params[0m
[34m49.3 K    Total params[0m
[34m0.197     Total estimated model params size (MB)[0m
[34mSanity Checking: 0it [00:00, ?it/s][0m
  rank_zero_warn([0m
[34mSanity Checking:   0%|          | 0/2 [00:00<?, ?it/s][0m
[34mSanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s][0m
[34mSanity Checking DataLoader 0:  50%|█████     | 1/2 [00:00<00:00, 463.72it/s][0m
[34mSanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00, 302.07it/s][0m
  rank_zero_warn

[34mEpoch 0:  26%|██▌       | 522/2000 [00:03<00:10, 134.53it/s, loss=0.299, v_num=0, train_loss=0.228, train_acc=0.938][0m
[34mEpoch 0:  26%|██▌       | 522/2000 [00:03<00:10, 134.52it/s, loss=0.281, v_num=0, train_loss=0.212, train_acc=0.938][0m
[34mEpoch 0:  26%|██▌       | 523/2000 [00:03<00:10, 134.53it/s, loss=0.281, v_num=0, train_loss=0.212, train_acc=0.938][0m
[34mEpoch 0:  26%|██▌       | 523/2000 [00:03<00:10, 134.52it/s, loss=0.272, v_num=0, train_loss=0.222, train_acc=0.938][0m
[34mEpoch 0:  26%|██▌       | 524/2000 [00:03<00:10, 134.54it/s, loss=0.272, v_num=0, train_loss=0.222, train_acc=0.938][0m
[34mEpoch 0:  26%|██▌       | 524/2000 [00:03<00:10, 134.52it/s, loss=0.272, v_num=0, train_loss=0.198, train_acc=0.969][0m
[34mEpoch 0:  26%|██▋       | 525/2000 [00:03<00:10, 134.54it/s, loss=0.272, v_num=0, train_loss=0.198, train_acc=0.969][0m
[34mEpoch 0:  26%|██▋       | 525/2000 [00:03<00:10, 134.53it/s, loss=0.275, v_num=0, train_loss=0.274, train_acc=0.8

[34mTesting DataLoader 0:  16%|█▌        | 80/500 [00:00<00:01, 238.63it/s][0m
[34mTesting DataLoader 0:  16%|█▌        | 81/500 [00:00<00:01, 238.63it/s][0m
[34mTesting DataLoader 0:  16%|█▋        | 82/500 [00:00<00:01, 238.63it/s][0m
[34mTesting DataLoader 0:  17%|█▋        | 83/500 [00:00<00:01, 238.63it/s][0m
[34mTesting DataLoader 0:  17%|█▋        | 84/500 [00:00<00:01, 238.63it/s][0m
[34mTesting DataLoader 0:  17%|█▋        | 85/500 [00:00<00:01, 238.63it/s][0m
[34mTesting DataLoader 0:  17%|█▋        | 86/500 [00:00<00:01, 238.63it/s][0m
[34mTesting DataLoader 0:  17%|█▋        | 87/500 [00:00<00:01, 238.63it/s][0m
[34mTesting DataLoader 0:  18%|█▊        | 88/500 [00:00<00:01, 238.63it/s][0m
[34mTesting DataLoader 0:  18%|█▊        | 89/500 [00:00<00:01, 238.63it/s][0m
[34mTesting DataLoader 0:  18%|█▊        | 90/500 [00:00<00:01, 238.63it/s][0m
[34mTesting DataLoader 0:  18%|█▊        | 91/500 [00:00<00:01, 238.63it/s][0m
[34mTesting DataLoader 0:  

[34mPredicting DataLoader 0:  37%|███▋      | 306/834 [00:00<00:00, -4161.66it/s][0m
[34mPredicting DataLoader 0:  37%|███▋      | 307/834 [00:00<00:00, -4144.95it/s][0m
[34mPredicting DataLoader 0:  37%|███▋      | 308/834 [00:00<00:00, -4128.71it/s][0m
[34mPredicting DataLoader 0:  37%|███▋      | 309/834 [00:00<00:00, -4112.49it/s][0m
[34mPredicting DataLoader 0:  37%|███▋      | 310/834 [00:00<00:00, -4095.85it/s][0m
[34mPredicting DataLoader 0:  37%|███▋      | 311/834 [00:00<00:00, -4079.73it/s][0m
[34mPredicting DataLoader 0:  37%|███▋      | 312/834 [00:00<00:00, -4063.82it/s][0m
[34mPredicting DataLoader 0:  38%|███▊      | 313/834 [00:00<00:00, -4046.86it/s][0m
[34mPredicting DataLoader 0:  38%|███▊      | 314/834 [00:00<00:00, -4031.21it/s][0m
[34mPredicting DataLoader 0:  38%|███▊      | 315/834 [00:00<00:00, -4015.49it/s][0m
[34mPredicting DataLoader 0:  38%|███▊      | 316/834 [00:00<00:00, -4000.16it/s][0m
[34mPredicting DataLoader 0:  38%|███▊    

[34mEpoch 0:  10%|█         | 171/1667 [00:00<00:06, 220.14it/s, loss=0.518, v_num=3, train_loss=0.577, train_acc=0.750][0m
[34mEpoch 0:  10%|█         | 171/1667 [00:00<00:06, 220.04it/s, loss=0.518, v_num=3, train_loss=0.485, train_acc=0.906][0m
[34mEpoch 0:  10%|█         | 172/1667 [00:00<00:06, 220.13it/s, loss=0.518, v_num=3, train_loss=0.485, train_acc=0.906][0m
[34mEpoch 0:  10%|█         | 172/1667 [00:00<00:06, 220.04it/s, loss=0.519, v_num=3, train_loss=0.478, train_acc=0.906][0m
[34mEpoch 0:  10%|█         | 173/1667 [00:00<00:06, 220.13it/s, loss=0.519, v_num=3, train_loss=0.478, train_acc=0.906][0m
[34mEpoch 0:  10%|█         | 173/1667 [00:00<00:06, 220.04it/s, loss=0.516, v_num=3, train_loss=0.467, train_acc=0.844][0m
[34mEpoch 0:  10%|█         | 174/1667 [00:00<00:06, 220.13it/s, loss=0.516, v_num=3, train_loss=0.467, train_acc=0.844][0m
[34mEpoch 0:  10%|█         | 174/1667 [00:00<00:06, 220.03it/s, loss=0.519, v_num=3, train_loss=0.552, train_acc=0.8

[34mGPU available: True, used: False[0m
[34mTPU available: False, using: 0 TPU cores[0m
[34mIPU available: False, using: 0 IPUs[0m
[34mHPU available: False, using: 0 HPUs[0m
  rank_zero_warn([0m
[34m| Name  | Type       | Params[0m
[34m-------------------------------------[0m
[34m0 | model | Sequential | 49.3 K[0m
[34m-------------------------------------[0m
[34m49.3 K    Trainable params[0m
[34m0         Non-trainable params[0m
[34m49.3 K    Total params[0m
[34m0.197     Total estimated model params size (MB)[0m
[34mSanity Checking: 0it [00:00, ?it/s][0m
  rank_zero_warn([0m
[34mSanity Checking:   0%|          | 0/2 [00:00<?, ?it/s][0m
[34mSanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s][0m
[34mSanity Checking DataLoader 0:  50%|█████     | 1/2 [00:00<00:00, 844.26it/s][0m
[34mSanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00, 359.16it/s][0m
  rank_zero_warn([0m
[34mTraining: 0it [00:00, ?it/s][0m
[34mTraining:  


2024-05-07 20:31:44 Uploading - Uploading generated training model[34mEpoch 0:  30%|███       | 603/2000 [00:04<00:10, 132.59it/s, loss=0.692, v_num=4, train_loss=0.681, train_acc=0.594][0m
[34mEpoch 0:  30%|███       | 603/2000 [00:04<00:10, 132.57it/s, loss=0.692, v_num=4, train_loss=0.688, train_acc=0.531][0m
[34mEpoch 0:  30%|███       | 604/2000 [00:04<00:10, 132.59it/s, loss=0.692, v_num=4, train_loss=0.688, train_acc=0.531][0m
[34mEpoch 0:  30%|███       | 604/2000 [00:04<00:10, 132.58it/s, loss=0.692, v_num=4, train_loss=0.693, train_acc=0.531][0m
[34mEpoch 0:  30%|███       | 605/2000 [00:04<00:10, 132.59it/s, loss=0.692, v_num=4, train_loss=0.693, train_acc=0.531][0m
[34mEpoch 0:  30%|███       | 605/2000 [00:04<00:10, 132.58it/s, loss=0.691, v_num=4, train_loss=0.688, train_acc=0.562][0m
[34mEpoch 0:  30%|███       | 606/2000 [00:04<00:10, 132.60it/s, loss=0.691, v_num=4, train_loss=0.688, train_acc=0.562][0m
[34mEpoch 0:  30%|███       | 606/2000 [00:04<00:1

[34mEpoch 0:  54%|█████▍    | 1088/2000 [00:08<00:06, 132.26it/s, loss=0.694, v_num=4, train_loss=0.698, train_acc=0.438][0m
[34mEpoch 0:  54%|█████▍    | 1089/2000 [00:08<00:06, 132.27it/s, loss=0.694, v_num=4, train_loss=0.698, train_acc=0.438][0m
[34mEpoch 0:  54%|█████▍    | 1089/2000 [00:08<00:06, 132.26it/s, loss=0.694, v_num=4, train_loss=0.696, train_acc=0.438][0m
[34mEpoch 0:  55%|█████▍    | 1090/2000 [00:08<00:06, 132.27it/s, loss=0.694, v_num=4, train_loss=0.696, train_acc=0.438][0m
[34mEpoch 0:  55%|█████▍    | 1090/2000 [00:08<00:06, 132.27it/s, loss=0.694, v_num=4, train_loss=0.695, train_acc=0.406][0m
[34mEpoch 0:  55%|█████▍    | 1091/2000 [00:08<00:06, 132.27it/s, loss=0.694, v_num=4, train_loss=0.695, train_acc=0.406][0m
[34mEpoch 0:  55%|█████▍    | 1091/2000 [00:08<00:06, 132.27it/s, loss=0.694, v_num=4, train_loss=0.690, train_acc=0.438][0m
[34mEpoch 0:  55%|█████▍    | 1092/2000 [00:08<00:06, 132.27it/s, loss=0.694, v_num=4, train_loss=0.690, train

[34mTesting DataLoader 0:  34%|███▎      | 168/500 [00:00<00:01, 228.82it/s][0m
[34mTesting DataLoader 0:  34%|███▍      | 169/500 [00:00<00:01, 228.82it/s][0m
[34mTesting DataLoader 0:  34%|███▍      | 170/500 [00:00<00:01, 228.82it/s][0m
[34mTesting DataLoader 0:  34%|███▍      | 171/500 [00:00<00:01, 228.82it/s][0m
[34mTesting DataLoader 0:  34%|███▍      | 172/500 [00:00<00:01, 228.82it/s][0m
[34mTesting DataLoader 0:  35%|███▍      | 173/500 [00:00<00:01, 228.82it/s][0m
[34mTesting DataLoader 0:  35%|███▍      | 174/500 [00:00<00:01, 228.82it/s][0m
[34mTesting DataLoader 0:  35%|███▌      | 175/500 [00:00<00:01, 228.82it/s][0m
[34mTesting DataLoader 0:  35%|███▌      | 176/500 [00:00<00:01, 228.82it/s][0m
[34mTesting DataLoader 0:  35%|███▌      | 177/500 [00:00<00:01, 228.82it/s][0m
[34mTesting DataLoader 0:  36%|███▌      | 178/500 [00:00<00:01, 228.82it/s][0m
[34mTesting DataLoader 0:  36%|███▌      | 179/500 [00:00<00:01, 228.82it/s][0m
[34mTesting Dat


2024-05-07 20:32:00 Completed - Training job completed
Training seconds: 343
Billable seconds: 343


## Model Deployment

We need to copy the training artifacts, i.e, output.tar.gz, from the corresponding S3 bucket to the current working directory.

In [8]:
# copy the training artifacts from the S3 bucket to the current working directory
!aws s3 cp "s3://sagemaker-us-east-1-456165959068/output/sagemaker-script-mode-2024-05-07-20-23-36-142/output/output.tar.gz" .     

download: s3://sagemaker-us-east-1-456165959068/output/sagemaker-script-mode-2024-05-07-20-23-36-142/output/output.tar.gz to ./output.tar.gz


We can decompress the training artifacts to `extracted_files` for further exploration.

In [9]:
!tar -xzf output.tar.gz -C extracted_training_artifacts

tar: Ignoring unknown extended header keyword `LIBARCHIVE.creationtime'
tar: Ignoring unknown extended header keyword `LIBARCHIVE.creationtime'
tar: Ignoring unknown extended header keyword `LIBARCHIVE.creationtime'
tar: Ignoring unknown extended header keyword `LIBARCHIVE.creationtime'


We then create an endpoint 'sentiment-analysis-endpoint-2' and deploy the model to that endpoint.

In [12]:
# deploy the trained model
predictor = estimator.deploy(instance_type="ml.p3.2xlarge", initial_instance_count=1)

INFO:sagemaker:Repacking model artifact (s3://sagemaker-us-east-1-456165959068/output/sagemaker-script-mode-2024-05-07-20-23-36-142/output/model.tar.gz), script artifact (s3://sagemaker-us-east-1-456165959068/sagemaker-script-mode-2024-05-07-20-23-36-142/source/sourcedir.tar.gz), and dependencies ([]) into single tar.gz file located at s3://sagemaker-us-east-1-456165959068/sagemaker-script-mode-2024-05-07-20-46-45-230/model.tar.gz. This may take some time depending on model size...
INFO:sagemaker:Creating model with name: sagemaker-script-mode-2024-05-07-20-46-45-230
INFO:sagemaker:Creating endpoint-config with name sagemaker-script-mode-2024-05-07-20-46-45-230
INFO:sagemaker:Creating endpoint with name sagemaker-script-mode-2024-05-07-20-46-45-230


-----------!