<h1> Fine-tuning DistilBERT with Amazon Polarity dataset

We will finetune pre-trained DistilBERT model with the Amazon Reviews Polarity dataset. This finetuned model will then be deployed for inference using SageMaker Endpoint.

The data
The Amazon Reviews Polarity dataset consists of reviews from Amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. It's avalaible under the amazon_polarity dataset on Hugging Face.

Setup
This notebook was tested in Amazon SageMaker Studio on a ml.m5.large instance with Python 3 (Data Science) kernel.

Dependecies


In [10]:
# !pip install -qq "sagemaker>=2.48.0" --upgrade
# # !pip install -qq torch==1.7.1 --upgrade
# !pip install -qq sagemaker-huggingface-inference-toolkit 
# !pip install transformers
# !pip install "datasets[s3]"
# !pip install -qq ipywidgets
# !pip install -qq watermark 
# !pip install -qq "seaborn>=0.11.0"

In [9]:
# Execute one by one
# %pip install torch torchvision
# %pip install torch==1.7.1+cpu torchvision==0.8.2+cpu -f https://download.pytorch.org/whl/torch_stable.html

In [2]:
%matplotlib inline
%config InlineBackend.figure_format='retina'

In [11]:
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import ProfilerConfig, DebuggerHookConfig, Rule, ProfilerRule, rule_configs
import sagemaker.huggingface
from sagemaker.huggingface import HuggingFace
import transformers
from transformers import AutoTokenizer
from datasets import load_dataset


import numpy as np
import pandas as pd
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
from textwrap import wrap

import boto3
import pprint
import time
import torch

In [4]:
sns.set(style='whitegrid', palette='muted', font_scale=1.2)
rcParams['figure.figsize'] = 17, 8

<h4>Set up SageMaker session and bucket

In [5]:
sess = sagemaker.Session()
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::632619531167:role/service-role/AmazonSageMaker-ExecutionRole-20230802T212620
sagemaker bucket: sagemaker-us-east-2-632619531167
sagemaker session region: us-east-2


Data Preperation

In [6]:
dataset_name = 'amazon_polarity'

train_dataset, test_dataset = load_dataset(dataset_name, split=['train', 'test'])
train_dataset = train_dataset.shuffle().select(range(5000)) # limiting the dataset size to speed up the training during the demo
test_dataset = test_dataset.shuffle().select(range(1000))

In [20]:
print(train_dataset.column_names)

['label', 'title', 'content']


In [21]:
train_dataset

Dataset({
    features: ['label', 'title', 'content'],
    num_rows: 5000
})

In [22]:
train_dataset[0]

{'label': 0,
 'title': 'Laptop lite and Fan U14',
 'content': 'I bought this product as a gift for my daughters and they now have them so I can not tell you how they are but from the looks of them when I was wrapping them they looked like they might help see the key board of laptop a lot better especially in low light as for the fan without trying it would be hard to be able to say how it works.Thank YouSandra Ayotte'}

<h2>Preparing the dataset to be used with PyTorch<h2>

In [7]:
tokenizer_name = 'distilbert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

In [8]:
# tokenize our training and testing datasets and then set them to the PyTorch format:
def tokenize(batch):
    return tokenizer(batch['content'], padding='max_length', truncation=True)

# Tokenize
train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(test_dataset))

# Set the format to PyTorch
train_dataset = train_dataset.rename_column("label", "labels")
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset = test_dataset.rename_column("label", "labels")
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [12]:
train_dataset

Dataset({
    features: ['labels', 'title', 'content', 'input_ids', 'attention_mask'],
    num_rows: 5000
})

In [15]:
import botocore
from datasets.filesystems import S3FileSystem

# Upload to S3
s3 = S3FileSystem()
s3_prefix = f'samples/datasets/{dataset_name}'
training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
train_dataset.save_to_disk(training_input_path,fs=s3)
test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
test_dataset.save_to_disk(test_input_path,fs=s3)

print(f'Uploaded training data to {training_input_path}')
print(f'Uploaded testing data to {test_input_path}')

  obj = super().__call__(*args, **kwargs)


Saving the dataset (0/1 shards):   0%|          | 0/5000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

Uploaded training data to s3://sagemaker-us-east-2-632619531167/samples/datasets/amazon_polarity/train
Uploaded testing data to s3://sagemaker-us-east-2-632619531167/samples/datasets/amazon_polarity/test


<h3>Fine-tuning & starting Sagemaker Training Job<h3>


In order to create a sagemaker training job you need an HuggingFace Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In a Estimator you define, which fine-tuning script should be used as entry_point, which instance_type should be used, which hyperparameters are passed in.

huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.6',
                            pytorch_version='1.7',
                            py_version='py36',
                            hyperparameters = hyperparameters)
When you create a SageMaker training job, SageMaker takes care of starting and managing all the required compute instances with the huggingface container, uploads the provided fine-tuning script train.py and downloads the data from our sagemaker_session_bucket into the container local storage at /opt/ml/input/data. Then, it starts the training job by running.

/opt/conda/bin/python train.py --epochs 5 --model_name distilbert-base-cased --token_name distilbert-base-cased --train_batch_size 32
The hyperparameters you define in the HuggingFace estimator are passed in as named arguments. The training script expect the HuggingFace model and token name so it can retrieve them.

Sagemaker is providing other useful properties about the training environment through various environment variables, including the following:

SM_MODEL_DIR: A string that represents the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting.

SM_NUM_GPUS: An integer representing the number of GPUs available to the host.

SM_CHANNEL_XXXX: A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the HuggingFace estimator’s fit call, named train and test, the environment variables SM_CHANNEL_TRAIN and SM_CHANNEL_TEST are set.

You can inspect the training script by running the next cell

<h4>Creating an Estimator and start a training job


Name your training job so you can follow it:

In [18]:
model_name = 'distilbert-base-cased'
import datetime
ct = datetime.datetime.now() 
current_time = str(ct.now()).replace(":", "-").replace(" ", "-")[:19]
training_job_name=f'finetune-{model_name}-{current_time}'
print( training_job_name )

finetune-distilbert-base-cased-2023-08-08-10-43-27


In [19]:
hyperparameters={'epochs': 1,
                 'train_batch_size': 32,
                 'model_name': model_name,
                 'tokenizer_name': tokenizer_name,
                 'output_dir':'/opt/ml/checkpoints',
                 }

In [21]:
metric_definitions=[
    {'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'learning_rate', 'Regex': "'learning_rate': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_accuracy', 'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_f1', 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_precision', 'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_recall', 'Regex': "'eval_recall': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_runtime', 'Regex': "'eval_runtime': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e\-)[0-9]+),?"}]

In [27]:
huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type='ml.p2.xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.6', 
                            pytorch_version='1.7',
                            py_version='py36',
                            hyperparameters = hyperparameters,
                            metric_definitions=metric_definitions,
                            max_run=36000, # expected max run in seconds
                        )

In [29]:
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path}, wait=False, job_name=training_job_name )

<h2>Training Metrics

In [None]:
from sagemaker import TrainingJobAnalytics

# Captured metrics can be accessed as a Pandas dataframe
df = TrainingJobAnalytics(training_job_name=training_job_name).dataframe()
df.head(10)

In [None]:
evals = df[df.metric_name.isin(['eval_accuracy','eval_precision', 'eval_f1'])]
losses = df[df.metric_name.isin(['loss', 'eval_loss'])]

sns.lineplot(
    x='timestamp', 
    y='value', 
    data=evals, 
    style='metric_name',
    markers=True,
    hue='metric_name'
)

ax2 = plt.twinx()
sns.lineplot(
    x='timestamp', 
    y='value', 
    data=losses, 
    hue='metric_name',
    ax=ax2)

<h2>Endpoint

<h2>Predictions

In [None]:
predictor = huggingface_estimator.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge", endpoint_name=training_job_name)

In [None]:
data = {
   "inputs": [
       "Good product!",
       "Product is not good at all",
       "Idea is good, but product quality is poor"
   ]
}

# request
predictor.predict(data)