<img src="https://i.imgur.com/gb6B4ig.png" width="400" alt="Weights & Biases" />

<!--- @wandbcode{sagemaker-hf} -->

<a href="https://colab.research.google.com/github/wandb/examples/blob/master/colabs/sagemaker/text_class.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification with Sagemaker & Weights & Biases

This notebook will demonstrate how to:
- log the datasets to W&B Tables for EDA
- train on the [`banking77`](https://huggingface.co/datasets/banking77) dataset
- log experiment results to Weights & Biases
- log the validation predictions to W&B Tables for model evaluation
- save the raw dataset, processed dtaset and model weights to W&B Artifacts

Note, this notebook should be run in a SageMaker notebook instance

## Sagemaker

<img src="https://i.imgur.com/Za9P1sr.png" width="400" alt="Weights & Biases" />

SageMaker is a comprehensive machine learning service. It is a tool that helps data scientists and developers to prepare, build, train, and deploy high-quality machine learning (ML) models by providing a rich set of orchestration tools and features.

## Credit

This notebook is based on the Hugging Face & AWS SageMaker examples that can be [found here](https://huggingface.co/docs/sagemaker)

## Setup

In [1]:
!pip install -qqq wandb --upgrade
!pip install -qq "sagemaker>=2.48.0" "transformers>=4.6.1" "datasets[s3]>=1.6.2" --upgrade

In [8]:
from pathlib import Path
from datasets import load_dataset, load_from_disk
from transformers import AutoTokenizer

## Weights & Biases Setup for AWS SageMaker

In [3]:
import wandb
wandb.login()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mmorgan[0m (use `wandb login --relogin` to force relogin)


True

The **only** additional piece of setup needed to use W&B with SageMaker is to make your W&B API key available to SageMaker. In this case we save it to a file in the same directory as our training script. This will be named `secrets.env` and W&B will then use this to authenticate on each of the instances that SageMaker spins up.

In [4]:
wandb.sagemaker_auth(path="scripts")

# Log Dataset for Exporatory Analysis in W&B Tables

Here we log the `train` and `eval` datasets to separtate W&B Tables. After this is run, we can explore these tables in the W&B UI.

In [12]:
wandb.init(name='log_dataset_to_table', project='hf-sagemaker', job_type='TableLogging')

raw_dataset = load_dataset('banking77')
label_list = raw_dataset['train'].features["label"].names

# ✍️ Log the training and eval datasets as a Weights & Biases Tables to Artifacts ✍️
for split in ['train','test']:
    
    ds = raw_dataset[split]
    
    # Create W&B Table
    dataset_table = wandb.Table(columns=['id', 'label_id', 'label', 'text'])

    # Ensure different row ids when logging train and eval data
    if split == 'test':
        idx_step = len(raw_dataset['train'])
        nm = 'eval'
    else:
        idx_step = 0
        nm = 'train'

    # Add each row of data to the table
    for index in range(len(ds)):
        idx = index + idx_step
        lbl = ds[index]['label']
        row = [idx, lbl, label_list[lbl], ds[index]['text']]
        dataset_table.add_data(*row)

    wandb.log({f'{nm} table': dataset_table})
    
wandb.finish()

Using custom data configuration default
Reusing dataset banking77 (/home/ec2-user/.cache/huggingface/datasets/banking77/default/1.1.0/17ffc2ed47c2ed928bee64127ff1dbc97204cb974c2f980becae7c864007aed9)


VBox(children=(Label(value=' 2.20MB of 2.20MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
_runtime,11
_timestamp,1629108940
_step,1


0,1
_runtime,▁█
_timestamp,▁█
_step,▁█


# Training with SageMaker and W&B

### SageMaker Role

First we need to get our SageMaker role permissions. If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.

In [25]:
import sagemaker

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::618469898284:role/jeff-sagemaker
sagemaker bucket: sagemaker-us-east-2-618469898284
sagemaker session region: us-east-2


### Creating an Estimator and start a training job
Here we will use the `HuggingFace` estimator from SageMaker, which includes an image of the main libraries necessary when training Hugging Face models

In [26]:
from sagemaker.huggingface import HuggingFace

model = 'distilbert-base-uncased'
warmup_steps = 100
lr = 1e-4

# hyperparameters, which are passed into the training job
hyperparameters={
    'output_dir': 'tmp',
    'overwrite_output_dir': True,
    'model_name_or_path': model,
    'dataset_name': 'banking77',
    'do_train': True,
    'per_device_train_batch_size': 16,
    'per_device_eval_batch_size': 16,
    'gradient_accumulation_steps': 2,
    'learning_rate': lr,
    'warmup_steps': warmup_steps,
    'fp16': True,
    'logging_steps': 10,
    'max_steps': 1200,
    'eval_steps': 100,
    'evaluation_strategy' : 'steps',
    'save_steps': 600,
    'save_total_limit' : 2,
    'load_best_model_at_end': True,
    'metric_for_best_model': 'accuracy',
    'report_to': 'wandb',    # ✍️
    }

hyperparameters['run_name'] = f"{model}_{lr}_{warmup_steps}"

In [27]:
huggingface_estimator = HuggingFace(entry_point='run_text_classification.py',
                            source_dir='./scripts',
                            instance_type= 'ml.p3.2xlarge', 
                            instance_count=1,
                            role=role,
                            transformers_version='4.6',
                            pytorch_version='1.7',
                            py_version='py36',
                            hyperparameters = hyperparameters)

In [28]:
huggingface_estimator.fit(wait=False)

# HyperParameter Tuning with SageMaker and Weights & Biases

We can alsp use SageMaker's `HyperparameterTuner` to run hyperparameter search and log the results to Weights & Biases

In [35]:
import sagemaker
from sagemaker.huggingface import HuggingFace
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

In [36]:
max_jobs=50
max_parallel_jobs=2

In [37]:
dataset_name = 'banking77_artifacts'  # Pre-tokenized dataset that will be downloaded from W&B Artifacts
model = 'roberta-large'
warmup_steps = None
lr = 1e-5

# hyperparameters, which are passed into the training job
hyperparameters={
    'output_dir': 'tmp',
    'overwrite_output_dir': True,
    'model_name_or_path': model,
    'dataset_name': dataset_name,
    'do_train': True,
    'per_device_train_batch_size': 4,
    'per_device_eval_batch_size': 4,
    'gradient_accumulation_steps': 8,
    'learning_rate': lr,
    'warmup_steps': warmup_steps,
    'fp16': True,
    'logging_steps': 10,
    'max_steps': 1200,
    'evaluation_strategy' : 'steps',
    'eval_steps': 100,
    'save_strategy': "steps", # "no"
    'save_steps': 600,
    'save_total_limit' : 1,
    'load_best_model_at_end': True,
    'metric_for_best_model': 'accuracy',
    'report_to': 'wandb',    # ✍️
    'run_name': 'hpt'  # will set run name 
    }

In [39]:
huggingface_estimator = HuggingFace(
    entry_point='run_text_classification.py',
    source_dir='./scripts',
    instance_type= 'ml.p3.2xlarge',
    instance_count=1,
    role=role,
    transformers_version='4.6',
    pytorch_version='1.7',
    py_version='py36',
    hyperparameters = hyperparameters
)

In [40]:
hyperparameter_ranges = {
    'learning_rate': ContinuousParameter(1e-5, 1e-4),
    'warmup_steps': IntegerParameter(48, 320),
    'model_name_or_path': CategoricalParameter(["google/electra-large-discriminator",
                                                "roberta-large", 
                                                "albert-large-v2",
                                               ])
}

objective_metric_name = 'eval_accuracy'
objective_type = 'Maximize'
metric_definitions = [
    {"Name": "train_runtime", "Regex": "train_runtime.*=\D*(.*?)"},
    {"Name": "eval_accuracy", "Regex": "eval_accuracy.*=\D*(.*?)"},
    {"Name": "eval_loss", "Regex": "eval_loss.*=\D*(.*?)"},
]

tuner = HyperparameterTuner(
    huggingface_estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=max_jobs,
    max_parallel_jobs=max_parallel_jobs,
    objective_type=objective_type
)

In [41]:
tuner.fit(wait=False)

# Dataset Versioning with W&B Artifacts

Weights and Biases Artifacts enable you to log end-to-end training pipelines to ensure your experiments are always reproducible.

Data privacy is critical to Weights & Biases and so we support the creation of Artifacts from reference locations such as your own private cloud such as AWS S3 or Google Cloud Storage. Local, on-premises of W&B are also available upon request.

By default, W&B stores artifact files in a private Google Cloud Storage bucket located in the United States. All files are encrypted at rest and in transit. For sensitive files, we recommend a private W&B installation or the use of reference artifacts.

### Artifacts - Log Raw Dataset

In [61]:
dataset_name = 'banking77'
dataset_path = Path('data')
raw_dataset_path = dataset_path/f'{dataset_name}_raw'

Log to W&B Artifacts

In [36]:
wandb.init(project='hf-sagemaker', name='log_raw_dataset', job_type='dataset-logging')

# Download data and save to disk
raw_datasets = load_dataset(dataset_name)
raw_datasets.save_to_disk(raw_dataset_path)

# Upload data to W&B Artifacts
dataset_artifact = wandb.Artifact(f'{dataset_name}_raw', type='raw_dataset')
dataset_artifact.add_dir(raw_dataset_path)
wandb.log_artifact(dataset_artifact)

Using custom data configuration default
Reusing dataset banking77 (/home/ec2-user/.cache/huggingface/datasets/banking77/default/1.1.0/17ffc2ed47c2ed928bee64127ff1dbc97204cb974c2f980becae7c864007aed9)
[34m[1mwandb[0m: Adding directory to artifact (./data/banking77_raw)... Done. 0.1s


<wandb.sdk.wandb_artifacts.Artifact at 0x7f85ca81dc88>

In [38]:
wandb.finish()

### Artifacts - Log Train/Eval Split

In [39]:
# Define our train/eval paths
train_dataset_path = dataset_path/f'{dataset_name}_train'
eval_dataset_path = dataset_path/f'{dataset_name}_eval'

Log to W&B Artifacts

In [42]:
wandb.init(project='hf-sagemaker', name='log_train_eval_split', job_type='train-eval-split')

# Download the raw dataset from W&B Artifacts
artifact = wandb.use_artifact('morgan/hf-sagemaker/banking77_raw:v0', type='raw_dataset')
artifact_dir = artifact.download(raw_dataset_path)

# Load the raw dataset into a Hugging Face Datasets object
raw_datasets = load_from_disk(artifact_dir)

# Log the train and eval datasets as separate objects
for split in ['train', 'test']:
    ds = raw_datasets[split]
    
    if split == 'test':
        split = 'eval'
        
    nm = f'{dataset_name}_{split}'    
    
    # Save the Hugging Face Datasets object to disk
    ds.save_to_disk(dataset_path/nm)

    # Upload the train or eval split to W&B Artifacts
    artifact = wandb.Artifact(nm, type=f'{split}_dataset')
    artifact.add_dir(dataset_path/nm)
    wandb.log_artifact(artifact)

[34m[1mwandb[0m: Adding directory to artifact (./data/banking77_train)... Done. 0.1s
[34m[1mwandb[0m: Adding directory to artifact (./data/banking77_eval)... Done. 0.1s


In [43]:
wandb.finish()

VBox(children=(Label(value=' 0.88MB of 0.88MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

### Artifacts - Dataset Preprocessing: Tokenization

In [72]:
def preprocess_function(examples):
    # Tokenize the texts
    result = tokenizer(examples['text'], padding=padding, max_length=max_seq_length, truncation=True)

    # Map labels to IDs (not necessary for GLUE tasks)
    if "label" in examples:
        result["label"] = examples["label"]
    return result

In [73]:
# Define the models we'll be using
models = ["google/electra-large-discriminator", "roberta-large", "albert-large-v2"]
padding = "max_length"    
max_seq_length=512

In [74]:
# Load Dataset
wandb.init(project='hf-sagemaker', name='tokenization', job_type='train-eval-tokenization')

for split in ['train', 'eval']:
    # Define our train/eval paths
    ds_path = dataset_path/f'{dataset_name}_{split}'
    
    # Download the raw dataset from W&B Artifacts and load to HF Datasets object
    artifact = wandb.use_artifact(f'morgan/hf-sagemaker/banking77_{split}:v0', type=f'{split}_dataset')
    artifact_dir = artifact.download(ds_path)
    dataset = load_from_disk(artifact_dir)
    
    for model_name in models:
        nm = f"{split}_{model_name.split('/')[-1]}_tokenized"
        pth = ds_path/f'{dataset_name}_{nm}'
        
        # Get tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
        max_seq_length = min(max_seq_length, tokenizer.model_max_length)

        # Do tokenization
        tok_dataset = dataset.map(preprocess_function, batched=True)
        
        # Save the Hugging Face Datasets object to disk
        tok_dataset.save_to_disk(pth)

        # Upload the train or eval split to W&B Artifacts
        artifact = wandb.Artifact(nm, type=f'{split}_tokenized_dataset')
        artifact.add_dir(pth)
        wandb.log_artifact(artifact)

HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))

[34m[1mwandb[0m: Adding directory to artifact (./data/banking77_train/banking77_train_electra-large-discriminator_tokenized)... Done. 0.1s



train_tokenized_dataset


HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))

[34m[1mwandb[0m: Adding directory to artifact (./data/banking77_train/banking77_train_roberta-large_tokenized)... Done. 0.1s



train_tokenized_dataset


HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))

[34m[1mwandb[0m: Adding directory to artifact (./data/banking77_train/banking77_train_albert-large-v2_tokenized)... Done. 0.1s



train_tokenized_dataset


HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))

[34m[1mwandb[0m: Adding directory to artifact (./data/banking77_eval/banking77_eval_electra-large-discriminator_tokenized)... Done. 0.1s



eval_tokenized_dataset


HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))

[34m[1mwandb[0m: Adding directory to artifact (./data/banking77_eval/banking77_eval_roberta-large_tokenized)... Done. 0.1s



eval_tokenized_dataset


HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))

[34m[1mwandb[0m: Adding directory to artifact (./data/banking77_eval/banking77_eval_albert-large-v2_tokenized)... Done. 0.1s



eval_tokenized_dataset


In [75]:
wandb.finish()

VBox(children=(Label(value=' 111.67MB of 111.67MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=…

### Mini sweep

In [13]:
from sagemaker.huggingface import HuggingFace

for i in range(12):
    
    # hyperparameters, which are passed into the training job
    hyperparameters={
        'output_dir': 'tmp',
        'overwrite_output_dir': True,
        'model_name_or_path': 'albert-large-v2', #'distilbert-base-uncased',  # microsoft/deberta-base , roberta-base
        'dataset_name': 'banking77',
        'do_train': True,
        'per_device_train_batch_size': 16,
        'per_device_eval_batch_size': 16,
        'gradient_accumulation_steps': 4,
        'learning_rate': 1e-4,
        'warmup_steps': 100,
        'fp16': True,
        'logging_steps': 10,
        'max_steps': 1200,
        'eval_steps': 100,
        'save_steps': 2000,
        'save_total_limit' : 1,
        'evaluation_strategy' : 'steps',
        'load_best_model_at_end': True,
        'metric_for_best_model': 'accuracy',
        'report_to': 'wandb',    # ✍️
    }
    
#     if i == 1: 
#         hyperparameters['model_name_or_path'] = "roberta-large"
#     if i == 2: 
#         hyperparameters['model_name_or_path'] = "roberta-base"        
    if i == 0: 
        hyperparameters['learning_rate'] = 3e-4
    elif i == 1: 
        hyperparameters['model_name_or_path'] = "roberta-base"
        hyperparameters['learning_rate'] = 3e-4
    elif i == 2: 
        hyperparameters['model_name_or_path'] = "roberta-large"
        hyperparameters['learning_rate'] = 3e-4
    elif i == 3: 
        hyperparameters['learning_rate'] = 3e-5
    elif i == 4: 
        hyperparameters['model_name_or_path'] = "roberta-base"
        hyperparameters['learning_rate'] = 3e-5
    elif i == 5: 
        hyperparameters['model_name_or_path'] = "roberta-large"
        hyperparameters['learning_rate'] = 3e-5
    elif i == 6: 
        hyperparameters['warmup_steps'] = None
    elif i == 7: 
        hyperparameters['warmup_steps'] = None
        hyperparameters['model_name_or_path'] = "roberta-base"
    elif i == 8: 
        hyperparameters['warmup_steps'] = None
        hyperparameters['model_name_or_path'] = "roberta-large"

    hyperparameters['run_name'] = hyperparameters['model_name_or_path']    # ✍️'
        
    huggingface_estimator = HuggingFace(entry_point='run_text_classification.py',
                                source_dir='./scripts',
                                instance_type= 'ml.p3.2xlarge', #'ml.p3.8xlarge', #'ml.p3.2xlarge', ##'g4dn.12xlarge',
                                instance_count=1,
                                role=role,
                                transformers_version='4.6',
                                pytorch_version='1.7',
                                py_version='py36',
                                hyperparameters = hyperparameters)

    huggingface_estimator.fit()

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p3.2xlarge for training job usage' is 2 Instances, with current utilization of 2 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.

# W&B Sweep

In [14]:
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={
    'output_dir': 'tmp',
    'model_name_or_path': 'distilbert-base-uncased',  # microsoft/deberta-base , roberta-base
    'dataset_name': 'banking77',
    'do_train': True,
    'do_eval': True,
    'per_device_train_batch_size': 16,
    'per_device_eval_batch_size': 16,
    'gradient_accumulation_steps': 4,
    'learning_rate': 1e-4,
    'warmup_steps': 100,
    'fp16': True,
    'logging_steps': 10,
    'max_steps': 1200,
    'eval_steps': 100,
    'save_steps': 10000,
    'evaluation_strategy' : 'steps',
    'report_to': 'wandb',    # ✍️
}

huggingface_estimator = HuggingFace(
    entry_point='run_text_cls_sweep.py',
    source_dir='./scripts',
    instance_type= 'ml.p3.2xlarge', #'ml.p3.8xlarge', #'g4dn.12xlarge',
    instance_count=1,
    role=role,
    transformers_version='4.6',
    pytorch_version='1.7',
    py_version='py36',
    hyperparameters = hyperparameters
)

huggingface_estimator.fit(wait=False)