<img src="https://i.imgur.com/gb6B4ig.png" width="400" alt="Weights & Biases" />

<!--- @wandbcode{sagemaker-hf} -->

# Text Classification with Sagemaker & Weights & Biases

This notebook will demonstrate how to:
- log the datasets to W&B Tables for EDA
- train on the [`banking77`](https://huggingface.co/datasets/banking77) dataset
- train in distributed mode
- log experiment results to Weights & Biases
- log the validation predictions to W&B Tables for model evaluation
- save the model weigths to W&B Artifacts

## Sagemaker

<img src="https://i.imgur.com/Za9P1sr.png" width="400" alt="Weights & Biases" />

SageMaker is a comprehensive machine learning service. It is a tool that helps data scientists and developers to prepare, build, train, and deploy high-quality machine learning (ML) models by providing a rich set of orchestration tools and features.

## Setup

In [9]:
!pip install -qqq wandb --upgrade
!pip install -qq "sagemaker>=2.48.0" "transformers>=4.6.1" "datasets[s3]>=1.6.2" --upgrade

_If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it._

In [10]:
import sagemaker

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::618469898284:role/jeff-sagemaker
sagemaker bucket: sagemaker-us-east-2-618469898284
sagemaker session region: us-east-2


## Uploading data to `sagemaker_session_bucket`

After we processed the `datasets` we are going to use the new `FileSystem` [integration](https://huggingface.co/docs/datasets/filesystems.html) to upload our dataset to S3.

In [None]:
import botocore
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()  

# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
train_dataset.save_to_disk(training_input_path,fs=s3)

# save test_dataset to s3
test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
test_dataset.save_to_disk(test_input_path,fs=s3)

# Fine-tuning & starting Sagemaker Training Job

In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In a Estimator we define, which fine-tuning script should be used as `entry_point`, which `instance_type` should be used, which `hyperparameters` are passed in .....



```python
huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            base_job_name='huggingface-sdk-extension',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            transformers_version='4.4',
                            pytorch_version='1.6',
                            py_version='py36',
                            role=role,
                            hyperparameters = {'epochs': 1,
                                               'train_batch_size': 32,
                                               'model_name':'distilbert-base-uncased'
                                                })
```

When we create a SageMaker training job, SageMaker takes care of starting and managing all the required ec2 instances for us with the `huggingface` container, uploads the provided fine-tuning script `train.py` and downloads the data from our `sagemaker_session_bucket` into the container at `/opt/ml/input/data`. Then, it starts the training job by running. 

```python
/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32
```

The `hyperparameters` you define in the `HuggingFace` estimator are passed in as named arguments. 

Sagemaker is providing useful properties about the training environment through various environment variables, including the following:

* `SM_MODEL_DIR`: A string that represents the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting.

* `SM_NUM_GPUS`: An integer representing the number of GPUs available to the host.

* `SM_CHANNEL_XXXX:` A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the HuggingFace estimator’s fit call, named `train` and `test`, the environment variables `SM_CHANNEL_TRAIN` and `SM_CHANNEL_TEST` are set.


To run your training job locally you can define `instance_type='local'` or `instance_type='local_gpu'` for gpu usage. _Note: this does not working within SageMaker Studio_


In [1]:
from datasets import load_dataset
dataset = load_dataset('banking77')

dataset['train'][0]

# dataset['train'].features['label'].names

Using custom data configuration default
Reusing dataset banking77 (/home/ec2-user/.cache/huggingface/datasets/banking77/default/1.1.0/17ffc2ed47c2ed928bee64127ff1dbc97204cb974c2f980becae7c864007aed9)


{'label': 11, 'text': 'I am still waiting on my card?'}

In [7]:
dataset["train"].features["label"][11]

TypeError: 'ClassLabel' object does not support indexing

In [2]:
label_list = dataset["train"].features["label"].names[:5]
label_list

['activate_my_card',
 'age_limit',
 'apple_pay_or_google_pay',
 'atm_support',
 'automatic_top_up']

In [127]:
label_list.sort()  # Let's sort it for determinism
# num_labels = len(label_list)
label_list

['activate_my_card',
 'age_limit',
 'apple_pay_or_google_pay',
 'atm_support',
 'automatic_top_up']

In [None]:

if is_regression:
    num_labels = 1
else:
    # A useful fast method:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
    label_list = raw_datasets["train"].unique("label")
    label_list.sort()  # Let's sort it for determinism
    num_labels = len(label_list)

In [95]:
eval_dataset = dataset['test']
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
eval_dataset = eval_dataset.map(lambda x: tokenizer(x['text'], truncation=True), batched=True)

eval_dataset

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))




Dataset({
    features: ['attention_mask', 'input_ids', 'label', 'text'],
    num_rows: 3080
})

In [128]:
validation_targets = eval_dataset['label']
# convert labels to their respective class
validation_targets = [eval_dataset.features['label'].int2str(x) for x in validation_targets]
validation_targets[:10]

['card_arrival',
 'card_arrival',
 'card_arrival',
 'card_arrival',
 'card_arrival',
 'card_arrival',
 'card_arrival',
 'card_arrival',
 'card_arrival',
 'card_arrival']

In [102]:
validation_inputs = eval_dataset.remove_columns(['label', 'attention_mask', 'input_ids'])

validation_targets = eval_dataset['label']
# convert labels to their respective class
validation_targets = [eval_dataset.features['label'].int2str(x) for x in validation_targets]

from wandb.sdk.integration_utils.data_logging import ValidationDataLogger

validation_logger = ValidationDataLogger(
    inputs = validation_inputs[:],
    targets = validation_targets
)

(OrderedDict([('text',
               ['How do I locate my card?',
                'I still have not received my new card, I ordered over a week ago.',
                'I ordered a card but it has not arrived. Help please!',
                'Is there a way to know when my card will arrive?',
                'My card has not arrived yet.',
                'When will I get my card?',
                'Do you know if there is a tracking number for the new card you sent me?',
                'i have not received my card',
                'still waiting on that card',
                'Is it normal to have to wait over a week for my new card?'])]),
 ['card_arrival',
  'card_arrival',
  'card_arrival',
  'card_arrival',
  'card_arrival',
  'card_arrival',
  'card_arrival',
  'card_arrival',
  'card_arrival',
  'card_arrival'])

In [29]:
# from transformers import EvalPrediction
# from typing import Callable

# class ComputeMetrics:
#     def __init__(self, train_len, eval_steps):
#         self.train_len = train_len
#         self.eval_steps = eval_steps
#         self.eval_step_count = eval_steps
        
#     def __call__(self, p: EvalPrediction):
#             preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
#             preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1)

#     #         # ✍️ log predictions to W&B Validation Logger ✍️ 
#     #         preds_labels = [model.config.id2label[x.item()] for x in preds] # convert id to class, (0, 1, 2…) to label (Health, Science…)
#     #         validation_logger.log_predictions(preds_labels)

#             # Create W&B Table
#             validation_table = wandb.Table(columns=['id', 'step', 'pred_label_id'])
#             for i,p in enumerate(preds):
#                 idx = i + len(train_dataset)
#                 row = [idx, self.eval_step_count, p]                    
#                 validation_table.add_data(*row)

#             # Log the table to Weights & Biases
#             wandb.log(
#                 {f'Eval Predictions/{data_args.dataset_name}_{self.eval_step_count}' : validation_table}, 
#                 commit=False
#             )
            
#             self.eval_step_count+=self.eval_steps
            
#             return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}
        
# compute_metrics = ComputeMetrics(100, 100)

True


## Creating an Estimator and start a training job

In [70]:
import wandb
wandb.login()
wandb.sagemaker_auth(path="scripts")

In [74]:
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={
    'output_dir': 'tmp',
    'model_name_or_path': 'distilbert-base-uncased',  # microsoft/deberta-base , roberta-base
    'dataset_name': 'banking77',
    'do_train': True,
    'do_eval': True,
    'per_device_train_batch_size': 64,
    'per_device_eval_batch_size': 64,
    'learning_rate': 1e-4,
    'warmup_steps': 100,
    'fp16': True,
    'logging_steps': 10,
    'max_steps': 1000,
    'eval_steps': 100,
    'save_steps': 200,
    'evaluation_strategy' : 'steps',
    'load_best_model_at_end': True,
    'metric_for_best_model': 'accuracy',
    'report_to': 'wandb',    # ✍️
    'run_name': 'distilbert2'    # ✍️
}

In [75]:
huggingface_estimator = HuggingFace(entry_point='run_text_classification.py',
                            source_dir='./scripts',
                            instance_type= 'ml.p3.2xlarge', #'ml.p3.8xlarge', #'g4dn.12xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.6',
                            pytorch_version='1.7',
                            py_version='py36',
                            hyperparameters = hyperparameters)

In [76]:
huggingface_estimator.fit()

2021-08-11 20:24:34 Starting - Starting the training job...
2021-08-11 20:24:57 Starting - Launching requested ML instancesProfilerReport-1628713474: InProgress
......
2021-08-11 20:25:58 Starting - Preparing the instances for training.........
2021-08-11 20:27:31 Downloading - Downloading input data...
2021-08-11 20:27:58 Training - Downloading the training image..............[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-08-11 20:30:16,077 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-08-11 20:30:16,104 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-08-11 20:30:19,141 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-08-11 20:30:19,446 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m
[34m/opt/conda/bi

### Estimator Parameters

In [70]:
# container image used for training job
print(f"container image used for training job: \n{huggingface_estimator.image_uri}\n")

# s3 uri where the trained model is located
print(f"s3 uri where the trained model is located: \n{huggingface_estimator.model_data}\n")

# latest training job name for this estimator
print(f"latest training job name for this estimator: \n{huggingface_estimator.latest_training_job.name}\n")



container image used for training job: 
558105141721.dkr.ecr.us-east-1.amazonaws.com/huggingface-training:pytorch1.6.0-transformers4.2.2-tokenizers0.9.4-datasets1.2.1-py36-gpu-cu110

s3 uri where the trained model is located: 
s3://philipps-sagemaker-bucket-us-east-1/huggingface-training-2021-02-04-16-47-39-189/output/model.tar.gz

latest training job name for this estimator: 
huggingface-training-2021-02-04-16-47-39-189

