# Getting started with Hugging Face and Amazon Sagemaker

## Binary classification on movie reviews

* https://huggingface.co/distilbert-base-uncased
* https://huggingface.co/transformers/model_doc/distilbert.html
* https://huggingface.co/datasets/imdb

# Setup

In [None]:
!pip -q install "sagemaker>=2.31.0" "transformers>=4.4.2" "datasets[s3]>=1.5.0" --upgrade

In [None]:
!pip -q install torch tensorflow --upgrade

In [1]:
import sagemaker

print(sagemaker.__version__)

sess = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()

2.32.1


# Preprocessing

We are using the `datasets` library to download and preprocess the `imdb` dataset. After preprocessing, the dataset will be uploaded to our `sagemaker_session_bucket` to be used within our training job. The [imdb](http://ai.stanford.edu/~amaas/data/sentiment/) dataset consists of 25000 training and 25000 testing highly polar movie reviews.

In [2]:
from datasets import load_dataset

train_dataset, test_dataset = load_dataset('imdb', split=['train', 'test'])

print(train_dataset.shape)
print(test_dataset.shape)

Reusing dataset imdb (/home/ec2-user/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3)


(25000, 2)
(25000, 2)


In [3]:
print(train_dataset[0])

{'label': 1, 'text': 'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'}


In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)

train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(test_dataset))

Loading cached processed dataset at /home/ec2-user/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3/cache-d6a52d59d516280b.arrow
Loading cached processed dataset at /home/ec2-user/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3/cache-10ed9b7a4517e667.arrow


In [5]:
print(train_dataset[0])

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [6]:
# Set format for pytorch
train_dataset = train_dataset.rename_column("label", "labels")
test_dataset = test_dataset.rename_column("label", "labels")

In [7]:
print(train_dataset[0])

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

# Upload data to S3

In [8]:
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()  

s3_prefix = 'hugging-face/demo'

training_input_path = f's3://{bucket}/{s3_prefix}/train'
train_dataset.save_to_disk(training_input_path,fs=s3)

test_input_path = f's3://{bucket}/{s3_prefix}/test'
test_dataset.save_to_disk(test_input_path,fs=s3)

In [9]:
print(training_input_path)
print(test_input_path)

s3://sagemaker-us-east-1-613904931467/hugging-face/demo/train
s3://sagemaker-us-east-1-613904931467/hugging-face/demo/test


# Fine-tuning & starting Sagemaker Training Job

In [10]:
!pygmentize ./scripts/train.py

[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m AutoModelForSequenceClassification, Trainer, TrainingArguments
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mmetrics[39;49;00m [34mimport[39;49;00m accuracy_score, precision_recall_fscore_support
[34mfrom[39;49;00m [04m[36mdatasets[39;49;00m [34mimport[39;49;00m load_from_disk
[34mimport[39;49;00m [04m[36mrandom[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m

[34mif[39;49;00m [31m__name__[39;49;00m == [33m"[39;49;00m[33m__main__[39;49;00m[33m"[39;49;00m:

    parser = argparse.ArgumentParser()

    [37m# hyperparameters sent by the client are passed as command-line arguments to the script.[39;49;00m
    parser.add_argument(

## Fine-tune the Hugging Face model on SageMaker

In [11]:
hyperparameters={
    'epochs': 1,
    'train_batch_size': 32,
    'model_name':'distilbert-base-uncased'
}

In [12]:
import sagemaker.huggingface
from sagemaker.huggingface import HuggingFace

huggingface_estimator = HuggingFace(
    role=role,
    # Fine-tuning script
    entry_point='train.py',
    source_dir='./scripts',
    hyperparameters=hyperparameters,
    # Infrastructure
    transformers_version='4.4.2',
    pytorch_version='1.6.0',
    py_version='py36',
    instance_type='ml.p3.2xlarge',
    instance_count=1,
    # Managed Spot Training
    use_spot_instances=True,
    max_wait=3600,
    max_run=3600,
    # Disable profiling
    disable_profiler=True
)

In [13]:
huggingface_estimator.fit(
    {'train': training_input_path, 'test': test_input_path}
)

2021-04-06 11:08:28 Starting - Starting the training job...
2021-04-06 11:08:51 Starting - Launching requested ML instancesProfilerReport-1617707307: InProgress
...
2021-04-06 11:09:14 Starting - Insufficient capacity error from EC2 while launching instances, retrying!......
2021-04-06 11:10:11 Starting - Preparing the instances for training......
2021-04-06 11:11:14 Downloading - Downloading input data...
2021-04-06 11:11:52 Training - Downloading the training image..................
2021-04-06 11:15:01 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-04-06 11:15:04,498 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-04-06 11:15:04,523 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-04-06 11:15:04,533 sagemaker_pytorch

[34m[2021-04-06 11:15:28.658 algo-1:25 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2021-04-06 11:15:29.151 algo-1:25 INFO profiler_config_parser.py:102] User has disabled profiler.[0m
[34m[2021-04-06 11:15:29.152 algo-1:25 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.[0m
[34m[2021-04-06 11:15:29.153 algo-1:25 INFO hook.py:199] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.[0m
[34m[2021-04-06 11:15:29.883 algo-1:25 INFO hook.py:253] Saving to /opt/ml/output/tensors[0m
[34m[2021-04-06 11:15:29.883 algo-1:25 INFO state_store.py:67] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.[0m
[34m[2021-04-06 11:15:30.663 algo-1:25 INFO hook.py:550] name:distilbert.embeddings.word_embeddings.weight count_params:23440896[0m
[34m[2021-04-06 11:15:30.663 algo-1:25 INFO hook.py:550] name:distilbert.embeddings.position_embeddings

[34m{'loss': 0.3726, 'learning_rate': 5e-05, 'epoch': 0.64}[0m
[34m{'eval_loss': 0.20122800767421722, 'eval_accuracy': 0.92364, 'eval_f1': 0.9246556419465604, 'eval_precision': 0.9125185012074473, 'eval_recall': 0.93712, 'eval_runtime': 110.0394, 'eval_samples_per_second': 227.191, 'epoch': 1.0}[0m
[34m{'train_runtime': 499.8767, 'train_samples_per_second': 1.564, 'epoch': 1.0}[0m
[34m***** Eval results *****[0m
[34m#015Downloading:   0%|          | 0.00/442 [00:00<?, ?B/s]#015Downloading: 100%|██████████| 442/442 [00:00<00:00, 597kB/s][0m
[34m#015Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]#015Downloading:   2%|▏         | 5.01M/268M [00:00<00:05, 50.0MB/s]#015Downloading:   4%|▍         | 10.1M/268M [00:00<00:05, 50.4MB/s]#015Downloading:   6%|▌         | 15.3M/268M [00:00<00:04, 50.9MB/s]#015Downloading:   8%|▊         | 20.7M/268M [00:00<00:04, 51.6MB/s]#015Downloading:  10%|▉         | 26.1M/268M [00:00<00:04, 52.4MB/s]#015Downloading:  12%|█▏        | 31.6M/


2021-04-06 11:25:55 Uploading - Uploading generated training model
2021-04-06 11:27:57 Completed - Training job completed
Training seconds: 1003
Billable seconds: 301
Managed Spot Training savings: 70.0%


In [14]:
huggingface_estimator.sagemaker_session.logs_for_job(
    huggingface_estimator.latest_training_job.name)

2021-04-06 11:28:15 Starting - Preparing the instances for training
2021-04-06 11:28:15 Downloading - Downloading input data
2021-04-06 11:28:15 Training - Training image download completed. Training in progress.
2021-04-06 11:28:15 Uploading - Uploading generated training model
2021-04-06 11:28:15 Completed - Training job completed[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-04-06 11:15:04,498 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-04-06 11:15:04,523 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-04-06 11:15:04,533 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-04-06 11:15:04,969 sagemaker-training-toolkit INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[34m{
    "additional_framework_parameters": {},
    "channel_in

[34m[2021-04-06 11:15:30.673 algo-1:25 INFO hook.py:550] name:distilbert.transformer.layer.5.ffn.lin2.weight count_params:2359296[0m
[34m[2021-04-06 11:15:30.673 algo-1:25 INFO hook.py:550] name:distilbert.transformer.layer.5.ffn.lin2.bias count_params:768[0m
[34m[2021-04-06 11:15:30.673 algo-1:25 INFO hook.py:550] name:distilbert.transformer.layer.5.output_layer_norm.weight count_params:768[0m
[34m[2021-04-06 11:15:30.673 algo-1:25 INFO hook.py:550] name:distilbert.transformer.layer.5.output_layer_norm.bias count_params:768[0m
[34m[2021-04-06 11:15:30.673 algo-1:25 INFO hook.py:550] name:pre_classifier.weight count_params:589824[0m
[34m[2021-04-06 11:15:30.673 algo-1:25 INFO hook.py:550] name:pre_classifier.bias count_params:768[0m
[34m[2021-04-06 11:15:30.673 algo-1:25 INFO hook.py:550] name:classifier.weight count_params:1536[0m
[34m[2021-04-06 11:15:30.673 algo-1:25 INFO hook.py:550] name:classifier.bias count_params:2[0m
[34m[2021-04-06 11:15:30.673 algo-1:25 INFO

## Retrieve model, load it and predict

In [15]:
%%sh -s $huggingface_estimator.model_data
aws s3 cp $1 .
mkdir -p model
tar -xvzf model.tar.gz -C model

Completed 256.0 KiB/910.5 MiB (2.4 MiB/s) with 1 file(s) remainingCompleted 512.0 KiB/910.5 MiB (4.7 MiB/s) with 1 file(s) remainingCompleted 768.0 KiB/910.5 MiB (7.0 MiB/s) with 1 file(s) remainingCompleted 1.0 MiB/910.5 MiB (9.3 MiB/s) with 1 file(s) remaining  Completed 1.2 MiB/910.5 MiB (11.3 MiB/s) with 1 file(s) remaining Completed 1.5 MiB/910.5 MiB (13.5 MiB/s) with 1 file(s) remaining Completed 1.8 MiB/910.5 MiB (15.3 MiB/s) with 1 file(s) remaining Completed 2.0 MiB/910.5 MiB (17.1 MiB/s) with 1 file(s) remaining Completed 2.2 MiB/910.5 MiB (18.9 MiB/s) with 1 file(s) remaining Completed 2.5 MiB/910.5 MiB (20.4 MiB/s) with 1 file(s) remaining Completed 2.8 MiB/910.5 MiB (21.4 MiB/s) with 1 file(s) remaining Completed 3.0 MiB/910.5 MiB (23.1 MiB/s) with 1 file(s) remaining Completed 3.2 MiB/910.5 MiB (24.8 MiB/s) with 1 file(s) remaining Completed 3.5 MiB/910.5 MiB (26.6 MiB/s) with 1 file(s) remaining Completed 3.8 MiB/910.5 MiB (28.1 MiB/s) with 1 file(s) remain

In [16]:
from transformers import AutoModel, AutoConfig, DistilBertForSequenceClassification

config = AutoConfig.from_pretrained('./model/config.json')
model = DistilBertForSequenceClassification.from_pretrained('./model/pytorch_model.bin', config=config)

print(config)
print(model)

DistilBertConfig {
  "_name_or_path": "./model/pytorch_model.bin",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.4.2",
  "vocab_size": 30522
}

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (atten

In [42]:
#inputs = tokenizer("The Phantom Menace was a really bad movie. What a waste of my life.", return_tensors='pt')
inputs = tokenizer("The Phantom Menace was an amazing movie. Jar Jar rocks!", return_tensors='pt')

print(inputs.input_ids)
print(inputs.attention_mask)

tensor([[  101,  1996, 11588, 19854,  2001,  2019,  6429,  3185,  1012, 15723,
         15723,  5749,   999,   102]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


In [43]:
outputs = model(**inputs)

print(outputs.logits)

tensor([[-2.2089,  2.4441]], grad_fn=<AddmmBackward>)


In [44]:
def top_class(logits):
    import torch
    import numpy as np
    softmax = torch.nn.Softmax(dim=1)
    print(softmax(logits))
    pred = np.argmax(softmax(logits).detach().numpy(), axis=1)

    return pred

In [45]:
print(top_class(outputs.logits))

tensor([[0.0094, 0.9906]], grad_fn=<SoftmaxBackward>)
[1]


In [32]:
## Fine-tune the Hugging Face model on SageMaker with Distributed Training

In [33]:
hyperparameters={
    'epochs': 16,
    'train_batch_size': 32,
    'model_name':'distilbert-base-uncased'
}

In [None]:
huggingface_estimator = HuggingFace(
    role=role,
    # Fine-tuning script
    entry_point='train.py',
    source_dir='./scripts',
    hyperparameters=hyperparameters,
    # Infrastructure
    transformers_version='4.4.2',
    pytorch_version='1.6.0',
    py_version='py36',
    instance_type='ml.p3.16xlarge',
    instance_count=2,
    # Managed Spot Training
    use_spot_instances=True,
    max_wait=3600,
    max_run=3600,
    # Disable profiling
    disable_profiler=True
)

huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

2021-04-06 13:14:57 Starting - Starting the training job...
2021-04-06 13:14:59 Starting - Launching requested ML instances.........
2021-04-06 13:16:49 Starting - Preparing the instances for training.........
2021-04-06 13:18:15 Downloading - Downloading input data...
2021-04-06 13:18:47 Training - Downloading the training image............
2021-04-06 13:20:57 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-04-06 13:20:56,897 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-04-06 13:20:56,976 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[35mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[35mbash: no job control in this shell[0m
[35m2021-04-06 13:20:58,137 sagemaker-training-toolkit INFO  

[35m2021-04-06 13:21:06,147 - __main__ - INFO -  loaded train_dataset length is: 25000[0m
[35m2021-04-06 13:21:06,148 - __main__ - INFO -  loaded test_dataset length is: 25000[0m
[35m2021-04-06 13:21:06,340 - filelock - INFO - Lock 140522520212032 acquired on /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.d423bdf2f58dc8b77d5f5d18028d7ae4a72dcfd8f468e81fe979ada957a8c361.lock[0m
[35m2021-04-06 13:21:06,376 - filelock - INFO - Lock 140522520212032 released on /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.d423bdf2f58dc8b77d5f5d18028d7ae4a72dcfd8f468e81fe979ada957a8c361.lock[0m
[35m2021-04-06 13:21:06,397 - filelock - INFO - Lock 140522520212032 acquired on /root/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212b677c637033ee4bce9ab5ee.126183e36667471617ae2f0835fab707baa54b731f991507ebbb55ea85adb12a.lock[0m
[34m2021-04-06 13:21:08,139 - __main__ 

[35mNCCL version 2.4.8+cuda11.0[0m
[34mNCCL version 2.4.8+cuda11.0[0m
[35m{'eval_loss': 0.27522990107536316, 'eval_accuracy': 0.89416, 'eval_f1': 0.8923164577567964, 'eval_precision': 0.9081345261762757, 'eval_recall': 0.87704, 'eval_runtime': 27.8937, 'eval_samples_per_second': 896.259, 'epoch': 1.0}[0m


[34m{'eval_loss': 0.27700275182724, 'eval_accuracy': 0.89432, 'eval_f1': 0.8917390591706277, 'eval_precision': 0.9140625, 'eval_recall': 0.87048, 'eval_runtime': 29.3815, 'eval_samples_per_second': 850.875, 'epoch': 1.0}[0m
[35m{'eval_loss': 0.2115076184272766, 'eval_accuracy': 0.91848, 'eval_f1': 0.9177894312222671, 'eval_precision': 0.9256305939788446, 'eval_recall': 0.91008, 'eval_runtime': 27.6871, 'eval_samples_per_second': 902.947, 'epoch': 2.0}[0m
[34m{'eval_loss': 0.2133331447839737, 'eval_accuracy': 0.917, 'eval_f1': 0.9151225099194175, 'eval_precision': 0.9363020005022181, 'eval_recall': 0.89488, 'eval_runtime': 29.0302, 'eval_samples_per_second': 861.171, 'epoch': 2.0}[0m
[35m{'eval_loss': 0.20018154382705688, 'eval_accuracy': 0.92448, 'eval_f1': 0.9232145762160404, 'eval_precision': 0.9389477167438782, 'eval_recall': 0.908, 'eval_runtime': 27.7887, 'eval_samples_per_second': 899.647, 'epoch': 3.0}[0m
[34m{'eval_loss': 0.20067191123962402, 'eval_accuracy': 0.923, 'e


2021-04-06 13:51:14 Uploading - Uploading generated training model
2021-04-06 13:59:53 Completed - Training job completed
Training seconds: 4996
Billable seconds: 1498
Managed Spot Training savings: 70.0%


## Fine-tune the Hugging Face model on SageMaker with Data Parallelism

In [None]:
huggingface_estimator = HuggingFace(
    role=role,
    # Fine-tuning script
    entry_point='train.py',
    source_dir='./scripts',
    hyperparameters=hyperparameters,
    # Infrastructure
    transformers_version='4.4.2',
    pytorch_version='1.6.0',
    py_version='py36',
    instance_type='ml.p3.16xlarge',
    instance_count=2,
    # Managed Spot Training
    use_spot_instances=True,
    max_wait=3600,
    max_run=3600,
    # Disable profiling
    disable_profiler=True,
    # Data Parallelism
    distribution={'smdistributed': {'dataparallel': {'enabled': True}}}
)

huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

2021-04-06 14:00:05 Starting - Starting the training job...
2021-04-06 14:00:20 Starting - Launching requested ML instances.........
2021-04-06 14:02:02 Starting - Preparing the instances for training.........
2021-04-06 14:03:19 Downloading - Downloading input data...
2021-04-06 14:03:50 Training - Downloading the training image...............
2021-04-06 14:06:31 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-04-06 14:06:31,092 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-04-06 14:06:31,169 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-04-06 14:06:34,222 sagemaker_pytorch_container.training INFO     Invoking SMDataParallel[0m
[34m2021-04-06 14:06:34,222 sagemaker_pytorch_container.training INFO     Invoking use

[34m[1,13]<stdout>:2021-04-06 14:06:47,397 - filelock - INFO - Lock 140291343550616 released on /root/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212b677c637033ee4bce9ab5ee.126183e36667471617ae2f0835fab707baa54b731f991507ebbb55ea85adb12a.lock[0m
[34m[1,14]<stdout>:2021-04-06 14:06:47,416 - filelock - INFO - Lock 140359338155424 acquired on /root/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212b677c637033ee4bce9ab5ee.126183e36667471617ae2f0835fab707baa54b731f991507ebbb55ea85adb12a.lock[0m
[34m[1,14]<stdout>:2021-04-06 14:06:47,416 - filelock - INFO - Lock 140359338155424 released on /root/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212b677c637033ee4bce9ab5ee.126183e36667471617ae2f0835fab707baa54b731f991507ebbb55ea85adb12a.lock[0m
[34m[1,11]<stdout>:2021-04-06 14:06:47,417 - filelock - INFO - Lock 139906430984032 acquired on /root/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212b6

[34m[1,0]<stdout>:Running smdistributed.dataparallel v1.0.0[0m
[34m[1,7]<stdout>:[2021-04-06 14:06:58.678 algo-1:61 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[1,1]<stdout>:[2021-04-06 14:06:58.678 algo-1:51 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[1,2]<stdout>:[2021-04-06 14:06:58.678 algo-1:54 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[1,5]<stdout>:[2021-04-06 14:06:58.678 algo-1:59 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[1,0]<stdout>:[2021-04-06 14:06:58.678 algo-1:50 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[1,3]<stdout>:[2021-04-06 14:06:58.678 algo-1:56 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[1,10]<stdout>:[2021-04-06 14:06:58.660 algo-2:60 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[1,13]<stdout>:[2021-04-06 14:06:58.659 algo-2:66 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[1,8]<stdout>:[2021-04-06 14:06:

[34m[1,8]<stdout>:{'eval_loss': 0.6650556921958923, 'eval_accuracy': 0.75004, 'eval_f1': 0.7653486538244902, 'eval_precision': 0.7211803835538886, 'eval_recall': 0.81528, 'eval_runtime': 7.6834, 'eval_samples_per_second': 3253.784, 'epoch': 1.0}[0m
[34m[1,0]<stdout>:{'eval_loss': 0.6650556921958923, 'eval_accuracy': 0.75004, 'eval_f1': 0.7653486538244902, 'eval_precision': 0.7211803835538886, 'eval_recall': 0.81528, 'eval_runtime': 7.6834, 'eval_samples_per_second': 3253.783, 'epoch': 1.0}[0m
[34m[1,0]<stdout>:{'eval_loss': 0.265903115272522, 'eval_accuracy': 0.89548, 'eval_f1': 0.8973966309341501, 'eval_precision': 0.8812369861957277, 'eval_recall': 0.91416, 'eval_runtime': 7.2734, 'eval_samples_per_second': 3437.162, 'epoch': 2.0}[0m
[34m[1,8]<stdout>:{'eval_loss': 0.265903115272522, 'eval_accuracy': 0.89548, 'eval_f1': 0.8973966309341501, 'eval_precision': 0.8812369861957277, 'eval_recall': 0.91416, 'eval_runtime': 7.2789, 'eval_samples_per_second': 3434.602, 'epoch': 2.0}[0

[34m[1,0]<stdout>:{'eval_loss': 0.2790542244911194, 'eval_accuracy': 0.92256, 'eval_f1': 0.9244576244732324, 'eval_precision': 0.9023461304082876, 'eval_recall': 0.94768, 'eval_runtime': 7.3801, 'eval_samples_per_second': 3387.507, 'epoch': 11.0}[0m
[34m[1,8]<stdout>:{'eval_loss': 0.2790542244911194, 'eval_accuracy': 0.92256, 'eval_f1': 0.9244576244732324, 'eval_precision': 0.9023461304082876, 'eval_recall': 0.94768, 'eval_runtime': 7.3799, 'eval_samples_per_second': 3387.599, 'epoch': 11.0}[0m
[34m[1,8]<stdout>:{'eval_loss': 0.29104533791542053, 'eval_accuracy': 0.92412, 'eval_f1': 0.9257272620492542, 'eval_precision': 0.9065255731922398, 'eval_recall': 0.94576, 'eval_runtime': 7.388, 'eval_samples_per_second': 3383.886, 'epoch': 12.0}[0m
[34m[1,0]<stdout>:{'eval_loss': 0.29104533791542053, 'eval_accuracy': 0.92412, 'eval_f1': 0.9257272620492542, 'eval_precision': 0.9065255731922398, 'eval_recall': 0.94576, 'eval_runtime': 7.3846, 'eval_samples_per_second': 3385.42, 'epoch': 12


2021-04-06 14:17:16 Uploading - Uploading generated training model[35m2021-04-06 14:17:14,101 sagemaker-training-toolkit INFO     MPI process finished.[0m
[35m2021-04-06 14:17:14,101 sagemaker-training-toolkit INFO     Reporting training SUCCESS[0m

2021-04-06 14:19:31 Completed - Training job completed
Training seconds: 1944
Billable seconds: 584
Managed Spot Training savings: 70.0%
