In [1]:
!pip install "transformers==4.26.0" "datasets[s3]==2.9.0" sagemaker --upgrade

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting transformers==4.26.0
  Using cached transformers-4.26.0-py3-none-any.whl (6.3 MB)
Collecting datasets[s3]==2.9.0
  Using cached datasets-2.9.0-py3-none-any.whl (462 kB)
Collecting sagemaker
  Using cached sagemaker-2.145.0.tar.gz (714 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m73.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Using cached huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
Collecting responses<0.19
  Using cached responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.



In [2]:
import sagemaker
from sagemaker import get_execution_role
import boto3

sess = sagemaker.Session()
role = get_execution_role()

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::687912291502:role/webui-notebook-stack-ExecutionRole-62U5FV4LJQS
sagemaker bucket: sagemaker-us-west-2-687912291502
sagemaker session region: us-west-2


## 1. process dataset and upload to S3

We prepare a dataset on the [CNN Dailymail Dataset](https://huggingface.co/datasets/cnn_dailymail). 


In [3]:
# experiment config
model_id = "EleutherAI/gpt-j-6b" # Hugging Face Model Id
dataset_id = "cnn_dailymail" # Hugging Face Dataset Id
dataset_config = "3.0.0" # config/verison of the dataset
save_dataset_path = "data" # local path to save processed dataset
text_column = "article" # column of input text is
summary_column = "highlights" # column of the output text 
# custom instruct prompt start
prompt_template = f"Summarize the following news article:\n{{input}}\nSummary:\n"

We process (tokenize) the dataset, upload to s3 and pass it into our managed Training job.

In [4]:
from datasets import load_dataset
from transformers import AutoTokenizer
import numpy as np 

dataset = load_dataset(dataset_id,name=dataset_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

# Train dataset size: 287113
# Test dataset size: 11490

Found cached dataset cnn_dailymail (/home/ec2-user/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de)


  0%|          | 0/3 [00:00<?, ?it/s]

Train dataset size: 287113
Test dataset size: 11490


We defined a `prompt_template` in our config, which we will use to construct an instruct prompt for better performance of our model. Our `prompt_template` has a “fixed” start and end, and our document is in the middle. This means we need to ensure that the “fixed” template parts + document are not exceeding the max length of the model. Therefore we calculate the max length of our document, which we will later use for padding and truncation

In [5]:
prompt_lenght = len(tokenizer(prompt_template.format(input=""))["input_ids"])
max_sample_length = tokenizer.model_max_length - prompt_lenght
print(f"Prompt length: {prompt_lenght}")
print(f"Max input length: {max_sample_length}")

# Prompt length: 12
# Max input length: 500

Prompt length: 12
Max input length: 500


We know now that our documents can be “500” tokens long to fit our `template_prompt` still correctly. In addition to our input, we need to understand better our “target” sequence length meaning and how long are the summarization ins our dataset. Therefore we iterate over the dataset and calculate the max input length (at max 500) and the max target length. (takes a few minutes)

In [6]:
from datasets import concatenate_datasets
import numpy as np

# The maximum total input sequence length after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x[text_column], truncation=True), batched=True, remove_columns=[text_column, summary_column])
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
max_source_length = min(max_source_length, max_sample_length)
print(f"Max source length: {max_source_length}")

# The maximum total sequence length for target text after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x[summary_column], truncation=True), batched=True, remove_columns=[text_column, summary_column])
target_lenghts = [len(x) for x in tokenized_targets["input_ids"]]
# use 95th percentile as max target length
max_target_length = int(np.percentile(target_lenghts, 95))
print(f"Max target length: {max_target_length}")

Loading cached processed dataset at /home/ec2-user/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de/cache-7304467cfe0961a4.arrow


Max source length: 500


  0%|          | 0/299 [00:00<?, ?ba/s]

Max target length: 129


We now have everything needed to process our dataset.

In [7]:
def preprocess_function(sample, padding="max_length"):
    # created prompted input
    inputs = [prompt_template.format(input=item) for item in sample[text_column]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample[summary_column], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# process dataset
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=list(dataset["train"].features))

  0%|          | 0/288 [00:00<?, ?ba/s]

  0%|          | 0/14 [00:00<?, ?ba/s]

  0%|          | 0/12 [00:00<?, ?ba/s]

After we processed the datasets we are going to use the new [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

In [8]:
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/processed-404/{dataset_id}/train'
tokenized_dataset["train"].save_to_disk(training_input_path)

# save test_dataset to s3
test_input_path = f's3://{sess.default_bucket()}/processed-404/{dataset_id}/test'
tokenized_dataset["test"].save_to_disk(test_input_path)


print("uploaded data to:")
print(f"training dataset to: {training_input_path}")
print(f"test dataset to: {test_input_path}")

Saving the dataset (0/3 shards):   0%|          | 0/287113 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/11490 [00:00<?, ? examples/s]

uploaded data to:
training dataset to: s3://sagemaker-us-west-2-687912291502/processed-404/cnn_dailymail/train
test dataset to: s3://sagemaker-us-west-2-687912291502/processed-404/cnn_dailymail/test


## 2. prepare training script and deepspeed launcher

Here we use torch.distribute.launch to launch deepspeed on multiple nodes. First, we use start.py to configure some enviroments and invoke the shell script torch_launch.sh. Second, the shell script torch_launch.sh will configure all of parameters required for both torch.distribute.launch and training script run_seq2seq_deepspeed.py.
In addition, we create a deepspeed config file named ds_flan_t5_z3_config_bf16.json to configure our training setup.  

We are going to use a p4dn.24xlarge AWS EC2 Instance including 8x NVIDIA A100 40GB. This means we can leverage `bf16`, which reduces the memory footprint of the model by almost ~2x, which allows us to train without offloading efficiently. 


## 3. Fine-tune gpt-j-6b with deepspeed + torch.distribute.lancher on Amazon SageMaker



In [15]:
import time
from sagemaker.huggingface import HuggingFace
from sagemaker import get_execution_role

role = get_execution_role()
# define Training Job Name 
job_name = f'huggingface-gpt-j-deepspeed-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
#define the model s3 path which will store your trained model asset
#Note: you should use your real s3 path to configure model_s3_path
model_s3_path='s3://{}/llm/models/gpt-j/deepspeed/'.format(sess.default_bucket())

instance_count = 2
#define the enviroment variables for your scripts.
environment = {'NODE_NUMBER':str(instance_count),
              'MODEL_S3_PATH': model_s3_path
}
# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'start.py',          # user endpoint script
    source_dir           = 'src_deepspeed',               # directory which includes all the files needed for training
    instance_type        = 'ml.p4d.24xlarge', # instances type used for the training job
    instance_count       = instance_count,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    transformers_version = '4.17',            # the transformers version used in the training job
    pytorch_version      = '1.10',            # the pytorch_version version used in the training job
    py_version           = 'py38',            # the python version used in the training job
    environment = environment,
)

INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials


We created our `HuggingFace` estimator including the `start.py` as `entry_point` . We can now start our training job, with the `.fit()` method passing our S3 path to the training script.

In [None]:
# define a data input dictonary with our uploaded s3 uris
#Here we set test_input_path for both training channel and test channel to quickly verify the whole training procedure.
data = {
    'training': test_input_path,
    'test': test_input_path
}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-flan-t5-deepspeed-2023-04-1-2023-04-12-12-22-04-927


2023-04-12 12:22:08 Starting - Starting the training job......
2023-04-12 12:22:55 Starting - Preparing the instances for training.........
2023-04-12 12:24:35 Downloading - Downloading input data...
2023-04-12 12:24:51 Training - Downloading the training image..................
2023-04-12 12:28:07 Training - Training image download completed. Training in progress.......[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-04-12 12:29:12,294 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-04-12 12:29:12,393 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-04-12 12:29:12,395 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-04-12 12:29:12,732 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m
[34m/opt/conda/bin/pytho

In [18]:
print(model_s3_path)

s3://sagemaker-us-west-2-687912291502/llm/models/


In [19]:
!aws s3 ls s3://sagemaker-us-west-2-687912291502/llm/models/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2023-04-12 13:05:41        766 config.json
2023-04-12 13:05:41        142 generation_config.json
2023-04-12 13:05:41 22270852157 pytorch_model.bin
2023-04-12 13:05:41       2201 special_tokens_map.json
2023-04-12 13:05:41     791656 spiece.model
2023-04-12 13:05:41    2422164 tokenizer.json
2023-04-12 13:05:41       2537 tokenizer_config.json
2023-04-12 13:05:41       4655 training_args.bin
