In [1]:
!pip install "transformers==4.26.0" "datasets[s3]==2.9.0" sagemaker --upgrade

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.



In [2]:
import sagemaker
from sagemaker import get_execution_role
import boto3

sess = sagemaker.Session()
role = get_execution_role()

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::687912291502:role/webui-notebook-stack-ExecutionRole-62U5FV4LJQS
sagemaker bucket: sagemaker-us-west-2-687912291502
sagemaker session region: us-west-2


## 1. process dataset and upload to S3

We prepare a dataset on the [CNN Dailymail Dataset](https://huggingface.co/datasets/cnn_dailymail). 


In [3]:
# experiment config
model_id = "EleutherAI/gpt-j-6b" # Hugging Face Model Id
dataset_id = "tiny_shakespeare" # Hugging Face Dataset Id
save_dataset_path = "data" # local path to save processed dataset
text_column = "text" # column of input text is
prompt_template = f"Summarize the following news article:\n{{input}}\nSummary:\n"

We process (tokenize) the dataset, upload to s3 and pass it into our managed Training job.

In [4]:
from datasets import concatenate_datasets
from datasets import load_dataset
from transformers import AutoTokenizer
import numpy as np 

dataset = load_dataset(dataset_id)

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Found cached dataset tiny_shakespeare (/home/ec2-user/.cache/huggingface/datasets/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e)


  0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
tokenizer.pad_token = tokenizer.eos_token
block_size = 512

def tokenizer_func(examples):
    ret = tokenizer(
        examples["text"],
        truncation=True,
        max_length=block_size,
        padding="max_length",
        return_tensors="np",
    )
    ret["labels"] = ret["input_ids"].copy()
    return dict(ret)

train_tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map\
        (tokenizer_func, batched=True, remove_columns=["text"])
validation_tokenized_inputs = concatenate_datasets([dataset["validation"]]).map\
        (tokenizer_func, batched=True, remove_columns=["text"])
max_source_length = max([len(x) for x in train_tokenized_inputs["input_ids"]])
max_validation_length = max([len(x) for x in validation_tokenized_inputs["input_ids"]])
print(f"Max source length: {max_source_length}")

Loading cached processed dataset at /home/ec2-user/.cache/huggingface/datasets/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e/cache-42618c1362fa5d3b.arrow
Loading cached processed dataset at /home/ec2-user/.cache/huggingface/datasets/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e/cache-f3754588cdde1a88.arrow


Max source length: 512


We now have everything needed to process our dataset.

In [8]:
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/processed-gpt-j/{dataset_id}/train'
validation_input_path = f's3://{sess.default_bucket()}/processed-gpt-j/{dataset_id}/validation'
train_tokenized_inputs.save_to_disk(training_input_path)
validation_tokenized_inputs.save_to_disk(validation_input_path)
print("uploaded data to:")
print(f"training dataset to: {training_input_path}")
print(f"valication dataset to: {validation_input_path}")

Saving the dataset (0/1 shards):   0%|          | 0/2 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1 [00:00<?, ? examples/s]

uploaded data to:
training dataset to: s3://sagemaker-us-west-2-687912291502/processed-gpt-j/tiny_shakespeare/train
valication dataset to: s3://sagemaker-us-west-2-687912291502/processed-gpt-j/tiny_shakespeare/validation


## 3. Fine-tune gpt-j-6b with deepspeed + torch.distribute.lancher on Amazon SageMaker



In [18]:
import time
from sagemaker.huggingface import HuggingFace
from sagemaker import get_execution_role

role = get_execution_role()
# define Training Job Name 
job_name = f'huggingface-gpt-j-deepspeed-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
#define the model s3 path which will store your trained model asset
#Note: you should use your real s3 path to configure model_s3_path
model_s3_path='s3://{}/llm/models/gpt-j/deepspeed/'.format(sess.default_bucket())

instance_count = 2
#define the enviroment variables for your scripts.
environment = {'NODE_NUMBER':str(instance_count),
              'MODEL_S3_PATH': model_s3_path
}
# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'start.py',          # user endpoint script
    source_dir           = 'src_deepspeed',               # directory which includes all the files needed for training
    instance_type        = 'ml.p4d.24xlarge', # instances type used for the training job
    instance_count       = instance_count,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    transformers_version = '4.17',            # the transformers version used in the training job
    pytorch_version      = '1.10',            # the pytorch_version version used in the training job
    py_version           = 'py38',            # the python version used in the training job
    environment = environment,
)

INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials


We created our `HuggingFace` estimator including the `start.py` as `entry_point` . We can now start our training job, with the `.fit()` method passing our S3 path to the training script.

In [19]:
# define a data input dictonary with our uploaded s3 uris
#Here we set test_input_path for both training channel and test channel to quickly verify the whole training procedure.
data = {
    'training': training_input_path,
    'test':validation_input_path
}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

Using provided s3_resource


INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-gpt-j-deepspeed-2023-04-23--2023-04-23-01-57-08-086


2023-04-23 01:57:10 Starting - Starting the training job......
2023-04-23 01:58:01 Starting - Preparing the instances for training.........
2023-04-23 01:59:28 Downloading - Downloading input data...
2023-04-23 01:59:43 Training - Downloading the training image...............
2023-04-23 02:02:39 Training - Training image download completed. Training in progress.......[35mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[35mbash: no job control in this shell[0m
[35m2023-04-23 02:03:37,066 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[35m2023-04-23 02:03:37,133 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[35m2023-04-23 02:03:37,135 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[35m2023-04-23 02:03:37,423 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m
[35m/opt/conda/bin/python3.

In [23]:
print(huggingface_estimator)

<sagemaker.huggingface.estimator.HuggingFace object at 0x7efc0f7e7970>


In [24]:
print(model_s3_path)

s3://sagemaker-us-west-2-687912291502/llm/models/gpt-j/deepspeed/


In [25]:
!aws s3 ls s3://sagemaker-us-west-2-687912291502/llm/models/gpt-j/deepspeed/

2023-04-23 02:33:07       4328 added_tokens.json
2023-04-23 02:33:07        996 config.json
2023-04-23 02:33:07     456356 merges.txt
2023-04-23 02:33:07 24321080387 pytorch_model.bin
2023-04-23 02:33:07        470 special_tokens_map.json
2023-04-23 02:33:07    2135258 tokenizer.json
2023-04-23 02:33:07        763 tokenizer_config.json
2023-04-23 02:33:07       4527 training_args.bin
2023-04-23 02:33:07     798156 vocab.json
