References:
    
- https://gitlab.com/juliensimon/huggingface-demos/-/tree/main/summarization-t5-qlora

- https://www.youtube.com/watch?v=ebQ2wyn8RGM&list=WL&index=3

# Summarizing legal documents with Hugging Face and Amazon Sagemaker

In [1]:
# Number of parameters for flan-t5 family: small 80M, base 250M, large 780M, xl 3B, xxl 11B
model_id = "google/flan-t5-large"  # copy-paste from HF

# https://huggingface.co/datasets/billsum
dataset_id = "billsum"  # copy-paste from HF

# Setup

In [1]:
# !pip -q install transformers datasets sagemaker --upgrade

In [2]:
# !pip -q install widgetsnbextension ipywidgets

In [2]:
import sagemaker

print(sagemaker.__version__)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
2.188.0


In [23]:
# sess = sagemaker.Session(default_bucket="anurag-finetune-llm")
# bucket = sess.default_bucket()

In [6]:
bucket = "anurag-finetune-llm"

In [7]:
import transformers
import datasets

print(transformers.__version__)
print(datasets.__version__)

4.34.1
2.14.5


# Preprocessing

## Load dataset

In [8]:
from datasets import load_dataset, load_from_disk

dataset = load_dataset(dataset_id)

Downloading builder script:   0%|          | 0.00/3.66k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.70k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/67.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

In [9]:
type(dataset)

datasets.dataset_dict.DatasetDict

In [10]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 18949
    })
    test: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 3269
    })
    ca_test: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 1237
    })
})

## Preprocess dataset 

In [11]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [12]:
prefix = "summarize: "
input_max_length = 1024
output_max_length = 128


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    
    model_inputs = tokenizer(inputs,
                             max_length=input_max_length,
                             truncation=True)
    labels = tokenizer(
                       text_target=examples["summary"],
                       max_length=output_max_length,
                       truncation=True
                      )
    
    model_inputs["labels"] = labels["input_ids"]
    
    return model_inputs

In [13]:
tokenized_dataset = dataset.map(
                                preprocess_function,
                                batched=True,
                                remove_columns=["title", "text", "summary"]
                                )

Map:   0%|          | 0/18949 [00:00<?, ? examples/s]

Map:   0%|          | 0/3269 [00:00<?, ? examples/s]

Map:   0%|          | 0/1237 [00:00<?, ? examples/s]

In [14]:
tokenized_dataset.save_to_disk(f"billsum-t5-tokenized")

Saving the dataset (0/1 shards):   0%|          | 0/18949 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/3269 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1237 [00:00<?, ? examples/s]

# Upload processed dataset to S3

In [15]:
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()

s3_prefix = "huggingface/billsum-t5-summarization"

dataset_input_path = "s3://{}/{}".format(bucket, s3_prefix)
train_input_path = "{}/train".format(dataset_input_path)
valid_input_path = "{}/validation".format(dataset_input_path)

print(dataset_input_path)
print(train_input_path)
print(valid_input_path)

  obj = super().__call__(*args, **kwargs)


s3://anurag-finetune-llm/huggingface/billsum-t5-summarization
s3://anurag-finetune-llm/huggingface/billsum-t5-summarization/train
s3://anurag-finetune-llm/huggingface/billsum-t5-summarization/validation


In [17]:
type(tokenized_dataset)

datasets.dataset_dict.DatasetDict

In [18]:
tokenized_dataset.keys()

dict_keys(['train', 'test', 'ca_test'])

Get the specific IAM role being used in the SageMaker notebook instance. (It's set while creating the notebook instance in the first place.)

In [28]:
sagemaker.get_execution_role()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


Couldn't call 'get_role' to get Role ARN from role name sm_jumpstart_flan_bot_endpoint_role to get Role path.


'arn:aws:iam::195565468328:role/sm_jumpstart_flan_bot_endpoint_role'

Make sure this specific role has all the required permimssions on the S3 bucket we are going to use. Then run the following cell.

Save locally

In [29]:
tokenized_dataset["train"].save_to_disk(train_input_path,
                                        fs=s3)
tokenized_dataset["test"].save_to_disk(valid_input_path,
                                       fs=s3)

Saving the dataset (0/1 shards):   0%|          | 0/18949 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/3269 [00:00<?, ? examples/s]

In [30]:
%%sh -s $dataset_input_path
aws s3 ls --recursive $1

2023-10-20 23:22:00  114509384 huggingface/billsum-t5-summarization/train/data-00000-of-00001.arrow
2023-10-20 23:22:04       2064 huggingface/billsum-t5-summarization/train/dataset_info.json
2023-10-20 23:22:04        250 huggingface/billsum-t5-summarization/train/state.json
2023-10-20 23:22:04   19763920 huggingface/billsum-t5-summarization/validation/data-00000-of-00001.arrow
2023-10-20 23:22:05       2064 huggingface/billsum-t5-summarization/validation/dataset_info.json
2023-10-20 23:22:05        249 huggingface/billsum-t5-summarization/validation/state.json


Load from local machine/VM

In [31]:
train_ds = load_from_disk(train_input_path)
valid_ds = load_from_disk(valid_input_path)

In [32]:
type(train_ds)

datasets.arrow_dataset.Dataset

# Fine-tune on SageMaker with a Hugging Face Deep Learning Container

In [33]:
!pygmentize train.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34mimport[39;49;00m [04m[36margparse[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[34mimport[39;49;00m [04m[36mevaluate[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mdatasets[39;49;00m [34mimport[39;49;00m load_from_disk[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m ([37m[39;49;00m
    AutoModelForSeq2SeqLM,[37m[39;49;00m
    AutoTokenizer,[37m[39;49;00m
    DataCollatorForSeq2Seq,[37m[39;49;00m
    Seq2SeqTrainer,[37m[39;49;00m
    Seq2SeqTrainingArguments,[37m[39;49;00m
    BitsAndBytesConfig[37m[39;49;00m
)[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mpeft[39;49;00m [34mimport[39;49;00m LoraCon

In [34]:
hyperparameters = {
    "epochs": 1,
    "learning-rate": 1e-6,
    "train-batch-size": 1,
    "eval-batch-size": 8,
    "model-name": model_id,
}

In [35]:
from sagemaker.huggingface import HuggingFace

huggingface_estimator = HuggingFace(
    role=sagemaker.get_execution_role(),
    # Fine-tuning script
    entry_point="train.py",
    dependencies=["requirements.txt"],
    hyperparameters=hyperparameters,
    # Infrastructure
    transformers_version="4.28.1",
    pytorch_version="2.0.0",
    py_version="py310",
    instance_type="ml.g5.xlarge",
    instance_count=1,
    #use_spot_instances=True,
    #max_run=86400, # 24 hours
    #max_wait=86400,
)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


Couldn't call 'get_role' to get Role ARN from role name sm_jumpstart_flan_bot_endpoint_role to get Role path.


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


Make sure the IAM role has all the required permimssions to submit training job(s) on SageMaker. Then run the following cell.

In [None]:
huggingface_estimator.fit({"train": train_input_path,
                           "valid": valid_input_path})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2023-10-21-00-26-37-166


Using provided s3_resource
2023-10-21 00:26:37 Starting - Starting the training job...
2023-10-21 00:26:52 Starting - Preparing the instances for training......
2023-10-21 00:28:03 Downloading - Downloading input data......
2023-10-21 00:28:43 Training - Downloading the training image....................................
2023-10-21 00:34:55 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-10-21 00:35:07,452 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-10-21 00:35:07,465 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-10-21 00:35:07,474 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-10-21 00:35:07,480 sagemaker_pytorch_container.training INFO     Invoking user tra

In [40]:
huggingface_estimator.model_data

's3://sagemaker-us-east-1-195565468328/huggingface-pytorch-training-2023-10-21-00-26-37-166/output/model.tar.gz'

So model has been finetuned and quantized. Artifacts have been saved in S3.

Next steps:

- Upload the model artifacts on HF hub.

- Compare the base model with the finetuned-quantized model: 1) size of artifacts 2) response to payload.