# End to end fine tuning Notebook

1. data preparation
2. tokenization
3. fine tuning with QLora
3. Evaluation

# Data preparation

## Packages installation and key obj instantiation

In [1]:
!pip install huggingface_hub --upgrade --quiet
!pip install "transformers==4.30.2" "datasets[s3]==2.13.0" sagemaker --upgrade --quiet

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.



In [2]:
#required to work in local_mode on your notebook instance for development/debugging purpose
#!pip install 'sagemaker[local]' --upgrade --quiet
#!pip install docker-compose --quiet

In [68]:
import sagemaker
import boto3
import os

#uncomment to run in local mode
#from sagemaker import LocalSession
#sess = LocalSession()
#the below help setting up the container's root on the EBS volume of your instance.
#sess.config = {'local' : {'local_code' : True, 'container_root' : '/home/ec2-user/SageMaker/'}}
#if you're running local mode and run into out of space issues, consider running docker_scripts/prepare-docker.sh to set the docker root under /home/ec2-user/SageMaker

sess = sagemaker.Session()
region = sess.boto_region_name

#replace the below by a specific bucket if you need
sagemaker_session_bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
s3_client = boto3.client("s3")
s3_prefix = "model-fine-tuning"

#local notebook path
notebook_home = "/home/ec2-user/SageMaker/"

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {region}")

sagemaker role arn: arn:aws:iam::327216439222:role/Sagemaker
sagemaker bucket: sagemaker-us-east-1-327216439222
sagemaker session region: us-east-1


### Model selection

Choose the model you want to fine tune.

In [4]:
model_id = "tiiuae/falcon-7b"
#model_id = "tiiuae/falcon-7b-instruct"
model_name = model_id.split("/")[-1]

## Data Preparation

### Structure of our BBC Dataset

We're using BBC articles for our fine tuning contained in the local zip file.

In [5]:
import zipfile

#name fo the zip file that we'll use
data_zip = "BBC_news_summary.zip"

base_dir = os.path.join(os.getcwd())

path_to_file = os.path.join(os.getcwd(), "data", data_zip)

#unziping file
with zipfile.ZipFile(os.path.join(base_dir, "data", data_zip), 'r') as zip_ref:
    zip_ref.extractall(os.path.join(notebook_home, "data"))

#Folders that we'll iterate through after unzipping.
articles_folder = "News Articles"
summaries_folder = "Summaries"
sub_folders = ["business", "entertainment", "politics", "sport", "tech"]

articles_folders = f"{notebook_home}/data/BBC_news_summary/" + articles_folder
summaries_folder = f"{notebook_home}/data/BBC_news_summary/" + summaries_folder

### Transform folder base data into jsonlines

See below the format that we want:

{

  "id": "13818513",
  
  "summary": "Amanda baked cookies and will bring Jerry some tomorrow.",
  
  "content": "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"
  
}

In [6]:
import json

with open(os.path.join(notebook_home, "data", "data_jsonlines.jsonl"), 'w') as outfile:
    for folder in os.scandir(path = articles_folders):
        for filename in os.scandir(path = articles_folders + "/" + str(folder.name)):
            if filename.is_file():
                try:
                    #create article id of the form folder_001
                    id_article = str(folder.name) + "_" + str(filename.name).split(".")[0]

                    #get article content
                    content = ""
                    with open(filename, 'rb') as file:
                        content = file.read()
                    #get article summary
                    summary = ""
                    equivalent_summary_file = summaries_folder + "/" + str(folder.name) + "/" + str(filename.name)
                    with open(equivalent_summary_file, 'rb') as file:
                        summary = file.read()

                    #create json object
                    data = {}
                    data['id'] = id_article
                    data['content'] = content.decode("utf-8")
                    data['summary'] = summary.decode("utf-8")
   
                    json.dump(data, outfile)
                    outfile.write('\n')

                except UnicodeDecodeError:
                    print(f"skipping:{id_article} due to UnicodeDecodeError")

skipping:sport_199 due to UnicodeDecodeError


## Split test train dataset

we're splitting the jsonl file into train and test before further processing

In [69]:
from sklearn.model_selection import train_test_split

with open(os.path.join(notebook_home, "data", "data_jsonlines.jsonl")) as f:
    lines = f.readlines()
    
train, test = train_test_split(lines, test_size=0.2)

with open(os.path.join(notebook_home, "data", "data_jsonlines_train.jsonl"), 'w') as outfile:
    for t in train:
        #no need to add an escape character as there is already one.
        outfile.write(t)

In [72]:
test_data = []
for t in test:
    t = json.loads(t)
    #payload = {"inputs": t["content"], "parameters":{ "do_sample": True, "top_p": 0.9, "temperature": 0.3, "max_new_tokens": 1024}}
    payload = {"inputs": f'{t["content"]}'}
    test_data.append(payload)

In [73]:
with open(os.path.join(notebook_home, "data", "data_jsonlines_test.jsonl"), 'w') as outfile:
    for t in test_data:
        json.dump(t, outfile)
        outfile.write('\n')

### Upload the test jsonline file for later test/validation

In [74]:
s3_client.upload_file(os.path.join(notebook_home, "data", "data_jsonlines_test.jsonl"), sagemaker_session_bucket, os.path.join(s3_prefix, "test", "data_jsonlines_test.jsonl"))
test_input_path = os.path.join("s3://", sagemaker_session_bucket, s3_prefix, "test", "data_jsonlines_test.jsonl")
test_input_path

's3://sagemaker-us-east-1-327216439222/model-fine-tuning/test/data_jsonlines_test.jsonl'

### Load into HF dataset object

In [15]:
from datasets import load_dataset, load_from_disk

#we load the data into a dataset object
dataset = load_dataset('json', data_files=os.path.join(notebook_home, "data", "data_jsonlines_train.jsonl"), split="train")

Downloading and preparing dataset json/default to /home/ec2-user/.cache/huggingface/datasets/json/default-bfa0b3dcf2f75f20/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /home/ec2-user/.cache/huggingface/datasets/json/default-bfa0b3dcf2f75f20/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


In [17]:
dataset

Dataset({
    features: ['id', 'content', 'summary'],
    num_rows: 1779
})

Applying the template to the dataset

## Data preparation for Domain adaptation

we prepare the data for Domain adaptation. in that scenario we're merging content and summary into one "text" feature as we only care about the "language modelling value" of that training data.

In [19]:
from transformers import AutoTokenizer

# Load tokenizer of falcon
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.model_max_length = 2048 #you might want to reduce that depending on GPU memory available

Downloading (…)okenizer_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

In [20]:
from random import randint
from itertools import chain
from functools import partial

# custom instruct prompt start
prompt_template_domain = f"Provide a summary for the following article:\n{{content}}\n---\nSummary:\n{{summary}}{{eos_token}}"

# template dataset to add prompt to each sample
def template_dataset_domain_tuning(sample):
    sample["text"] = prompt_template_domain.format(content=sample["content"],
                                            summary=sample["summary"],
                                            eos_token=tokenizer.eos_token)
    return sample

# apply prompt template per sample

dataset_domain = dataset.map(template_dataset_domain_tuning, remove_columns=list(dataset.features))

Map:   0%|          | 0/1779 [00:00<?, ? examples/s]

In [21]:
dataset_domain

Dataset({
    features: ['text'],
    num_rows: 1779
})

In [22]:
#printing an example for domain fine tuning data
print(dataset_domain[randint(0, len(dataset))]["text"])

Provide a summary for the following article:
Tsunami slows Sri Lanka's growth

Sri Lanka's president has launched a reconstruction drive worth $3.5bn (£1.8bn) by appealing for peace and national unity.

President Kumaratunga said it was now important to find a peaceful solution to years of internal conflict. Meanwhile, the International Monetary Fund (IMF) said damage from the tsunami would cut one percentage point from Sri Lanka's economic growth this year. It estimated the wave left physical damage equal to 6.5% of the economy.

Separately, the International Labour Organisation (ILO) said that at least one million people have lost their livelihoods in Sri Lanka and Indonesia alone. It called for action to create jobs. President Kumaratunga attended a ceremony in the southern town of Hambantota. She was joined by government and opposition politicians, together with Buddhist, Hindu, Muslim and Christian clergy.

Prime Minister Mahinda Rajapakse laid the foundation stone on a new housin

### Tokenizer and" chunking"

we retrieve the tokenizer for our specific model using the very convenient HF from_pretrained() API.

we then concatenate the text in chunk of a certain size. we don't care about chunks containing 2 parts of the same article, we only care about not losing information and providing training data of the same format for the model to be trained on.

In [23]:
# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "token_type_ids": [], "attention_mask": []}

def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    #print(concatenated_examples.keys())
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result

# tokenize and chunk dataset
def tokenize_chunk(dataset, tokenizer):
    lm_dataset = dataset.map(
        lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
    ).map(
        partial(chunk, chunk_length=2048),
        batched=True,
    )
    return lm_dataset

In [24]:
# tokenize and chunk dataset for Instruction dataset
lm_dataset_domain = tokenize_chunk(dataset_domain, tokenizer)

# Print total number of samples
print(f"Total number of samples: {len(lm_dataset_domain)}")

Map:   0%|          | 0/1779 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2555 > 2048). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/1779 [00:00<?, ? examples/s]

Total number of samples: 640


The tokenizer transformed our "text" feature into a tokenized version compatible with our model

In [25]:
lm_dataset_domain

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 640
})

In [26]:
#sample of the input_ids feature
print(lm_dataset_domain[0]['input_ids'][:100])

[47945, 241, 11055, 312, 248, 1863, 2507, 37, 193, 6273, 424, 14522, 648, 204, 18, 55298, 5676, 18, 193, 193, 52096, 5485, 2821, 4562, 271, 980, 4881, 3824, 11701, 312, 10629, 272, 248, 2574, 6675, 23, 1815, 10374, 47958, 25, 193, 193, 44, 6414, 312, 248, 13730, 8305, 204, 6186, 16, 275, 648, 24, 3690, 94, 1112, 506, 1304, 5676, 272, 241, 2679, 6675, 4354, 335, 645, 204, 1392, 16, 275, 204, 1121, 271, 204, 1463, 649, 39763, 25, 35051, 9744, 17808, 393, 678, 1342, 746, 565, 241, 204, 13, 10129, 54763, 10137, 13, 398, 11272, 388, 248, 6675, 334]


In [27]:
#sample of the token_type_ids feature
print(lm_dataset_domain[0]['token_type_ids'][:100])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [28]:
#sample of the token_type_ids feature
print(lm_dataset_domain[0]['attention_mask'][:100])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [29]:
#sample of the token_type_ids feature
print(lm_dataset_domain[0]['labels'][:100])

[47945, 241, 11055, 312, 248, 1863, 2507, 37, 193, 6273, 424, 14522, 648, 204, 18, 55298, 5676, 18, 193, 193, 52096, 5485, 2821, 4562, 271, 980, 4881, 3824, 11701, 312, 10629, 272, 248, 2574, 6675, 23, 1815, 10374, 47958, 25, 193, 193, 44, 6414, 312, 248, 13730, 8305, 204, 6186, 16, 275, 648, 24, 3690, 94, 1112, 506, 1304, 5676, 272, 241, 2679, 6675, 4354, 335, 645, 204, 1392, 16, 275, 204, 1121, 271, 204, 1463, 649, 39763, 25, 35051, 9744, 17808, 393, 678, 1342, 746, 565, 241, 204, 13, 10129, 54763, 10137, 13, 398, 11272, 388, 248, 6675, 334]


you'll notice that our input_ids have the same dimensions which is the max chunk length.

In [30]:
print(len(lm_dataset_domain[0]['input_ids']))
print(len(lm_dataset_domain[1]['input_ids']))

2048
2048


## Data preparation for Instruct fine tuning
Now we take the same data but transform it differently to instruct fine tune our model for summarisation.
Notice that this time we want each input to be truncated if above limit and not mixed across. we add padding too for data points who are under the limit to keep the length of the input constant for the model.

In [31]:
from transformers import AutoTokenizer

# Load tokenizer of falcon
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.model_max_length = 2048 #you might want to reduce that depending on GPU memory available
tokenizer.pad_token = tokenizer.eos_token

In [32]:
from random import randint
from itertools import chain
from functools import partial
    
def format_dataset_instruction_tuning(sample):
    
    prompt_template_instruction = f"prompt: Provide a summary for the following text article:. Text: {{content}} completion:{{summary}}"
    full_prompt = prompt_template_instruction.format(content=sample["content"], summary= sample["summary"])
    
    #note that we're asking the tokenizer to do padding till max length of our chunks and also truncate if it's above the limit.
    return tokenizer(full_prompt, padding='max_length', truncation=True, max_length=2048)

def add_labels_column(sample):
    return {'labels': sample['input_ids']}

In [33]:
# tokenize the dataset
lm_dataset_instruction = dataset.shuffle().map(format_dataset_instruction_tuning)

Map:   0%|          | 0/1779 [00:00<?, ? examples/s]

Note that we're missing a labels column that is required and that is just going to be a copy of our input_ids for that scenario.
Note as well that you can achieve similar result by adding a DataCollatorForLanguageModeling data_collator as a configuration of the Trainer object in the run_clm.py file instead of using the default one.

    trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=dataset,
            data_collator=default_data_collator,
            #data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )
    
More info here: https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling

In [34]:
lm_dataset_instruction = lm_dataset_instruction.map(add_labels_column)

Map:   0%|          | 0/1779 [00:00<?, ? examples/s]

In [35]:
lm_dataset_instruction

Dataset({
    features: ['id', 'content', 'summary', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 1779
})

Let's print a row from the dataset to understand what happened and also make sure that we've got consistent chunk size across our data points.

In [36]:
print(lm_dataset_instruction.features)
print(lm_dataset_instruction.shape)

{'id': Value(dtype='string', id=None), 'content': Value(dtype='string', id=None), 'summary': Value(dtype='string', id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}
(1779, 7)


In [37]:
print(lm_dataset_instruction[0]['id'])
print(lm_dataset_instruction[0]['content'])
print(lm_dataset_instruction[0]['summary'])
print("\n")
print(lm_dataset_instruction[0]['input_ids'][:100])
print(lm_dataset_instruction[0]['token_type_ids'][:100])
print(lm_dataset_instruction[0]['attention_mask'][:100])
print(lm_dataset_instruction[0]['labels'][:100])
print("\n")
print(len(lm_dataset_instruction[0]['input_ids']))
print(len(lm_dataset_instruction[1]['input_ids']))

business_481
Christmas sales worst since 1981

UK retail sales fell in December, failing to meet expectations and making it by some counts the worst Christmas since 1981.

Retail sales dropped by 1% on the month in December, after a 0.6% rise in November, the Office for National Statistics (ONS) said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of retailers have already reported poor figures for December. Clothing retailers and non-specialist stores were the worst hit with only internet retailers showing any significant growth, according to the ONS.

The last time retailers endured a tougher Christmas was 23 years previously, when sales plunged 1.7%.

The ONS echoed an earlier caution from Bank of England governor Mervyn King not to read too much into the poor December figures. Some analysts put a positive gloss on the figures, pointing out that the non-seasonally-adjusted figures showed a performance comparable with 2003. T

### Uploading tokenized and chunked datasets to S3

In [38]:
lm_dataset_domain

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 640
})

In [43]:
def upload_dataset_to_s3(lm_dataset, bucket, s3_prefix, dataset_name):

    training_input_path = os.path.join("s3://", bucket, s3_prefix, dataset_name, "train", "")
    lm_dataset.save_to_disk(training_input_path)
    
    return training_input_path

In [44]:
training_input_path_domain = upload_dataset_to_s3(lm_dataset_domain, sagemaker_session_bucket, s3_prefix, "tokenized-domain")

Saving the dataset (0/1 shards):   0%|          | 0/640 [00:00<?, ? examples/s]

In [45]:
training_input_path_instruct = upload_dataset_to_s3(lm_dataset_instruction, sagemaker_session_bucket, s3_prefix, "tokenized-instruct")

Saving the dataset (0/1 shards):   0%|          | 0/1779 [00:00<?, ? examples/s]

In [46]:
%store test_input_path
%store training_input_path_domain
%store training_input_path_instruct

Stored 'training_input_path_domain' (str)
Stored 'training_input_path_instruct' (str)
