### Bloom 1b1 finetuning with PEFT prompt tuning

Import libraries

In [1]:
from datasets import Dataset, load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer, TrainerCallback

  from .autonotebook import tqdm as notebook_tqdm


Load dataset.  
The dataset used is alpaca cleaned version. The first 8000 rows from training set is used.

In [2]:
ds = load_dataset("yahma/alpaca-cleaned", split="train[:8000]")
ds # inspect dataset structure

Dataset({
    features: ['output', 'instruction', 'input'],
    num_rows: 8000
})

Inspect first 3 rows of data

In [3]:
ds[:3]

{'output': ['1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.',
  'The three primary colors are red, blue, and yellow. These colors are called primary because they cannot be created by mixing other colors and all other colors can be made by combining them in various proportions. In the add

Pre-process dataset using the tokenizer from the pretrained model  
Load the tokenizer

In [4]:
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-1b1")
tokenizer

BloomTokenizerFast(name_or_path='bigscience/bloom-1b1', vocab_size=250680, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

Define function for data processing and map the data

In [5]:
def process_func(example):
    MAX_LENGTH = 256
    input_ids, attention_mask, labels = [], [], []
    instruction = tokenizer("\n".join(["Human: " + example["instruction"], example["input"]]).strip() + "\n\nAssistant: ")
    response = tokenizer(example["output"] + tokenizer.eos_token)
    input_ids = instruction["input_ids"] + response["input_ids"]
    attention_mask = instruction["attention_mask"] + response["attention_mask"]
    labels = [-100] * len(instruction["input_ids"]) + response["input_ids"]
    if len(input_ids) > MAX_LENGTH:
        input_ids = input_ids[:MAX_LENGTH]
        attention_mask = attention_mask[:MAX_LENGTH]
        labels = labels[:MAX_LENGTH]
    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels
    }
tokenized_ds = ds.map(process_func, remove_columns=ds.column_names)
tokenized_ds

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 8000
})

Inspect the processed dataset

In [6]:
tokenizer.decode(tokenized_ds[1]["input_ids"])

'Human: What are the three primary colors?\n\nAssistant: The three primary colors are red, blue, and yellow. These colors are called primary because they cannot be created by mixing other colors and all other colors can be made by combining them in various proportions. In the additive color system, used for light, the primary colors are red, green, and blue (RGB).</s>'

In [7]:
tokenizer.decode(list(filter(lambda x: x != -100, tokenized_ds[1]["labels"])))

'The three primary colors are red, blue, and yellow. These colors are called primary because they cannot be created by mixing other colors and all other colors can be made by combining them in various proportions. In the additive color system, used for light, the primary colors are red, green, and blue (RGB).</s>'

Load the pre-trianed model Bloom-1b1

In [11]:
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b1", low_cpu_mem_usage=True)
model # inspect model structure

BloomForCausalLM(
  (transformer): BloomModel(
    (word_embeddings): Embedding(250880, 1536)
    (word_embeddings_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
    (h): ModuleList(
      (0-23): 24 x BloomBlock(
        (input_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
        (self_attention): BloomAttention(
          (query_key_value): Linear(in_features=1536, out_features=4608, bias=True)
          (dense): Linear(in_features=1536, out_features=1536, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (post_attention_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
        (mlp): BloomMLP(
          (dense_h_to_4h): Linear(in_features=1536, out_features=6144, bias=True)
          (gelu_impl): BloomGelu()
          (dense_4h_to_h): Linear(in_features=6144, out_features=1536, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
  )
  (

Set the configuration for prompt tuning using the Huggingface PEFT library

In [12]:
from peft import PromptTuningConfig, get_peft_model, TaskType, PromptTuningInit

config = PromptTuningConfig(task_type=TaskType.CAUSAL_LM,
                            prompt_tuning_init=PromptTuningInit.TEXT,
                            prompt_tuning_init_text="Below is a conversation between a person and a chatbot.",
                            num_virtual_tokens=len(tokenizer("Below is a conversation between a person and a chatbot.")["input_ids"]),
                            tokenizer_name_or_path="bigscience/bloom-1b1")

Get model for peft training from the configuration

In [13]:
model = get_peft_model(model, config)
model 

PeftModelForCausalLM(
  (base_model): BloomForCausalLM(
    (transformer): BloomModel(
      (word_embeddings): Embedding(250880, 1536)
      (word_embeddings_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
      (h): ModuleList(
        (0-23): 24 x BloomBlock(
          (input_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
          (self_attention): BloomAttention(
            (query_key_value): Linear(in_features=1536, out_features=4608, bias=True)
            (dense): Linear(in_features=1536, out_features=1536, bias=True)
            (attention_dropout): Dropout(p=0.0, inplace=False)
          )
          (post_attention_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
          (mlp): BloomMLP(
            (dense_h_to_4h): Linear(in_features=1536, out_features=6144, bias=True)
            (gelu_impl): BloomGelu()
            (dense_4h_to_h): Linear(in_features=6144, out_features=1536, bias=True)
          )
        )
      

Check size of parameters to be trained in finetuning

In [18]:
model.print_trainable_parameters()

trainable params: 18,432 || all params: 1,065,332,736 || trainable%: 0.0017301636734835116


Set arguments for trainer

In [19]:
args = TrainingArguments(
    output_dir="./chatbot", # Save checkpoints to a folder
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    logging_steps=10,
    num_train_epochs=1,
    save_steps=20,
    disable_tqdm=True
)

Create trainer for training the finetuned model

In [20]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_ds,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),    
)

# define a callback function for logging the losses to a text file
class LossLoggingCallback(TrainerCallback):
    def __init__(self, output_dir):
        self.output_dir = output_dir
        self.losses = []

    def on_log(self, args, state, control, logs=None, **kwargs):
        if 'loss' in logs:
            self.losses.append(logs['loss'])
            with open(f"{self.output_dir}/losses.txt", "a") as f:
                f.write(f"{state.global_step}: {logs['loss']}\n")

trainer.add_callback(LossLoggingCallback(output_dir="./"))

Train the model

In [21]:
trainer.train()

{'loss': 2.4391, 'grad_norm': 0.6870361566543579, 'learning_rate': 4.9500000000000004e-05, 'epoch': 0.01}
{'loss': 2.4893, 'grad_norm': 0.6608906388282776, 'learning_rate': 4.9e-05, 'epoch': 0.02}
{'loss': 2.4627, 'grad_norm': 1.5243343114852905, 'learning_rate': 4.85e-05, 'epoch': 0.03}
{'loss': 2.5278, 'grad_norm': 1.57200026512146, 'learning_rate': 4.8e-05, 'epoch': 0.04}
{'loss': 2.6109, 'grad_norm': 1.3944616317749023, 'learning_rate': 4.75e-05, 'epoch': 0.05}
{'loss': 2.3909, 'grad_norm': 1.9582581520080566, 'learning_rate': 4.7e-05, 'epoch': 0.06}
{'loss': 2.4162, 'grad_norm': 1.15701425075531, 'learning_rate': 4.6500000000000005e-05, 'epoch': 0.07}
{'loss': 2.5592, 'grad_norm': 4.310897350311279, 'learning_rate': 4.600000000000001e-05, 'epoch': 0.08}
{'loss': 2.4701, 'grad_norm': 1.0848174095153809, 'learning_rate': 4.55e-05, 'epoch': 0.09}
{'loss': 2.2615, 'grad_norm': 1.9100693464279175, 'learning_rate': 4.5e-05, 'epoch': 0.1}
{'loss': 2.1659, 'grad_norm': 1.8064520359039307,

TrainOutput(global_step=1000, training_loss=2.0148623447418212, metrics={'train_runtime': 995.5488, 'train_samples_per_second': 8.036, 'train_steps_per_second': 1.004, 'train_loss': 2.0148623447418212, 'epoch': 1.0})

Test finetuned model with a sample input

In [47]:
model = model.cuda()
ipt = tokenizer("Human: {}\n{}".format("How to prepare for an exam?", "").strip() + "\n\nAssistant: ", return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**ipt, max_length=128, do_sample=True)[0], skip_special_tokens=True))

Human: How to prepare for an exam?

Assistant: Use prepared material for the exam to reinforce the material being taught, and think through any questions that may come up or are coming up. For example, you might find yourself thinking about the question "What is a free event?" and want to give a specific answer to it, or perhaps you can look up other relevant topics for your area.


In [34]:
ipt2 = tokenizer("Human: {}\n{}".format("Give me some tips for an exam.", "").strip() + "\n\nAssistant: ", return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**ipt2, max_length=128, do_sample=True)[0], skip_special_tokens=True))

Human: Give me some tips for an exam.

Assistant: An important thing to take into consideration is the stress of reading. The question is not a simple academic one but a life-help question with significant consequences in the real world. Here are recommendations to try to avoid:
1. Keep a daily plan and work on your mindlist. This will help you to stay consistent.
2. Create an environment that encourages you to focus on the test, as this will reduce stress.
3. Remember to set realistic goals for your exam;
4. It is also advisable to keep a record of where exactly you failed while
