<a href="https://colab.research.google.com/github/asokraju/LangChainDatasetForge/blob/main/Finetuning_Falcon_7b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading Data

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [0]:
df = pd.read_csv('dataset.csv', encoding='utf-8', usecols=['user_plan', 'user_input'])
df = df.drop(df.index[0])
# df = df.dropna(subset=['summary'])
df.to_csv('dataset_clean.csv')


Sample data:    

    *Unstructured*: So, I'm looking for a winter backpacking adventure in Tokyo, and I'm not afraid to go alone! I'm hoping to find a hostel to stay in, and I'm going to book it through an online platform. I'm looking for a private tour guide who speaks English, and I'm sure I'll have a great time! After all, what could possibly go wrong with a budget of $5000?
    *Structured*: Travelers: 1 persons, Accommodation: hostel, Booking Mode: online platform, Travel Type: backpacking, Duration: 2 weeks, Season: winter, Language Preference: English, Guide Preference: private tour, Destinations: Tokyo, Budget: $5000.

    *Unstructured*: Ah, what a grand adventure I have planned! Seven of us are setting off to explore the wonders of Paris, Tokyo, and beyond in the glorious springtime! We shall be traveling for 23 days and have a budget of $15,000, so there will be plenty of room for sightseeing, folk performances, and other exciting activities. We shall be sure to take in the sights, sounds, and cultures of these two amazing cities! Who knows what wonders await us?
    *Structured*: Cultural Interest: folk performances, Budget: $15000, Destinations: Paris, Tokyo, Duration: 23 days, Season: spring, Activities: sightseeing, Travelers: 7 persons.

# Install packages

First, let's install the required modules: bitsandbytes, transformers, and accelerate. Please note that you'll need a GPU with at least 16GB of memory for this to function correctly.

In [0]:
!pip install -Uqqq pip
!pip install -qqq bitsandbytes==0.39.0
!pip install -qqq torch==2.0.1
!pip install -qqq -U git+https://github.com/huggingface/transformers.git@e03a9cc
!pip install -qqq -U git+https://github.com/huggingface/peft.git@42a184f
!pip install -qqq -U git+https://github.com/huggingface/accelerate.git@c9fbb71
!pip install -qqq datasets==2.12.0
!pip install -qqq loralib==0.1.1
!pip install -qqq einops==0.6.1

# import packages
Let us import the necessary packages

In [0]:
import json
import os
from pprint import pprint
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import load_dataset
from huggingface_hub import notebook_login
from peft import (
    LoraConfig,
    PeftConfig,
    PeftModel,
    get_peft_model,
    prepare_model_for_kbit_training
)
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

We will login to Hugging face to save the fine tuned model. Note that it will only save the config and adaptation matrices.

In [0]:
notebook_login()

# Loading model and preparing config

We'll be working with the falcon-7b-instruct model. We will define two config file, one for QLORA and another LORA.

## QLORA config
Neural network weights commonly adhere to a Normal distribution. QLORA simplifies weight storage. Instead of using precise float point 32 or 16 formats, weights are categorized into "bins" by their values. These bins can then be represented in more straightforward formats like int/float4 or int/float8 - a technique known as Quantization, often dubbed normal float 4 or 8. An added layer, double quantization, further refines these bins to conserve memory. Although the weights get quantized, during back-propagation, we revert them to float 16 to update. Post-update, these specific values are discarded. Dive deeper into quantization with this Hugging Face guide.

With the assistance of bitsandbytes, we'll load our model in 4-bit format using normal float 4 and employ double quantization.

In [0]:
# model = "tiiuae/falcon-7b-instruct"
MODEL_NAME = "tiiuae/falcon-7b-instruct"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

The tokenizer's padding token is set to its end-of-sequence (EOS) token. Gradient check-pointing is a technique that trades off computation time for memory. Instead of storing all intermediate activations in memory (as is typically done in the forward pass of back-propagation), gradient check-pointing stores only a subset of them. Then, during the backward pass, it recomputes the required activations on-the-fly. This way, the memory consumption is significantly reduced at the expense of additional computation.

In [0]:
def print_trainable_parameters(model):
  """
  Prints the number of trainable parameters in the model.
  """
  trainable_params = 0
  all_param = 0
  for _, param in model.named_parameters():
    all_param += param.numel()
    if param.requires_grad:
      trainable_params += param.numel()
  print(
      f"trainable params: {trainable_params} || all params: {all_param} || trainables%: {100 * trainable_params / all_param}"
  )

This step preprocesses the model to ready it for training.

In [0]:
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

 ## LORA config
 Next we will prepare the config file for LORA.   Moving forward, we will set up the configuration for LORA. For an in-depth understanding, refer to the relevant  paper and the blog by Hugging face. When using LORA, it's essential to pinpoint where the adaptation matrices, W, should be placed. In our scenario, they are aligned with the query, key, and value functions. We must designate the rank parameter, r, which determines the rank of the adaptation matrix. For our purposes, we'll set r to 16. Additionally, there's a requirement to set a scaling parameter known as alpha, which modulates the scaling of learned weights. Higher alpha values yield fewer weight updates; specifically, weights are scaled by r/alpha. We've opted for an alpha value of 32. We also set the dropout rate at 5% and designate the task as Causal Language Model.

In [0]:
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

In [0]:
prompt = """
<human>: So, I'm looking for a winter backpacking adventure in Tokyo,
and I'm not afraid to go alone! I'm hoping to find a hostel to stay in,
and I'm going to book it through an online platform. I'm looking for a private
tour guide who speaks English, and I'm sure I'll have a great time! After all,
what could possibly go wrong with a budget of $5000?
""".strip()
prompt

## Generation config
Next, we will set up the generation configuration for the LLM. This configuration guides the model when generating new text sequences. We've limited the maximum number of tokens to generate to 400 and adjusted the temperature to 0.5, making the generation more deterministic since values closer to 0 reduce randomness. Utilizing "nucleus sampling", the model won't sample from the entire token distribution. Instead, it will sample from a narrowed set of tokens whose combined probability surpasses a threshold, set here at 0.7. This approach adds a controlled element of randomness, effectively filtering out extremely improbable tokens. Moreover, we've stipulated that the model should produce only one output sequence by setting num_return_sequences to 1. Lastly, both the padding token ID and the end-of-sequence (EOS) token ID are set to the tokenizer's EOS token ID, ensuring that any sequence padding uses the EOS token and signifying sequence completion once the EOS token is generated.

In [0]:
generation_config = model.generation_config
generation_config.max_new_tokens = 200
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

In [0]:
%%time
device = "cuda:0"

encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
  outputs = model.generate(
      input_ids = encoding.input_ids,
      attention_mask = encoding.attention_mask,
      generation_config = generation_config
  )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [0]:
data = load_dataset("csv", data_files="dataset_clean.csv")

## Propmt template

We're next focusing on the creation of a textual template, which makes use of unstructured text labeled as 'user_input' and structured text labeled as 'user_plan'.

In [0]:
def generate_prompt(data_point):
  return f"""
<user_input>: {data_point["user_input"]}
<user_plan>: {data_point["user_plan"]}
""".strip()


def generate_and_tokenize_prompt(data_point):
  full_prompt = generate_prompt(data_point)
  tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
  return tokenized_full_prompt

data = data["train"].shuffle().map(generate_and_tokenize_prompt)

# Training
First and foremost, we need to configure the training parameters. This includes settings related to batch size, learning rate, optimization technique, learning schedule, among others.

With everything set up, we can dive into the actual training. The Transformers library simplifies this process for us, abstracting away most of the complexity.

In [0]:
# import datetime
# now = datetime.datetime.now()
# output_dir = f"experiments/{now.strftime('%b%d_%H-%M-%S')}"
# training_args = transformers.TrainingArguments(
#       per_device_train_batch_size=1,
#       gradient_accumulation_steps=4,
#       num_train_epochs=1,
#       learning_rate=2e-4,
#       fp16=True,
#       save_total_limit=3,
#       logging_steps=1,
#       output_dir=output_dir,
#       optim="paged_adamw_8bit",
#       lr_scheduler_type="cosine",
#       warmup_ratio=0.05,
# )

# trainer = transformers.Trainer(
#     model=model,
#     train_dataset=data,
#     args=training_args,
#     data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
# )
# model.config.use_cache = False
# trainer.train()


# import torch
# from torch.utils.data import Dataset, DataLoader

# class MyDataset(Dataset):
#     def __init__(self, data):
#         # data is a list of input-output pairs
#         self.inputs = []
#         self.outputs = []
#         for d in data:
#             if 'input' in d and 'output' in d:
#                 self.inputs.append(d['input'])
#                 self.outputs.append(d['output'])
#             else:
#                 # handle missing keys
#                 print('Missing input or output key:', d)
#         self.inputs = torch.tensor(self.inputs)
#         self.outputs = torch.tensor(self.outputs)

#     def __len__(self):
#         return len(self.inputs)

#     def __getitem__(self, idx):
#         input_ids = self.inputs[idx]
#         output_ids = self.outputs[idx]
#         return input_ids, output_ids

# # create dataset object with properly formatted tensors
# dataset = MyDataset(data)
# # create dataloader object for efficient batch processing
# dataloader = DataLoader(dataset, batch_size=1, shuffle=True)

# training_args = transformers.TrainingArguments(
#       per_device_train_batch_size=1,
#       gradient_accumulation_steps=4,
#       num_train_epochs=1,
#       learning_rate=2e-4,
#       fp16=True,
#       save_total_limit=3,
#       logging_steps=1,
#       output_dir="experiments",
#       optim="paged_adamw_8bit",
#       lr_scheduler_type="cosine",
#       warmup_ratio=0.05,
# )

# model.config.use_cache = False

# # Enable gradient accumulation for bigger batches
# model.train()
# accumulation_step = 4
# optimizer = transformers.AdamW(model.parameters(), lr=training_args.learning_rate)

# for epoch in range(training_args.num_train_epochs):
#   for step, batch in enumerate(dataloader):
#     input_ids, output_ids = batch
#     # clear gradients
#     optimizer.zero_grad()
#     loss = model(input_ids, labels=output_ids)
#     loss = loss.mean()
#     # backpropagate
#     loss.backward()
#     if step % accumulation_step == 0:
#         # update weights
#         optimizer.step()
#         # zero out gradients
#         optimizer.zero_grad()
#     if step % training_args.logging_steps == 0:
#         print(f"Epoch {epoch}, Step {step}, Loss {loss.item()}")
import torch
import transformers

class MyModel(transformers.PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.config = config
        self.encoder = transformers.BertModel(config)
        self.qa_outputs = torch.nn.Linear(config.hidden_size, 2)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, **kwargs):
        outputs = self.encoder(
            input_ids,
            token_type_ids=token_type_ids,
            attention_mask=attention_mask,
            **kwargs
        )
        sequence_output = outputs[0]
        logits = self.qa_outputs(sequence_output)
        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)
        return start_logits, end_logits

model = MyModel.from_pretrained('bert-base-cased')
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased')

training_args = transformers.TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)
data = [
        {'input_id': [101, 1045, 2031, 1996, 17858, 2004, 2025, 1000, 1234, 1000,  102], 'start_pos': [0, 5,  9, 30], 'end_pos': [0, 7, 15, 32]},
        {'input_id': [101, 1045, 2228, 2059, 2507, 2062, 1997, 2049, 102], 'start_pos': [0, 5,  9, 30], 'end_pos': [0, 7, 15, 32]}
       ]
input_ids = torch.tensor([d['input_id'] for d in data])
start_positions = torch.tensor([d['start_pos'] for d in data])
end_positions = torch.tensor([d['end_pos'] for d in data])
dataset = torch.utils.data.TensorDataset(input_ids, start_positions, end_positions)
dataloader = torch.utils.data.DataLoader(
    dataset, batch_size=training_args.per_device_train_batch_size, shuffle=True
)

optimizer = transformers.AdamW(model.parameters(), lr=5e-5)

for epoch in range(training_args.num_train_epochs):
    for step, batch in enumerate(dataloader):
        optimizer.zero_grad()
        inputs = {'input_ids': batch[0], 'start_positions': batch[1], 'end_positions': batch[2]}
        outputs = model(**inputs)
        loss = outputs[0]
        loss.backward()
        optimizer.step()
        print(f'Epoch {epoch+1}, Step {step+1}, loss = {loss.item()}' )

# Step - 5: saving and loading the model


We can save me the model locally using


In [0]:
model.save_pretrained("trained-model")

It can also be pushed to Hugging Face Hub by

In [0]:
PEFT_MODEL = "asokraju/finetuned-falcon-7b"

model.push_to_hub(
    PEFT_MODEL, use_auth_token=True
)

In [0]:
PEFT_MODEL = "asokraju/finetuned-falcon-7b"

config = PeftConfig.from_pretrained(PEFT_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

tokenizer=AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token

model = PeftModel.from_pretrained(model, PEFT_MODEL)

In [0]:
generation_config = model.generation_config
generation_config.max_new_tokens = 200
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

In [0]:
import numpy as np

In [0]:
%%time


prompt = f"""
<summary>: {data["user_input"][i]}
<requirement>:
""".strip()
i = np.random.randint(5000)

device = "cuda:0"
encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
  outputs = model.generate(
      input_ids = encoding.input_ids,
      attention_mask = encoding.attention_mask,
      generation_config = generation_config
  )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

data["user_input"][i]