<a href="https://colab.research.google.com/github/rishabh6936/Thesis_code/blob/main/Falcon_training_tumail.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Finetune Falcon-7b on a Google colab

Welcome to this Google Colab notebook that shows how to fine-tune the recent Falcon-7b model on a single Google colab and turn it into a chatbot

We will leverage PEFT library from Hugging Face ecosystem, as well as QLoRA for more memory efficient finetuning

## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes). We will also install `einops` as it is a requirement to load Falcon models.

In [None]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb
#dckr_pat_NSXTNnhxk19TTKg-FQWe7jaiEGc -- docker token

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.8/245.8 kB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m115.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.1/315.1 kB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.4/103.4 kB[0m [31m574.1 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m40.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m6.0 MB/s[0m eta [36m

In [None]:
!pip install datasets



In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
PATH = F"/content/gdrive/My Drive/Falcon_FT/Falcon_model1"
PATH_MAIL = F"/content/gdrive/My Drive/TU_mails.pkl"
PATH_MODEL = F"/content/gdrive/My Drive/Falcon_FT/Falcon_model1/checkpoint-250"

In [None]:
import pandas as pd
import torch
from transformers import DefaultDataCollator
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
import pickle
from transformers import AutoTokenizer
from datasets import Dataset
import re
import nltk
from nltk import tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Dataset




In [None]:
from datasets import load_dataset

dataset = pd.read_pickle(PATH_MAIL)
train_datasets = dataset[['Body', 'Subject']][250:20000]
test_datasets = dataset[['Body', 'Subject']][20000:22000]

In [None]:
def convert_to_squad_format(email_record):
    body = email_record['Body']
    subject = email_record['Subject']

    # Split the body into sentences using a simple regex that looks for sentence-ending punctuation.
    #        sentences = re.split(r'(?<=[.!?]) +', body, maxsplit=1)
    sentences = tokenize.sent_tokenize(body)

    if len(sentences) < 2:
        raise ValueError("The email body does not contain enough sentences to form a question and an answer.")

    # The question is the first sentence
    question = sentences[0]
#    question_words = nltk.tokenize.word_tokenize(question)
#    question = ' '.join(question_words[:14])

    # The context is the entire body
    context = ' '.join(sentences[1:5])

    # The answer is the rest of the body after the first sentence
    answer = sentences[1]

    # Find the starting index of the answer within the context
    answer_start = context.find(answer)

    # The title is the subject
    title = subject

    # Create the SQuAD-like format dictionary
    squad_format = {
        'question': question,
        'answer': context,
        'title': title
    }

    return squad_format



filtered_test_datasets = []
filtered_train_datasets = []
for index, datapoint in test_datasets.iterrows():
    body = datapoint['Body']

    if isinstance(body, float):
        # Convert float to string or handle it accordingly
        body = str(body)

    body = body.strip()

    # Check if the body has at least 2 sentences
    sentences = tokenize.sent_tokenize(body)
    if len(sentences) >= 2:
        filtered_test_datasets.append({'Body': body, 'Subject': datapoint['Subject']})

# Convert the filtered list to a DataFrame
test_dataset = pd.DataFrame(filtered_test_datasets)
filtered_test_datasets = []
for index, datapoint in test_dataset.iterrows():
    datapoint = convert_to_squad_format(datapoint)
    filtered_test_datasets.append(datapoint)
test_dataset = []
test_dataset = pd.DataFrame(filtered_test_datasets)

for index, datapoint in train_datasets.iterrows():
    body = datapoint['Body']

    if isinstance(body, float):
        # Convert float to string or handle it accordingly
        body = str(body)

    body = body.strip()

    # Check if the body has at least 2 sentences
    sentences = tokenize.sent_tokenize(body)
    if len(sentences) >= 2:
        filtered_train_datasets.append({'Body': body, 'Subject': datapoint['Subject']})

# Convert the filtered list to a DataFrame
train_dataset = pd.DataFrame(filtered_train_datasets)
filtered_train_datasets = []
for index, datapoint in train_dataset.iterrows():
    datapoint = convert_to_squad_format(datapoint)
    filtered_train_datasets.append(datapoint)
train_dataset = []
train_dataset = pd.DataFrame(filtered_train_datasets)

df = pd.DataFrame(test_dataset)
df2 = pd.DataFrame(train_dataset)

train_dataset = Dataset.from_pandas(df2)
test_dataset = Dataset.from_pandas(df)


## Loading the model

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer
from peft import AutoPeftModelForCausalLM
save_path = F"/content/gdrive/My Drive/Falcon_FT/Falcon_model1/checkpoint-250"

model_name = "ybelkada/falcon-7b-sharded-bf16"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/581 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/17.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.92G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/921M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/180 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

Below we will load the configuration file in order to create the LoRA model. According to QLoRA paper, it is important to consider all linear layers in the transformer block for maximum performance. Therefore we will add `dense`, `dense_h_to_4_h` and `dense_4h_to_h` layers in the target modules in addition to the mixed query key value layer.

In [None]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [None]:
from transformers import TrainingArguments

output_dir = PATH
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 500
logging_steps = 50
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 2500
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    gradient_checkpointing=True,
)

Then finally pass everthing to the trainer

In [None]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    peft_config=peft_config,
    dataset_text_field="answer",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/19436 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [None]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
       module = module.to(torch.float32)


## Train the model

Now let's train the model! Simply call `trainer.train()`

In [None]:
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Step,Training Loss
50,2.2013
100,2.0322
150,1.9017
200,1.8334
250,1.797
300,1.7947
350,1.7708
400,1.7697
450,1.6746
500,1.6795


