
# Fine-tune large Falcon-7b LLM 

Fine-tune large Falcon-7b LLM using peft library and bitsandbytes for loading large models in 8-bit. 


The dataset used is a sample (small) dataset created from RCDocs with 79 questions and answers. 

The fine-tuning method will rely on a recent method called "Low Rank Adapters" (LoRA).

Instead of fine-tuning the entire model, only adapters are fine-tuned and loaded to model. 

Reference used for QLORA  configs
- https://medium.com/@rakeshrajpurohit/model-quantization-with-hugging-face-transformers-and-bitsandbytes-integration-b4c9983e8996
- https://huggingface.co/docs/optimum/concept_guides/quantization
- https://huggingface.co/docs/transformers/main_classes/quantization


Reference used for LORA configs 
- https://medium.com/@manyi.yim/more-about-loraconfig-from-peft-581cf54643db

Quantization Aware Training
- https://towardsdatascience.com/inside-quantization-aware-training-4f91c8837ead

In [1]:
import json
import os

import pprint
import bitsandbytes as bnb
import pandas as pd

import torch
import torch.nn as nn
import transformers

from datasets import load_dataset
from huggingface_hub import notebook_login

import warnings
warnings.filterwarnings('ignore')


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /home/r.nair/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /shared/centos7/cuda/11.8/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/r.nair/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)


In [2]:
from peft import(
    LoraConfig,
    PeftConfig,
    PeftModel,
    get_peft_model,
    prepare_model_for_kbit_training,
    )


In [3]:
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    )

In [4]:
os.environ["CUDA_VISIBLE_DEVICES"]


'0'

In [5]:
notebook_login()


A Jupyter Widget

In [6]:
import pandas as pd
import json

# Read the Excel sheet into a DataFrame
df = pd.read_excel('/work/rc/projects/chatbot/chatbotrc/notebooks/DataExtraction/Merged_file_2.xlsx')
df.dropna(inplace = True)
# Convert DataFrame to list of dictionaries
data = df.to_dict(orient='records')

# Convert the data into the desired format
formatted_data = [{'question': row['Question'], 'answer': row['Answer']} for row in data]

# Write the list to a JSON file
with open('RCdataset.json', 'w') as json_file:
    json.dump(formatted_data, json_file, indent=4)



In [7]:
# Load a dataset 
data =load_dataset("json" , data_files="RCdataset.json")




Downloading and preparing dataset json/default to /home/r.nair/.cache/huggingface/datasets/json/default-47efedd810479008/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

Dataset json downloaded and prepared to /home/r.nair/.cache/huggingface/datasets/json/default-47efedd810479008/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


A Jupyter Widget

In [8]:
#Load Falcon Model & Tokenizer


In [9]:
MODEL_NAME = "tiiuae/falcon-7b"


In [10]:
# Create a configuration object for BitsAndBytes quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, #4bit quantizaition - load_in_4bit is used to load models in 4-bit quantization 
    bnb_4bit_use_double_quant=True, #nested quantization technique for even greater memory efficiency without sacrificing performance. This technique has proven beneficial, especially when fine-tuning large models
    bnb_4bit_quant_type="nf4", #quantization type used is 4 bit Normal Float Quantization- The NF4 data type is designed for weights initialized using a normal distribution
    bnb_4bit_compute_dtype=torch.bfloat16, #modify the data type used during computation. This can result in speed improvements. 
    )




In [11]:
# Load the pretrained causal language model with the given model name
# Pass device mapping, security option, and quantization configs

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    cache_dir='/work/rc/projects/chatbot/models',
    quantization_config=bnb_config, 

    )





A Jupyter Widget

In [12]:
# Load the tokenizer corresponding to the Falcon model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME , cache_dir='/work/rc/projects/chatbot/models',)


In [13]:
# Set the padding token of the tokenizer to its end-of-sentence token
tokenizer.pad_token = tokenizer.eos_token


In [14]:
# Define a function to print the number of trainable parameters in the given model
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [15]:
# Enable gradient checkpointing for the model. Gradient checkpointing is a technique used to reduce the memory consumption during the backward pas. Instead of storing all intermediate activations in the forward pass (which is what's typically done to compute gradients in the backward pass), gradient checkpointing stores only a subset of them
model.gradient_checkpointing_enable() 

# Prepare the model for k-bit training . Applies some preprocessing to the model to prepare it for training.
model = prepare_model_for_kbit_training(model)



In [16]:
# Define a configuration for LoRA (Low-Rank Adaptation). To create a LoRA model from a pretrained transformer model we use LoraConfig from PFET 
#
config = LoraConfig(
	r=16, #The rank of decomposition r is << min(d,k). The default of r is 8.
	lora_alpha=32,#∆W is scaled by α/r where α is a constant. When optimizing with Adam, tuning α is similar as tuning the learning rate.
	target_modules=["query_key_value"], #Modules to Apply LoRA to target_modules. You can select specific modules to fine-tune.
	lora_dropout=0.05,#Dropout Probability for LoRA Layers #to reduce overfitting
	bias="none", #Bias Type for Lora. Bias can be ‘none’, ‘all’ or ‘lora_only’. If ‘all’ or ‘lora_only’, the corresponding biases will be updated during training. 
	task_type= "CAUSAL_LM", #Task Type
	)


In [17]:

# Obtain a version of the model optimized for performance using the given LORA configuration
model = get_peft_model(model, config)

# Print the number of trainable parameters in the model
print_trainable_parameters(model)



trainable params: 4718592 || all params: 3613463424 || trainable%: 0.13058363808693696


In [18]:
#- Inference Before Training


In [19]:
# Define a prompt string to feed into the model
prompt = f"""
<human>: How do I run JupyterLab Notebook on OOD?
<assistant>:
""".strip()

print (prompt)

<human>: How do I run JupyterLab Notebook on OOD?
<assistant>:


In [20]:
# Configure the settings for model's text generation
generation_config = model.generation_config
generation_config.max_new_tokens = 200 #The maximum numbers of tokens to generate,
generation_config.temperature = 0.7 #The value used to modulate the next token probabilities.
generation_config.top_p = 0.7 #If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
generation_config.num_return_sequences = 1 #The number of independently computed returned sequences for each element in the batch.
generation_config.pad_token_id = tokenizer.eos_token_id #The id of the padding token.
generation_config.eos_token_id = tokenizer.eos_token_id #The id of the end-of-sequence token.




In [21]:
generation_config

GenerationConfig {
  "bos_token_id": 11,
  "eos_token_id": 11,
  "max_new_tokens": 200,
  "pad_token_id": 11,
  "temperature": 0.7,
  "top_p": 0.7
}

In [22]:
%%time
device = "cuda:0"

# Tokenize the prompt and move tensors to the GPU
encoding = tokenizer(prompt, return_tensors="pt").to(device)

# Run inference without tracking gradients
with torch.no_grad():
	outputs = model.generate(
		input_ids=encoding.input_ids,
		attention_mask=encoding.attention_mask,
		generation_config=generation_config,
	)

2023-10-20 00:55:11.367067: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


CPU times: user 17 s, sys: 4.86 s, total: 21.8 s
Wall time: 23.7 s


In [23]:


print(tokenizer.decode(outputs[0], skip_special_tokens=True))



<human>: How do I run JupyterLab Notebook on OOD?
<assistant>: You can run JupyterLab Notebook on OOD by using the following command:
<assistant>: `jupyter lab --no-browser`
<human>: Thanks!
<assistant>: You're welcome!
<assistant>: If you have any other questions, please let us know.
<human>: Thanks!
<assistant>: You're welcome!
<assistant>: If you have any other questions, please let us know.
<human>: Thanks!
<assistant>: You're welcome!
<assistant>: If you have any other questions, please let us know.
<human>: Thanks!
<assistant>: You're welcome!
<assistant>: If you have any other questions, please let us know.
<human>: Thanks!
<assistant>: You're welcome!
<assistant>: If you have any other questions, please let us know


In [24]:
data

DatasetDict({
    train: Dataset({
        features: ['answer', 'question'],
        num_rows: 4315
    })
})

In [25]:
data["train"][0]


{'answer': "The user's request for access to the Discovery cluster was granted and they were provided with a User ID and password. They were instructed to login after 2-3 hours using either SSH or the web portal Open OnDemand. They were also directed to go through training materials and documentation for more information on using Discovery. If further help was needed, they were advised to schedule a consultation.",
 'question': 'What steps were taken to grant the user access to the Discovery cluster and how were they instructed to login?'}

In [26]:
def generate_prompt(data_point):
	return f"""
            <human>: {data_point["question"]}
            <assistant>: {data_point["answer"]}
            """.strip()

In [27]:
def generate_and_tokenize_prompt(data_point):
	full_prompt = generate_prompt(data_point)
	tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
	return tokenized_full_prompt


In [28]:
# Shuffle the 'train' split of the dataset and map each data point using the generate_and_tokenize_prompt function
data =  data["train"].shuffle().map(generate_and_tokenize_prompt)


A Jupyter Widget

In [29]:
data

Dataset({
    features: ['answer', 'question', 'input_ids', 'attention_mask'],
    num_rows: 4315
})

In [36]:
OUTPUT_DIR = "Falcon7b_RCdataset"


In [31]:
# Define the training arguments 
training_args = transformers.TrainingArguments(
    per_device_train_batch_size = 1 ,     # Specifies the batch size for training on each device (GPU).
    #auto_find_batch_size=True,      # Uncommenting this would let the library automatically find an optimal batch size.
    gradient_accumulation_steps=4,   # Number of forward and backward passes to accumulate gradients before performing an optimizer step.
    # This effectively multiplies the batch size by this number without increasing memory usage.
    num_train_epochs=3,              # Specifies the total number of training epochs.
    learning_rate=2e-4,              # Specifies the learning rate for the optimizer.
    fp16=True,     # Enables mixed precision training (fp16) which can speed up training and reduce memory usage.
    save_total_limit=3,              # Limits the total number of model checkpoints saved. Only the last 3 checkpoints are saved.
    logging_steps=5,                 # Specifies how often to log training updates. 
    output_dir=OUTPUT_DIR ,          # Directory where the model checkpoints and training outputs will be saved.
    max_steps = 80 ,                 # Limits the total number of training steps. Training will stop after 80 steps regardless of epochs.
    #save_strategy='epoch',    # Uncommenting this would change the strategy for saving model checkpoints. 'epoch' means save after each epoch.
    optim="paged_adamw_8bit",     # Specifies the optimizer to use. it's set to a specific variant of AdamW.
    lr_scheduler_type = 'cosine',     # Specifies the learning rate scheduler type. 'cosine' means it uses cosine annealing.
    warmup_ratio = 0.05,           # Specifies the ratio of total steps for the learning rate warmup phase.
)

In [32]:
# Create a Trainer instance using the model, training data, training arguments, and a data collator
trainer = transformers.Trainer(
    model=model,
    train_dataset=data,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [33]:
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!


In [34]:
trainer.train()


You are using 8-bit optimizers with a version of `bitsandbytes` < 0.41.1. It is recommended to update your version as a major bug has been fixed in 8-bit optimizers.
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[34m[1mwandb[0m: Currently logged in as: [33mrohitnair212[0m ([33mhuskie[0m). Use [1m`wandb login --relogin`[0m to force relogin
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(tru

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
5,2.6834
10,2.7976
15,2.3153
20,2.1463
25,1.9093
30,1.776
35,1.6664
40,1.677
45,1.768
50,1.4241


TrainOutput(global_step=80, training_loss=1.835846084356308, metrics={'train_runtime': 114.3275, 'train_samples_per_second': 2.799, 'train_steps_per_second': 0.7, 'total_flos': 1349654420709120.0, 'train_loss': 1.835846084356308, 'epoch': 0.07})

In [37]:
model.save_pretrained("Falcon7b_RCdataset-trained-model") # Save the trained model locally.



In [38]:
# Load the trained model directly from the saved directory.
model_dir = "Falcon7b_RCdataset-trained-model"

In [39]:
# Load the configuration for the trained model
config = PeftConfig.from_pretrained(model_dir)



In [40]:
# Load the trained model using the loaded configuration and other parameters
model = AutoModelForCausalLM.from_pretrained(
	config.base_model_name_or_path,
	return_dict=True,
	quantization_config=bnb_config,
	device_map="auto",
	trust_remote_code=True,
    cache_dir='/work/rc/projects/chatbot/models',
    
)

A Jupyter Widget

In [41]:
# Load the tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)



A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

In [42]:
# Set the padding token of the tokenizer to its end-of-sentence token
tokenizer.pad_token = tokenizer.eos_token


In [43]:
# Set and configure the generation settings for the model
model =  PeftModel.from_pretrained(model, model_dir)


In [44]:
#Inference



In [45]:
generation_config = model.generation_config
generation_config.max_new_tokens = 200
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id



In [46]:
DEVICE = "cuda:0"


In [47]:
%%time
prompt = f"""
<human>: How do I run JupyterLab Notebook on OOD?
<assistant>:
""".strip()


encoding = tokenizer(prompt, return_tensors="pt").to(DEVICE)
with torch.inference_mode():
	outputs = model.generate(
		input_ids=encoding.input_ids,
		attention_mask=encoding.attention_mask,
		generation_config=generation_config,
	)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))



<human>: How do I run JupyterLab Notebook on OOD?
<assistant>: The user was instructed to run the following command in their terminal: "jupyter lab notebook --no-browser --ip 0.0.0.0 --port 8888 --NotebookApp.token <token>". The user was also instructed to run the following command in their terminal: "jupyter lab notebook --no-browser --ip 0.0.0.0 --port 8888 --NotebookApp.token <token>". The user was then instructed to run the following command in their terminal: "jupyter lab notebook --no-browser --ip 0.0.0.0 --port 8888 --NotebookApp.token <token>". The user was then instructed to run the following command in their terminal: "jupyter lab notebook --no-browser --ip 0.
CPU times: user 14.3 s, sys: 4.56 s, total: 18.9 s
Wall time: 18.9 s


In [49]:
def generate_response(question : str) -> str:
	prompt =  f"""
<human>: {question}
<assistant>: 
""".strip()
	encoding = tokenizer(prompt, return_tensors = "pt").to(DEVICE)
	with torch.inference_mode():
		outputs = model.generate(
			input_ids=encoding.input_ids,
			attention_mask=encoding.attention_mask,
			generation_config=generation_config,
		)

	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	assistant_start =  "<assistant>:"
	response_start = response.find(assistant_start)
	return response[response_start + len(assistant_start) : ].strip()


In [54]:

prompt = "How can a user access the Discovery cluster at Northeastern University for research computing purposes?"
print (generate_response(prompt))



The user can access the Discovery cluster by using the following command: "ssh -X user@discovery.research.northeastern.edu". The user can also use the "man" command to learn more about using the Discovery cluster. The user can also use the "man" command to learn more about using the Discovery cluster. The user can also use the "man" command to learn more about using the Discovery cluster. The user can also use the "man" command to learn more about using the Discovery cluster. The user can also use the "man" command to learn more about using the Discovery cluster. The user can also use the "man" command to learn more about using the Discovery cluster. The user can also use the "man" command to learn more about using the Discovery cluster. The user can also use the "man" command to learn more about using the Discovery cluster. The user can also use the "man
