<a href="https://colab.research.google.com/github/shouraykumra/LLM/blob/FineTuningLLM/FineTuning_using_QLora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Fine-Tuning

####Fine-tuning involves selecting a dataset of your choice and adjusting an existing model to suit a specific use case.

##Use Case
In this scenario, we utilize a question-answer dataset by prompting the model with questions or comments to obtain corresponding answers. Prompt engineering is employed for this purpose.

We will adopt the QLoRA fine-tuning method. QLoRA provides an effective means of refining a model without compromising its performance.

Quantization involves dividing data into buckets. QLoRA initially divides the data into relative-sized buckets, encoding it into 16 buckets (pow(2,4)). These buckets are equally sized to handle weights, acknowledging the non-uniformity of data. The final component of QLoRA, LoRA, freezes parameters and introduces relatively small parameters.


In [1]:
# [1] AutoGPTQ library in Transformers by Hugging Face addresses resource-intensive LLM training and deployment challenges
%%capture
!pip install auto-gptq #allows to run LLM using GPTQ algorithm. Basically brings the memory savings
!pip install optimum #an extension of transformers that provides a set of performance optimzation tools to train and run models with max efficiency.
!pip install bitsandbytes #python wrapper, used with 8-bit optimizers, matmul and 8&4 quantization techniques.
!pip install pynvml
!pip install accelerate
!git clone https://github.com/shaankhosla/NLP_with_LLMs/
%cd "NLP_with_LLMs"

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
import transformers
import gpu_utilities, generate_data

In [3]:
# #loading the model without unsloth and unsmashed version of the model.
# model_name = 'mistralai/Mistral-7B-Instruct-v0.2'
# model = AutoModelForCausalLM.from_pretrained(model_name,
#                                              device_map = 'auto', #for best use of CPU or GPU.
#                                              trust_remote_code=False, #not to run custom files on local machine.
#                                              revision="main") #version of the model.

In [4]:
gpu_utilities.print_gpu_utilization()

Device 0 : Tesla T4
GPU memory occupied: 260 MB.


In [5]:
#The from_pretrained() method takes care of returning the correct tokenizer class instance based on the model_type property of the config object, or when it’s missing.
# tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True) #setting use_fast = True makes the transformer load fast version of tokenizer.

In [6]:
#loading the model without unsloth but with smashed version of the model.
model_name = 'PrunaAI/mistralai-Mistral-7B-Instruct-v0.2-bnb-4bit-smashed'
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map = 'auto', #for best use of CPU or GPU.
                                             trust_remote_code=False, #not to run custom files on local machine.
                                             revision="main") #version of the model.

config.json:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

Unused kwargs: ['quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


model.safetensors:   0%|          | 0.00/4.45G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [7]:
gpu_utilities.print_gpu_utilization()

Device 0 : Tesla T4
GPU memory occupied: 5290 MB.


In [8]:
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [9]:
#base model
model.eval() # model in evaluation mode (dropout modules are deactivated)

# craft prompt
comment = "Great content, thank you!"
prompt=f'''[INST] {comment} [/INST]'''

# tokenize input
inputs = tokenizer(prompt, return_tensors="pt")

# generate output
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=140)

print(tokenizer.batch_decode(outputs)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] Great content, thank you! [/INST] I'm glad you found the content helpful! If you have any specific questions or topics you'd like me to cover in the future, please don't hesitate to ask. I'm here to provide accurate and informative responses to help answer your questions and support your learning journey.

In the meantime, if you have any general feedback or suggestions for future content, I'd be happy to hear from you. Your input helps me create content that is relevant, engaging, and valuable to you.

Thank you for being a part of this community, and I look forward to continuing to provide you with high-quality content. If you have any other questions or topics


In [10]:
intstructions_string = f"""ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. \
ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
thus keeping the interaction natural and engaging.

Please respond to the following comment.
"""

prompt_template = lambda comment: f'''[INST] {intstructions_string} \n{comment} \n[/INST]'''

prompt = prompt_template(comment)
print(prompt)


[INST] ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.
 
Great content, thank you! 
[/INST]


In [11]:
gpu_utilities.print_gpu_utilization()

Device 0 : Tesla T4
GPU memory occupied: 5650 MB.


In [12]:
%%time
# tokenize input
inputs = tokenizer(prompt, return_tensors="pt")

# generate output
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=140)

print(tokenizer.batch_decode(outputs)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.
 
Great content, thank you! 
[/INST] Thank you for your kind words! I'm glad you're finding the content helpful. If you have any questions or if there's a specific topic you'd like me to cover in more detail, just let me know –ShawGPT.</s>
CPU times: user 4.31 s, sys: 242 ms, total: 4.56 s
Wall time: 4.56 s


In [13]:
gpu_utilities.print_gpu_utilization()

Device 0 : Tesla T4
GPU memory occupied: 5660 MB.


### Gradient checkpointing
This is a technique that saves strategically selected activations so that only selected selected activation needs to be recomputed for the gradients.
This techniques works well because there won't be any need to save all the activations during forward pass to compute the gradient, which could create a big over head.

In [14]:
#Train the model
model.train()
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

### LoRA
Low-Rank Adaptation (LoRA) is a PEFT method that decomposes a large matrix into two smaller low-rank matrices in the attention layers. This drastically reduces the number of parameters that need to be fine-tuned. [3]

Rank 'r' corresponds to the number of parameters in the adaptation layers -- the more parameters, the better it remembers, and the more complex things it can pick up. Whether this falls into the overfitting regime is debated, but if you have a lot of data covering a variety of types of knowledge and tasks, then you'll need a higher rank for that information to "stick". [4]

'Alpha' is a scaling factor -- it changes how the adaptation layer's weights affect the base model's. Higher alpha means the LoRA layers act more strongly on the base model. [4]

In [15]:
#LoRA config
#why to use fp16 - [2]
config = LoraConfig(
    r=8, #setting rank to 8 means there will be 8 independent rows or columns in the matrix, more the rank, better it will remember.
    lora_alpha=32,
    target_modules=["q_proj"], #target is Q linear model here, if dataset is relatively smalled then one or two linear models are enough or else we will fall into case of overfitting.
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
x
# LoRA trainable version of model
model = get_peft_model(model, config)

# trainable parameter count
model.print_trainable_parameters()

trainable params: 2,097,152 || all params: 7,243,829,248 || trainable%: 0.028950875679172275


In [16]:
data = load_dataset('shawhin/shawgpt-youtube-comments')

Downloading readme:   0%|          | 0.00/531 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/18.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.09k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/50 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/9 [00:00<?, ? examples/s]

In [21]:
# create tokenize function
def tokenize_function(examples):
    # extract text
    text = examples["example"]

    #tokenize and truncate text
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=512
    )

    return tokenized_inputs

# tokenize training and validation datasets
tokenized_data = data.map(tokenize_function, batched=True)

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/9 [00:00<?, ? examples/s]

### why padding tokens
Why would you do this? In some models and contexts, it can be beneficial to treat the end of a sequence and padding as the same, especially in models where the distinction between padding and the end of a sequence is not crucial or might interfere with how the model processes the text. For example, in some generative models, having distinct padding and EOS tokens might lead to unwanted behavior during text generation, and using the same token for both can mitigate such issues. [5]

In [18]:
# setting pad token
tokenizer.pad_token = tokenizer.eos_token #This means that the same token will be used both to pad shorter sequences and to signify the end of a sequence.
# data collator
data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [23]:
# hyperparameters
lr = 2e-4
batch_size = 4
num_epochs = 10

# define training arguments
training_args = transformers.TrainingArguments(
    output_dir= "/content/YT_Results",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    gradient_accumulation_steps=4,
    warmup_steps=2,
    fp16=True,
    optim="paged_adamw_8bit",

)

In [24]:
# configure trainer
trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    args=training_args,
    data_collator=data_collator
)


# train model
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

# renable warnings
model.config.use_cache = True



Epoch,Training Loss,Validation Loss
0,4.5563,3.926272
1,4.0352,3.424262
2,3.4577,2.972856
4,2.6253,2.267962
5,2.2751,2.036241
6,2.0117,1.87391
8,1.749,1.682974
9,1.2249,1.677719




In [16]:
#after training the model it should be loaded to a location from where it could be fetched for calling into 'from_pretrained'


In [26]:
hf_name = 'shouray' # your hf username or org name
model_id = hf_name + "/" + "youtube-comments"

In [27]:
model.push_to_hub(model_id)
trainer.push_to_hub(model_id)

adapter_model.safetensors:   0%|          | 0.00/8.40M [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/8.40M [00:00<?, ?B/s]

events.out.tfevents.1714529837.d6700c9e815f.1071.0:   0%|          | 0.00/10.3k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.98k [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/shouray/YT_Results/commit/956ed109e4c5514614323acfdf7228977a0a8cd3', commit_message='shouray/youtube-comments', commit_description='', oid='956ed109e4c5514614323acfdf7228977a0a8cd3', pr_url=None, pr_revision=None, pr_num=None)

In [29]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM
config = PeftConfig.from_pretrained("shouray/YT_Results")
model = PeftModel.from_pretrained(model, "shouray/YT_Results")

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

References:
1. https://www.marktechpost.com/2023/08/26/meet-autogptq-an-easy-to-use-llms-quantization-package-with-user-friendly-apis-based-on-gptq-algorithm/
2. https://www.quora.com/What-is-the-difference-between-FP16-and-FP32-when-doing-deep-learning
3. https://huggingface.co/docs/peft/en/package_reference/lora#peft.LoraConfig.lora_alpha
4. https://www.reddit.com/r/LocalLLaMA/comments/17pw7bv/eternal_question_what_rank_r_and_alpha_to_use_in/
5.  https://www.linkedin.com/pulse/demystifying-tokenization-preparing-data-large-models-rany-2nebc#:~:text=tokenizer.,the%20end%20of%20a%20sequence.