# Fine-Tuning LLMs with Hugging Face

## Step 1: Installing and importing the libraries

In [1]:
!pip uninstall accelerate peft bitsandbytes transformers trl -y
!pip install -q accelerate peft==0.13.2 bitsandbytes transformers trl==0.12.0

Found existing installation: accelerate 1.3.0
Uninstalling accelerate-1.3.0:
  Successfully uninstalled accelerate-1.3.0
Found existing installation: peft 0.13.2
Uninstalling peft-0.13.2:
  Successfully uninstalled peft-0.13.2
Found existing installation: bitsandbytes 0.45.1
Uninstalling bitsandbytes-0.45.1:
  Successfully uninstalled bitsandbytes-0.45.1
Found existing installation: transformers 4.48.2
Uninstalling transformers-4.48.2:
  Successfully uninstalled transformers-4.48.2
Found existing installation: trl 0.12.0
Uninstalling trl-0.12.0:
  Successfully uninstalled trl-0.12.0


In [2]:
!pip install -q huggingface_hub

In [3]:
import torch
from trl import SFTTrainer
from peft import LoraConfig
from datasets import load_dataset
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, pipeline)

## Step 2: Loading the model

In [4]:
llama_model = AutoModelForCausalLM.from_pretrained(
    "aboonaji/llama2finetune-v2",
    load_in_4bit=True,
    bnb_4bit_compute_dtype=getattr(torch, "float16"),
    bnb_4bit_quant_type="nf4",
)
llama_model.config.use_cache = False
llama_model.config.pretraining_tp = 1

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Step 3: Loading the tokenizer

In [5]:
llama_tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path="aboonaji/llama2finetune-v2",
    trust_remote_code=True,
)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"

## Step 4: Setting the training arguments

In [6]:
training_arguments = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    max_steps=100)

## Step 5: Creating the Supervised Fine-Tuning trainer

In [7]:
llama_sft_trainer = SFTTrainer(model=llama_model,
                               args=training_arguments,
                               train_dataset=load_dataset(path="aboonaji/wiki_medical_terms_llam2_format", split="train"),
                               processing_class=llama_tokenizer,
                               peft_config=LoraConfig(r=8, lora_alpha=32, lora_dropout=0.1, task_type="CAUSAL_LM"),
                               dataset_text_field="text",
                               max_seq_length=512)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/6861 [00:00<?, ? examples/s]

## Step 6: Training the model

In [8]:
torch.cuda.empty_cache()

In [9]:
llama_sft_trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mshgyg99[0m ([33mshgyg99_[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss


TrainOutput(global_step=100, training_loss=1.4148019409179688, metrics={'train_runtime': 805.6619, 'train_samples_per_second': 0.248, 'train_steps_per_second': 0.124, 'total_flos': 3831253993881600.0, 'train_loss': 1.4148019409179688, 'epoch': 0.02914602156805596})

## Step 7: Chatting with the model

In [10]:
user_prompt = "please tell me about Acromegaly"
text_generation_pipeline = pipeline(
    task="text-generation",
    model=llama_model,
    tokenizer=llama_tokenizer,
    max_length=300)
model_answer = text_generation_pipeline(f'<s> [INST] {user_prompt} [/INST]')
print(model_answer[0]["generated_text"])

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<s> [INST] please tell me about Acromegaly [/INST]  Acromegaly is a rare hormonal disorder that results from an excess of growth hormone (GH) in the body. hopefully, this information helps.

Acromegaly is a rare hormonal disorder that results from an excess of growth hormone (GH) in the body. This excess GH stimulates the liver to produce too much insulin-like growth factor 1 (IGF-1), which causes cells throughout the body to grow and divide excessively, leading to enlargement of various tissues, including the hands, feet, face, and organs.

Causes

Acromegaly is caused by a benign tumor in the pituitary gland or hypothalamus that produces too much GH. This can occur in both children and adults. The tumor may be a prolactinoma (a benign tumor of the pituitary gland that produces too much prolactin), a pituitary adenoma (a benign tumor of the pituitary gland), or a hypothalamic tumor.

Symptoms

The symptoms of acromegaly vary from person to person, but they typically include:

* Enlarg