In [1]:
%%capture
%pip install -U transformers 
%pip install -U datasets 
%pip install -U accelerate 
%pip install -U peft 
%pip install -U trl 
%pip install -U bitsandbytes 
%pip install -U wandb

In [2]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import (
    LoraConfig,
    PeftModel,
    prepare_model_for_kbit_training,
    get_peft_model,
)
import os, torch, wandb
from datasets import load_dataset
from trl import SFTTrainer, setup_chat_format

In [4]:
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

hf_token = user_secrets.get_secret("HUGGINGFACE_TOKEN")
login(token = hf_token)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [5]:
wb_token = user_secrets.get_secret("wandb")

wandb.login(key=wb_token)
run = wandb.init(
    project='Fine-tune Gemma-2-9b-it on HealthCare Dataset', 
    job_type="training", 
    anonymous="allow"
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msamarthmishra291-personal[0m ([33msamarthmishra291-personal-nit-rourkela[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [34]:
base_model = "google/gemma-2-2b-it"
dataset_name = "lavita/ChatDoctor-HealthCareMagic-100k"
new_model = "Gemma-2-2b-baymax"

In [7]:
# Set torch dtype and attention implementation
if torch.cuda.get_device_capability()[0] >= 8:
    !pip install -qqq flash-attn
    torch_dtype = torch.bfloat16
    attn_implementation = "flash_attention_2"
else:
    torch_dtype = torch.float16
    attn_implementation = "eager"

In [17]:
# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch_dtype,
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation=attn_implementation
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# Trying out inference

In [18]:
bnbConfig = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model1 = AutoModelForCausalLM.from_pretrained(
    base_model,
    device_map = "auto",
    quantization_config=bnbConfig
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [20]:
from IPython.display import Markdown, display

system =  "You are a skilled software architect who consistently creates system designs for various applications."
user = "Design a system with the ASCII diagram for the customer support application."

prompt = f"System: {system} \n User: {user} \n AI: "
    
inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to("cuda")

outputs = model1.generate(**inputs, max_length=500, num_return_sequences=1)

text = tokenizer.decode(outputs[0], skip_special_tokens=True)

Markdown(text.split("AI:")[1])

 

I can help you design a customer support application system. However, I can't create an ASCII diagram.  

To help me design a system, please tell me:

* **What is the purpose of the customer support application?** (e.g., handle incoming calls, email inquiries, live chat, etc.)
* **What features are essential for the application?** (e.g., ticket management, knowledge base, reporting, etc.)
* **What are the expected user roles?** (e.g., customer, agent, manager, etc.)
* **What are the technical requirements?** (e.g., scalability, security, integration with other systems, etc.)

Once I have this information, I can provide you with a detailed system design, including:

* **System architecture:** A high-level overview of the system's components and their relationships.
* **Data model:** A description of the data structures and relationships used by the system.
* **API design:** A specification of the interfaces used by the system to communicate with other systems.
* **Security considerations:** A description of the security measures used to protect the system and its data.

Let's start by defining the purpose and features of your customer support application. 


# Fine-tuning Gemma 2 Using LoRA

### Adding Adapter to the layer:<br>
Fine-tuning the full model will take a lot of time, so to accelerate the training process, we will create and attach the adapter layer, resulting in a faster and more memory-efficient process. 

The adoption layer is created using the target modules and task type. Next, we set up the chat format for the model and tokenizer. Finally, we attach the base model to the adapter to create a Parameter Efficient Fine-Tuning (PEFT) model.

In [21]:
import bitsandbytes as bnb

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names:  # needed for 16 bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

In [22]:
modules = find_all_linear_names(model)

In [23]:
modules

['k_proj', 'o_proj', 'gate_proj', 'q_proj', 'v_proj', 'up_proj', 'down_proj']

In [24]:
# LoRA config
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=modules
)
model, tokenizer = setup_chat_format(model, tokenizer)

ValueError: Chat template is already added to the tokenizer. If you want to overwrite it, please set it to None

In [25]:
model = get_peft_model(model, peft_config)

### Loading the dataset:<br>
We will now load the lavita/ChatDoctor-HealthCareMagic-100k dataset from the Hugging Face hub. The dataset consists of three columns:

* instruction: Consists of system instruction. 
* input: Detailed patient query.
* output: The doctor's response to the patient's query.

After loading the dataset, we will shuffle it and select 1000 samples to reduce the training time even further. In the end, we will create the chat format using the default chat template and use it to create the “text” column. 

In [27]:
#Importing the dataset
dataset = load_dataset(dataset_name, split="all")
dataset = dataset.shuffle(seed=65).select(range(1000)) # Only use 1000 samples for quick demo

def format_chat_template(row):
    row_json = [{"role": "user", "content": row["input"]},
                {"role": "assistant", "content": row["output"]}]
    row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
    return row

dataset = dataset.map(
    format_chat_template,
    num_proc= 4,
)

dataset

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 1000
})

In [28]:
dataset['text'][3]

"<bos><start_of_turn>user\nSoreness on right side directly below rib cage, not under the ribs and bad back and hip pain. What could be going on? I am 46 years old, weigh 240 and am 5 11. Last year I tore my left acl and meniscus. Received an allograft and meniscus shaving. Anxiety became present then but after medication it went away.<end_of_turn>\n<start_of_turn>model\nDear-thanks for using our service, I reviewed the question in details and will give you my medical advice. The pain below the rib cage can be gas, muscular of from the gallbladder. However, you are overweight and that is aggravating the pain that you are experiencing, putting all the weight on your knee aggravated the meniscus problem. Anxiety increases intestinal gas and that can give you more pain. I recommend you to have a healthy diet, free of irritants and start doing exercise daily. If after diet modification and exercise you don't feel better, you might need a reevaluation of your problem with your primary care d

In [29]:
dataset = dataset.train_test_split(test_size=0.1)

### Complaining and training the model:<br>
We will now set the training argument and STF parameters and then start the training process.

In [30]:
#Hyperparamter
training_arguments = TrainingArguments(
    output_dir=new_model,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    num_train_epochs=1,
    eval_strategy="steps",
    eval_steps=0.2,
    logging_steps=1,
    warmup_steps=10,
    logging_strategy="steps",
    learning_rate=2e-4,
    fp16=False,
    bf16=False,
    group_by_length=True,
    report_to="wandb"
)


In [31]:
# Setting sft parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,
    max_seq_length= 512,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]



In [32]:
model.config.use_cache = False
trainer.train()



Step,Training Loss,Validation Loss
90,2.6002,2.718037
180,2.9021,2.677273
270,2.3122,2.64641
360,2.3803,2.621952
450,2.6171,2.612842


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


TrainOutput(global_step=450, training_loss=2.6965235212114123, metrics={'train_runtime': 479.4825, 'train_samples_per_second': 1.877, 'train_steps_per_second': 0.939, 'total_flos': 2616512146168320.0, 'train_loss': 2.6965235212114123, 'epoch': 1.0})

### Model evaluation

In [33]:
# Save the fine-tuned model
wandb.finish()
model.config.use_cache = True

0,1
eval/loss,█▅▃▂▁
eval/runtime,█▃▃▁▆
eval/samples_per_second,▁▆▆█▃
eval/steps_per_second,▁▆▆█▃
train/epoch,▁▁▂▂▂▂▂▃▃▃▄▄▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇███
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇█████
train/grad_norm,█▃▄▆█▂▃▃▄▄▂▁▄▄▅▂▂▂▄▄▂▃▃▃▂▅▄▅▃▃▂▃▃▄▄▂▂▂▃▃
train/learning_rate,██▇▇▇▇▆▆▆▆▅▅▅▅▅▅▅▄▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁▁▁
train/loss,█▆▆▆▆▅▇▅▅▆▆▆▆▅▆▅▆▇▅▄▅▆▇▆▆▇▄▆▄▆▄▇▅▄▅▁▃▆▄▅

0,1
eval/loss,2.61284
eval/runtime,16.8759
eval/samples_per_second,5.926
eval/steps_per_second,5.926
total_flos,2616512146168320.0
train/epoch,1.0
train/global_step,450.0
train/grad_norm,2.66102
train/learning_rate,0.0
train/loss,2.6171


In [35]:
# Save the fine-tuned model
trainer.model.save_pretrained(new_model)
trainer.model.push_to_hub(new_model, use_temp_dir=False)

adapter_model.safetensors:   0%|          | 0.00/83.1M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/samarth1029/Gemma-2-2b-baymax/commit/46756ee948c6cdb4f8770e6664d15ea03bbaaa35', commit_message='Upload model', commit_description='', oid='46756ee948c6cdb4f8770e6664d15ea03bbaaa35', pr_url=None, repo_url=RepoUrl('https://huggingface.co/samarth1029/Gemma-2-2b-baymax', endpoint='https://huggingface.co', repo_type='model', repo_id='samarth1029/Gemma-2-2b-baymax'), pr_revision=None, pr_num=None)

In [37]:
# import pkg_resources
# pkg_resources.require("torch==2.3.0")
import torch
messages = [
    {"role": "user", "content": "Hello, I am in the middle of a severe anxiety/panic attack. Could you help me?"}]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    
inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=150, num_return_sequences=1)

text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(text)

user
Hello, I am in the middle of a severe anxiety/panic attack. Could you help me?
model
Hi, I am Chat Doctor answering your query. I have gone through your query and understand your concern. I can understand your anxiety and panic attack. I would suggest you to take deep breaths and relax. You can also take some anti-anxiety Chat Doctor.  You can also take some anti-depressant Chat Doctor.  You can also take some anti-anxiety Chat Doctor.  You can also take some anti-depressant Chat Doctor.  You can also take some anti-anxiety Chat Doctor.  You can also take some anti-depressant Chat Doctor.  You can also take some anti-anxiety Chat Doctor.  You can also take some anti-depressant Chat Doctor.  You can also take some anti-anxiety Chat


# Merging the Base Model with Adopter<br>
Now, we will merge the adapter with the base model and push the full model to the Hugging Face hub.

In [41]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
from peft import PeftModel
import torch
from trl import setup_chat_format


# Reload tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(base_model)

base_model_reload= AutoModelForCausalLM.from_pretrained(
    base_model,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Set the chat format to the newly loaded base model and combine it with the adopter. In the end, we will load and merge the adopter to the base model. 

The merge_and_unload() function will help us merge the adapter weights with the base model and use it as a standalone model.

In [43]:
#base_model_reload, tokenizer = setup_chat_format(base_model_reload, tokenizer)
model = PeftModel.from_pretrained(base_model_reload, new_model)

model = model.merge_and_unload()

In [44]:
model.save_pretrained("Gemma-2-2b-baymax")
tokenizer.save_pretrained("Gemma-2-2b-baymax")

('Gemma-2-2b-baymax/tokenizer_config.json',
 'Gemma-2-2b-baymax/special_tokens_map.json',
 'Gemma-2-2b-baymax/tokenizer.model',
 'Gemma-2-2b-baymax/added_tokens.json',
 'Gemma-2-2b-baymax/tokenizer.json')

In [46]:
model.push_to_hub("Gemma-2-2b-baymax", use_temp_dir=False)
tokenizer.push_to_hub("Gemma-2-2b-baymax", use_temp_dir=False)

README.md:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/34.4M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/samarth1029/Gemma-2-2b-baymax/commit/51c81591e7b1e0c96db70dc06e3fc92664566e31', commit_message='Upload tokenizer', commit_description='', oid='51c81591e7b1e0c96db70dc06e3fc92664566e31', pr_url=None, repo_url=RepoUrl('https://huggingface.co/samarth1029/Gemma-2-2b-baymax', endpoint='https://huggingface.co', repo_type='model', repo_id='samarth1029/Gemma-2-2b-baymax'), pr_revision=None, pr_num=None)

In [47]:
import torch
messages = [
    {"role": "user", "content": "Hello, I am in the middle of a severe anxiety/panic attack. Could you help me?"}]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    
inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=150, num_return_sequences=1)

text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(text)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


user
Hello, I am in the middle of a severe anxiety/panic attack. Could you help me?
model
Hi, Thanks for your query. I can understand your concern. I would suggest you to take deep breaths and relax. You can also try to focus on your breathing. You can also try to distract yourself by doing something else. You can also try to relax your muscles. You can also try to meditate. You can also try to do some exercise. You can also try to eat something healthy. You can also try to get some sleep. You can also try to take some medication. I hope this helps.


# Loading from Pretrained

In [53]:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "samarth1029/Gemma-2-2b-baymax"
model = AutoModelForCausalLM.from_pretrained(model_name, 
low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [54]:
text = "The cure to fever is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The cure to fever is to give the patient a cold drink. This is
