Note: It is important to specify the versions of libraries used after getting the code to run first time. These libraries are updated quite often and this usually results in error and incompatibilty between liberaries.

Use: 
"pip freeze > requirements.txt"
to generate a file containing all the details about libraries used and their versions. This will come handy in case of replicating code.

On kaggle keep the environment pinned to original to avoid conflict in future

In [2]:
!pip install -q torch peft bitsandbytes transformers trl accelerate

In [3]:
!pip install -q -U datasets

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.7 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 15.0.2 which is incompatible.
beatrix-jupyterlab 2023.814.150030 requires jupyter-server~=1.16, but you have jupyter-server 2.12.1 which is incompatible.
beatrix-jupyterlab 2023.814.150030 requires jupyterlab~=3.4, but you have jupyterlab 4.0.5 which is incompatible.
cudf 23.8.0 requires pandas<1.6.0dev0,>=1.3, but you have pandas 2.0.3 which is incompatible.
cudf 23.8.0 requires protobuf<5,>=4.21, but you have protobuf 3.20.3 which is inco

In [1]:
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)
from peft import LoraConfig
from trl import SFTTrainer
from datasets import load_dataset



In [2]:
# Load csv file with QA data divide into train and test

from sklearn.model_selection import train_test_split
import pandas as pd
from datasets import Dataset

data_name = "/kaggle/input/diabetes-qa/QA_dataset - Sheet1.csv" #replace with file path
df = pd.read_csv(data_name)

# last 40 kept as test data
# first 40 for training
test_data = df.tail(40)
train_data = df.head(len(df) - 40)

# It is import to convert data into Dataset of Dataset library to be used with 
# finetuning trainer provided by Hugging face
train_dataset = Dataset.from_pandas(train_data)

In [None]:
# Standard Procedure to load a LLM using Hugging Face libraries
#llama-2 7b-chat is used for both tokenization and generation
#To access llama-2 in kaggle go to +Add input -> Models -> llama-2
#Copy it's path here as base_model_name
#The model can also be accessed through a hugging face 
#After that it can be loaded on google colab through hugging face login

base_model_name = "/kaggle/input/llama-2/pytorch/7b-chat-hf/1"

# The fine tuned model will be saved with this name
refined_model = "llama-2-7b-medical-enhanced"

# The tokenizer defined here will be used later to generate responses from refined model
llama_tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True, do_sample=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"
llama_tokenizer.deprecation_warnings["Asking-to-pad-a-fast-tokenizer"] = True

# This is used to break the model down to reduce memory usage passed on to the LLM loading function
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False
)

# Here the LLM is configuration are set
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=quant_config,
    do_sample=True,
    device_map="auto"
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

In [8]:
# This defines PEFT parameters for fine tuning training
peft_parameters = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=8,
    bias="none",
    task_type="CAUSAL_LM"
)

#other training arguments taken from web
#Things like epochs, LR etc. can be adjusted here
train_params = TrainingArguments(
    output_dir="results_modified",
    num_train_epochs=5,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

#Trainer for fine tuning defined here, provided by HuggingFace
fine_tuning = SFTTrainer(
    model=base_model,
    train_dataset=train_dataset,
    peft_config=peft_parameters,
    dataset_text_field="Questions",
    tokenizer=llama_tokenizer,
    args=train_params
)

# Trainer function called
fine_tuning.train()

# fine tuned model saved
fine_tuning.model.save_pretrained(refined_model)



Map:   0%|          | 0/162 [00:00<?, ? examples/s]

Step,Training Loss
25,2.4675
50,1.674
75,1.4565
100,1.1825
125,1.0467
150,0.7737
175,0.709
200,0.5739




In [3]:
# pipeline from transformer library defined to generate responses using refined model
# max_length should be adjusted according to data
text_gen = pipeline(task="text-generation", model=refined_model, tokenizer=llama_tokenizer, max_length=350)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [7]:
# Sample tested from test data
# test_data["Questions"].iloc[40] is a query of string data type
output = text_gen(test_data["Questions"].iloc[40]) # any other question string can be passed
print(output[0]['generated_text']) #output printed

What are the common symptoms of type 1 diabetes in children, and how does it differ from type 2 diabetes?

Type 1 diabetes is an autoimmune disease in which the immune system attacks and destroys the cells in the pancreas that produce insulin, a hormone that regulates blood sugar levels. This results in a complete deficiency of insulin production and a reliance on insulin injections or an insulin pump to manage blood sugar levels.

The common symptoms of type 1 diabetes in children include:

1. Increased thirst and urination: When there is too much glucose in the blood, the body tries to flush it out by producing more urine, leading to increased thirst and frequent urination.
2. Fatigue: High blood sugar levels can cause fatigue, lethargy, and a lack of energy.
3. Weight loss: Despite increased thirst and urination, children with type 1 diabetes may experience weight loss due to the body's inability to use glucose for energy.



In [11]:
# This line is used to generate a text file containing list of all the libraries used along with their version number
# !pip freeze > requirements.txt