<a href="https://colab.research.google.com/github/satrajitgithub/NRG_AI_NeuroOnco_preproc/blob/master/Intro_to_llm_rice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Introduction to Large Language Models**
### *Michaela Buchanan - Mark III Systems*

---

In this notebook we will be looking at using a LLM for inference as well as finetuning a LLM for a specific task. Please note that these are large models we are dealing with so downloading and training these models will take several hours.

Imports for all the code are below. Please run first before anything else!

In [None]:
!pip install torch
!pip install transformers
!pip install accelerate
!pip install bitsandbytes
!pip install peft
!pip install datasets
!pip install einops

import torch
from transformers import AutoTokenizer, FalconForCausalLM, BitsAndBytesConfig, TrainingArguments, Trainer, DataCollatorForLanguageModeling
import transformers
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset, Dataset

!git clone https://github.com/michaelabuchanan/llm_education_session_lora.git

Collecting bitsandbytes
  Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl (137.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.43.3
Collecting peft
  Downloading peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Downloading peft-0.12.0-py3-none-any.whl (296 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.4/296.4 kB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: peft
Successfully installed peft-0.12.0
Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading di

---

###Inference with Falcon-7B

We are going to start by using `Falcon-7B` to perform inference on some prompts we come up with. This will demonstrate how you can use Hugging Face's Transformers library to easily pull down and start using a model. The documentation for the model is [linked here](https://huggingface.co/ybelkada/falcon-7b-sharded-bf16). This is not the original version of Falcon-7B. Instead it's a sharded version, which means the files for the model are broken up into smaller pieces. This will allow us to be able to run this notebook in enviroments where system RAM is limited.

To make this workshop happen we will be using the methodology outlined in the paper *QLoRA: Efficient Finetuning of Quantized LLMs* [linked here](https://arxiv.org/abs/2305.14314). It boils down to using reduced precision for our parameters which means we will need significatly less GPU RAM to load the model. In our case we are going to reduce the parameters from 16 bit precision to 4 bit precision, which means we will need appoximately 1/4 the GPU RAM to load our model.

Notice the `model` variable in the code cell below. You can change the model specified in the variable to use other text-generation models available on Hugging Face.

The first step is to create our tokenizer so that we can feed our input into the model. Remember that we can't just pass a query into the model, it has to be converted into a numeric format first. To create our tokenizer we are going to use the AutoTokenizer class from HuggingFace. [Visit this link](https://huggingface.co/docs/transformers/v4.32.0/en/model_doc/auto#transformers.AutoTokenizer) to see the documentation for this class. We can pass the name of the model we would like to use to AutoTokenizer and it will automatically set up the tokenizer we need to interact with the model.

In [None]:
model = "ybelkada/falcon-7b-sharded-bf16"

tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
print(tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/180 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

PreTrainedTokenizerFast(name_or_path='ybelkada/falcon-7b-sharded-bf16', vocab_size=65024, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|endoftext|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['>>TITLE<<', '>>ABSTRACT<<', '>>INTRODUCTION<<', '>>SUMMARY<<', '>>COMMENT<<', '>>ANSWER<<', '>>QUESTION<<', '>>DOMAIN<<', '>>PREFIX<<', '>>SUFFIX<<', '>>MIDDLE<<']}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken(">>TITLE<<", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken(">>ABSTRACT<<", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken(">>INTRODUCTION<<", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken(">>SUMMARY<<", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken(">>COMMENT<<", rstrip=False, lstri

Next we need to set up our BitsAndBytesConfig. We are using the BitsAndBytes library to make our parameter precision changes so that we can load our 7 billion parameter model without running out of GPU RAM. As you can see below we are loading the model in 4 bit format by using the load_in_4bit parameter.

In [None]:
bb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

Below you can see the code for actually loading our model. We use the FalconForCausalLM class from HuggingFace to do this. We give it the name of the model we want to load and the BitsAndBytes configuration we just defined. When you run this code cell you should see it load in the sharded portions of the model which will take a while to download and load into memory.

In [None]:
falcon7b_model = FalconForCausalLM.from_pretrained(
    model,
    quantization_config=bb_config,
    use_cache=False
)

config.json:   0%|          | 0.00/581 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/17.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.92G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/921M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In the slideshow we talked about how a LLM's perfomance should improve by simply increasing the number of parameters. To explore this we are also going to initialize a 1 billion parameter version of Falcon. We can then give both the 7B and 1B models the same input prompts and see how their outputs differ. Note that we initialize a seperate tokenizer for the 1B model. If you look at the description of the tokenizer when it is printed it is a different kind of tokenizer than that used for the 7B model. Therefore if we used the first tokenizer for this model the code would run but the output would most likely not be correct.

In [None]:
tokenizer2 = AutoTokenizer.from_pretrained("tiiuae/falcon-rw-1b", use_fast=True)
tokenizer2.pad_token = tokenizer2.eos_token
print(tokenizer2)

tokenizer_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

GPT2TokenizerFast(name_or_path='tiiuae/falcon-rw-1b', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


In [None]:
falcon1b_model = FalconForCausalLM.from_pretrained(
    "tiiuae/falcon-rw-1b",
    quantization_config=bb_config,
    use_cache=False
)

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


pytorch_model.bin:   0%|          | 0.00/2.62G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

Now that we have our models loaded let's do some inference examples with them. Below you can see that we have a prompt for our model in the text variable, in this case a question about the national bird of the United States. We need to use our tokenizer to convert our prompt into something we can feed the model. Then we can use the generate funtion to give our model our tokenized prompts and save the output as outputs. For the 7B model the printed results say that the national bird is a bald eagle which is correct. However the 1B model does not fare as well...

In [None]:
print("Falcon 1B:")
text = "Question: What is the national bird of the United States? \n Answer: "

inputs = tokenizer2(text, return_tensors="pt").to("cuda:0")
outputs = falcon1b_model.generate(input_ids=inputs.input_ids, max_new_tokens=6)
print("\n" + str(tokenizer2.decode(outputs[0], pad_token_id=tokenizer2.pad_token_id, skip_special_tokens=True)))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Falcon 1B:

Question: What is the national bird of the United States? 
 Answer: 
Question: What is the


In [None]:
print("Falcon 7B:")

inputs = tokenizer(text, return_tensors="pt").to("cuda:0")
outputs = falcon7b_model.generate(input_ids=inputs.input_ids, pad_token_id=tokenizer.pad_token_id, max_new_tokens=6)
print("\n" + str(tokenizer.decode(outputs[0], skip_special_tokens=True)))

Falcon 7B:

Question: What is the national bird of the United States? 
 Answer:  The bald eagle.
 Question


Below is another similar example with a new prompt. Feel free to change the prompt to experiment with the model. Creating a good prompt is more of an art than a science and you may find that small differences in the way you phrase a question can have large impacts on how the model interprets it and responds.

In [None]:
print("Falcon 1B:")
text2 = "Tell me a joke about a bear."

inputs = tokenizer2(text2, return_tensors="pt").to("cuda:0")
outputs = falcon1b_model.generate(input_ids=inputs.input_ids, pad_token_id=tokenizer2.pad_token_id, max_new_tokens=40)
print("\n" + str(tokenizer2.decode(outputs[0], skip_special_tokens=True)))

Falcon 1B:

Tell me a joke about a bear.
I’m not sure what to say about this.
I’m not sure what to say about this.
I’m not sure what to say about this.



In [None]:
print("Falcon 7B:")

inputs = tokenizer(text2, return_tensors="pt").to("cuda:0")
outputs = falcon7b_model.generate(input_ids=inputs.input_ids, pad_token_id=tokenizer.pad_token_id, max_new_tokens=160)
print("\n" + str(tokenizer.decode(outputs[0], skip_special_tokens=True)))

Falcon 7B:

Tell me a joke about a bear.
A bear walks into a bar.
The bartender says, "What'll it be?"
The bear says, "I'll have a beer."
The bartender says, "What kind?"
The bear says, "A Bud."
The bartender says, "You're a bear, you can't drink beer."
The bear says, "I'm not a bear, I'm a polar bear."
The bartender says, "You're still not allowed to drink beer."
The bear says, "I'm not a bear, I'm a polar bear."
The bartender says, "You're still not allowed to drink beer."
The bear says, "I'


We can see in this example that while the 7B model doesn't exactly give us a joke it makes much more of an attempt at it than the 1B model.

Now let's change things up a bit. For the finetuning portion of this workshop we are going to be using the MedText dataset. This dataset contains blurbs of patient symptoms and then an associated diagnosis and action plan to treat the symptoms. The hope with our finetuning is that we can give the model a description of symptoms as the prompt and then produce a reasonable response as output. Let's test our current model to see how it performs on this task as a baseline. Below you can see one of the entries in the MedText dataset as our prompt.  

In [None]:
dataset = load_dataset("BI55/MedText", split="train")

Downloading readme:   0%|          | 0.00/4.84k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/940k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1412 [00:00<?, ? examples/s]

In [None]:
import pandas as pd

df = pd.DataFrame(dataset)
print(df['Prompt'][0])
print("\n")
print(df['Completion'][0])
prompt = df.pop("Prompt")
comp = df.pop("Completion")
df["Info"] = prompt + "\n" + comp

A 50-year-old male presents with a history of recurrent kidney stones and osteopenia. He has been taking high-dose vitamin D supplements due to a previous diagnosis of vitamin D deficiency. Laboratory results reveal hypercalcemia and hypercalciuria. What is the likely diagnosis, and what is the treatment?


This patient's history of recurrent kidney stones, osteopenia, and high-dose vitamin D supplementation, along with laboratory findings of hypercalcemia and hypercalciuria, suggest the possibility of vitamin D toxicity. Excessive intake of vitamin D can cause increased absorption of calcium from the gut, leading to hypercalcemia and hypercalciuria, which can result in kidney stones and bone loss. Treatment would involve stopping the vitamin D supplementation and potentially providing intravenous fluids and loop diuretics to promote the excretion of calcium.


In [None]:
print("Falcon 1B:")
text3 = "A 25-year-old female presents with swelling, pain, and inability to bear weight on her left ankle following a fall during a basketball game where she landed awkwardly on her foot. The pain is on the outer side of her ankle. What is the likely diagnosis and next steps? "

inputs = tokenizer2(text3, return_tensors="pt").to("cuda:0")
outputs = falcon1b_model.generate(input_ids=inputs.input_ids, pad_token_id=tokenizer2.pad_token_id, max_new_tokens=20)
print(tokenizer2.decode(outputs[0], skip_special_tokens=True))

Falcon 1B:
A 25-year-old female presents with swelling, pain, and inability to bear weight on her left ankle following a fall during a basketball game where she landed awkwardly on her foot. The pain is on the outer side of her ankle. What is the likely diagnosis and next steps? 
-
-
-
-
-
-
-
-
-
-


In [None]:
print("Falcon 7B:")

inputs = tokenizer(text3, return_tensors="pt").to("cuda:0")
outputs = falcon7b_model.generate(input_ids=inputs.input_ids, pad_token_id=tokenizer.pad_token_id, max_new_tokens=70)
print("\n" + str(tokenizer.decode(outputs[0], skip_special_tokens=True)))

Falcon 7B:

A 25-year-old female presents with swelling, pain, and inability to bear weight on her left ankle following a fall during a basketball game where she landed awkwardly on her foot. The pain is on the outer side of her ankle. What is the likely diagnosis and next steps? (A) Ankle sprain (B) Ankle fracture (C) Ankle dislocation (D) Ankle fracture with dislocation (E) Ankle fracture with dislocation and ankle sprain
Ankle sprain
Ankle fracture
Ankle dislocation
Ankle fracture with dislocation and ankle sp


Once again the 1B model doesn't attempt to respond to the propt. We can see that the response from our 7B model is reasonable but not in the format or detail that we are hoping for. It just generate a list of possible conditions that could be associated with the symptoms in our prompt. Keep this response in mind as we will compare it to the output generated by our finetuned model at the end of this workshop.

---


###Finetuning on the MedText Dataset

As promised we are going to finetune our Falcon model using the MedText dataset. We will be sticking with Falcon-7b for the remainder of this workshop as we have seen it is much better at handling complex prompts. The first step is going to be to define some arguments for training. Below you can see the configuration we are going to be using for this which is saved as `training_args`.

In [None]:
training_args = TrainingArguments(
    output_dir="./finetuned_falcon",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    fp16 = True,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=1,
    optim = "paged_adamw_8bit"
)



We are going to be using the `peft` library to make finetuning possible on the hardware we have in this notebook. What this will allow us to do is finetune on some "adapter" parameters that will be incorperated into our existing model rather than attempting to finetune all 7 billion parameters we already have. This will make training much more feasible compuationally and also keep the model from "forgetting" things it already knows in the finetuning process.

In [None]:
print(falcon7b_model)

FalconForCausalLM(
  (transformer): FalconModel(
    (word_embeddings): Embedding(65024, 4544)
    (h): ModuleList(
      (0-31): 32 x FalconDecoderLayer(
        (self_attention): FalconAttention(
          (rotary_emb): FalconRotaryEmbedding()
          (query_key_value): Linear4bit(in_features=4544, out_features=4672, bias=False)
          (dense): Linear4bit(in_features=4544, out_features=4544, bias=False)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): FalconMLP(
          (dense_h_to_4h): Linear4bit(in_features=4544, out_features=18176, bias=False)
          (act): GELUActivation()
          (dense_4h_to_h): Linear4bit(in_features=18176, out_features=4544, bias=False)
        )
        (input_layernorm): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln_f): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=4544, out_features=65024, bias=False)
)


Above you can see the model structure for the Falcon-7B model. We can see which modules we need to target in our LoRA configuration by looking for the layers with `Linear` in the name. In the 7B model these would be the layers listed in `target_modules` in the configuration below.

In [None]:
falcon7b_model.gradient_checkpointing_enable()
falcon7b_model = prepare_model_for_kbit_training(falcon7b_model)

lora_config = LoraConfig(
    r=4,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

lora_model = get_peft_model(falcon7b_model, lora_config)

Above you can see the configuration we are going to use. We use that configuration along with the `get_peft_model` method to generate a new model that has adapters we can finetune, which we store into the variable `lora_model`.

The function below comes from Chris Kuo's blog about using LoRa which is linked in the comment below and at the end of this notebook. It counts and prints out  the number and percentage of parameters we will be using to train compared to the total number of parameters in the model. You should see that we will be training only ~0.2% of the total parameters which tells us our LoRa setup was successful.

In [None]:
# https://dataman-ai.medium.com/fine-tune-a-gpt-lora-e9b72ad4ad3
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
print_trainable_parameters(lora_model)

trainable params: 8159232 || all params: 3616904064 || trainable%: 0.22558607736409123


Now we need our dataset that we are going to use for finetuning. We are going to use MedText which is hosted on the HuggingFace Hub. This means we can use the load_dataset function to import our dataset as seen below.

For finetuning we are going to combine our prompts and responses. Below you can see where we create a Pandas DataFrame from our dataset, grab the Prompt and Completion columns, and combine both in a new Info column.

A lot of the work that goes into finetuning is preprocessing the data you wish to feed the model. Not only do we need to run all our data through our tokenizer, but we also need to ensure that we keep the dimensions of our data in line with what the model expects. Below you can see the function I used to handle the tokenization and returning the tokenized results. This function is based on the tokenizing function writted by Harveen Singh Chanda which is linked in the comment below.

In [None]:
# https://www.kaggle.com/code/harveenchadha/tokenize-train-data-using-bert-tokenizer
def tokenizing(text, tokenizer, chunk_size, maxlen):
    input_ids = []
    tt_ids = []
    at_ids = []

    for i in range(0, len(text), chunk_size):
        text_chunk = text[i:i+chunk_size]
        encs = tokenizer(
                    text_chunk,
                    max_length = 2048,
                    padding='max_length',
                    truncation=True
                    )

        input_ids.extend(encs['input_ids'])
        at_ids.extend(encs['attention_mask'])

    return {'input_ids': input_ids, 'attention_mask':at_ids}

Now we can use this function to preprocess our data. We pass the Info column we created earlier into this function as a list along with the tokenizer we intilalized ealier and receive tokens as output. We can use the HuggingFace Datasets object to easily create a dataset from these tokens and then split the dataset into train and test subsets.

In [None]:
tokens = tokenizing(list(df["Info"]), tokenizer, 256, 2048)
tokens_dataset = Dataset.from_dict(tokens)
split_dataset = tokens_dataset.train_test_split(test_size=0.2)
split_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1129
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 283
    })
})

We now only need one more piece before we can start training - the Trainer itself. Below you can see that we initialize the `trainer` by giving it the LoRA model we created, our training argument configuration, and our dataset.

In [None]:
trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=split_dataset["train"],
    eval_dataset=split_dataset["test"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


If you would like to perform the finetuning in this notebook go ahead and uncomment the two lines below. These will perform the training and then save the low ranking adapters to the `finetuned_falcon` directory. However note that training will take about two and a half hours. Therefore this notebook is set up to use already finetuned low ranking adapters from a previous finetuning run. This allows us to skip the training time and see how you would utilize the resulting files to reinstanciate the model for inference. These already finetuned files are cloned from a GitHub repository in the first cell of this notebook. To use the pretrained files you should not have to change anything in this notebook.

In [None]:
# trainer.train()
# trainer.model.save_pretrained("./finetuned_falcon")

---


###Testing the Finetuned Model

Now that our finetuning is complete it's time to see if it gives a better response to the MedText data than it did before. We first need to load in our finetuned model as seen below.

In [None]:
from peft import PeftConfig, PeftModel

config = PeftConfig.from_pretrained('./llm_education_session_lora')

# if using already finetuned files
finetuned_model = PeftModel.from_pretrained(falcon7b_model, './llm_education_session_lora')

# comment out line 6 and uncomment this if you are finetuning in this notebook
# finetuned_model = PeftModel.from_pretrained(falcon7b_model, './finetuned_falcon')

  adapters_weights = torch.load(filename, map_location=torch.device(device))


Then we will do inference just like we did before with the same prompt used previously.

In [None]:
text4 = "A 25-year-old female presents with swelling, pain, and inability to bear weight on her left ankle following a fall during a basketball game where she landed awkwardly on her foot. The pain is on the outer side of her ankle. What is the likely diagnosis and next steps?"

inputs = tokenizer(text4, return_tensors="pt").to("cuda:0")
outputs = finetuned_model.generate(input_ids=inputs.input_ids, max_new_tokens=75)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


A 25-year-old female presents with swelling, pain, and inability to bear weight on her left ankle following a fall during a basketball game where she landed awkwardly on her foot. The pain is on the outer side of her ankle. What is the likely diagnosis and next steps?
This patient's symptoms suggest a lateral ankle sprain, which is common in sports like basketball where sudden changes in direction and landing on an outstretched foot are common. The next steps would include rest, ice, compression, and elevation (RICE protocol) to reduce pain and swelling. If symptoms persist, an X-ray or MRI may be considered to rule


Here is one more example. Feel free to change the `text5` variable to your own prompt to see how the model responds.

In [None]:
text5 = "A 28-year-old man presents with a mild fever which broke after a few days and nausea which comes back at night and after he eats. He then experiences a sore throat and cough. What is the likely diagnosis and next steps?"

inputs = tokenizer(text5, return_tensors="pt").to("cuda:0")
outputs = finetuned_model.generate(input_ids=inputs.input_ids, max_new_tokens=75)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


A 28-year-old man presents with a mild fever which broke after a few days and nausea which comes back at night and after he eats. He then experiences a sore throat and cough. What is the likely diagnosis and next steps?
This patient's symptoms suggest a common cold, which is usually caused by a virus. The next steps would be to encourage rest, hydration, and over-the-counter pain relievers for the fever and sore throat. If the cough persists or worsens, or if the patient has difficulty breathing, he should seek medical attention. Antibiotics are not effective against viruses


As you can see the response is now much more similar to the MedText responses. Not only are the recommendations the model gives reasonable but the way the responses are written are also much more cohesive and professional than the list it gave us before. This shows how finetuning can be used to change the behavior of a LLM pretty dramatically.

---

###Helpful Resources for Further Study

As mentioned in the lecture portion of this workshop, we did not have time to cover many imporant aspects of how LLMs work. Below I have links to resources I found useful in understanding how Transformers work as well as other concepts mentioned in this session.

**Papers:**
*   8 Things to Known about LLMs paper: https://arxiv.org/abs/2304.00612
*   Attention is All You Need (original transformers paper): https://arxiv.org/abs/1706.03762
*   LoRA: Low-Rank Adaptation of Large Language Models: https://arxiv.org/abs/2106.09685
*   QLoRA: Efficient Finetuning of Quantized LLMs: https://arxiv.org/abs/2305.14314

**Blogs:**
*   Blog explaining building a Transformer model from scratch: https://peterbloem.nl/blog/transformers
*   Blog on Bits and Bytes: https://huggingface.co/blog/hf-bitsandbytes-integration
*   Hugging Face's Blog on Falcon: https://huggingface.co/blog/falcon
*   Using LoRa for finetuning: https://dataman-ai.medium.com/fine-tune-a-gpt-lora-e9b72ad4ad3
*   LoRa Parameter Overview: https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms
*   Blog on Transformer Variants (Encoder vs Decoder vs Encoder-Decoder): https://vaclavkosar.com/ml/Encoder-only-Decoder-only-vs-Encoder-Decoder-Transfomer


**YouTube videos:**
*   University of Waterloo lecture on Attention and Transformers: https://www.youtube.com/watch?v=OyFJWRnt_AY
*   Visual Guide to Transformers series: https://www.youtube.com/watch?v=dichIcUZfOw
*   Attention is All You Need paper explanation: https://www.youtube.com/watch?v=w76Dpp7b3B4
*   Visual Guide to Transformers: https://www.youtube.com/watch?v=mMa2PmYJlCo


