# **Falcon Fine-Tuning & Medical RAG Project**
### *Developed for FalconForge*

---

In this notebook I will be looking at using a LLM for inference as well as finetuning a LLM for a specific task. We will finish off by working through a RAG example and comparing this to my finetuning approach. 

Imports for all the code are below. Please run first before anything else!

In [7]:
rm -rf gtc_llm_session

In [None]:
!pip install -q transformers accelerate bitsandbytes peft datasets einops torch

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset, Dataset

import sys
import os
from sentence_transformers.util import semantic_search, dot_score
from sentence_transformers import SentenceTransformer
import pandas as pd

Cloning into 'gtc_llm_session'...
remote: Enumerating objects: 22, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 22 (delta 5), reused 0 (delta 0), pack-reused 6 (from 1)[K
Receiving objects: 100% (22/22), 39.15 MiB | 12.88 MiB/s, done.
Resolving deltas: 100% (5/5), done.


In [9]:
import os
print("Current folder:", os.getcwd())
print("Files here:", os.listdir('.'))

Current folder: /content
Files here: ['.config', 'gtc_llm_session', 'sample_data']


---

###Inference with Falcon3-7B-Base

We are going to start by using `Falcon3-7B-Base` to perform inference on some prompts we come up with. This will demonstrate how you can use Hugging Face's Transformers library to easily pull down and start using a model. The documentation for the model is [linked here](https://huggingface.co/tiiuae/Falcon3-7B-Base).

While using the compute environment provided for GTC we should not have a problem loading a model this size. However, I usually provide this notebook on Google Colab and want to make sure it can be run using their free GPU tier. You may be wondering how I can load and finetune a model with 7 billion parameters using only the T4 GPU provided by the free tier of Google Colab. A T4 GPU has 16GB of GPU RAM which may seem like a lot but to load a 7 billion parameter model at full precision one would have to use at least 28 GB of GPU memory, or almost twice what I have available to us.

To make this project happen I will be using the methodology outlined in the paper *QLoRA: Efficient Finetuning of Quantized LLMs* [linked here](https://arxiv.org/abs/2305.14314). It boils down to using reduced precision for my parameters which means I will need significatly less GPU RAM to load the model. In my case I am going to reduce the parameters from 16 bit precision to 4 bit precision, which means I will need appoximately 1/4 the GPU RAM to load my model. This will allow us to use this model for inference with the T4 GPU available to us.

Notice the `model` variable in the code cell below. If you would like to try out a different model, replace the contents of this variable with the model you would like to try. Just remember to be mindful of the computational resources you have available.

The first step is to create my tokenizer so that I can feed my input into the model. Remember that we can't just pass a query into the model, it has to be converted into a numeric format first. To create my tokenizer I am going to use the AutoTokenizer class from HuggingFace. [Visit this link](https://huggingface.co/docs/transformers/v4.32.0/en/model_doc/auto#transformers.AutoTokenizer) to see the documentation for this class. We can pass the name of the model we would like to use to AutoTokenizer and it will automatically set up the tokenizer I need to interact with the model.

In [10]:
model = "tiiuae/Falcon3-7B-Base"

tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True)

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/686 [00:00<?, ?B/s]

Next I need to set up my `BitsAndBytesConfig`. We are using the BitsAndBytes library to make my parameter precision changes so that I can load my 7 billion parameter model without running out of GPU RAM in compute limited environments. As you can see below I am loading the model in 4 bit format by using the `load_in_4bit` parameter.

In [11]:
bb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

Below you can see the code for actually loading my model. We use another Auto class from HuggingFace to do this, this time `AutoModelForCausalLM`. We give it the name of the model we want to load and the BitsAndBytes configuration we just defined. When you run this code cell you should see it load the model which will take a while to download and load into memory.

In [12]:
falcon_model = AutoModelForCausalLM.from_pretrained(
    model,
    quantization_config=bb_config,
    use_cache=False,
    low_cpu_mem_usage=True
)

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/255 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/113 [00:00<?, ?B/s]

Now that I have my model loaded let's do some inference examples with it. Below you can see that I have a prompt for my model in the text variable, in this case a question about the national bird of the United States. We need to use my tokenizer to convert my prompt into something I can feed the model. Then I can use the `generate` funtion to give my model my tokenized prompts and save the output as outputs. The printed results say that the national bird is a bald eagle which is correct.

In [13]:
text = "Question: What is the national bird of the United States? \n Answer: "

inputs = tokenizer(text, return_tensors="pt").to("cuda:0")
outputs = falcon_model.generate(input_ids=inputs.input_ids, max_new_tokens=30, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Question: What is the national bird of the United States? 
 Answer: 

The national bird of the United States is the Bald Eagle. The Bald Eagle is a symbol of strength, freedom, and the American spirit.


Below is another similar example with a new prompt. Feel free to change the prompt to experiment with the model. Creating a good prompt is more of an art than a science and you may find that small differences in the way you phrase a question can have large impacts on how the model interprets it and responds.

In [14]:
text2 = "How do I make teriyaki sauce?"

inputs = tokenizer(text2, return_tensors="pt").to("cuda:0")
outputs = falcon_model.generate(input_ids=inputs.input_ids, max_new_tokens=500, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

How do I make teriyaki sauce?
OrObject: Teriyaki sauce is a popular Japanese condiment made from soy sauce, mirin, sake, and sugar.

Step 1: Gather your ingredients.
You will need:
- 1/2 cup soy sauce
- 1/2 cup mirin (sweet rice wine)
- 1/4 cup sake
- 1/4 cup granulated sugar
- 1 tablespoon grated ginger
- 1 clove garlic, minced

Step 2: Combine soy sauce, mirin, sake, and sugar in a medium saucepan.
Whisk the ingredients together until the sugar is completely dissolved.

Step 3: Add grated ginger and minced garlic to the saucepan.
Stir well to combine.

Step 4: Bring the mixture to a boil over medium heat.
Once boiling, reduce the heat to low and simmer for 10 minutes, stirring occasionally.

Step 5: Strain the sauce through a fine-mesh sieve into a clean container.
This will remove any solids and give you a smooth sauce.

Step 6: Store the sauce in the refrigerator for up to 2 weeks.
Use it as a marinade or dipping sauce for meats, vegetables, and rice dishes.

Step 7: Adjust the sea

Now let's change things up a bit. For the finetuning portion of this project I am going to be using the MedText dataset. This dataset contains blurbs of patient symptoms and then an associated diagnosis and action plan to treat the symptoms. The hope with my finetuning is that I can give the model a description of symptoms as the prompt and then produce a reasonable response as output. Let's test my current model to see how it performs on this task as a baseline. Below you can see one of the entries in the MedText dataset as my prompt.  

In [15]:
text3 = "A 25-year-old female presents with swelling, pain, and inability to bear weight on her left ankle following a fall during a basketball game where she landed awkwardly on her foot. The pain is on the outer side of her ankle. What is the likely diagnosis and next steps? "

inputs = tokenizer(text3, return_tensors="pt").to("cuda:0")
outputs = falcon_model.generate(input_ids=inputs.input_ids, max_new_tokens=100, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

A 25-year-old female presents with swelling, pain, and inability to bear weight on her left ankle following a fall during a basketball game where she landed awkwardly on her foot. The pain is on the outer side of her ankle. What is the likely diagnosis and next steps? (Choose all that apply)

A. Ankle sprain
B. Lateral ankle ligament tear
C. Medial malleolar fracture
D. Peroneal tendon injury
E. Achilles tendon rupture

**Answer:**
A. Ankle sprain
B. Lateral ankle ligament tear
D. Peroneal tendon injury

**Explanation:**
- **Ankle sprain:** The patient's symptoms of swelling, pain, and inability to bear weight on the ankle are consistent with


We can see that the response from my model is reasonable but not in the format or detail that I am hoping for. It tried to format a response as a multiple choice question which is not what we want in this case. Keep this response in mind as I will compare it to the output generated by my finetuned model at the end of this project.

---


###Finetuning on the MedText Dataset

As promised I am going to finetune my Falcon model using the MedText dataset. The first step is going to be to define some arguments for training. Below you can see the configuration I am going to be using for this which is saved as `training_args`.

In [16]:
training_args = TrainingArguments(
    output_dir="./finetuned_falcon",
    eval_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    fp16 = True,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=1,
    logging_steps=1,
    num_train_epochs=1,
    optim = "paged_adamw_8bit",
    report_to="none"
)

We are going to be using the `peft` library to make finetuning possible on the hardware I have in this notebook. What this will allow us to do is finetune on some "adapter" parameters that will be incorperated into my existing model rather than attempting to finetune all 7 billion parameters we already have. This will make training much more feasible computationally and also keep the model from "forgetting" things it already knows in the finetuning process. Below you can see the configuration I am going to use to make this happen. To determine what to put in `target_modules` it can be helpful to print out the model and find all the `Linear4bit` layers. We use that configuration along with the `get_peft_model` method to generate a new model that has adapters I can finetune, which we store into the variable `lora_model`.

In [17]:
print(falcon_model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(131072, 3072)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (k_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=3072, out_features=23040, bias=False)
          (up_proj): Linear4bit(in_features=3072, out_features=23040, bias=False)
          (down_proj): Linear4bit(in_features=23040, out_features=3072, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((3072,), eps=1e-06)
        (post_attention_layernorm): LlamaRMSNorm((3072,), eps=1e-06)
      )
    )
    (norm): LlamaRM

In [18]:
falcon_model.gradient_checkpointing_enable()
falcon_model = prepare_model_for_kbit_training(falcon_model)

lora_config = LoraConfig(
    r=4,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj"
    ]
)

lora_model = get_peft_model(falcon_model, lora_config)

The function below counts and prints out  the number and percentage of parameters I will be using to train compared to the total number of parameters in the model. You should see that I will be training only ~0.12% of the total parameters which tells us my LoRa setup was successful.

In [19]:
lora_model.print_trainable_parameters()

trainable params: 11,067,392 || all params: 7,466,617,856 || trainable%: 0.1482


Now I need my dataset that I am going to use for finetuning. We are going to use MedText which is hosted on the HuggingFace Hub. This means I can use the `load_dataset` function to import my dataset as seen below.

In [20]:
dataset = load_dataset("BI55/MedText", split="train")

README.md: 0.00B [00:00, ?B/s]

medtext_2.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/1412 [00:00<?, ? examples/s]

For finetuning I am going to combine my prompts and responses. Below you can see where we create a Pandas DataFrame from my dataset, grab the Prompt and Completion columns, and combine both in a new Info column.

In [21]:
import pandas as pd

df = pd.DataFrame(dataset)
prompt = df.pop("Prompt")
comp = df.pop("Completion")
df["Info"] = prompt + "\n" + comp

In [22]:
print(df["Info"][0])

A 50-year-old male presents with a history of recurrent kidney stones and osteopenia. He has been taking high-dose vitamin D supplements due to a previous diagnosis of vitamin D deficiency. Laboratory results reveal hypercalcemia and hypercalciuria. What is the likely diagnosis, and what is the treatment?
This patient's history of recurrent kidney stones, osteopenia, and high-dose vitamin D supplementation, along with laboratory findings of hypercalcemia and hypercalciuria, suggest the possibility of vitamin D toxicity. Excessive intake of vitamin D can cause increased absorption of calcium from the gut, leading to hypercalcemia and hypercalciuria, which can result in kidney stones and bone loss. Treatment would involve stopping the vitamin D supplementation and potentially providing intravenous fluids and loop diuretics to promote the excretion of calcium.


A lot of the work that goes into finetuning is preprocessing the data you wish to feed the model. Not only do I need to run all my data through my tokenizer, but we also need to ensure that we keep the dimensions of my data in line with what the model expects. Below you can see the function I used to handle the tokenization and returning the tokenized results. This function is based on the tokenizing function written by Harveen Singh Chanda which is linked in the comment below.

In [24]:
# https://www.kaggle.com/code/harveenchadha/tokenize-train-data-using-bert-tokenizer
def tokenizing(text, tokenizer, chunk_size, maxlen):
    input_ids = []
    tt_ids = []
    at_ids = []
    tokenizer.pad_token = tokenizer.eos_token

    for i in range(0, len(text), chunk_size):
        text_chunk = text[i:i+chunk_size]
        encs = tokenizer(
                    text_chunk,
                    max_length = 2048,
                    padding='max_length',
                    truncation=True,
                    )

        input_ids.extend(encs['input_ids'])

    return {'input_ids': input_ids}

Now I can use this function to preprocess my data. We pass the `Info` column we created earlier into this function as a list along with the tokenizer we intilalized earlier and receive tokens as output. We can use the HuggingFace Datasets object to easily create a dataset from these tokens and then split the dataset into `train` and `test` subsets.

In [25]:
tokens = tokenizing(list(df["Info"]), tokenizer, 256, 2048)
tokens_dataset = Dataset.from_dict(tokens)
split_dataset = tokens_dataset.train_test_split(test_size=0.2)
split_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 1129
    })
    test: Dataset({
        features: ['input_ids'],
        num_rows: 283
    })
})

In [4]:
!ls ./finetuned_falcon/checkpoint-1000

ls: cannot access './finetuned_falcon/checkpoint-1000': No such file or directory


In [27]:
import json

with open('./finetuned_falcon/checkpoint-1000/adapter_config.json', 'r') as f:
    config_data = json.load(f)

print(json.dumps(config_data, indent=4))

FileNotFoundError: [Errno 2] No such file or directory: './finetuned_falcon/checkpoint-1000/adapter_config.json'

In [28]:
!ls ./finetuned_falcon

ls: cannot access './finetuned_falcon': No such file or directory


We now only need one more piece before I can start training - the Trainer itself. Below you can see that we initialize the `trainer` by giving it the LoRA model we created, my training arguments configuration, and my dataset.

In [29]:
trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=split_dataset["train"],
    eval_dataset=split_dataset["test"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

I will now perform the finetuning. This process will take some time depending on your hardware.


In [None]:
trainer.train()
trainer.model.save_pretrained("./finetuned_falcon")

  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss


---


###Testing the Finetuned Model

Now that my finetuning is complete it's time to see if it gives a better response to the MedText data than it did before. We first need to load in my finetuned model as seen below.

In [30]:
from peft import PeftConfig, PeftModel

config = PeftConfig.from_pretrained('./gtc_llm_session')

# if using already finetuned files
finetuned_model = PeftModel.from_pretrained(falcon_model, './gtc_llm_session')

# comment out line 6 and uncomment this if you are finetuning in this notebook
#finetuned_model = PeftModel.from_pretrained(falcon_model, './finetuned_falcon')



Then I will do inference just like we did before with the same prompt used previously.

In [31]:
text4 = "A 25-year-old female presents with swelling, pain, and inability to bear weight on her left ankle following a fall during a basketball game where she landed awkwardly on her foot. The pain is on the outer side of her ankle. What is the likely diagnosis and next steps?"

inputs = tokenizer(text4, return_tensors="pt").to("cuda:0")
outputs = finetuned_model.generate(input_ids=inputs.input_ids, max_new_tokens=75, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

A 25-year-old female presents with swelling, pain, and inability to bear weight on her left ankle following a fall during a basketball game where she landed awkwardly on her foot. The pain is on the outer side of her ankle. What is the likely diagnosis and next steps?
This patient's symptoms are suggestive of an ankle sprain, specifically an inversion sprain, which is common in basketball players. The next steps would include a physical examination to assess the severity of the sprain, including range of motion, swelling, and tenderness. An X-ray may be ordered to rule out any fractures. Treatment typically involves rest, ice, compression, and elevation


As you can see the response is now much more similar to the MedText responses. Not only are the recommendations the model gives reasonable but the way the responses are written are also much more cohesive and professional than the multiple choice question format it gave us before. This shows how finetuning can be used to change the behavior of a LLM pretty dramatically.

#**RAG Example**

In this section I will be going through a basic RAG (Retrieval-Augmented Generation) workflow using a dataset of recipes and [Falcon3-7B-Base](https://huggingface.co/tiiuae/Falcon3-7B-Base). The goal will be to see the model use specific information from [this recipe dataset](https://huggingface.co/datasets/Shengtao/recipe) in its responses after we build out the RAG pipeline.

---
### Baseline Inference

Let's set a baseline of what kind of responses my model gives us without using RAG so that I can hopefully see a difference with RAG. First we set up the prompts we would like to use. Feel free to edit or add to these if you are curious about another prompt.

In [32]:
prompt_task = "Question: "
prompt_end = "\nAnswer: "

prompts = [
    "What two ingredients do I need for two-ingredient pizza dough?",
    "What do I need for a quick tartar sauce?",
    "What should I preheat the oven to to make blueberry muffins?"
]

Now we feed them into my model to see what responses we get.

In [33]:
for prompt in prompts:
  prompt = prompt_task + prompt + prompt_end
  inputs = tokenizer(prompt, return_tensors="pt").to('cuda')

  generate_ids = falcon_model.generate(input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, max_new_tokens=25, pad_token_id=tokenizer.eos_token_id)

  print("\n")
  print(tokenizer.decode(generate_ids[0]))
  print("\n------------------------------------------------------------")

print("\n\nAll done")



Question: What two ingredients do I need for two-ingredient pizza dough?
Answer: 2 cups of flour and 1 cup of water.

Question: What is the main ingredient in a pizza?


------------------------------------------------------------


Question: What do I need for a quick tartar sauce?
Answer: 1/2 cup mayonnaise, 1 tablespoon lemon juice, 1 tablespoon sweet pickle relish, 1/2 teaspoon salt

------------------------------------------------------------


Question: What should I preheat the oven to to make blueberry muffins?
Answer: 350 degrees Fahrenheit.

Question: What is the capital of France?
Answer: Paris.

Question

------------------------------------------------------------


All done


---
### Import Recipe Dataset

The results seem reasonable. However we would like my model to use recipe information from a recipe dataset rather than information it remembers from its pretraining. To start that process we first need to import my dataset. We download it from Hugging Face and grab only the first 500 entries. This is only done to save time for this demo because creating the embeddings for all entries in the dataset takes a lot of time and, since this is just a demo, we don't need all of the dataset entries.

In [34]:
df = pd.read_csv("hf://datasets/Shengtao/recipe/recipe.csv")
df = df.head(500)

df

Unnamed: 0,title,url,category,author,description,rating,rating_count,review_count,ingredients,directions,...,vitamin_k_mcg,biotin_mcg,vitamin_b12_mcg,mono_fat_g,poly_fat_g,trans_fatty_acid_g,omega_3_fatty_acid_g,omega_6_fatty_acid_g,instructions_list,image
0,Simple Macaroni and Cheese,https://www.allrecipes.com/recipe/238691/simpl...,main-dish,g0dluvsugly,A very quick and easy fix to a tasty side-dish...,4.42,834,575,1 (8 ounce) box elbow macaroni ; ¼ cup butter ...,Bring a large pot of lightly salted water to a...,...,,,,,,,,,['Bring a large pot of lightly salted water to...,https://www.allrecipes.com/thmb/GZrTl8DBwmRuor...
1,Gourmet Mushroom Risotto,https://www.allrecipes.com/recipe/85389/gourme...,main-dish,Myleen Sagrado Sjödin,Authentic Italian-style risotto cooked the slo...,4.80,3388,2245,"6 cups chicken broth, divided ; 3 tablespoons ...","In a saucepan, warm the broth over low heat. W...",...,,,,,,,,,"['Warm broth in a saucepan over low heat.', 'M...",https://www.allrecipes.com/thmb/xCk4IEjfAYBikO...
2,Dessert Crepes,https://www.allrecipes.com/recipe/19037/desser...,breakfast-and-brunch,ANN57,Essential crepe recipe. Sprinkle warm crepes ...,4.80,1156,794,"4 eggs, lightly beaten ; 1 ⅓ cups milk ; 2 ta...","In large bowl, whisk together eggs, milk, melt...",...,,,,,,,,,"['Whisk together eggs, milk, flour, melted but...",https://www.allrecipes.com/thmb/VwULr05JFDluPI...
3,Pork Steaks,https://www.allrecipes.com/recipe/70463/pork-s...,meat-and-poultry,BABYLOVE1222,My mom came up with this recipe when I was a c...,4.57,689,539,¼ cup butter ; ¼ cup soy sauce ; 1 bunch green...,"Melt butter in a skillet, and mix in the soy s...",...,,,,,,,,,['Melt butter in a skillet over medium heat; s...,https://www.allrecipes.com/thmb/mYkvln7o9pb35l...
4,Quick and Easy Pizza Crust,https://www.allrecipes.com/recipe/20171/quick-...,bread,CHEF RIDER,This is a great recipe when you don't want to ...,4.70,3741,2794,1 (.25 ounce) package active dry yeast ; 1 tea...,Preheat oven to 450 degrees F (230 degrees C)....,...,,,,,,,,,['Preheat oven to 450 degrees F (230 degrees C...,https://www.allrecipes.com/thmb/V3Llo-ottudIs_...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,Spicy Grilled Shrimp,https://www.allrecipes.com/recipe/12775/spicy-...,seafood,SUBEAST,"So fast and easy to prepare, these shrimp are ...",4.60,1121,836,1 large clove garlic ; 1 teaspoon coarse salt ...,Preheat grill for medium heat. In a small bowl...,...,,,,,,,,,"['Preheat a grill for medium heat.', 'Crush ga...",https://www.allrecipes.com/thmb/bH-TT4Cifwew96...
496,Grilled Chicken Marinade,https://www.allrecipes.com/recipe/241890/grill...,meat-and-poultry,Jennifer,Simply the best chicken marinade for any occas...,4.77,236,167,¼ cup red wine vinegar ; ¼ cup reduced-sodium ...,"Whisk vinegar, soy sauce, olive oil, parsley, ...",...,,,,,,,,,"['Whisk vinegar, soy sauce, olive oil, parsley...",https://www.allrecipes.com/thmb/b8GwNUs4HIz9-X...
497,Wool Roll Bread,https://www.allrecipes.com/recipe/284058/wool-...,uncategorized,Chef John,I've spun quite a few yarns but one thing I've...,0.00,0,0,½ cup water ; ¼ cup all-purpose flour ; ½ cup ...,"To make ""water roux,"" whisk together water and...",...,,,,,,,,,"['To make ""water roux,"" whisk together water a...",https://www.allrecipes.com/thmb/BL8waIMOSis0wl...
498,Easy Pizza Sauce I,https://www.allrecipes.com/recipe/11771/easy-p...,side-dish,Frank Sweterlitsch,A simple pizza sauce used by many pizzerias. T...,4.34,570,430,1 (6 ounce) can tomato paste ; 1 ½ cups water ...,"Mix together the tomato paste, water, and oliv...",...,,,,,,,,,"['Mix together water, tomato paste, and olive ...",https://www.allrecipes.com/thmb/aYLD9MOpFl7p7A...


Now I am going to combine the `title`, `directions`, and `ingredients` columns into one entry per row. This will be what we embed for my model to reference using RAG.

In [35]:
text = []
for i, row in df.iterrows():
    text.append(row['title'] + ": " + row['directions'] + " Ingredients list: " + row['ingredients'])

print("Made recipe list")

Made recipe list


---
### Embedding

Now I am going to setup my embedding model. We will be using the [SentenceTransformers](https://sbert.net/) library for this example but there are lots of other libraries/frameworks out there for doing RAG.

We will be using the `all-mpnet-base-v2` model which offers a nice balance of size and performance. There are many models available through SentenceTransformers. Some are more accurate but larger and slower to run while others are more lightweight but may suffer in performance. However if you would like to learn more or try out a different model see the [documentation linked here](https://sbert.net/docs/sentence_transformer/pretrained_models.html).

In [36]:
st_model = SentenceTransformer("all-mpnet-base-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

MPNetModel LOAD REPORT from: sentence-transformers/all-mpnet-base-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In the cell below we take each entry in my list of recipes and embed it using the `all-mpnet-base-v2` model. `.encode` creates my embeddings which we append to a list called `embeddings`.

In [37]:
embeddings = []

for line in text:
    embedding = st_model.encode(line, convert_to_tensor=True)
    embeddings.append(embedding)

print("Made recipe embeddings")

Made recipe embeddings


---
### Inference Using RAG

**Note on Storing Embeddings:**

In this demo, embeddings are stored in a Python list embeddings = [] in memory for simplicity. However, for real-world RAG systems with large datasets, it's highly recommended to use a **vector database** (or vector store). These databases are optimized for efficiently storing and searching large collections of high-dimensional embeddings, enabling fast and scalable retrieval.

Now I am ready to see the results of using my embeddings in my prompts for `Falcon3-7B-Base`. For each prompt in my list of prompts we run through the loop below. We first encode the prompt and then use `semantic_search` from SentenceTransformers to find the most similar embedded recipes. With the `top_k=2` argument we only get the top two results. Then we reference which index these results correspond to and fetch the appropriate elements from my `text` recipe list. These results are included in the prompt and the prompt is passed to my LLM just like before.

In [38]:
for prompt in prompts:
    prompt = prompt_task + prompt + prompt_end

    prompt_embed = st_model.encode(prompt, convert_to_tensor=True)
    hits = semantic_search(prompt_embed, embeddings, top_k=2)

    result_line1 = hits[0][0]['corpus_id']
    result_line2 = hits[0][1]['corpus_id']

    prompt = text[result_line1] + "\n" + text[result_line2] + "\n\n" + prompt

    inputs = tokenizer(prompt, return_tensors="pt")

    generate_ids = falcon_model.generate(input_ids=inputs.input_ids.to('cuda'), attention_mask=inputs.attention_mask.to('cuda'), max_new_tokens=25, pad_token_id=tokenizer.eos_token_id)

    print("\n")
    print(tokenizer.decode(generate_ids[0]))
    print("\n------------------------------------------------------------")

print("\n\nAll done")



Two-Ingredient Pizza Dough: Mix flour and Greek yogurt together in a bowl; transfer to a work surface floured with self-rising flour. Knead dough, adding more flour as needed to keep dough from being too sticky, for 8 to 10 minutes. Spray a 12-inch pizza pan with cooking spray and spread dough to edges of pan. Ingredients list: 1 ½ cups self-rising flour, plus more for kneading ; 1 cup plain Greek yogurt ;   cooking spray
No-Yeast Pizza Crust: Mix flour, baking powder, and salt together in a bowl; stir in milk and olive oil until a soft dough forms. Turn dough onto a lightly floured surface and knead 10 times. Shape dough into a ball. Cover dough with an inverted bowl and let sit for 10 minutes. Roll dough into a 12-inch circle on a baking sheet. Ingredients list: 1 ⅓ cups all-purpose flour ; 1 teaspoon baking powder ; ½ teaspoon salt ; ½ cup fat-free milk ; 2 tablespoons olive oil

Question: What two ingredients do I need for two-ingredient pizza dough?
Answer: 1 ½ cups self-rising 

In these results you can see the dataset entries it pulled appended to the beginning of each prompt. The LLM then answers the questions using these bits of information from my dataset, giving different answers to the previous inference example without RAG. This demonstrates how I can use RAG to add information we would like my model to reference when answering my questions.

---

###Helpful Resources for Further Study

As mentioned in the lecture portion of this project, we did not have time to cover many important aspects of how LLMs work. Below I have links to resources I found useful in understanding how Transformers work as well as other concepts mentioned in this session.

Papers:
*   8 Things to Known about LLMs paper: https://arxiv.org/abs/2304.00612
*   Attention is All You Need (original transformers paper): https://arxiv.org/abs/1706.03762
*   LoRA: Low-Rank Adaptation of Large Language Models: https://arxiv.org/abs/2106.09685
*   QLoRA: Efficient Finetuning of Quantized LLMs: https://arxiv.org/abs/2305.14314

Blogs:
*   Blog explaining building a Transformer model from scratch: https://peterbloem.nl/blog/transformers
*   Blog on Bits and Bytes: https://huggingface.co/blog/hf-bitsandbytes-integration
*   Hugging Face's Blog on Falcon: https://huggingface.co/blog/falcon
*   Using LoRA for finetuning: https://dataman-ai.medium.com/fine-tune-a-gpt-lora-e9b72ad4ad3
*   LoRA Parameter Overview: https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms


YouTube videos:
*   University of Waterloo lecture on Attention and Transformers: https://www.youtube.com/watch?v=OyFJWRnt_AY
*   Attention is All You Need paper explanation: https://www.youtube.com/watch?v=w76Dpp7b3B4


