# Establishing a threshold

This notebook was run with the following configuration (in `parameters.toml`)
```
input_text_folder           = "data/raw_input_files"

client_source               = "https://75ce65ac-3f32-4f99-9994-130310c38fc1.europe-west3-0.gcp.cloud.qdrant.io"
encoder_name                = "intfloat/e5-base"
collection_name             = "nutrition_faq"
force_replace_collection    = "False"

distance_type               = "COSINE"
retrieve_k                  = 5
min_similarity_threshold    = 0

llm_model                   = "gpt-4o-mini"
llm_temperature             = 0
llm_system_prompt_path      = "prompts/llm_system_prompt.txt"
```

And some now commented out parts of the `setup_vector_db()` function in `src/nutritionrag/rag_pipeline.py`

The notebook is here to discuss an alternate methodology that was since abandoned, and it cannot be run with the current setup without changing the codebase


## Setup


In [20]:
import os
import pandas as pd
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', 5)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 200)

from nutritionrag.rag_pipeline import rag_setup_qdrant, query_vector_db_list_qdrant, rag_query_list_qdrant

In [2]:
%cd ../..

/home/szaboildi/code/szaboildi/nutrition-rag


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [3]:
try:
    import tomllib # type: ignore
except ModuleNotFoundError:
    import tomli as tomllib

with open(os.path.join("parameters.toml"), mode="rb") as fp:
    config = tomllib.load(fp)

config_name = "default"
from_scratch = False

In [4]:
eval_df = pd.read_csv(os.path.join("data", "eval", "test_questions_raw.csv"))
query_list = eval_df["user_question"].to_list()

In [5]:
vector_db_client, encoder, llm_client = rag_setup_qdrant(
    config=config[config_name])

Text cleaned in data/raw_input_files/faq_data.json
Vector database created
RAG setup complete


## Retrieval

In [7]:
raw_answers = query_vector_db_list_qdrant(
    vector_db_client, encoder, query_list,config=config[config_name])

In [8]:
# Data formatting
processed_answers = []

# unpack the payloads into a single dataframe
for i in range(len(raw_answers)):
    for doc in raw_answers[i]["retrieved"]:
        processed_answers.append(
            {"user_question": raw_answers[i]["user_question"],
            **doc})

processed_answers = pd.DataFrame(processed_answers).merge(eval_df, how="inner")
processed_answers_grouped = processed_answers.groupby(["user_question", "answerable"]).agg({"cosine": ["min", "max"]}).reset_index()
processed_answers_grouped.columns = ["user_question", "answerable", "min_cosine", "max_cosine"]

In [9]:
# processed_answers.loc[~(processed_answers.answerable)]

In [22]:
processed_answers_grouped

Unnamed: 0,user_question,answerable,min_cosine,max_cosine
0,Are any foods no-go for someone with diabetes?,True,0.87305,0.891398
1,"As a diabetic, should I choose an apple or a cake for dessert?",True,0.861635,0.870484
2,"As a diabetic, should I skip either lunch or dinner?",True,0.88339,0.915602
3,"Can I drink a caramel cappuccino, if I have diabetes?",True,0.84244,0.862628
4,Can I eat white bread as a diabetic?,True,0.85452,0.885428
5,Can you eat berries with diabetes?,True,0.864016,0.889805
6,Can you eat pineapple with diabetes?,True,0.833887,0.850812
7,I'm considering intermittent fasting. Could it help me maintain my blood sugar?,True,0.871183,0.896058
8,Is it better to have a high blood sugar or a low blood sugar?,True,0.8981,0.90801
9,Should I not eat carbohydrates at all as a diabetic?,True,0.870366,0.900012


Based on these questions there is no consistent boundary that could be established as a cutoff for a minimum cosine similarity (with these embeddings). If the cutoff is chosen at for example 0.89 (the highest cosine similarity for a question that cannot be answered based on the dataset, question #13), that would mean that questions #1 ("As a diabetic, should I choose an apple or a cake for dessert?") and #4 ("Can I eat white bread as a diabetic?") cannot be answered based on the provided data, which is incorrect. #1 has to be pieced together from two answers and #4 is mentioned in an answer to differently structured questions.

## RAG

In [None]:
rag_responses = rag_query_list_qdrant(
    query_list, vector_db_client, encoder, llm_client, config[config_name])

In [12]:
qa_df = pd.DataFrame({"user_question": rag_responses[0], "llm_response": rag_responses[1]})
qa_df_meta = pd.DataFrame([{**item, "user_question": row["user_question"]} for row in rag_responses[2] for item in row["retrieved"]])

rag_df_processed = qa_df.merge(qa_df_meta, how="inner")

In [21]:
rag_df_processed

Unnamed: 0,user_question,llm_response,question,answer,cosine
0,Are any foods no-go for someone with diabetes?,"Avoid sugary drinks, processed snacks, white bread, and high-sugar desserts.",What are some unhealthy foods for people with diabetes?,"Avoid sugary drinks, processed snacks, white bread, and high-sugar desserts.",0.891398
1,Are any foods no-go for someone with diabetes?,"Avoid sugary drinks, processed snacks, white bread, and high-sugar desserts.",What foods should I avoid as a diabetic?,"Avoid sugary drinks, processed snacks, white bread, and high-sugar desserts.",0.883724
2,Are any foods no-go for someone with diabetes?,"Avoid sugary drinks, processed snacks, white bread, and high-sugar desserts.",Are there any foods I should stay away from with diabetes?,"Avoid sugary drinks, processed snacks, white bread, and high-sugar desserts.",0.883347
3,Are any foods no-go for someone with diabetes?,"Avoid sugary drinks, processed snacks, white bread, and high-sugar desserts.",What should diabetics not eat?,"Avoid sugary drinks, processed snacks, white bread, and high-sugar desserts.",0.882046
4,Are any foods no-go for someone with diabetes?,"Avoid sugary drinks, processed snacks, white bread, and high-sugar desserts.",Are there low-sugar snacks that are good for people with diabetes?,"Healthy snack options include Greek yogurt, almonds, boiled eggs, and vegetables with hummus.",0.873050
...,...,...,...,...,...
70,Should I not eat carbohydrates at all as a diabetic?,"Yes, but focus on complex carbs like whole grains, legumes, and vegetables, and control portions.",Should I avoid all carbs with diabetes?,"Yes, but focus on complex carbs like whole grains, legumes, and vegetables, and control portions.",0.900012
71,Should I not eat carbohydrates at all as a diabetic?,"Yes, but focus on complex carbs like whole grains, legumes, and vegetables, and control portions.",Are carbohydrates bad for diabetics?,"Yes, but focus on complex carbs like whole grains, legumes, and vegetables, and control portions.",0.886881
72,Should I not eat carbohydrates at all as a diabetic?,"Yes, but focus on complex carbs like whole grains, legumes, and vegetables, and control portions.",Can I eat carbs if I have diabetes?,"Yes, but focus on complex carbs like whole grains, legumes, and vegetables, and control portions.",0.882225
73,Should I not eat carbohydrates at all as a diabetic?,"Yes, but focus on complex carbs like whole grains, legumes, and vegetables, and control portions.",What kind of carbs can I eat with diabetes?,"Yes, but focus on complex carbs like whole grains, legumes, and vegetables, and control portions.",0.873397


In [17]:
for i, row in rag_df_processed[["user_question", "llm_response"]].drop_duplicates().iterrows():
    print(row["user_question"])
    print(row["llm_response"])
    print("")

Are any foods no-go for someone with diabetes?
Avoid sugary drinks, processed snacks, white bread, and high-sugar desserts.

Can you eat berries with diabetes?
Yes, you can eat berries with diabetes. They are a good option due to their fiber content. Just watch portion sizes.

Can you eat pineapple with diabetes?
Sorry, I don't have information on that. Please try a different question.

Can I eat white bread as a diabetic?
Yes, it is advisable to avoid white bread as a diabetic.

What's a good lunch for someone with diabetes?
Sorry, I don't have information on that. Please try a different question.

Can I drink a caramel cappuccino, if I have diabetes?
Sorry, I don't have information on that. Please try a different question.

I'm considering intermittent fasting. Could it help me maintain my blood sugar?
Sorry, I don't have information on that. Please try a different question.

What's your favorite snack?
Sorry, I don't have information on that. Please try a different question.

As a d

This approach yields the same number of false negatives (5/15; questions where the fallback answer is given but the question should have been answerable based on the dataset).

The LLM-responses are also less well-suited to the given questions than in the question-to-question similarity embeddings:
e.g. Q: "Can I eat white bread as a diabetic?" A: "Yes, it is advisable to avoid white bread as a diabetic."
This could be circumvented by further processing the question-answer pairs to remove polarity markers, etc.