# Dependencies

Besides the default Colab libraries, the only dependencies of this code are BM25s and the Transformers library.

In [1]:
!pip install bm25s -qqq
!pip install transformers -qqq

# Dataset Reader

Here we write the reader that will read the dataset from a CSV file ( with "question" and "answer" columns ) and convert it to an appropriate format.

The dataset will be sliced into train, validation and test splits. Although our reader also supports K-fold cross-validation, we are just using a standard train and validation split.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split, KFold

class Dataset(object):

    def __init__(self, source):
        self.df = pd.read_csv(source)

    def split_data(self, test_size=0.2, random_state=42):
        train_df, test_df = train_test_split(self.df, test_size=test_size, random_state=random_state)
        return train_df, test_df

    def kfold_split(self, train_df, n_splits=5, random_state=42):
        kf = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
        folds = []
        for train_index, val_index in kf.split(train_df):
            train_fold = train_df.iloc[train_index]
            val_fold = train_df.iloc[val_index]
            folds.append((train_fold, val_fold))
        return folds

    def read_corpus(self, df, include_questions=True):
        corpus = []
        records = df.to_dict("records")
        for row in records:
          output_row = ""
          question = str(row["question"])
          answer = str(row["answer"])
          if include_questions:
            output_row += f"{question} \n"
          output_row += f"{answer}"
          corpus.append(output_row)
        return corpus

# Retriever

We define the Retriever module of our RAG system. Since BM25 is a strong baseline, we use it with its standard parameters as our retriever of choice.

In [3]:
import bm25s

class Retriever(object):

  def __init__(self, corpus, k1=1.5, b=0.75, delta=1.5):
    self.model = bm25s.BM25(method="bm25+", k1=k1, b=b,delta=delta)
    self.corpus = corpus
    self.tokens = None

  def tokenize(self):
    corpus_tokens = bm25s.tokenize(self.corpus, stopwords="en", stemmer=None)
    self.model.index(corpus_tokens)

  def query(self, query, k=2):
    query_tokens = bm25s.tokenize(query, stemmer=None)
    results = self.model.retrieve(query_tokens, k=k, return_as="documents")
    return results

# Generator

The generator module of our RAG will take the documents supplied by the retriever and generate an answer. We are opting for lightweight LLMs available on Hugging Face Hub for this module.

In [4]:
from transformers import pipeline
class Generator(object):

  def __init__(self, model="Qwen/Qwen2.5-1.5B-Instruct", device="cuda"):
    self.pipe = pipeline("text-generation", model=model, device=device)

  def predict(self, messages):
    return self.pipe(messages, max_new_tokens=512)

# Preprocessing

We preprocess the dataset, inserting all train and validation answers into our BM25 index.

In [5]:
reader = Dataset("dataset.csv")
train_val, test = reader.split_data()
folds = reader.kfold_split(train_val)
train, validation = folds[0][0], folds[0][1]

In [6]:
train_corpus = reader.read_corpus(train, include_questions=False)
validation_corpus = reader.read_corpus(validation, include_questions=False)

# Model Training

We train the BM25 model.

In [7]:
retriever = Retriever(train_corpus + validation_corpus)
retriever.tokenize()

Split strings:   0%|          | 0/13124 [00:00<?, ?it/s]

DEBUG:bm25s:Building index from IDs objects


BM25S Count Tokens:   0%|          | 0/13124 [00:00<?, ?it/s]

BM25S Compute Scores:   0%|          | 0/13124 [00:00<?, ?it/s]

# Evaluation

Here, we evaluate the accuracy of our BM25 model on the validation set using the top 20 results (ACC@20).

The model achieved an accuracy of 90%, meaning that the correct answer is present in the top 20 results 90% of the time.

This demonstrates that our retriever module performs at an adequate level for our RAG system.

In [8]:
def evaluate_retriever(retriever, validation, limit=10):
  validation_records = validation.to_dict("records")
  questions = [ str(row["question"]) for row in validation_records ][0:limit]
  answers = [ str(row["answer"]) for row in validation_records ][0:limit]
  predictions = retriever.query(questions, k=20)
  acc_array = []
  for prediction_idx, query_result in enumerate(predictions):
    for ranked_doc in query_result:
      if retriever.corpus[ranked_doc] == answers[prediction_idx]:
        acc_array.append(True)
        break
    else:
      acc_array.append(False)
  acc_20 = sum(acc_array) / len(acc_array)
  return acc_20

evaluate_retriever(retriever, validation)

Split strings:   0%|          | 0/10 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/10 [00:00<?, ?it/s]

0.9

# Inference

In this step we simply put together the components of our RAG system into a "Model" class. The retriever will supply the context to a prompt, and this prompt will be then forwarded to our generator (LLM).

In [9]:
class Model(object):

  def __init__(self, retriever, generator):
    self.retriever = retriever
    self.generator = generator

  def build_prompt(self, query, k=3):
    prompt = f"You are a medical question-answering system that can effectively answer user queries related to medical diseases. \n \n User Query: {query} \n \n"
    context = [self.retriever.corpus[x] for x in self.retriever.query(query, k=k)[0]]
    prompt += "Context: \n \n"
    for item in context:
      prompt += f"{item} \n "
    prompt += "\n \n \n Your answer:"
    return prompt

  def predict(self, query, k=3):
    prompt = self.build_prompt(query, k=k)
    prediction = self.generator.predict([{
        "role": "user", "content": prompt
    }])
    return prediction

In [10]:
generator = Generator(model="Qwen/Qwen2-0.5B-Instruct")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [11]:
model = Model(retriever, generator)

# Inference Examples

In [12]:
print(model.build_prompt("What are symptoms of diabetes?", k=3))

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

You are a medical question-answering system that can effectively answer user queries related to medical diseases. 
 
 User Query: What are symptoms of diabetes? 
 
Context: 
 
People can manage their diabetes with meal planning, physical activity, and if needed, medications. More information about taking care of type 1 or type 2 diabetes is provided in the NIDDK health topics:
                
- What I need to know about Diabetes Medicines  - What I need to know about Eating and Diabetes  - Your Guide to Diabetes: Type 1 and Type 2
                
These NDIC publications are available at http://www.niddk.nih.gov/health-information/health-topics/Diabetes/Pages/default.aspx or by calling 18008608747. 
 What causes nephrogenic diabetes insipidus? Nephrogenic diabetes insipidus can be either acquired or hereditary. The acquired form can result from chronic kidney disease, certain medications (such as lithium), low levels of potassium in the blood (hypokalemia), high levels of calcium in t

Question: "What are symptoms of diabetes?"

Model answer: "Diabetes can cause a variety of symptoms, including:\n\n  1. High blood sugar levels: People with type 1 diabetes have high blood sugar levels, which can lead to nerve damage and blindness.\n  2. Fatigue and weakness: People with type 2 diabetes may experience fatigue and weakness, especially when they exercise.\n  3. Heart problems: People with type 2 diabetes have higher risk of heart attacks and strokes than those without it.\n  4. Kidney problems: People with type 1 diabetes have higher risk of kidney failure than those without it.\n\nSymptoms can vary depending on the specific type of diabetes, so it's important for people with diabetes to talk to their healthcare provider about their symptoms and how they might be affecting their daily life."

In [13]:
model.predict("What are symptoms of diabetes?", k=3)

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

[{'generated_text': [{'role': 'user',
    'content': 'You are a medical question-answering system that can effectively answer user queries related to medical diseases. \n \n User Query: What are symptoms of diabetes? \n \nContext: \n \nPeople can manage their diabetes with meal planning, physical activity, and if needed, medications. More information about taking care of type 1 or type 2 diabetes is provided in the NIDDK health topics:\n                \n- What I need to know about Diabetes Medicines  - What I need to know about Eating and Diabetes  - Your Guide to Diabetes: Type 1 and Type 2\n                \nThese NDIC publications are available at http://www.niddk.nih.gov/health-information/health-topics/Diabetes/Pages/default.aspx or by calling 18008608747. \n What causes nephrogenic diabetes insipidus? Nephrogenic diabetes insipidus can be either acquired or hereditary. The acquired form can result from chronic kidney disease, certain medications (such as lithium), low levels of 

Question: "How to diagnose keratitis-ichthyosis-deafness syndrome?"

Model answer: "To diagnose KID syndrome, it's important to first rule out other potential causes of the symptoms. For example, if you suspect that your symptoms might be caused by another underlying disease, such as diabetes or thyroid disorders, you should seek medical attention immediately. Additionally, you may need to undergo additional tests, such as an eye examination or imaging studies, to confirm the diagnosis.\nFor example, to confirm the diagnosis of KID syndrome, your doctor may order an eye exam and possibly perform an MRI scan of the brain to see if there is any damage or injury to the optic nerve or retina. They may also order a CT scan or X-ray of the skull to check for any signs of damage or injury to the facial bones.\nOnce确诊了KID syndrome，接下来需要进行一系列的检查以确认诊断。例如，如果怀疑可能由其他潜在原因引起的症状，如糖尿病或甲状腺疾病等，你应该立即寻求医疗帮助。此外，你可能还需要进行一些额外的测试，如眼部检查或脑部影像学检查，以确认诊断。\n例如，如果您怀疑可能存在KID综合征，您的医生可能会安排眼科检查，并可能进行MRI扫描来查看是否有任何神经损伤或视力损伤；或者他们可能会进行CT扫描或X光扫描来检查面部骨骼是否有损伤或其他损害。"

In [14]:
model.predict("How to diagnose keratitis-ichthyosis-deafness syndrome?")

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

[{'generated_text': [{'role': 'user',
    'content': 'You are a medical question-answering system that can effectively answer user queries related to medical diseases. \n \n User Query: How to diagnose keratitis-ichthyosis-deafness syndrome? \n \nContext: \n \nKeratitis-ichthyosis-deafness (KID) syndrome is characterized by eye problems, skin abnormalities, and hearing loss.  People with KID syndrome usually have keratitis, which is inflammation of the front surface of the eye (the cornea). The keratitis may cause pain, increased sensitivity to light (photophobia), abnormal blood vessel growth over the cornea (neovascularization), and scarring. Over time, affected individuals experience a loss of sharp vision (reduced visual acuity); in severe cases the keratitis can lead to blindness.  Most people with KID syndrome have thick, hard skin on the palms of the hands and soles of the feet (palmoplantar keratoderma). Affected individuals also have thick, reddened patches of skin (erythroker

Question: "What is the treatment for Guillain-Barr syndrome?"

Model answer: 'Guillain-Barr Syndrome Treatment: Plasma Exchange and High-Dose Immunoglobulin'

In [15]:
model.predict("What is the treatment for Guillain-Barr syndrome?")

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

[{'generated_text': [{'role': 'user',
    'content': "You are a medical question-answering system that can effectively answer user queries related to medical diseases. \n \n User Query: What is the treatment for Guillain-Barr syndrome? \n \nContext: \n \nGuillain-Barr syndrome is a rare disorder in which the body's immune system attacks part of the peripheral nervous system. Symptoms include muscle weakness, numbness, and tingling sensations, which can increase in intensity until the muscles cannot be used at all. Usually Guillain-Barr syndrome occurs a few days or weeks after symptoms of a viral infection. Occasionally, surgery or vaccinations will trigger the syndrome. It remains unclear why only some people develop Guillain-Barr syndrome but there may be a genetic predisposition in some cases. Diagnosed patients should be admitted to a hospital for early treatment. There is no cure for Guillain-Barr syndrome, but treatments such as plasma exchange (plasmapheresis) and high dose immu