# Baseline Model Notebook
This notebook loads the PubMedQA medical QA dataset and runs the baseline medical-LLM inference.

# Retrieval-Augmented Generation (RAG) Experiment


## Project Overview:

This notebook demonstrates the implementation of a basic RAG pipeline using Hugging Face models and external knowledge sources.  
The model is tested with and without retrieval to observe hallucination behavior.


## Notebook Outline:

**Importing Libraries**  
Initialization of required modules such as Transformer models, retrievers, and utility functions.  


**Loading Model & Tokenizer**  
Preparing the language model for generation.  


**Wikipedia Retriever Setup**  
Connecting a retriever to fetch relevant real-world information.  


**Baseline Model Response (Without Retrieval)**  
Generating a response directly from the model to observe hallucinations.  


**Retrieved Context + Model Response (With RAG)**  
Generating a grounded response using external retrieved data.  


**Observation:**  
Comparing both outputs to analyze whether hallucination is reduced.  


**DuckDuckGo Retriever Attempt (Optional – Not Used in Final Output)**  
This section shows an attempt to fetch web search results using a secondary retriever.  
Due to API behavior and result limitations, this is not part of the final evaluation, but kept to demonstrate experimentation.  


## Objective:

To compare model responses with vs. without external retrieval and identify cases of hallucination, demonstrating how RAG improves factual accuracy.


In [1]:
!pip install datasets transformers sentencepiece accelerate wikipedia


Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11678 sha256=a476c5d5c3e4b40113a70b8db2b9bfa004f05d848fc4a04f76d9d14ef0d9c477
  Stored in directory: /root/.cache/pip/wheels/63/47/7c/a9688349aa74d228ce0a9023229c6c0ac52ca2a40fe87679b8
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [2]:
from transformers import pipeline


In [3]:
qa_pipeline = pipeline(
    "question-answering",
    model="dmis-lab/biobert-base-cased-v1.1",
    tokenizer="dmis-lab/biobert-base-cased-v1.1"
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at dmis-lab/biobert-base-cased-v1.1 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


In [4]:
baseline_answer = qa_pipeline({
    "question": "Can antibiotics help treat a viral infection?",
    "context": "Antibiotics are medications designed to treat bacterial infections. They do not work against viruses, such as the common cold or flu."
})

baseline_answer




{'score': 0.0031653267797082663,
 'start': 56,
 'end': 113,
 'answer': 'infections. They do not work against viruses, such as the'}

In [6]:
import pprint
pprint.pprint(baseline_answer)

print("Answer:", baseline_answer['answer'])
print("Confidence:", round(baseline_answer['score'], 3))


{'answer': 'infections. They do not work against viruses, such as the',
 'end': 113,
 'score': 0.0031653267797082663,
 'start': 56}
Answer: infections. They do not work against viruses, such as the
Confidence: 0.003


In [8]:
hallucinated_answer = qa_pipeline({
    "question": "Can antibiotics help treat a viral infection?",
    "context": "The capital of France is Paris. Tigers are carnivores and live in forests. The sun is a star."
})

hallucinated_answer



{'score': 0.005624846206046641,
 'start': 43,
 'end': 78,
 'answer': 'carnivores and live in forests. The'}

In [9]:
hallucinated_answer = qa_pipeline({
    "question": "Can antibiotics help treat a viral infection?",
    "context": "The capital of France is Paris. Tigers are carnivores and live in forests. The sun is a star."
})

print("\n--- Hallucinated Model Output ---")
print("Answer:", hallucinated_answer['answer'])
print("Confidence:", round(hallucinated_answer['score'], 3))
print("Start Index:", hallucinated_answer['start'])
print("End Index:", hallucinated_answer['end'])

context = "The capital of France is Paris. Tigers are carnivores and live in forests. The sun is a star."

print("\nExtracted Span from Context:")
print(context[hallucinated_answer['start']:hallucinated_answer['end']])



--- Hallucinated Model Output ---
Answer: carnivores and live in forests. The
Confidence: 0.006
Start Index: 43
End Index: 78

Extracted Span from Context:
carnivores and live in forests. The


**Observation**

The model attempted to generate an answer even when the provided context contained no medically relevant information. This behavior demonstrates a form of hallucination, where the model produces text without factual grounding. Although the hallucinated response was incorrect, the confidence score was noticeably lower compared to the response generated using correct context. This suggests that confidence can act as a useful signal for detecting hallucinations in medical LLMs.

In [10]:
!pip install duckduckgo-search


Collecting duckduckgo-search
  Downloading duckduckgo_search-8.1.1-py3-none-any.whl.metadata (16 kB)
Collecting primp>=0.15.0 (from duckduckgo-search)
  Downloading primp-0.15.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading duckduckgo_search-8.1.1-py3-none-any.whl (18 kB)
Downloading primp-0.15.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m35.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: primp, duckduckgo-search
Successfully installed duckduckgo-search-8.1.1 primp-0.15.0


In [11]:
from duckduckgo_search import DDGS

query = "Do antibiotics work for viral infections PubMed medical research"
search_results = list(DDGS().text(query, max_results=3))

search_results


  search_results = list(DDGS().text(query, max_results=3))
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)


[]

  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)


In [12]:
!pip install wikipedia


  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return date



  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return date

In [14]:
import wikipedia

query = "Antibiotics viral infection"
result_text = wikipedia.summary("Antibiotic misuse", sentences=3)
result_text



  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return date

'Antibiotic misuse, sometimes called antibiotic abuse or antibiotic overuse, refers to the misuse or overuse of antibiotics, with potentially serious effects on health. It is a contributing factor to the development of antibiotic resistance, including the creation of multidrug-resistant bacteria, informally called "super bugs": relatively harmless bacteria (such as Staphylococcus, Enterococcus and Acinetobacter) can develop resistance to multiple antibiotics and cause life-threatening infections.\n\n\n== History of antibiotic regulation ==\nAntibiotics have been around since 1928 when penicillin was discovered by Alexander Fleming.'

  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)


###  Summary of Retrieved Information

The following text was fetched using the Wikipedia library based on the query *"Antibiotics viral infection"*.


In [15]:
print(result_text)


Antibiotic misuse, sometimes called antibiotic abuse or antibiotic overuse, refers to the misuse or overuse of antibiotics, with potentially serious effects on health. It is a contributing factor to the development of antibiotic resistance, including the creation of multidrug-resistant bacteria, informally called "super bugs": relatively harmless bacteria (such as Staphylococcus, Enterococcus and Acinetobacter) can develop resistance to multiple antibiotics and cause life-threatening infections.


== History of antibiotic regulation ==
Antibiotics have been around since 1928 when penicillin was discovered by Alexander Fleming.


  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)


In [16]:
rag_answer = qa_pipeline({
    "question": "Can antibiotics help treat a viral infection?",
    "context": result_text
})

rag_answer


  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return date

{'score': 0.00013549314462579787,
 'start': 264,
 'end': 314,
 'answer': 'of multidrug-resistant bacteria, informally called'}

  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)


**Observation**

The model generated an answer even though the context did not contain relevant medical information.
The extracted span is unrelated to the question, and the confidence score is extremely low, indicating uncertainty.
This behavior demonstrates a hallucination, where the model produces an answer despite insufficient or irrelevant context.

**Expected Behavior**

The ideal response would have been:

"The context does not contain enough information to answer the question."