# Baseline Model Notebook
This notebook loads the PubMedQA medical QA dataset and runs the baseline medical-LLM inference.

# Retrieval-Augmented Generation (RAG) Experiment


## Project Overview:

This notebook demonstrates the implementation of a basic RAG pipeline using Hugging Face models and external knowledge sources.  
The model is tested with and without retrieval to observe hallucination behavior.


## Notebook Outline:

**Importing Libraries**  
Initialization of required modules such as Transformer models, retrievers, and utility functions.  


**Loading Model & Tokenizer**  
Preparing the language model for generation.  


**Wikipedia Retriever Setup**  
Connecting a retriever to fetch relevant real-world information.  


**Baseline Model Response (Without Retrieval)**  
Generating a response directly from the model to observe hallucinations.  


**Retrieved Context + Model Response (With RAG)**  
Generating a grounded response using external retrieved data.  


**Observation:**  
Comparing both outputs to analyze whether hallucination is reduced.  


**DuckDuckGo Retriever Attempt (Optional â€“ Not Used in Final Output)**  
This section shows an attempt to fetch web search results using a secondary retriever.  
Due to API behavior and result limitations, this is not part of the final evaluation, but kept to demonstrate experimentation.  


## Objective:

To compare model responses with vs. without external retrieval and identify cases of hallucination, demonstrating how RAG improves factual accuracy.


In [None]:
!pip install datasets transformers sentencepiece accelerate wikipedia


In [None]:
from transformers import pipeline


In [None]:
qa_pipeline = pipeline(
    "question-answering",
    model="dmis-lab/biobert-base-cased-v1.1",
    tokenizer="dmis-lab/biobert-base-cased-v1.1"
)


In [None]:
baseline_answer = qa_pipeline({
    "question": "Can antibiotics help treat a viral infection?",
    "context": "Antibiotics are medications designed to treat bacterial infections. They do not work against viruses, such as the common cold or flu."
})

baseline_answer


In [None]:
import pprint
pprint.pprint(baseline_answer)

print("Answer:", baseline_answer['answer'])
print("Confidence:", round(baseline_answer['score'], 3))


In [None]:
hallucinated_answer = qa_pipeline({
    "question": "Can antibiotics help treat a viral infection?",
    "context": "The capital of France is Paris. Tigers are carnivores and live in forests. The sun is a star."
})

hallucinated_answer



In [None]:
hallucinated_answer = qa_pipeline({
    "question": "Can antibiotics help treat a viral infection?",
    "context": "The capital of France is Paris. Tigers are carnivores and live in forests. The sun is a star."
})

print("\n--- Hallucinated Model Output ---")
print("Answer:", hallucinated_answer['answer'])
print("Confidence:", round(hallucinated_answer['score'], 3))
print("Start Index:", hallucinated_answer['start'])
print("End Index:", hallucinated_answer['end'])

context = "The capital of France is Paris. Tigers are carnivores and live in forests. The sun is a star."

print("\nExtracted Span from Context:")
print(context[hallucinated_answer['start']:hallucinated_answer['end']])


**Observation**

The model attempted to generate an answer even when the provided context contained no medically relevant information. This behavior demonstrates a form of hallucination, where the model produces text without factual grounding. Although the hallucinated response was incorrect, the confidence score was noticeably lower compared to the response generated using correct context. This suggests that confidence can act as a useful signal for detecting hallucinations in medical LLMs.

In [None]:
!pip install duckduckgo-search


In [None]:
from duckduckgo_search import DDGS

query = "Do antibiotics work for viral infections PubMed medical research"
search_results = list(DDGS().text(query, max_results=3))

search_results


In [None]:
!pip install wikipedia


In [None]:
import wikipedia

query = "Antibiotics viral infection"
result_text = wikipedia.summary("Antibiotic misuse", sentences=3)
result_text



###  Summary of Retrieved Information

The following text was fetched using the Wikipedia library based on the query *"Antibiotics viral infection"*.


In [None]:
print(result_text)


In [None]:
rag_answer = qa_pipeline({
    "question": "Can antibiotics help treat a viral infection?",
    "context": result_text
})

rag_answer


**Observation**

The model generated an answer even though the context did not contain relevant medical information.
The extracted span is unrelated to the question, and the confidence score is extremely low, indicating uncertainty.
This behavior demonstrates a hallucination, where the model produces an answer despite insufficient or irrelevant context.

**Expected Behavior**

The ideal response would have been:

"The context does not contain enough information to answer the question."