To build an AI-powered medical explanation system that:

1. Extracts medical terms using NLP.
2. Retrieves medically accurate definitions using RAG.
3. Ensures source faithfulness (no hallucination).
4. Incorporates expert validation (human review).
5. Computes a final confidence score using weighted scoring.

User Input  
→ NLP Medical Term Extraction  
→ Vector Database Retrieval (RAG)  
→ Explanation Generation  
→ Source Faithfulness Check  
→ Human Validation Layer  
→ Final Confidence Score Calculation  


Final Confidence Score:

C_final = (w1 × S_retrieval)  
          + (w2 × S_source)  
          + (w3 × S_human)

Where:

- **S_retrieval** → Cosine similarity score from vector database  
- **S_source** → Faithfulness score (LLM answer vs original source)  
- **S_human** → Expert validation score (1.0 if verified)  

---


In [8]:
import sqlite3
import pandas as pd

# Connect to database
conn = sqlite3.connect("medical_jargon.db")

# See available tables
tables = pd.read_sql_query(
    "SELECT name FROM sqlite_master WHERE type='table';", conn
)

tables


Unnamed: 0,name
0,medical_terms
1,medical_fts
2,medical_fts_data
3,medical_fts_idx
4,medical_fts_content
5,medical_fts_docsize
6,medical_fts_config


In [9]:
df = pd.read_sql_query("SELECT * FROM medical_terms;", conn)

df.head()


Unnamed: 0,term,content,__index_level_0__,term_lower,content_length,extracted_date,summary
0,Paracetamol poisoning,"Paracetamol poisoning, also known as acetamino...",0,paracetamol poisoning,23666,2026-01-23,"Paracetamol poisoning, also known as acetamino..."
1,Acromegaly,Acromegaly is a disorder that results from exc...,1,acromegaly,21318,2026-01-23,Acromegaly is a disorder that results from exc...
2,Actinic keratosis,"Actinic keratosis (AK), sometimes called solar...",2,actinic keratosis,33330,2026-01-23,"Actinic keratosis (AK), sometimes called solar..."
3,Congenital adrenal hyperplasia,Congenital adrenal hyperplasia (CAH) is a grou...,3,congenital adrenal hyperplasia,19416,2026-01-23,Congenital adrenal hyperplasia (CAH) is a grou...
4,Adrenocortical carcinoma,Adrenocortical carcinoma (ACC) is an aggressi...,4,adrenocortical carcinoma,8252,2026-01-23,Adrenocortical carcinoma (ACC) is an aggressi...


In [10]:
pd.read_sql_query(
    "SELECT name FROM sqlite_master WHERE type='table';",
    conn
)


Unnamed: 0,name
0,medical_terms
1,medical_fts
2,medical_fts_data
3,medical_fts_idx
4,medical_fts_content
5,medical_fts_docsize
6,medical_fts_config


In [11]:
df.columns


Index(['term', 'content', '__index_level_0__', 'term_lower', 'content_length',
       'extracted_date', 'summary'],
      dtype='str')

In [12]:
import sqlite3
import pandas as pd

conn = sqlite3.connect("medical_jargon.db")

df = pd.read_sql_query("SELECT * FROM medical_terms;", conn)

print("Columns:")
print(df.columns)

print("\nTotal rows:", len(df))
df.head()


Columns:
Index(['term', 'content', '__index_level_0__', 'term_lower', 'content_length',
       'extracted_date', 'summary'],
      dtype='str')

Total rows: 901


Unnamed: 0,term,content,__index_level_0__,term_lower,content_length,extracted_date,summary
0,Paracetamol poisoning,"Paracetamol poisoning, also known as acetamino...",0,paracetamol poisoning,23666,2026-01-23,"Paracetamol poisoning, also known as acetamino..."
1,Acromegaly,Acromegaly is a disorder that results from exc...,1,acromegaly,21318,2026-01-23,Acromegaly is a disorder that results from exc...
2,Actinic keratosis,"Actinic keratosis (AK), sometimes called solar...",2,actinic keratosis,33330,2026-01-23,"Actinic keratosis (AK), sometimes called solar..."
3,Congenital adrenal hyperplasia,Congenital adrenal hyperplasia (CAH) is a grou...,3,congenital adrenal hyperplasia,19416,2026-01-23,Congenital adrenal hyperplasia (CAH) is a grou...
4,Adrenocortical carcinoma,Adrenocortical carcinoma (ACC) is an aggressi...,4,adrenocortical carcinoma,8252,2026-01-23,Adrenocortical carcinoma (ACC) is an aggressi...


In [13]:
# Keep only needed columns
df = df[['term', 'content', 'summary']]

# Remove nulls & duplicates
df = df.dropna().drop_duplicates()

# Clean text
df['content'] = df['content'].str.lower().str.strip()

print("Cleaned rows:", len(df))
df.head()


Cleaned rows: 901


Unnamed: 0,term,content,summary
0,Paracetamol poisoning,"paracetamol poisoning, also known as acetamino...","Paracetamol poisoning, also known as acetamino..."
1,Acromegaly,acromegaly is a disorder that results from exc...,Acromegaly is a disorder that results from exc...
2,Actinic keratosis,"actinic keratosis (ak), sometimes called solar...","Actinic keratosis (AK), sometimes called solar..."
3,Congenital adrenal hyperplasia,congenital adrenal hyperplasia (cah) is a grou...,Congenital adrenal hyperplasia (CAH) is a grou...
4,Adrenocortical carcinoma,adrenocortical carcinoma (acc) is an aggressi...,Adrenocortical carcinoma (ACC) is an aggressi...


In [15]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content=row['content'],
        metadata={
            "term": row['term'],
            "summary": row['summary']
        }
    )
    for _, row in df.iterrows()
]

print("Total documents:", len(documents))


Total documents: 901


In [16]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content=row['content'],
        metadata={
            "term": row['term'],
            "summary": row['summary']
        }
    )
    for _, row in df.iterrows()
]

print("Total documents:", len(documents))


Total documents: 901


In [18]:
# Create a set of all medical terms
medical_terms = set(df['term'].str.lower())

def extract_medical_terms(text):
    text = text.lower()
    found_terms = []

    for term in medical_terms:
        if term in text:
            found_terms.append(term)

    return list(set(found_terms))


In [19]:
user_input = "This patient has diabetes and hypertension"

terms = extract_medical_terms(user_input)
print("Extracted terms:", terms)


Extracted terms: []


In [20]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

split_docs = text_splitter.split_documents(documents)

print("Total chunks:", len(split_docs))


Total chunks: 36691


In [21]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

vectorstore = FAISS.from_documents(split_docs, embedding_model)

print("Vector database created ✅")


Loading weights: 100%|██████████| 103/103 [00:00<00:00, 1194.24it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Vector database created ✅


In [22]:
conn = sqlite3.connect("medical_jargon.db")

df = pd.read_sql_query("SELECT * FROM medical_terms LIMIT 10", conn)
print("First 10 rows:")
display(df)

print("\nColumn types:")
print(df.dtypes)

print("\nMissing values:")
print(df.isna().sum())

First 10 rows:


Unnamed: 0,term,content,__index_level_0__,term_lower,content_length,extracted_date,summary
0,Paracetamol poisoning,"Paracetamol poisoning, also known as acetamino...",0,paracetamol poisoning,23666,2026-01-23,"Paracetamol poisoning, also known as acetamino..."
1,Acromegaly,Acromegaly is a disorder that results from exc...,1,acromegaly,21318,2026-01-23,Acromegaly is a disorder that results from exc...
2,Actinic keratosis,"Actinic keratosis (AK), sometimes called solar...",2,actinic keratosis,33330,2026-01-23,"Actinic keratosis (AK), sometimes called solar..."
3,Congenital adrenal hyperplasia,Congenital adrenal hyperplasia (CAH) is a grou...,3,congenital adrenal hyperplasia,19416,2026-01-23,Congenital adrenal hyperplasia (CAH) is a grou...
4,Adrenocortical carcinoma,Adrenocortical carcinoma (ACC) is an aggressi...,4,adrenocortical carcinoma,8252,2026-01-23,Adrenocortical carcinoma (ACC) is an aggressi...
5,Alcohol withdrawal syndrome,Alcohol withdrawal syndrome (AWS) is a set of ...,5,alcohol withdrawal syndrome,16646,2026-01-23,Alcohol withdrawal syndrome (AWS) is a set of ...
6,Alopecia areata,"Alopecia areata, also known as spot baldness, ...",6,alopecia areata,11883,2026-01-23,"Alopecia areata, also known as spot baldness, ..."
7,Altitude sickness,"Altitude sickness, the mildest form being acut...",7,altitude sickness,20260,2026-01-23,"Altitude sickness, the mildest form being acut..."
8,Amblyopia,"Amblyopia, also called lazy eye, is a disorder...",8,amblyopia,12923,2026-01-23,"Amblyopia, also called lazy eye, is a disorder..."
9,Amoebiasis,"Amoebiasis, or amoebic dysentery, is an infect...",9,amoebiasis,15410,2026-01-23,"Amoebiasis, or amoebic dysentery, is an infect..."



Column types:
term                   str
content                str
__index_level_0__    int64
term_lower             str
content_length       int64
extracted_date         str
summary                str
dtype: object

Missing values:
term                 0
content              0
__index_level_0__    0
term_lower           0
content_length       0
extracted_date       0
summary              0
dtype: int64


In [23]:
from langchain_text_splitters import RecursiveCharacterTextSplitter


In [24]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)


In [25]:
split_docs = text_splitter.split_documents(documents)


In [26]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

vectorstore = FAISS.from_documents(split_docs, embedding_model)


Loading weights: 100%|██████████| 103/103 [00:00<00:00, 2314.37it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


In [27]:
query = "What is hypertension?"

results = vectorstore.similarity_search(query, k=3)

for i, doc in enumerate(results):
    print(f"\nResult {i+1}:")
    print("Content:", doc.page_content[:300])
    print("Source:", doc.metadata)



Result 1:
Content: de novo manifestation of hypertension with systolic pressure and diastolic pressure above 160mmhg and 110 mmhg, respectively.
proteinuria, leucocytosis and elevated uric acid concentrations > 7.8 mg.
decreased serum haptoglobin and haemoglobin levels.
Source: {'term': 'HELLP syndrome', 'summary': 'HELLP syndrome is a complication of pregnancy; the acronym stands for hemolysis, elevated liver enzymes, and low platelet count. It usually begins during the last three months of pregnancy or shortly after childbirth.'}

Result 2:
Content: secondary hypertension (or, less commonly, inessential hypertension) is a type of hypertension which by definition is caused by an identifiable underlying primary cause. it is much less common than the other type, called essential hypertension, affecting only 5-10% of hypertensive patients. it has m
Source: {'term': 'Secondary hypertension', 'summary': 'Secondary hypertension (or, less commonly, inessential hypertension) is a type of hyp

In [28]:
results = vectorstore.similarity_search_with_score(query, k=3)

for i, (doc, score) in enumerate(results):
    print(f"\nResult {i+1}")
    print("Term:", doc.metadata["term"])
    print("Distance:", score)
    print(doc.page_content[:200])



Result 1
Term: HELLP syndrome
Distance: 0.70721364
de novo manifestation of hypertension with systolic pressure and diastolic pressure above 160mmhg and 110 mmhg, respectively.
proteinuria, leucocytosis and elevated uric acid concentrations > 7.8 mg.


Result 2
Term: Secondary hypertension
Distance: 0.71290946
secondary hypertension (or, less commonly, inessential hypertension) is a type of hypertension which by definition is caused by an identifiable underlying primary cause. it is much less common than th

Result 3
Term: Secondary hypertension
Distance: 0.7433505
hypertension secondary to endocrine disorders


In [29]:
def convert_distance_to_similarity(distance):
    return 1 / (1 + distance)

top_doc, distance = results[0]
S_retrieval = convert_distance_to_similarity(distance)

print("S_retrieval:", S_retrieval)


S_retrieval: 0.58574975


In [30]:
vectorstore.save_local("faiss_index")


In [1]:
!pip3 install transformers huggingface_hub datasets wandb evaluate rouge_score accelerate

Collecting datasets
  Using cached datasets-4.5.0-py3-none-any.whl.metadata (19 kB)
Collecting wandb
  Downloading wandb-0.25.0-py3-none-macosx_12_0_arm64.whl.metadata (12 kB)
Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting accelerate
  Using cached accelerate-1.12.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Using cached pyarrow-23.0.1-cp313-cp313-macosx_12_0_arm64.whl.metadata (3.1 kB)
Collecting dill<0.4.1,>=0.3.0 (from datasets)
  Using cached dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess<0.70.19 (from datasets)
  Using cached multiprocess-0.70.18-py313-none-any.whl.metadata (7.2 kB)
Collecting fsspec>=2023.5.0 (from huggingface_hub)
  Using

In [2]:
from transformers import (
    T5Tokenizer,
    T5ForConditionalGeneration,
    TrainingArguments,
    Trainer
)
from datasets import load_dataset
import numpy as np
import evaluate
import torch
import wandb
import os

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
from getpass import getpass

hf_token = getpass("Enter you hugging face token: ")
wandb_key = getpass("Enter your wandb key: ")

KeyboardInterrupt: Interrupted by user

In [None]:
from huggingface_hub import login

login(token=hf_token)
wandb.login(key=wandb_key)

In [None]:
from datasets import load_dataset, DatasetDict

data_files = 'created_simplification_data.json'

dataset = load_dataset("json", data_files=data_files)

dataset