# 📝 Project Title - LLM & RAG-Powered Medical Transcription Summarizer

# 📖 Project Description
- This project aims to revolutionize how healthcare professionals interact with medical transcription data. Using the power of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), the system provides intelligent summarization of patient records.
- It simulates how nurses or doctors can automatically get health summaries before a check-up, using historical descriptions, specialties, and keywords.
- Additionally, the system is built to evolve — new transcriptions update the vector database, enabling real-time context-aware answers.

## 🧱 Notebook Structure & Sections

### ✅ 1. Importing Required Libraries
- Start by importing essential libraries (pandas, numpy, seaborn, sklearn, xgboost, lightgbm, matplotlib, etc.).

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from transformers import pipeline
from sentence_transformers import SentenceTransformer
import faiss
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq

### 🔍 2. Data Loading and Overview
- Loaded ~5,000 rows of transcription records.
- Columns include description, transcription, medical_specialty, sample_name, and keywords.
- ✅ Cleaned missing values and identified description as the key input for building summaries.

In [2]:
df = pd.read_csv('mtsamples.csv')

In [3]:
df.drop(columns = 'Unnamed: 0', axis = 1, inplace = True)

In [4]:
df.head()

Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4999 entries, 0 to 4998
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   description        4999 non-null   object
 1   medical_specialty  4999 non-null   object
 2   sample_name        4999 non-null   object
 3   transcription      4966 non-null   object
 4   keywords           3931 non-null   object
dtypes: object(5)
memory usage: 195.4+ KB


In [6]:
#Dropping Missing Transactions
df = df[df['transcription'].notna()].reset_index(drop = True)

In [7]:
df['full_text'] = df['description'] + df['transcription']

### ✨ 3. Summarization uaing (LLM + RAG) - Method(1)
- Applied a base OpenAI GPT or similar LLM directly on the description column.
- Used this for naive one-by-one summarization (just a baseline step).

### 🤖 3.1 RAG-Based Pipeline (Main Implementation)
- Retrieval-Augmented Generation (RAG) was built using the following steps:

In [8]:
#Method 1 - LLM + RAG (Retrieval-Augmented Generation)

#### 🧠 Step 1: Text Embedding + Vector DB Creation
- Used sentence-transformers (e.g., all-MiniLM-L6-v2) to embed the description column.
- Created a FAISS vector store from these embeddings.

In [9]:
model = SentenceTransformer('all-MiniLM-L6-v2')

In [10]:
documents = df['full_text'].tolist()
embeddings = model.encode(documents,show_progress_bar = True)

Batches:   0%|          | 0/156 [00:00<?, ?it/s]

In [11]:
index = faiss.IndexFlatL2(embeddings[0].shape[0])
index.add(np.array(embeddings))

#### 🧾 Step 2: Query + Retrieval
- Sample question passed (e.g., “Summarize the patient’s history for cardiovascular cases”).
- FAISS finds top-k relevant descriptions.

In [12]:
llm_rag = pipeline("text2text-generation", model="google/flan-t5-base", max_new_tokens=200) #Setting max tokens set to 200

Device set to use mps:0


In [13]:
def rag_query(user_query, top_k = 3):
    query_vector = model.encode([user_query])
    D,I = index.search(np.array(query_vector),top_k)
    context = '\n\n'.join([documents[i] for i in I[0]])
    '''
    if len(context.split()) > 1024:
        context = ' '.join(context.split()[:1024])
    '''
    prompt = f"Context:\n{context}\n\nQuestion:\n{user_query}\n\nAnswer:"
    response = llm_rag(prompt,max_new_tokens = 300)[0]['generated_text'] # # Overrides the default max_new_tokens=200 to 300 for this specific call
    return response

#### 📘 Step 3: LLM Generation
- Sent retrieved descriptions as context to the LLM.
- LLM generated a personalized summary using context + query → this is true RAG in action.

In [14]:
print(rag_query("What are the symptoms for this condition Allergic Rhinitis"))

Token indices sequence length is longer than the specified maximum sequence length for this model (4128 > 512). Running this sequence through the model will result in indexing errors


fever


In [15]:
print(rag_query("Summarize the patient’s history for cardiovascular cases"))

The patient is a 69-year-old male who has a history of atrial fibrillation, hypertension, and hyperlipidemia.


### 🚀 4. Fine-Tuning LLM an Optimized Lightweight Execution
In this approach, we fine-tune a lightweight pretrained language model (such as `t5-small`) on our own domain-specific dataset (e.g., medical transcripts or summaries). This allows the model to learn patterns unique to our data and improve its performance on similar tasks.
- Using local LLMs like TinyLLaMA, Phi, or GPT4All for MacBooks without GPU.
- Created a structure for fine-tuning or prompting smaller models later.

#### ✅ Steps Involved:

#### Step 4.1 📦 Install Required Packages
```bash
pip install transformers datasets peft accelerate sentencepiece bitsandbytes.

In [16]:
#Method 2 - Fine Tune an Existing LLM

#### Step 4.2 🧹 Prepare Dataset
- Format your data into two columns:
- full_text: The input prompt or context
- target_text: The desired output or summary
- Optionally create a placeholder summary:

In [17]:
df['target_text'] = df['full_text'].apply(lambda x: x[:100] + '...')

#### Step 4.3 📂 Load and Split Dataset

In [18]:
df_fine_llm = df[['full_text', 'target_text']].dropna().reset_index(drop = True)
data = df_fine_llm.sample(1000) # Limit for memory

In [19]:
#Convert to Dataset
data_set = Dataset.from_pandas(data)
data_set = data_set.train_test_split(test_size = 0.1)

#### Step 4.4 🧠 Load Pretrained Model and Tokenizer

In [20]:
#Load Model (T5-Small for Summarization)
model_name = 't5-small'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

#### Step 4.5 ✂️ Tokenize and Preprocess Data

In [21]:
max_input_length = 512
max_output_length = 128

def preprocess(each_data):
    inputs = tokenizer(
        each_data['full_text'],
        padding = 'max_length',
        truncation = True,
        max_length = max_input_length
    )
    targets = tokenizer(
        each_data['target_text'],
        padding = 'max_length',
        truncation = True,
        max_length = max_input_length  
    )

    inputs['labels'] = targets['input_ids']
    return inputs

In [22]:
tokenized_data_set = data_set.map(preprocess, batched = True, remove_columns = ['full_text', 'target_text'])

Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

#### Step 4.6 🏋️ Fine-Tune the Model

In [23]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./t5_medical_transcription_summary_finetuned",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    logging_dir="./logs",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    predict_with_generate=True,
    push_to_hub=False,
)

In [24]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data_set['train'],
    eval_dataset=tokenized_data_set['test'],
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
)

  trainer = Seq2SeqTrainer(


In [25]:
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,No log,0.025078
2,0.760500,0.021025
3,0.026500,0.020387


TrainOutput(global_step=1350, training_loss=0.29737290099815084, metrics={'train_runtime': 2124.8342, 'train_samples_per_second': 1.271, 'train_steps_per_second': 0.635, 'total_flos': 365422863974400.0, 'train_loss': 0.29737290099815084, 'epoch': 3.0})

#### Step 4.7 💾 Save Model & Tokenizer

In [26]:
trainer.save_model("t5_medical_transcription_summary_model")
tokenizer.save_pretrained("t5_medical_transcription_summary_tokenizer")

('t5_medical_transcription_summary_tokenizer/tokenizer_config.json',
 't5_medical_transcription_summary_tokenizer/special_tokens_map.json',
 't5_medical_transcription_summary_tokenizer/spiece.model',
 't5_medical_transcription_summary_tokenizer/added_tokens.json',
 't5_medical_transcription_summary_tokenizer/tokenizer.json')

#### Step 4.8. 🧪 Testing the Fine-Tuned Model

In [27]:
# Load the fine-tuned summarization model and tokenizer
summarizer = pipeline("summarization", model="t5_medical_transcription_summary_model", tokenizer="t5_medical_transcription_summary_tokenizer")

Device set to use mps:0


In [28]:
test_input = df['full_text'].iloc[0]
print("🧾 Input:", test_input[:500])
# Generate a summary and print it
print("📝 Summary:", summarizer(test_input, max_length=150, min_length=30, do_sample=False)[0]['summary_text'])

🧾 Input:  A 23-year-old white female presents with complaint of allergies.SUBJECTIVE:,  This 23-year-old white female presents with complaint of allergies.  She used to have allergies when she lived in Seattle but she thinks they are worse here.  In the past, she has tried Claritin, and Zyrtec.  Both worked for short time but then seemed to lose effectiveness.  She has used Allegra also.  She used that last summer and she began using it again two weeks ago.  It does not appear to be working very well.  S
📝 Summary: A 23-year-old white female presents with complaint of allergies.SUBJECTIVE, This . . This a british white woman has complained of allergies .


In [35]:
##if you want to summarize all the texts
'''
df['summary'] = df['full_text'][0:50].apply(
    lambda x : summarizer(x, max_length=150, min_length=30, do_sample=False)[0]['summary_text'])
'''

"\ndf['summary'] = df['full_text'][0:50].apply(\n    lambda x : summarizer(x, max_length=150, min_length=30, do_sample=False)[0]['summary_text'])\n"

In [36]:
test = 'Chief Complaint: Severe right back pain History of Present Illness: The patient is a 47-year-old male who presented to the Emergency Department with the sudden onset of severe right back pain that began approximately 2 hours ago. The pain is described as sharp and constant, radiating to the right flank. He reports associated nausea, but denies vomiting. He states the pain has not been relieved by over-the-counter pain medication. He has never experienced similar symptoms before. He denies any recent trauma, fever, or chills. He reports no change in pain with movement, no association with food, and no symptoms of GERD. Past Medical History: The patient has a history of asymptomatic gout, hypertension, hypercholesterolemia, and hypertriglyceridemia. He is currently taking medication for his hypertension, but the name of the medication is unknown at this time. He underwent a TAH with BSO six years ago. Social History: The patient is a non-smoker and reports occasional alcohol intake. Family History: There is a family history of premature CAD (coronary artery disease). Physical Examination: Findings from the physical examination included normal general appearance with the patient in distress, elevated blood pressure, normal heart and respiratory rates, and normal temperature and oxygen saturation. Specific findings included tenderness over the right costovertebral angle. The full physical examination details can be found in the referenced documents. Assessment and Plan: Based on the patients symptoms and a CT scan of the abdomen and pelvis, which showed a 5 mm renal stone, the diagnosis of nephrolithiasis was made. The plan involves admitting the patient for intravenous fluids, pain management, and monitoring for spontaneous stone passage. A urology consult has been requested, and vital signs, pain level, and kidney function will be monitored.'
print("📝 Summary:", summarizer(test, max_length=200, min_length=30, do_sample=False)[0]['summary_text'])

📝 Summary: The patient is a 47-year-old male who presented to the Emergency Department with the sudden onset of severe right back pain that began approximately 2 hours ago . The pain is described as sharp and constant, radiating to the right flank . he reports associated nausea, but denies vomiting. He states the pain has not been relieved by over-the-counter pain medication.


In [37]:
test = 'The patient is a 47-year-old male presenting to the emergency department with sudden onset of severe right-sided back pain radiating to the lower abdomen. He reports no trauma but mentions prior episodes of kidney stones. Physical exam reveals right flank tenderness. Vitals are stable. Urinalysis shows microscopic hematuria. A CT scan is ordered to rule out nephrolithiasis.'
print("📝 Summary:", summarizer(test, max_length=200, min_length=30, do_sample=False)[0]['summary_text'])

Your max_length is set to 200, but your input_length is only 97. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=48)


📝 Summary: The patient is a 47-year-old male presenting to the emergency department with sudden onset of severe right-sided back pain radiating to the lower abdomen. He reports no trauma . Physical exam reveals right flank tenderness. Vitals are stable. A CT scan is ordered to rule out nephrolithiasis.


In [38]:
test = 'A 26-year-old female with a known history of asthma presented with shortness of breath, wheezing, and coughing for the past two days. Symptoms worsened overnight despite using her albuterol inhaler. On examination, she had bilateral wheezing and accessory muscle use. Oxygen saturation was 92% on room air. She was treated with nebulized bronchodilators, corticosteroids, and oxygen therapy.'
print("📝 Summary:", summarizer(test, max_length=130, min_length=30, do_sample=False)[0]['summary_text'])

Your max_length is set to 130, but your input_length is only 109. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=54)


📝 Summary: A 26-year-old female with a known history of asthma presented with shortness of breath, wheezing, and coughing for the past two days . Symptoms worsened overnight despite using her albuterol inhaler .


In [39]:
test = 'The patient is a 59-year-old male with hypertension and hyperlipidemia who presented with chest pain radiating to the left arm, associated with diaphoresis and nausea. ECG showed ST elevations in the anterior leads. Troponin levels were elevated. The cardiology team was consulted, and the patient was taken for emergent cardiac catheterization which revealed a 90% occlusion in the LAD.'
print("📝 Summary:", summarizer(test, max_length=130, min_length=30, do_sample=False)[0]['summary_text'])

Your max_length is set to 130, but your input_length is only 97. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=48)


📝 Summary: The patient is a 59-year-old male with hypertension and hyperlipidemia who presented with chest pain radiating to the left arm, associated with diaphoresis and nausea. ECG showed ST elevations in the anterior leads. Troponin levels were elevated.


In [40]:
test = 'A 65-year-old female with poorly controlled type 2 diabetes mellitus presented with a non-healing ulcer on the plantar surface of her right foot for three weeks. The area is erythematous with purulent drainage. Wound culture grew Staphylococcus aureus. She was started on IV antibiotics, and surgical debridement was scheduled. Glycemic control is being optimized during hospitalization.'
print("📝 Summary:", summarizer(test, max_length=130, min_length=30, do_sample=False)[0]['summary_text'])

Your max_length is set to 130, but your input_length is only 100. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)


📝 Summary: a 65-year-old female with poorly controlled type 2 diabetes mellitus presented with a non-healing ulcer on the plantar surface of her right foot for three weeks. The area is erythematous with purulent drainage. Wound culture grew Staphylococcus aureus. She was started on IV antibiotics, and surgical debridement was scheduled .


## 🔚 Final Conclusion
- ✅ RAG clearly outperforms naive LLM prompting, especially when dealing with large, similar medical texts.
- 🔁 A loop-ready solution for updating vector stores allows real-time summarization of newly entered records.
- 📦 This architecture can scale from offline summarization to live AI assistants in hospitals.
This solution is powerful for clinical NLP, healthcare AI, and document automation workflows.