# Model Definition and Evaluation
## Table of Contents
1. [Model Selection](#model-selection)
2. [Feature Engineering](#feature-engineering)
3. [Hyperparameter Tuning](#hyperparameter-tuning)
4. [Implementation](#implementation)
5. [Evaluation Metrics](#evaluation-metrics)
6. [Comparative Analysis](#comparative-analysis)


In [None]:
# Import necessary libraries
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import PyPDF2
from utils.config import dir_dataset
import pandas as pd
from sklearn.metrics import confusion_matrix

## Model Selection
For this project, FLAN-T5 (specifically google/flan-t5-base) — a fine-tuned instruction-following text-to-text transformer model developed by Google was selected. It was chosen for its lightweight size, strong zero-shot performance, and capability to generate coherent, factual answers directly from prompts.

This model was selected because:

- It supports multi-task learning, including question answering.
- It is significantly smaller and more efficient to deploy than models like GPT-3.
- It's been proven to generalize well in biomedical QA tasks without requiring fine-tuning.



## Feature Engineering

Unlike traditional tabular models, transformer models like FLAN-T5 operate on text. The primary input is a prompt composed of:
- Extracted PDF text (truncated for length)
- Multiple natural language questions

Example prompt:

In [None]:
# Do not run this cell 
Context: <upload truncated PDF text>  
Question1 : "How many EEG channels were used?" ,
Answer: ""

Additional preprocessing:

PDF text is split into ~2000 characters (due to input length limits)
Minor cleaning to remove page numbers and headers

## Hyperparameter Tuning

Since the FLAN-T5 model is used out-of-the-box, we experimented with:

max_input_length: 1024 tokens

max_new_tokens: Tuned from 50 to 150

truncation: Enabled to prevent overflow

do_sample: Disabled (greedy decoding used)

No additional fine-tuning was performed, but future improvements may include domain-adapted fine-tuning on biomedical corpora like PubMed.


## Implementation

In [None]:
# Load model
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")

# Extract text from PDF
def extract_text(pdf_path):
    reader = PyPDF2.PdfReader(pdf_path)
    return "\n".join([page.extract_text() for page in reader.pages])

# Ask question using Flan-T5
def ask_question(context, question):
    prompt = f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024)
    outputs = model.generate(**inputs, max_new_tokens=100)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Run on example PDF and question
pdf_text = extract_text(dir_dataset, "PMC9204106.pdf")
truncated_context = pdf_text[:2000]  # To stay within token limits
answer = ask_question(truncated_context, "How many EEG channels were used?")
print("Answer:", answer)


## Evaluation Metrics

To quantitatively evaluate the FLAN-T5 QA model, we used binary classification metrics typically applied in information retrieval and clinical NLP:

Metric Definitions:
- True Positive (TP): Model provides a relevant and correct answer
- True Negative (TN): Model correctly outputs “Not mentioned” or no answer when no relevant info is present
- False Positive (FP): Model gives an incorrect or hallucinated answer
- False Negative (FN): Model fails to provide an answer when the information is present in the document

We assign these metrics onto agent-derived outputs based on its "correctness" with manually dervied outputs and calculate further evaluation metrics such as,

- Accuracy: Overall correctness of the model - (TP + TN) / (TP + TN + FP + FN)
- Precision: How many predicted answers were correct - TP / (TP + FP)
- Sensitivity: How many actual answers were captured - TP / (TP + FN)
- Specificity: How well irrelevant cases were identified - TN / (TN + FP)


In [None]:
import pandas as pd
from sklearn.metrics import confusion_matrix

# Simulated annotation table with realistic errors
data = {
    "question": [
        "How many EEG channels were used?",
        "What was the sampling frequency?",
        "Which brain region was analyzed?",
        "What software was used?",
        "What preprocessing steps were used?",
        "What type of EEG cap was used?",
        "What task did participants do?",
        "What year was the study conducted?",
        "Were ERP components analyzed?",
        "What were the main findings?"
    ],
    "model_answer": [
        "64 channels", "250 Hz", "Occipital", "EEGLAB", "Baseline correction",
        "Biosemi cap", "Memory recall task", "2020", "Yes, N170", "Alpha suppression"
    ],
    "ground_truth_present": [1, 0, 1, 1, 1, 1, 0, 0, 1, 1],   # Ground truth values assigned manually says which ones are answerable
    "correct_answer":        [1, 0, 0, 0, 0, 0, 1, 1, 1, 0]    # Model only got 3 right
}

df = pd.DataFrame(data)

# Compute confusion matrix
y_true = [1 if x == 1 else 0 for x in df["ground_truth_present"]]
y_pred = [1 if x == 1 else 0 for x in df["correct_answer"]]

tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

# Metrics
accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp) if (tp + fp) != 0 else 0
recall = tp / (tp + fn) if (tp + fn) != 0 else 0  # Sensitivity
specificity = tn / (tn + fp) if (tn + fp) != 0 else 0

# Output results
print(f"Evaluation of Flan-T5 on Scientific QA")
print(f"Accuracy    : {accuracy:.2f}")
print(f"Precision   : {precision:.2f}")
print(f"Recall      : {recall:.2f} (Sensitivity)")
print(f"Specificity : {specificity:.2f}")

