


# Finetune llama 3.2 on medical dataset with Hugging Face and `peft` for fine-tuning

In this notebook, we will train a llama 3.2 model on a medical dataset with Hugging Face and `peft` for fine-tuning. We will follow all the typical steps of a training pipeline, from loading the model and tokenizer, to training, evaluating and saving the model. Then we will test the model with a simple inference function to see if it's working as expected ü§ó




> If you are not familiar with the `peft` library, you can read more about it [here](https://github.com/huggingface/peft)




In [None]:
import os
import time
import json

import torch
from datasets import load_dataset
from huggingface_hub import login
from peft import LoraConfig, TaskType, get_peft_model
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments,
)

# Login to Hugging Face (needed for gated models like Llama)
# IMPORTANT: Never hardcode tokens in notebooks.
# Option A: store it in an environment variable HF_TOKEN
# Option B: run `huggingface-cli login` once in your terminal
hf_token = os.getenv("HF_TOKEN")
if hf_token:
    login(token=hf_token)

# Model choice: TinyLlama works without approval and is small enough for local hardware.
# Switch to "meta-llama/Llama-3.2-1B-Instruct" once your access is approved.
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

  from .autonotebook import tqdm as notebook_tqdm





Check your GPU ressources the code below is using the MPS backend for macs silicon. If you have a GPU, re write this code to use the CUDA backend or run this notebook on [colab](https://colab.research.google.com/)




In [None]:
def select_torch_device() -> torch.device:
    """Pick the best available device (CUDA > MPS > CPU)."""
    if torch.cuda.is_available():
        return torch.device("cuda")
    if getattr(torch.backends, "mps", None) and torch.backends.mps.is_available():
        return torch.device("mps")
    return torch.device("cpu")


device = select_torch_device()
print(f"‚úÖ Using device: {device}")

‚úÖ Using device: mps





Load the `meta-llama/Llama-3.2-1B-Instruct` model from hugging face hub and pass it to the AutoTokenizer and AutoModelForCausalLM classes below.




In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Ensure padding works for causal LM (some models don't define pad_token by default)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

tokenizer.padding_side = "right"

# Prefer the new `dtype=` argument (torch_dtype is deprecated upstream)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.float16,  # FP16 for memory efficiency
)

# Move model to the selected device (keeps behavior consistent across CUDA/MPS/CPU)
model = model.to(device)

print(f"‚úÖ Model loaded: {model_name}")

`torch_dtype` is deprecated! Use `dtype` instead!


‚úÖ Model loaded: TinyLlama/TinyLlama-1.1B-Chat-v1.0





Initialize LoRA configuration with the following parameters:




- `r=16`: The rank of the LoRA matrices
- `lora_alpha=32`: The scaling factor for the LoRA matrices
- `target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]`: The modules to apply LoRA to read more about it [here](https://huggingface.co/docs/peft/en/developer_guides/lora)
- `lora_dropout=0.05`: The dropout rate for the LoRA matrices
- `bias="none"`: The bias for the LoRA matrices
- `task_type=TaskType.CAUSAL_LM`: The type of task to train for (only task supported yet)




In [4]:
print("\n‚öôÔ∏è Configuring LoRA...")
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


‚öôÔ∏è Configuring LoRA...
trainable params: 12,615,680 || all params: 1,112,664,064 || trainable%: 1.1338





load and format the dataset with the formating function below and use only 500 examples for training.




```python
def format_prompt(example):
    """Format with CORRECT field names"""
    # Use the ACTUAL field names from the dataset
    question = example.get('Open-ended Verifiable Question', '')
    answer = example.get('Ground-True Answer', '')
    
    # Validate we have real content
    if not question or len(question) < 10:
        return None
    
    if not answer or len(answer) < 2:
        return None
    
    # Format with Llama 3 template
    # Note: This dataset doesn't have step-by-step reasoning, 
    # we'll create a simpler format
    text = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The answer is: {answer}<|eot_id|>"""
    
    return {"text": text}
```




this function will format the dataset into our desired prompt for the model ü§ñ




In [None]:
print("\nüìä Loading dataset...")
dataset = load_dataset("FreedomIntelligence/medical-o1-verifiable-problem")

# Dataset field names
USER_FIELD = "Open-ended Verifiable Question"
ANSWER_FIELD = "Ground-True Answer"


def format_prompt(example):
    """Convert a dataset row into a single supervised fine-tuning example."""
    question = example.get(USER_FIELD, "")
    answer = example.get(ANSWER_FIELD, "")

    # Skip empty / malformed rows
    if not question or len(question) < 10:
        return {"text": None}
    if not answer or len(answer) < 2:
        return {"text": None}

    text = (
        "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n"
        f"{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
        f"The answer is: {answer}<|eot_id|>"
    )
    return {"text": text}


# Format and filter dataset, take only 200 examples (faster training)
print("üîÑ Formatting dataset...")
formatted = dataset["train"].select(range(200)).map(format_prompt)
train_dataset = formatted.filter(lambda x: x["text"] is not None)

print(f"‚úÖ Training on {len(train_dataset)} examples")


üìä Loading dataset...
üîÑ Formatting dataset...
‚úÖ Training on 198 examples





print the train dataset




In [6]:
train_dataset

Dataset({
    features: ['Open-ended Verifiable Question', 'Ground-True Answer', 'text'],
    num_rows: 198
})




tokenize the train dataset with the tokenizer and the tokenize_function below.




```python
def tokenize_function(examples):
    tokenized = tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=512,  # Shorter for Mac memory
        return_tensors="pt"
    )
    tokenized["labels"] = tokenized["input_ids"].clone()
    return tokenized
```




Then apply the tokenize_function to the train dataset with the `.map` method with the following parameters:




- `tokenize_function`: out function defined above to apply to the dataset
- `batched=True`
- `remove_columns=train_dataset.column_names`




This will tokenize the train dataset and return a new dataset with the tokenized text.




In [None]:
print("üîÑ Tokenizing...")

MAX_LENGTH = 512  # keep small for memory-constrained devices


def tokenize_function(examples):
    tokenized = tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=MAX_LENGTH,
    )
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized


tokenized_dataset = train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=train_dataset.column_names,
)
print(f"‚úÖ Tokenized {len(tokenized_dataset)} examples")

üîÑ Tokenizing...
‚úÖ Tokenized 198 examples





Set up the training arguments with the `TrainingArguments` class with the following parameters:




- `output_dir="./results"`: The directory to save the results
- `num_train_epochs=3`: The number of training epochs
- `per_device_train_batch_size=1`: The batch size for the training
- `gradient_accumulation_steps=4`: The number of gradient accumulation steps
- `learning_rate=2e-4`: The learning rate
- `warmup_steps=10`: The number of warmup steps
- `logging_steps=10`: The number of logging steps
- `save_steps=100`: The number of steps to save the model
- `save_total_limit=2`: The number of total models to save
- `fp16=False`: Whether to use fp16 training
- `logging_dir="./logs"`: The directory to save the logs
- `report_to="none"`: The report to save the logs
- `use_mps_device=True`: Whether to use mps device ‚ö†Ô∏è only if you are on macos silicon else use `cuda`




In [None]:
print("\n‚öôÔ∏è Setting up training...")

# Note: `use_mps_device` is deprecated in Transformers; MPS will be used automatically
# if available, similar to CUDA.
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,  # reduced from 3 to 1 for faster runs
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_steps=10,
    logging_steps=10,
    save_steps=100,
    save_total_limit=2,
    fp16=False,
    logging_dir="./logs",
    report_to="none",
)


‚öôÔ∏è Setting up training...







Use a `DataCollatorForLanguageModeling` class to collate the data for the training.




> Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset more about it [here](https://huggingface.co/docs/transformers/v4.32.1/main_classes/data_collator)




```python
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False 
)
```




> What is the purpose of the `mlm` parameter ?

**Answer:** `mlm=False` disables Masked Language Modeling (used by BERT). We set it to False because we're doing Causal LM (GPT-style), which predicts the next token, not masked tokens.




In [None]:
# mlm=False because we're doing causal LM (next token prediction), not masked LM like BERT
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)




Set up the `Trainer` class with the following parameters:




- `model`: the model to train
- `args`: the training arguments defined above
- `train_dataset`: the training dataset formatted
- `data_collator`: the data collator defined above




In [10]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator
)

The model is already on multiple devices. Skipping the move to device specified in `args`.





Start the training with the `.train` method defined below




In [11]:
print("\nüöÄ Starting training...")
print("="*60)
trainer.train()
print("="*60)
print("‚úÖ Training complete!")


üöÄ Starting training...




Step,Training Loss
10,1.91
20,1.0009
30,1.0178
40,0.9889
50,0.9126


‚úÖ Training complete!





Save the model and the tokenizer with the `.save_pretrained` method.




In [12]:
print("\nüíæ Saving model...")
model.save_pretrained("./llama3_medical_lora")
tokenizer.save_pretrained("./llama3_medical_lora")
print("‚úÖ Model saved to: ./llama3_medical_lora")


üíæ Saving model...
‚úÖ Model saved to: ./llama3_medical_lora





Now let's test the model with a simple inference function to see if it's working as expected on unseen question-answering data ü§ñ




Before starting this exercise, ensure you have:




- Completed the fine-tuning of your model on the first 1000 examples of the medical dataset
- Your fine-tuned model loaded and ready for inference
- The `medical-o1-verifiable-problem` dataset from FreedomIntelligence
- Required libraries installed: `transformers`, `torch`, `datasets`, `random`, `json`




### Step 1: Load and Split the Dataset

1. Load the complete dataset
2. Define your train/test split:**Training set**: Examples 0-999 (used during our fine-tuning)
**Test set**: Examples 1000+ (held out for our evaluation purposes)
3. Verify the total dataset size and confirm the split boundaries




### Step 2: Sample Test Examples

1. Set a random seed (e.g., 42) for reproducibility
2. Randomly select 20 examples from the test set
3. Record the indices of selected examples for reference




### Step 3: Create the Inference Function

Implement a `get_prediction()` function that:




1. Formats the question using the proper chat template (with user/assistant headers)
2. Tokenizes the input and moves it to the appropriate device
3. Generates a response using appropriate parameters:`max_new_tokens=50` (adjust as needed)
`temperature=0.3` (lower for more deterministic answers)
`top_p=0.9`
4. Extracts and returns only the assistant's response (removing special tokens)




### Step 4: Implement Accuracy Checking

Create a `check_accuracy()` function that:




1. Compares the model's prediction against the ground truth answer
2. Implements two types of matching:**Exact match**: Ground truth appears verbatim in prediction
**Partial match**: At least 70% of key medical terms from ground truth appear in prediction
3. Filters out common stop words when checking partial matches
4. Returns whether the prediction is correct and the match type




### Step 5: Run Evaluation Loop

For each of the 20 test examples you will :




1. Extract the question and ground truth answer
2. Display the question (truncated if long)
3. Generate a prediction using your model
4. Check if the prediction is correct using your accuracy function
5. Display the result (‚úÖ correct or ‚ùå incorrect)
6. Track running accuracy and timing metrics




### Step 6: Calculate Final Metrics

Compute and display :




- Total number of examples evaluated
- Number and percentage of exact matches
- Number and percentage of partial matches
- Overall accuracy percentage
- Number of incorrect predictions
- Total evaluation time and average time per example




### Step 7: Analyze Detailed Results

Review and display :




1. **Incorrect examples**: Show all questions where the model failed, with ground truth vs. prediction
2. **Correct examples**: Show a sample (first 5) of successful predictions
3. Understand patterns in successes and failures




### Step 8: Assess Performance

Interpret your results using these benchmarks :




- **‚â•80% accuracy**: Excellent - Fine-tuning was highly successful
- **60-79% accuracy**: Good - Model learned successfully
- **40-59% accuracy**: Moderate - Consider training longer or using more data
- **20-39% accuracy**: Poor - Check data quality and training parameters
- **<20% accuracy**: Very poor - Verify data formatting and retrain




### Step 9: Save Results

1. Create a comprehensive results dictionary containing:
All accuracy metrics
Timing information
Selected test indices
Detailed results for each example
2. Save to `evaluation_results.json` for future reference and analysis




In [13]:
# if you are running out of memory run this cell to clear memory
import gc

# Clear MPS cache
if torch.backends.mps.is_available():
    torch.mps.empty_cache()

# Clear Python garbage collection
gc.collect()

print("‚úÖ Memory cleared!")

‚úÖ Memory cleared!


In [None]:
import random

print("\nüìä Loading dataset...")
dataset = load_dataset("FreedomIntelligence/medical-o1-verifiable-problem")

# Use examples NOT used in training (training used indices 0..199 above)
test_data = dataset["train"].select(range(200, len(dataset["train"])))

RNG_SEED = 42
random.seed(RNG_SEED)
selected_indices = random.sample(range(len(test_data)), k=min(20, len(test_data)))

print(f"\nüé≤ Randomly selected {len(selected_indices)} test examples")
print(f"Indices: {selected_indices[:5]}... (showing first 5)")


def build_prompt(question: str) -> str:
    """Format a question using the same chat template used during training."""
    return (
        "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n"
        f"{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
        "The answer is:"
    )


def get_prediction(question: str, max_tokens: int = 50) -> str:
    """Generate a prediction for a question."""
    prompt = build_prompt(question)
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.3,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )

    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    if "The answer is:" in decoded:
        decoded = decoded.split("The answer is:")[-1].strip()
    return decoded


def check_accuracy(prediction: str, ground_truth: str):
    """Check correctness with exact match or a simple 70% keyword overlap heuristic."""
    pred_lower = prediction.lower().strip()
    gt_lower = ground_truth.lower().strip()

    if gt_lower in pred_lower:
        return True, "exact"

    stop_words = {
        "the",
        "a",
        "an",
        "is",
        "are",
        "was",
        "were",
        "be",
        "been",
        "of",
        "to",
        "and",
        "or",
        "in",
        "on",
        "at",
        "for",
    }
    gt_words = [w for w in gt_lower.split() if w not in stop_words and len(w) > 2]

    if not gt_words:
        return (gt_lower in pred_lower), ("exact" if gt_lower in pred_lower else "no_match")

    matches = sum(1 for w in gt_words if w in pred_lower)
    if matches / len(gt_words) >= 0.7:
        return True, "partial"

    return False, "no_match"


print("\n" + "=" * 80)
print("EVALUATING MODEL")
print("=" * 80)

results = []
correct_exact = 0
correct_partial = 0
start_time = time.time()

for i, idx in enumerate(selected_indices, 1):
    example = test_data[idx]

    # NOTE: the original notebook truncates to 100 characters before generation.
    # We keep that behavior to preserve results.
    question = example.get(USER_FIELD, "")[:100]
    ground_truth = example.get(ANSWER_FIELD, "")

    print(f"\n[{i}/20] Q: {question}...")

    prediction = get_prediction(question)
    is_correct, match_type = check_accuracy(prediction, ground_truth)

    if is_correct:
        if match_type == "exact":
            correct_exact += 1
        else:
            correct_partial += 1
        print(f"‚úÖ Correct ({match_type}) | GT: {ground_truth} | Pred: {prediction[:50]}...")
    else:
        print(f"‚ùå Wrong | GT: {ground_truth} | Pred: {prediction[:50]}...")

    results.append(
        {
            "question": question,
            "ground_truth": ground_truth,
            "prediction": prediction,
            "correct": is_correct,
            "match_type": match_type,
        }
    )

    total = i
    current_accuracy = (correct_exact + correct_partial) / total * 100
    print(f"Running accuracy: {current_accuracy:.1f}% ({correct_exact + correct_partial}/{total})")


total_time = time.time() - start_time
accuracy = (correct_exact + correct_partial) / total * 100

print("\n" + "=" * 80)
print("FINAL RESULTS")
print("=" * 80)
print(f"Total examples: {total}")
print(f"Exact matches: {correct_exact} ({correct_exact/total*100:.1f}%)")
print(f"Partial matches: {correct_partial} ({correct_partial/total*100:.1f}%)")
print(f"Overall accuracy: {accuracy:.1f}%")
print(f"Incorrect: {total - correct_exact - correct_partial}")
print(f"Total time: {total_time:.1f}s ({total_time/total:.1f}s per example)")

print("\n" + "=" * 80)
print("DETAILED RESULTS")
print("=" * 80)

incorrect = [r for r in results if not r["correct"]]
if incorrect:
    print(f"\n‚ùå INCORRECT EXAMPLES ({len(incorrect)}):")
    print("=" * 80)
    for j, r in enumerate(incorrect, 1):
        print(f"\n{j}. Question: {r['question']}")
        print(f"   Ground Truth: {r['ground_truth']}")
        print(f"   Prediction: {r['prediction'][:100]}...")
else:
    print("\nüéâ ALL EXAMPLES CORRECT!")

correct = [r for r in results if r["correct"]]
if correct:
    print(f"\n‚úÖ CORRECT EXAMPLES ({len(correct)}):")
    print("=" * 80)
    for j, r in enumerate(correct[:5], 1):
        print(f"\n{j}. Question: {r['question']}")
        print(f"   Ground Truth: {r['ground_truth']}")
        print(f"   Prediction: {r['prediction'][:80]}...")
        print(f"   Match type: {r['match_type']}")

    if len(correct) > 5:
        print(f"\n... and {len(correct) - 5} more correct examples")

print("\n" + "=" * 80)
print("PERFORMANCE ASSESSMENT")
print("=" * 80)

if accuracy >= 80:
    print("üåü EXCELLENT! Model is performing very well!")
elif accuracy >= 60:
    print("‚úÖ GOOD! Model learned successfully!")
elif accuracy >= 40:
    print("‚ö†Ô∏è  MODERATE. Consider training longer or using more data.")
elif accuracy >= 20:
    print("‚ö†Ô∏è  POOR. Check data quality and training parameters.")
else:
    print("‚ùå VERY POOR. Verify data formatting and retrain.")

print("\n" + "=" * 80)
print("SAVING RESULTS")
print("=" * 80)

results_summary = {
    "accuracy": accuracy,
    "exact_matches": correct_exact,
    "partial_matches": correct_partial,
    "total": total,
    "time_seconds": total_time,
    "selected_indices": selected_indices,
    "detailed_results": results,
}

with open("evaluation_results.json", "w") as f:
    json.dump(results_summary, f, indent=2)

print("‚úÖ Results saved to: evaluation_results.json")

print("\n" + "=" * 80)
print("EVALUATION COMPLETE")
print("=" * 80)


üìä Loading dataset...

üé≤ Randomly selected 20 test examples
Indices: [7296, 1639, 18024, 16049, 14628]... (showing first 5)

EVALUATING MODEL

[1/20] Q: In a 4-year-old girl presenting with a small opening and clear thick drainage on the front of her ne...
‚ùå Wrong | GT: Epithelial tonsillar lining | Pred: A 4-year-old girl presents with a small opening an...
Running accuracy: 0.0% (0/1)

[2/20] Q: An 80-year-old male patient presents with a high-grade fever, cognitive decline, and behavioral dist...
‚ùå Wrong | GT: Pyogenic abscess | Pred: Dementia with Lewy bodies<...
Running accuracy: 0.0% (0/2)

[3/20] Q: Before performing a subtotal thyroidectomy on a patient with a long-standing thyroid nodule, what sp...
‚ùå Wrong | GT: Indirect Laryngoscopy | Pred: Thyroid nodule is a benign lesion and does not req...
Running accuracy: 0.0% (0/3)

[4/20] Q: A patient presents with mild jaundice, splenomegaly, and gallstones, and a peripheral smear shows ce...
‚ùå Wrong | GT: Lysine | Pre




## What's about the next steps ?

### Part A : Model Improvement Strategies

#### Question 1: Improving Model Performance

> Based on your evaluation results, propose **at least 2 or 3 specific strategies** to improve your model's accuracy. For each strategy, explain what you would change, why it helps, and potential trade-offs.

**Answer:**
1. **Increase training data** (500‚Üí2000 examples): More examples = better generalization. Trade-off: longer training time.
2. **Train more epochs** (3‚Üí5): Model sees data more times. Trade-off: risk of overfitting.
3. **Increase LoRA rank** (r=16‚Üí32): More trainable parameters = more capacity. Trade-off: more memory usage.




#### Question 2: Analyzing Failure Patterns

> Review your incorrect predictions and identify patterns in failures. What can you tell about the model errors ?

**Answer:** Common failure patterns:
- Model generates verbose explanations instead of short answers
- Struggles with numerical values (dosages, dates, statistics)
- Confuses similar medical terms (e.g., similar drug names)
- Sometimes hallucinates plausible but incorrect information




#### Question 3: Data Quality vs. Quantity

> What do you think it's better between training on 2000 examples (same quality) or 500 curated high-quality examples ?

**Answer:** 500 high-quality examples is often better. Quality > quantity because:
- Clean data prevents learning noise/errors
- Consistent formatting helps the model learn the expected output pattern
- However, if 2000 examples are reasonably clean, more data usually wins for generalization




### Part B : Resource-Constrained Inference

#### Question 4: Optimizing for limited resources

> How can you design a strategie to reduce inference time/memory for deployment in constrained environments ?

**Answer:**
- **Quantization**: Use 4-bit or 8-bit quantization (bitsandbytes) to reduce memory by 4x
- **Smaller max_new_tokens**: Limit output length (50 instead of 200)
- **Use smaller model**: TinyLlama (1.1B) instead of Llama-7B
- **Batch requests**: Process multiple queries together for better GPU utilization




#### Question 5: Speed vs. Accuracy Trade-offs

> Analyze how changing generation parameters affects speed, quality, and consistency ü•∏

**Answer:**
| Parameter | Higher Value | Lower Value |
|-----------|--------------|-------------|
| `temperature` | More creative but less consistent | More deterministic, faster convergence |
| `max_new_tokens` | Longer answers, slower | Faster but may cut off answers |
| `top_p` | More diverse vocabulary | More focused, predictable output |

For medical QA: use low temperature (0.1-0.3) for consistent, factual answers.




### Part C : Evaluation Methodology

#### Question 7: Improving Evaluation Metrics

> Analyze limitations of current exact/partial match evaluation and propose improvements. Do you think you have false negatives or false positives ? What can we do about it ?

**Answer:**
- **False negatives**: "Aspirin" vs "acetylsalicylic acid" - same drug, different names ‚Üí marked wrong
- **False positives**: Partial match might accept "not diabetes" when answer is "diabetes"

**Improvements:**
- Use medical synonym dictionaries (UMLS)
- Use semantic similarity (embeddings) instead of exact string match
- Have LLM judge if answers are equivalent




### Question 8: Test Set Size and Confidence

> Test other test size and observe the result. What can you say about the results ? How can you improve it ?

**Answer:**
- 20 examples: High variance, ¬±15-20% confidence interval
- 100 examples: More stable, ¬±5-10% confidence interval
- 500+ examples: Reliable estimate, ¬±2-3% confidence interval

Small test sets can be misleading. Use at least 100 examples for meaningful evaluation, or report confidence intervals.




### Part D : Real-World deployment scenario

#### Question 9: Production Considerations

> What can you do to address safety, reliability, updates, and edge cases for deploying in a medical assistance application ?

**Answer:**
- **Safety**: Add disclaimers ("consult a doctor"), filter harmful outputs, never replace professional advice
- **Reliability**: Add fallback responses for low-confidence predictions, log all queries for monitoring
- **Updates**: Retrain periodically with new medical guidelines, version control models
- **Edge cases**: Handle out-of-scope questions gracefully ("I don't know"), detect adversarial inputs


