# 07 - Custom Interview Testing

**Previous:** [06_Results_Analysis_and_Comparison.ipynb](06_Results_Analysis_and_Comparison.ipynb)

---

## What This Notebook Covers

The **interactive playground** - test your models on custom medical conversations!

**Key Features:**
1. Load custom doctor-patient conversations from JSON
2. Test multiple models side-by-side
3. Compare predictions with ground truth
4. Create your own test cases
5. Analyze model behavior on edge cases

**Use Cases:**
- Validate models on real-world conversations
- Test edge cases and rare diagnoses
- Demonstrate to stakeholders
- Quality assurance before deployment

**Custom Interview Format:**
```json
{
  "name": "Case 001 - Upper Respiratory Infection",
  "expected_icd10": "J06.9",
  "description": "Patient with cold symptoms",
  "messages": [
    {"role": "doctor", "message": "What brings you here today?"},
    {"role": "patient", "message": "I have a sore throat and fever."}
  ]
}
```

---

## Setup

In [None]:
import os
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Critical for GPU memory management
os.environ['PYTORCH_ALLOC_CONF'] = 'expandable_segments:True'

# Add src to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / "src"))

print(f"‚úÖ Project Root: {project_root}")

In [None]:
# Import libraries
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import json
import pandas as pd
from typing import Dict, List
from IPython.display import display, HTML
import gc

print("‚úÖ All libraries imported")

In [None]:
# Check GPU
if torch.cuda.is_available():
    print(f"‚úÖ CUDA Available: {torch.cuda.get_device_name(0)}")
    print(f"   VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    device = "cuda"
else:
    print("‚ö†Ô∏è  CUDA not available - using CPU (slower)")
    device = "cpu"

---

## 1. Loading Custom Interview Files

Custom interviews should be placed in `custom_interviews/` directory:

In [None]:
# Custom interviews directory
interviews_dir = project_root / "custom_interviews"

print(f"Looking for custom interviews in: {interviews_dir}\n")

# Find all JSON files
interview_files = list(interviews_dir.glob("*.json"))

if interview_files:
    print(f"‚úÖ Found {len(interview_files)} interview file(s):\n")
    for file in interview_files:
        print(f"   ‚Ä¢ {file.name}")
else:
    print("‚ùå No interview files found!")
    print("\nCreate JSON files in custom_interviews/ with format:")
    print('''
{
  "name": "Case Name",
  "expected_icd10": "J06.9",
  "description": "Short description",
  "messages": [
    {"role": "doctor", "message": "..."},
    {"role": "patient", "message": "..."}
  ]
}
    ''')

### Load Interview Function

In [None]:
def load_interview(file_path: Path) -> Dict:
    """
    Load a custom interview from JSON file.
    """
    with open(file_path, 'r') as f:
        interview = json.load(f)
    
    # Validate format
    required_fields = ['name', 'expected_icd10', 'messages']
    for field in required_fields:
        if field not in interview:
            raise ValueError(f"Missing required field: {field}")
    
    return interview

# Load all interviews
interviews = []
for file in interview_files:
    try:
        interview = load_interview(file)
        interviews.append(interview)
        print(f"‚úÖ Loaded: {interview['name']}")
    except Exception as e:
        print(f"‚ùå Error loading {file.name}: {e}")

print(f"\nTotal interviews loaded: {len(interviews)}")

### Display Interview

In [None]:
def display_interview(interview: Dict):
    """
    Pretty print an interview.
    """
    print("="*70)
    print(f"üìã {interview['name']}")
    print("="*70)
    if 'description' in interview:
        print(f"Description: {interview['description']}")
    print(f"Expected ICD-10: {interview['expected_icd10']}")
    print(f"\nConversation ({len(interview['messages'])} messages):")
    print("-"*70)
    
    for i, msg in enumerate(interview['messages'], 1):
        role = msg['role'].upper()
        content = msg['message'] if 'message' in msg else msg['content']
        print(f"\n[{i}] {role}:")
        print(f"    {content}")
    
    print("\n" + "="*70)

# Show first interview as example
if interviews:
    display_interview(interviews[0])

---

## 2. Model Loading Utilities

Helper functions to load and test models:

In [None]:
# Quantization config
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

@torch.no_grad()
def predict_diagnosis(interview: Dict, model, tokenizer) -> str:
    """
    Predict ICD-10 code from interview.
    """
    system_prompt = (
        "You are a medical diagnosis assistant. "
        "Based on the doctor-patient conversation, predict the ICD-10 diagnosis code. "
        "Respond with ONLY the code (e.g., 'J06.9'), nothing else."
    )
    
    # Format conversation
    conversation_text = "\n".join([
        f"{msg['role'].capitalize()}: {msg.get('message', msg.get('content'))}"
        for msg in interview['messages']
    ])
    
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": conversation_text}
    ]
    
    # Apply chat template
    formatted = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    # Tokenize and generate
    inputs = tokenizer(formatted, return_tensors="pt").to(model.device)
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=10,
        do_sample=False,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )
    
    # Decode
    generated = outputs[0][inputs['input_ids'].shape[1]:]
    prediction = tokenizer.decode(generated, skip_special_tokens=True).strip()
    
    # Extract code (first word)
    prediction = prediction.split()[0] if prediction.split() else ""
    
    return prediction

print("‚úÖ Prediction utilities ready")

---

## 3. Testing Finetuned Model

Load and test the finetuned Llama 3.2 3B model:

In [None]:
# Load finetuned model
print("Loading finetuned Llama 3.2 3B model...\n")

base_model_name = "meta-llama/Llama-3.2-3B-Instruct"
adapter_path = project_root / "models" / "llama-3.2-3b-medical-lora" / "final_model"

if adapter_path.exists():
    # Load base model
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        quantization_config=quantization_config,
        device_map="auto",
        torch_dtype=torch.bfloat16
    )
    
    # Load LoRA adapters
    finetuned_model = PeftModel.from_pretrained(base_model, str(adapter_path))
    finetuned_model.eval()
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    print("‚úÖ Finetuned model loaded successfully")
    
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1024**3
        print(f"   GPU Memory: {allocated:.2f} GB")
else:
    print(f"‚ùå Adapter not found at {adapter_path}")
    print("   Run notebook 04 to train the model first.")
    finetuned_model = None
    tokenizer = None

### Test on Custom Interviews

In [None]:
if finetuned_model and interviews:
    print("\nTesting finetuned model on custom interviews...\n")
    print("="*80)
    
    results = []
    
    for interview in interviews:
        print(f"\nüìã {interview['name']}")
        print("-"*80)
        
        # Make prediction
        prediction = predict_diagnosis(interview, finetuned_model, tokenizer)
        expected = interview['expected_icd10']
        
        # Check if correct
        correct = prediction == expected
        
        # Store result
        results.append({
            'interview': interview['name'],
            'expected': expected,
            'predicted': prediction,
            'correct': correct
        })
        
        # Display result
        print(f"Expected:   {expected}")
        print(f"Predicted:  {prediction}")
        print(f"Result:     {'‚úÖ CORRECT' if correct else '‚ùå INCORRECT'}")
        
        # Show conversation snippet
        print(f"\nConversation snippet:")
        for msg in interview['messages'][:2]:
            content = msg.get('message', msg.get('content', ''))
            print(f"  {msg['role']:8s}: {content[:60]}..." if len(content) > 60 else f"  {msg['role']:8s}: {content}")
    
    print("\n" + "="*80)
    
    # Summary
    correct_count = sum(r['correct'] for r in results)
    total_count = len(results)
    accuracy = correct_count / total_count if total_count > 0 else 0
    
    print(f"\nüìä SUMMARY:")
    print(f"   Total interviews: {total_count}")
    print(f"   Correct:          {correct_count}")
    print(f"   Accuracy:         {accuracy:.1%}")
    print("="*80)
else:
    if not finetuned_model:
        print("‚ö†Ô∏è  Finetuned model not loaded")
    if not interviews:
        print("‚ö†Ô∏è  No interviews to test")

---

## 4. Comparing Multiple Models

Test both finetuned and zero-shot models side-by-side:

In [None]:
def compare_models_on_interview(interview: Dict, models_dict: Dict) -> pd.DataFrame:
    """
    Test multiple models on one interview.
    
    Args:
        interview: Interview dict
        models_dict: {name: (model, tokenizer)}
    
    Returns:
        DataFrame with results
    """
    expected = interview['expected_icd10']
    results = []
    
    for model_name, (model, tokenizer) in models_dict.items():
        prediction = predict_diagnosis(interview, model, tokenizer)
        correct = prediction == expected
        
        results.append({
            'Model': model_name,
            'Prediction': prediction,
            'Match': '‚úÖ' if correct else '‚ùå'
        })
    
    # Add expected
    results.append({
        'Model': 'Ground Truth',
        'Prediction': expected,
        'Match': '‚Äî'
    })
    
    return pd.DataFrame(results)

# For demonstration, we only have finetuned model loaded
# In practice, you could load multiple models

if finetuned_model and interviews:
    print("\nDetailed Comparison (first interview):\n")
    
    models_to_test = {
        'Llama 3.2 3B (Finetuned)': (finetuned_model, tokenizer)
    }
    
    # Test first interview
    interview = interviews[0]
    display_interview(interview)
    
    print("\n" + "="*70)
    print("MODEL PREDICTIONS:")
    print("="*70 + "\n")
    
    comparison_df = compare_models_on_interview(interview, models_to_test)
    display(comparison_df)

---

## 5. Interactive Testing

Create your own interview on the fly:

In [None]:
# Create a custom interview programmatically
custom_interview = {
    "name": "Custom Test Case - Hypertension",
    "expected_icd10": "I10",
    "description": "Patient with high blood pressure",
    "messages": [
        {
            "role": "doctor",
            "message": "What brings you to the clinic today?"
        },
        {
            "role": "patient",
            "message": "I've been experiencing headaches and my blood pressure has been high."
        },
        {
            "role": "doctor",
            "message": "How high is your blood pressure?"
        },
        {
            "role": "patient",
            "message": "It's been around 150/95 for the past week."
        },
        {
            "role": "doctor",
            "message": "Do you have any family history of hypertension?"
        },
        {
            "role": "patient",
            "message": "Yes, both my parents have high blood pressure."
        }
    ]
}

# Test it
if finetuned_model:
    print("\nTesting custom interview:\n")
    display_interview(custom_interview)
    
    prediction = predict_diagnosis(custom_interview, finetuned_model, tokenizer)
    expected = custom_interview['expected_icd10']
    
    print("\n" + "="*70)
    print(f"Expected:  {expected}")
    print(f"Predicted: {prediction}")
    print(f"Result:    {'‚úÖ CORRECT' if prediction == expected else '‚ùå INCORRECT'}")
    print("="*70)

---

## 6. Batch Testing on All Interviews

Generate a comprehensive report:

In [None]:
if finetuned_model and interviews:
    # Test all interviews
    all_results = []
    
    for interview in interviews:
        prediction = predict_diagnosis(interview, finetuned_model, tokenizer)
        expected = interview['expected_icd10']
        
        all_results.append({
            'Interview': interview['name'],
            'Expected': expected,
            'Predicted': prediction,
            'Correct': prediction == expected
        })
    
    # Create DataFrame
    results_df = pd.DataFrame(all_results)
    results_df['Status'] = results_df['Correct'].apply(lambda x: '‚úÖ' if x else '‚ùå')
    
    print("\n" + "="*80)
    print(" "*25 + "COMPREHENSIVE TEST REPORT")
    print("="*80 + "\n")
    
    display(results_df[['Interview', 'Expected', 'Predicted', 'Status']])
    
    # Summary statistics
    print("\n" + "="*80)
    print("SUMMARY STATISTICS:")
    print("="*80)
    
    total = len(results_df)
    correct = results_df['Correct'].sum()
    accuracy = correct / total if total > 0 else 0
    
    print(f"\nTotal Interviews Tested:  {total}")
    print(f"Correct Predictions:      {correct}")
    print(f"Incorrect Predictions:    {total - correct}")
    print(f"Accuracy:                 {accuracy:.1%}")
    print("\n" + "="*80)
    
    # Save results
    output_path = project_root / "outputs" / "custom_interview_results.csv"
    output_path.parent.mkdir(parents=True, exist_ok=True)
    results_df.to_csv(output_path, index=False)
    print(f"\n‚úÖ Results saved to: {output_path}")

---

## 7. Creating New Test Cases

Template for creating your own interview JSON files:

In [None]:
# Template generator
def create_interview_template(name: str, icd10: str, description: str = "") -> Dict:
    """
    Create a template for a new interview.
    """
    template = {
        "name": name,
        "expected_icd10": icd10,
        "description": description,
        "messages": [
            {
                "role": "doctor",
                "message": "What brings you here today?"
            },
            {
                "role": "patient",
                "message": "[Patient describes symptoms]"
            },
            {
                "role": "doctor",
                "message": "[Doctor asks follow-up questions]"
            },
            {
                "role": "patient",
                "message": "[Patient provides more details]"
            }
        ]
    }
    return template

def save_interview(interview: Dict, filename: str):
    """
    Save interview to JSON file.
    """
    filepath = interviews_dir / filename
    with open(filepath, 'w') as f:
        json.dump(interview, f, indent=2)
    print(f"‚úÖ Saved interview to: {filepath}")

# Example: Create a new template
print("\nExample Interview Template:\n")
new_template = create_interview_template(
    name="Case XXX - Condition Name",
    icd10="XXX.X",
    description="Brief description of the case"
)

print(json.dumps(new_template, indent=2))

print("\nüí° To create your own interview:")
print("   1. Copy the template above")
print("   2. Fill in the conversation")
print("   3. Save as .json in custom_interviews/")
print("   4. Re-run this notebook to test!")

---

## 8. Cleanup

In [None]:
# Free GPU memory
if finetuned_model:
    del finetuned_model
    del base_model
    del tokenizer

gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    torch.cuda.ipc_collect()

print("‚úÖ Memory freed")

---

## 9. Key Takeaways üí°

### What We Learned

1. **Custom Testing is Essential**
   - Standard metrics don't capture everything
   - Real-world cases reveal edge cases
   - Stakeholder demonstrations need concrete examples

2. **JSON Format is Flexible**
   - Easy to create test cases
   - Human-readable and version-controllable
   - Can be generated programmatically

3. **Model Comparison**
   - Side-by-side comparison is powerful
   - Shows concrete differences between models
   - Helps identify model strengths/weaknesses

4. **Interactive Testing**
   - Immediate feedback loop
   - Can test hypotheses quickly
   - Useful for model debugging

### Best Practices

‚úÖ **Do:**
- Create diverse test cases (common + rare conditions)
- Include edge cases and ambiguous scenarios
- Document why each test case matters
- Version control your test cases
- Regularly expand your test suite

‚ùå **Don't:**
- Only test happy paths
- Ignore model failures (they're learning opportunities!)
- Use production data without proper anonymization
- Skip validation of expected ICD codes

### Next Steps

1. **Expand Test Suite**
   - Add more diverse medical conditions
   - Include multi-diagnosis cases
   - Test rare diseases

2. **Automate Testing**
   - Integrate with CI/CD pipeline
   - Set accuracy thresholds
   - Monitor regression

3. **Gather Real Data**
   - Collaborate with medical professionals
   - Collect anonymized real conversations
   - Validate on clinical data

---

## Summary

In this notebook, we:

- ‚úÖ Loaded custom medical interviews from JSON
- ‚úÖ Tested finetuned models on custom cases
- ‚úÖ Compared predictions with ground truth
- ‚úÖ Created interactive testing framework
- ‚úÖ Generated comprehensive test reports
- ‚úÖ Learned how to create new test cases

**You can now:**
- Test models on your own medical conversations
- Validate model performance on specific cases
- Demonstrate model capabilities to stakeholders
- Build confidence before deployment

**Related Files:**
- `custom_interviews/` - Your test cases directory
- `test_custom_interviews.py` - Command-line testing script
- `CUSTOM_INTERVIEW_TESTING.md` - Detailed documentation

---

## üéâ Congratulations!

You've completed the entire notebook series! You now understand:

1. ‚úÖ The complete ML pipeline (data ‚Üí training ‚Üí evaluation)
2. ‚úÖ How LoRA finetuning works
3. ‚úÖ Size vs specialization trade-offs
4. ‚úÖ How to evaluate and compare models
5. ‚úÖ How to test on custom cases

**Ready for production?** Check out the main project files:
- `main.py` - Automated pipeline
- `GUIDE.md` - Complete project guide
- `QUICKSTART.md` - Fast deployment guide

üöÄ **Happy model building!**