# Notebook 06: Ergebnisanalyse und Visualisierung

## Ziel dieses Notebooks

In diesem Notebook werden wir:

1. **Alle Ergebnisse laden** - LLM Baseline + SLM Finetuned
2. **Umfassende Analyse** - St√§rken und Schw√§chen identifizieren
3. **Visualisierungen erstellen** - Publikationsreife Plots
4. **HTML-Report generieren** - F√ºr Pr√§sentation
5. **Fazit formulieren** - Wissenschaftliche Schlussfolgerungen

---

## Forschungsfrage

> **K√∂nnen spezialisierte kleine Sprachmodelle (3B) durch Finetuning die Performance von gro√üen generischen Modellen (7-8B) bei der ICD-10 Klassifikation √ºbertreffen?**

---

In [None]:
# ============================================================
# SETUP: Imports und Umgebung
# ============================================================

import json
import warnings
from pathlib import Path
from dataclasses import dataclass, field
from datetime import datetime

warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch

# Plot-Stil
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 12

print("Imports erfolgreich!")

In [None]:
# ============================================================
# KONFIGURATION (Standalone)
# ============================================================

@dataclass
class PathConfig:
    project_root: Path = field(default_factory=lambda: Path.cwd().parent)
    outputs_dir: Path = field(default_factory=lambda: Path.cwd().parent / "outputs")
    plots_dir: Path = field(default_factory=lambda: Path.cwd().parent / "outputs" / "plots")
    reports_dir: Path = field(default_factory=lambda: Path.cwd().parent / "outputs" / "reports")
    
    def create_directories(self):
        for attr_name in dir(self):
            attr = getattr(self, attr_name)
            if isinstance(attr, Path) and not attr_name.startswith('_'):
                attr.mkdir(parents=True, exist_ok=True)

paths = PathConfig()
paths.create_directories()

print("Pfade konfiguriert!")

## 1. Ergebnisse laden

In [None]:
# ============================================================
# ALLE ERGEBNISSE LADEN
# ============================================================

# LLM Baseline
llm_path = paths.reports_dir / "llm_baseline_results.json"
if llm_path.exists():
    with open(llm_path, 'r') as f:
        llm_results = json.load(f)
    print(f"LLM Baseline: {len(llm_results)} Modelle")
else:
    print("LLM Baseline nicht gefunden!")
    llm_results = {}

# SLM Finetuned
slm_path = paths.reports_dir / "slm_finetuned_results.json"
if slm_path.exists():
    with open(slm_path, 'r') as f:
        slm_results = json.load(f)
    print(f"SLM Finetuned: {len(slm_results)} Modelle")
else:
    print("SLM Finetuned nicht gefunden")
    slm_results = {}

# Kombiniert
all_results = {}
all_results.update(llm_results)
all_results.update(slm_results)

print(f"\nGesamt: {len(all_results)} Modelle")

In [None]:
# ============================================================
# ERGEBNIS-DATAFRAME ERSTELLEN
# ============================================================

results_list = []

for key, metrics in all_results.items():
    model_type = "LLM" if key.startswith("LLM") else "SLM"
    short_name = key.replace("LLM_", "").replace("SLM_", "").replace("_untrained", "").replace("_finetuned", "")
    
    results_list.append({
        "model_key": key,
        "model_name": short_name,
        "model_type": model_type,
        "model_size": metrics.get("model_size", ""),
        "training": metrics.get("training", ""),
        "accuracy": metrics.get("exact_match_accuracy", 0),
        "prefix_3": metrics.get("prefix_match_3", 0),
        "prefix_1": metrics.get("prefix_match_1", 0),
        "precision": metrics.get("precision", 0),
        "recall": metrics.get("recall", 0),
        "f1": metrics.get("f1", 0),
        "n_samples": metrics.get("n_samples", 0),
        "eval_time": metrics.get("eval_time_seconds", 0),
        "samples_per_sec": metrics.get("samples_per_second", 0),
    })

df = pd.DataFrame(results_list)
df = df.sort_values("accuracy", ascending=False).reset_index(drop=True)

print("Ergebnis-Tabelle:")
print(df[["model_name", "model_type", "model_size", "training", "accuracy", "f1"]].to_string(index=False))

## 2. Hauptvergleich: LLM vs. SLM

In [None]:
# ============================================================
# HAUPTVISUALISIERUNG
# ============================================================

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Farben
colors_dict = {
    "LLM": "#E74C3C",   # Rot f√ºr LLM
    "SLM": "#27AE60",   # Gr√ºn f√ºr SLM
}

# 1. Accuracy-Vergleich
ax1 = axes[0]
colors = [colors_dict[t] for t in df["model_type"]]
bars = ax1.bar(range(len(df)), df["accuracy"], color=colors, edgecolor='black', linewidth=1.2)
ax1.set_xticks(range(len(df)))
ax1.set_xticklabels(df["model_name"], rotation=45, ha='right', fontsize=10)
ax1.set_ylabel('Exact Match Accuracy')
ax1.set_title('ICD-10 Klassifikationsgenauigkeit')
ax1.set_ylim(0, 1)
ax1.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5, label='Baseline 50%')

for bar, acc in zip(bars, df["accuracy"]):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
             f'{acc:.1%}', ha='center', va='bottom', fontsize=9, fontweight='bold')

# 2. Metriken-Radar (simuliert als Grouped Bar)
ax2 = axes[1]
metrics = ["Accuracy", "Prefix-3", "Precision", "Recall", "F1"]
x = np.arange(len(metrics))
width = 0.8 / len(df)

for i, (idx, row) in enumerate(df.iterrows()):
    values = [row["accuracy"], row["prefix_3"], row["precision"], row["recall"], row["f1"]]
    color = colors_dict[row["model_type"]]
    ax2.bar(x + i*width, values, width, label=row["model_name"], color=color, alpha=0.7 + 0.1*i)

ax2.set_xticks(x + width * (len(df)-1) / 2)
ax2.set_xticklabels(metrics)
ax2.set_ylabel('Score')
ax2.set_title('Metriken-Vergleich')
ax2.legend(fontsize=8, loc='lower right')
ax2.set_ylim(0, 1)

# 3. Modellgr√∂√üe vs. Accuracy
ax3 = axes[2]
sizes_numeric = {"3B": 3, "7B": 7, "8B": 8}
for idx, row in df.iterrows():
    size = sizes_numeric.get(row["model_size"], 5)
    color = colors_dict[row["model_type"]]
    marker = 'o' if row["model_type"] == "LLM" else 's'
    ax3.scatter(size, row["accuracy"], s=200, c=color, marker=marker, 
                edgecolor='black', linewidth=1.5, label=row["model_name"], zorder=5)

ax3.set_xlabel('Modellgr√∂√üe (Milliarden Parameter)')
ax3.set_ylabel('Exact Match Accuracy')
ax3.set_title('Gr√∂√üe vs. Performance')
ax3.set_xlim(0, 10)
ax3.set_ylim(0, 1)
ax3.legend(fontsize=8)

# Legende
legend_elements = [
    Patch(facecolor='#E74C3C', edgecolor='black', label='LLM (Zero-Shot)'),
    Patch(facecolor='#27AE60', edgecolor='black', label='SLM (Finetuned)'),
]
fig.legend(handles=legend_elements, loc='upper center', ncol=2, bbox_to_anchor=(0.5, 1.02), fontsize=11)

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.savefig(paths.plots_dir / 'main_comparison.png', dpi=200, bbox_inches='tight')
plt.show()

print(f"Plot gespeichert: {paths.plots_dir / 'main_comparison.png'}")

## 3. Detailanalyse

In [None]:
# ============================================================
# STATISTISCHE ANALYSE
# ============================================================

print("üìà Statistische Analyse")
print("=" * 60)

# LLM vs. SLM Vergleich
llm_df = df[df["model_type"] == "LLM"]
slm_df = df[df["model_type"] == "SLM"]

print("\nLLM (Zero-Shot) Statistiken:")
if len(llm_df) > 0:
    print(f"   Accuracy: {llm_df['accuracy'].mean():.2%} (¬±{llm_df['accuracy'].std():.2%})")
    print(f"   F1: {llm_df['f1'].mean():.4f}")
    print(f"   Beste: {llm_df.iloc[0]['model_name']} ({llm_df['accuracy'].max():.2%})")

print("\nSLM (Finetuned) Statistiken:")
if len(slm_df) > 0:
    print(f"   Accuracy: {slm_df['accuracy'].mean():.2%} (¬±{slm_df['accuracy'].std():.2%})")
    print(f"   F1: {slm_df['f1'].mean():.4f}")
    print(f"   Beste: {slm_df.iloc[0]['model_name']} ({slm_df['accuracy'].max():.2%})")

# Verbesserung berechnen
if len(llm_df) > 0 and len(slm_df) > 0:
    llm_best = llm_df['accuracy'].max()
    slm_best = slm_df['accuracy'].max()
    improvement = (slm_best - llm_best) / llm_best * 100
    
    print(f"\nVerbesserung durch Finetuning:")
    if improvement > 0:
        print(f"   +{improvement:.1f}% Accuracy-Gewinn")
    else:
        print(f"   {improvement:.1f}% Accuracy-Verlust")

In [None]:
# ============================================================
# EFFIZIENZ-ANALYSE
# ============================================================

print("\n‚ö° Effizienz-Analyse")
print("=" * 60)

fig, ax = plt.subplots(figsize=(10, 6))

for idx, row in df.iterrows():
    size = {"3B": 3, "7B": 7, "8B": 8}.get(row["model_size"], 5)
    color = colors_dict[row["model_type"]]
    
    # Bubble-Size basiert auf Speed
    bubble_size = max(50, row["samples_per_sec"] * 50) if row["samples_per_sec"] > 0 else 100
    
    ax.scatter(size, row["accuracy"], s=bubble_size, c=color, 
               alpha=0.7, edgecolor='black', linewidth=1.5)
    ax.annotate(row["model_name"], (size, row["accuracy"]), 
                xytext=(5, 5), textcoords='offset points', fontsize=8)

ax.set_xlabel('Modellgr√∂√üe (Milliarden Parameter)')
ax.set_ylabel('Exact Match Accuracy')
ax.set_title('Effizienz: Gr√∂√üe vs. Performance\n(Bubble-Gr√∂√üe = Inference-Speed)')
ax.set_xlim(0, 10)
ax.set_ylim(0, 1)

# Legende
legend_elements = [
    Patch(facecolor='#E74C3C', edgecolor='black', label='LLM (Zero-Shot)'),
    Patch(facecolor='#27AE60', edgecolor='black', label='SLM (Finetuned)'),
]
ax.legend(handles=legend_elements, loc='lower right')

plt.tight_layout()
plt.savefig(paths.plots_dir / 'efficiency_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

## 4. HTML-Report generieren

In [None]:
# ============================================================
# HTML REPORT GENERIEREN
# ============================================================

def generate_html_report(df: pd.DataFrame, llm_results: dict, slm_results: dict, output_path: Path):
    """Generiert einen HTML-Report."""
    
    # Beste Modelle ermitteln
    best_overall = df.iloc[0]
    best_llm = df[df["model_type"] == "LLM"].iloc[0] if len(df[df["model_type"] == "LLM"]) > 0 else None
    best_slm = df[df["model_type"] == "SLM"].iloc[0] if len(df[df["model_type"] == "SLM"]) > 0 else None
    
    # Verbesserung
    improvement = "N/A"
    if best_llm is not None and best_slm is not None:
        imp = (best_slm["accuracy"] - best_llm["accuracy"]) / best_llm["accuracy"] * 100
        improvement = f"+{imp:.1f}%" if imp > 0 else f"{imp:.1f}%"
    
    html = f"""
<!DOCTYPE html>
<html lang="de">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Medical Diagnosis Finetuning - Evaluation Report</title>
    <style>
        body {{
            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
            line-height: 1.6;
            max-width: 1200px;
            margin: 0 auto;
            padding: 20px;
            background-color: #f5f5f5;
        }}
        .header {{
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
            padding: 30px;
            border-radius: 10px;
            margin-bottom: 30px;
        }}
        .header h1 {{
            margin: 0;
            font-size: 2em;
        }}
        .header p {{
            margin: 10px 0 0 0;
            opacity: 0.9;
        }}
        .card {{
            background: white;
            border-radius: 10px;
            padding: 20px;
            margin-bottom: 20px;
            box-shadow: 0 2px 10px rgba(0,0,0,0.1);
        }}
        .card h2 {{
            color: #333;
            border-bottom: 2px solid #667eea;
            padding-bottom: 10px;
        }}
        table {{
            width: 100%;
            border-collapse: collapse;
            margin: 20px 0;
        }}
        th, td {{
            padding: 12px;
            text-align: left;
            border-bottom: 1px solid #ddd;
        }}
        th {{
            background-color: #667eea;
            color: white;
        }}
        tr:hover {{
            background-color: #f5f5f5;
        }}
        .metric-grid {{
            display: grid;
            grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
            gap: 20px;
            margin: 20px 0;
        }}
        .metric-box {{
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
            padding: 20px;
            border-radius: 10px;
            text-align: center;
        }}
        .metric-box .value {{
            font-size: 2em;
            font-weight: bold;
        }}
        .metric-box .label {{
            opacity: 0.9;
        }}
        .llm {{
            color: #E74C3C;
            font-weight: bold;
        }}
        .slm {{
            color: #27AE60;
            font-weight: bold;
        }}
        .highlight {{
            background-color: #e8f5e9;
        }}
        .conclusion {{
            background: linear-gradient(135deg, #11998e 0%, #38ef7d 100%);
            color: white;
            padding: 20px;
            border-radius: 10px;
            margin-top: 30px;
        }}
        .conclusion h2 {{
            color: white;
            border-bottom: 2px solid rgba(255,255,255,0.5);
        }}
        img {{
            max-width: 100%;
            border-radius: 10px;
            margin: 10px 0;
        }}
    </style>
</head>
<body>
    <div class="header">
        <h1>Medical Diagnosis Finetuning</h1>
        <p>Evaluation Report - ICD-10 Klassifikation mit LLMs und SLMs</p>
        <p>Generiert: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}</p>
    </div>
    
    <div class="card">
        <h2>Key Metrics</h2>
        <div class="metric-grid">
            <div class="metric-box">
                <div class="value">{len(all_results)}</div>
                <div class="label">Modelle evaluiert</div>
            </div>
            <div class="metric-box">
                <div class="value">{best_overall['accuracy']:.1%}</div>
                <div class="label">Beste Accuracy</div>
            </div>
            <div class="metric-box">
                <div class="value">{best_overall['model_name']}</div>
                <div class="label">Bestes Modell</div>
            </div>
            <div class="metric-box">
                <div class="value">{improvement}</div>
                <div class="label">SLM vs LLM</div>
            </div>
        </div>
    </div>
    
    <div class="card">
        <h2>Ergebnis√ºbersicht</h2>
        <table>
            <tr>
                <th>Modell</th>
                <th>Typ</th>
                <th>Gr√∂√üe</th>
                <th>Training</th>
                <th>Accuracy</th>
                <th>F1</th>
            </tr>
"""
    
    for idx, row in df.iterrows():
        highlight = 'highlight' if idx == 0 else ''
        typ_class = 'llm' if row['model_type'] == 'LLM' else 'slm'
        html += f"""
            <tr class="{highlight}">
                <td><strong>{row['model_name']}</strong></td>
                <td class="{typ_class}">{row['model_type']}</td>
                <td>{row['model_size']}</td>
                <td>{row['training']}</td>
                <td><strong>{row['accuracy']:.2%}</strong></td>
                <td>{row['f1']:.4f}</td>
            </tr>
"""
    
    html += """
        </table>
    </div>
    
    <div class="card">
        <h2>Visualisierungen</h2>
        <p>Die folgenden Plots wurden generiert:</p>
        <ul>
            <li><strong>main_comparison.png</strong> - Hauptvergleich aller Modelle</li>
            <li><strong>efficiency_analysis.png</strong> - Effizienz-Analyse (Gr√∂√üe vs. Performance)</li>
            <li><strong>slm_vs_llm_comparison.png</strong> - Detailvergleich SLM vs. LLM</li>
        </ul>
        <img src="main_comparison.png" alt="Hauptvergleich">
    </div>
"""
    
    # Fazit
    if best_slm is not None and best_llm is not None:
        if best_slm["accuracy"] > best_llm["accuracy"]:
            conclusion = f"""
            <p><strong>Hypothese best√§tigt:</strong> Das finetuned SLM ({best_slm['model_name']}) 
            √ºbertrifft das beste LLM ({best_llm['model_name']}) um {improvement}.</p>
            <p>Dies zeigt, dass <strong>Spezialisierung > Gr√∂√üe</strong> f√ºr dom√§nenspezifische Aufgaben gilt.</p>
            """
        else:
            conclusion = f"""
            <p><strong>Hypothese nicht best√§tigt:</strong> Das LLM ({best_llm['model_name']}) 
            ist weiterhin besser als das finetuned SLM.</p>
            <p>M√∂gliche Verbesserungen: Mehr Training, bessere Hyperparameter, gr√∂√üere Datens√§tze.</p>
            """
    else:
        conclusion = "<p>Nicht gen√ºgend Daten f√ºr eine Schlussfolgerung.</p>"
    
    html += f"""
    <div class="conclusion">
        <h2>üéØ Fazit</h2>
        {conclusion}
    </div>
    
    <div class="card">
        <h2>Methodologie</h2>
        <ul>
            <li><strong>Dataset:</strong> MedSynth (Ahmad0067/MedSynth) - Synthetische medizinische Dialoge</li>
            <li><strong>Split:</strong> 70% Train, 15% Val, 15% Test</li>
            <li><strong>LLM Evaluation:</strong> Zero-Shot mit 4-bit Quantisierung</li>
            <li><strong>SLM Training:</strong> LoRA (r=64, alpha=128), 3 Epochs</li>
            <li><strong>Metriken:</strong> Exact Match Accuracy, Prefix Match, F1-Score</li>
        </ul>
    </div>
    
    <footer style="text-align: center; padding: 20px; color: #666;">
        <p>Medical Diagnosis Finetuning Pipeline - {datetime.now().year}</p>
    </footer>
</body>
</html>
"""
    
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(html)
    
    return output_path

# Report generieren
report_path = generate_html_report(df, llm_results, slm_results, paths.reports_dir / "evaluation_report.html")
print(f"HTML-Report generiert: {report_path}")

## 5. Finale Zusammenfassung

In [None]:
# ============================================================
# FINALE JSON-ZUSAMMENFASSUNG
# ============================================================

final_summary = {
    "project": "Medical Diagnosis Finetuning",
    "description": "ICD-10 Klassifikation mit LLMs und finetuned SLMs",
    "dataset": {
        "name": "Ahmad0067/MedSynth",
        "split": {"train": 0.70, "val": 0.15, "test": 0.15},
    },
    "models_evaluated": len(all_results),
    "best_model": {
        "name": df.iloc[0]["model_name"],
        "type": df.iloc[0]["model_type"],
        "accuracy": float(df.iloc[0]["accuracy"]),
        "f1": float(df.iloc[0]["f1"]),
    },
    "llm_results": {
        "count": len(llm_df),
        "best_accuracy": float(llm_df["accuracy"].max()) if len(llm_df) > 0 else 0,
        "mean_accuracy": float(llm_df["accuracy"].mean()) if len(llm_df) > 0 else 0,
    },
    "slm_results": {
        "count": len(slm_df),
        "best_accuracy": float(slm_df["accuracy"].max()) if len(slm_df) > 0 else 0,
        "mean_accuracy": float(slm_df["accuracy"].mean()) if len(slm_df) > 0 else 0,
    },
    "generated_files": [
        str(paths.reports_dir / "evaluation_report.html"),
        str(paths.plots_dir / "main_comparison.png"),
        str(paths.plots_dir / "efficiency_analysis.png"),
    ],
    "generated_at": datetime.now().isoformat(),
}

# Speichern
summary_path = paths.reports_dir / "final_summary.json"
with open(summary_path, 'w') as f:
    json.dump(final_summary, f, indent=2)

print(f"Finale Zusammenfassung gespeichert: {summary_path}")

In [None]:
# ============================================================
# FINALE AUSGABE
# ============================================================

print("=" * 70)
print(#FINALE ZUSAMMENFASSUNG: Medical Diagnosis Finetuning")
print("=" * 70)
print(f"""
üéØ Forschungsfrage:
   K√∂nnen spezialisierte 3B SLMs durch Finetuning 7-8B LLMs √ºbertreffen?

üìä Evaluierte Modelle: {len(all_results)}
   LLM (Zero-Shot): {len(llm_df)}
   SLM (Finetuned): {len(slm_df)}

Rangliste:
""")

for i, (idx, row) in enumerate(df.iterrows()):
    medal = "ü•á" if i == 0 else "ü•à" if i == 1 else "ü•â" if i == 2 else "  "
    typ = "üî¥" if row["model_type"] == "LLM" else "üü¢"
    print(f"   {medal} {typ} {row['model_name']}: {row['accuracy']:.2%} (F1: {row['f1']:.4f})")

print(f"""
üìà Schl√ºsselerkenntnisse:
""")

if len(llm_df) > 0 and len(slm_df) > 0:
    llm_best = llm_df["accuracy"].max()
    slm_best = slm_df["accuracy"].max()
    
    if slm_best > llm_best:
        improvement = (slm_best - llm_best) / llm_best * 100
        print(f"   Finetuned SLM √ºbertrifft LLM um {improvement:.1f}%")
        print(f"   ‚Üí Spezialisierung schl√§gt Gr√∂√üe bei dom√§nenspezifischen Aufgaben")
        print(f"   ‚Üí 3B Modell mit LoRA besser als 8B Zero-Shot")
    else:
        print(f"   LLM Baseline weiterhin f√ºhrend")
        print(f"   ‚Üí Mehr Training oder bessere Daten erforderlich")

print(f"""
Generierte Dateien:
   - {paths.reports_dir / 'evaluation_report.html'}
   - {paths.reports_dir / 'final_summary.json'}
   - {paths.plots_dir / 'main_comparison.png'}
   - {paths.plots_dir / 'efficiency_analysis.png'}

Pipeline abgeschlossen!
""")

---

# Pipeline abgeschlossen!

## Zusammenfassung der Notebooks:

| Notebook | Inhalt |
|----------|--------|
| **00** | Projekt√ºbersicht und Konfiguration |
| **01** | Datenladung und Exploration |
| **02** | Datenverarbeitung und Tokenisierung |
| **03** | LLM Evaluation (Zero-Shot Baseline) |
| **04** | SLM Training mit LoRA |
| **05** | SLM Evaluation (Finetuned) |
| **06** | Ergebnisanalyse und Reporting |

## N√§chste Schritte:

1. **Hyperparameter-Tuning**: LoRA r, alpha, learning rate optimieren
2. **Mehr Daten**: Dataset erweitern oder augmentieren
3. **Andere Modelle**: Weitere SLMs testen (Phi-3, Gemma, etc.)
4. **Deployment**: Bestes Modell in Produktion bringen

---

**Autor**: Medical Diagnosis Finetuning Pipeline
**Datum**: {datetime.now().strftime('%Y-%m-%d')}