# Fine-tune a Domain-Restricted LLM in Colab

This notebook prepares data based on `details.txt` and fine-tunes a model to answer **only** Math, Physics, Economics, and Chemistry questions. It also prevents data overlap and plots training metrics dynamically.

## üìã Quick Start: Execution Steps

### **Complete Workflow in 11 Steps** ‚è±Ô∏è Total Time: 30-45 minutes

| Step | Section | Action | Time | GPU Memory |
|------|---------|--------|------|------------|
| **1** | Setup | Install libraries (torch, transformers, peft, accelerate) | 2-3 min | ‚Äî |
| **2** | Data | Load domain data (Math, Physics, Economics, Chemistry) | 1 min | 1 GB |
| **3** | Preprocess | Check overlaps (Jaccard ‚â• 0.95), tokenize | 1-2 min | 1 GB |
| **4** | Model | Download google/gemma-2b-it + Configure LoRA | 5 min | 4 GB |
| **5** | Loaders | Create train/val splits (80/20), batch_size=8 | <1 min | 1 GB |
| **6** | **Train ‚≠ê** | **Fine-tune model with early stopping** | **10-30 min** | **12-16 GB** |
| **7** | Evaluate | Compute metrics, plot loss curves | 2-3 min | 8 GB |
| **8** | Test | Generate predictions on new inputs | 1 min | 8 GB |
| **9** | QC | Run quality validators (1000+ guidelines) | <1 min | ‚Äî |
| **10** | Review | Read execution guide & troubleshooting | ‚Äî | ‚Äî |
| **11** | Deploy | Upload to Hugging Face Hub or share | 2-5 min | ‚Äî |

---

### **Before You Start:**

1. **Open in Google Colab**: [File ‚Üí Open Notebook] or click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com)
2. **Enable GPU**: Runtime ‚Üí Change runtime type ‚Üí GPU (T4 or A100) ‚Üí Save
3. **Verify GPU**: Run `!nvidia-smi` in a code cell to confirm GPU is active

---

### **Execution Order:**

```
Section 1: Install & Setup              ‚Üí  Run cell
Section 2-3: Data Loading & Preprocessing  ‚Üí  Run cells
Section 4: Model & LoRA Configuration      ‚Üí  Run cell
Section 5: Create Data Loaders             ‚Üí  Run cell
Section 6: Training Loop ‚≠ê                ‚Üí  Run cell (MAIN STEP - 10-30 min)
Section 7: Evaluation                      ‚Üí  Run cell
Section 8: Testing                         ‚Üí  Run cell
Section 9: QC Validation                   ‚Üí  Run cells (optional but recommended)
Section 10: Review Steps                   ‚Üí  Read guide
Section 11: Deploy                         ‚Üí  Follow upload instructions
```

---

### **Expected Outputs:**

- ‚úÖ **Section 1**: "‚úì All libraries installed successfully"
- ‚úÖ **Section 2**: 2 sample text examples displayed
- ‚úÖ **Section 3**: "No significant overlaps detected"
- ‚úÖ **Section 4**: Model loaded + LoRA config printed
- ‚úÖ **Section 5**: "Train batches: X, Val batches: Y"
- ‚úÖ **Section 6**: Training loss decreases, best model saved
- ‚úÖ **Section 7**: Validation accuracy + loss curve graph
- ‚úÖ **Section 8**: 5 generated examples with domain labels
- ‚úÖ **Section 9**: All QC checks passed (overlap, hierarchy, redundancy)

---

### **Troubleshooting Quick Fixes:**

| Issue | Solution |
|-------|----------|
| Out of Memory | Reduce `batch_size=8` to `batch_size=4` in Section 5 |
| No GPU detected | Runtime ‚Üí Change runtime type ‚Üí GPU (T4) |
| Training too slow | Switch to A100 GPU (Colab Pro) or reduce epochs |
| Model download fails | Wait 1 min and retry, or check internet connection |

---

**üöÄ Ready? Start with Section 1 below!**

In [5]:
# ============================================================================
# EXECUTION ROADMAP: Visual Step-by-Step Guide
# ============================================================================

print("‚ïî" + "‚ïê"*78 + "‚ïó")
print("‚ïë" + " "*20 + "üéØ FINE-TUNING EXECUTION ROADMAP" + " "*25 + "‚ïë")
print("‚ïö" + "‚ïê"*78 + "‚ïù")

steps = [
    {
        "number": "1Ô∏è‚É£",
        "name": "Setup Environment",
        "action": "Install libraries & verify CUDA",
        "time": "2-3 min",
        "cell": "Section 1"
    },
    {
        "number": "2Ô∏è‚É£",
        "name": "Load Data",
        "action": "Import domain examples (Math/Physics/Econ/Chem)",
        "time": "1 min",
        "cell": "Section 2"
    },
    {
        "number": "3Ô∏è‚É£",
        "name": "Preprocess & Check",
        "action": "Tokenize + overlap detection (Jaccard)",
        "time": "1-2 min",
        "cell": "Section 3"
    },
    {
        "number": "4Ô∏è‚É£",
        "name": "Load Model",
        "action": "Download Gemma-2b + LoRA config (r=16)",
        "time": "5 min",
        "cell": "Section 4"
    },
    {
        "number": "5Ô∏è‚É£",
        "name": "Create Loaders",
        "action": "Split dataset 80/20, batch_size=8",
        "time": "<1 min",
        "cell": "Section 5"
    },
    {
        "number": "6Ô∏è‚É£",
        "name": "‚≠ê TRAIN MODEL ‚≠ê",
        "action": "Fine-tune with early stopping",
        "time": "10-30 min",
        "cell": "Section 6"
    },
    {
        "number": "7Ô∏è‚É£",
        "name": "Evaluate",
        "action": "Compute accuracy + plot loss curves",
        "time": "2-3 min",
        "cell": "Section 7"
    },
    {
        "number": "8Ô∏è‚É£",
        "name": "Test Predictions",
        "action": "Generate outputs on new inputs",
        "time": "1 min",
        "cell": "Section 8"
    },
    {
        "number": "9Ô∏è‚É£",
        "name": "QC Validation",
        "action": "Run 1000+ guideline checks",
        "time": "<1 min",
        "cell": "Section 9"
    },
    {
        "number": "üîü",
        "name": "Review Guide",
        "action": "Read troubleshooting & tips",
        "time": "‚Äî",
        "cell": "Section 10"
    },
    {
        "number": "1Ô∏è‚É£1Ô∏è‚É£",
        "name": "Deploy & Share",
        "action": "Upload to Hugging Face Hub",
        "time": "2-5 min",
        "cell": "Section 11"
    }
]

print("\n")
for i, step in enumerate(steps, 1):
    print(f"{step['number']} {step['name']}")
    print(f"   üìå Action: {step['action']}")
    print(f"   ‚è±Ô∏è  Time: {step['time']}")
    print(f"   üìç Location: {step['cell']}")
    if i < len(steps):
        print(f"   ‚îÇ")
        print(f"   ‚Üì")
    print()

print("‚îÄ"*80)
print("‚è±Ô∏è  TOTAL TIME: ~30-45 minutes (Step 6 is the longest)")
print("üíæ GPU MEMORY: Peak 16GB during training (Step 6)")
print("‚îÄ"*80)

# Visual checkpoint tracker
print("\n‚úÖ CHECKPOINT TRACKER (mark as you go):\n")
checkpoints = [
    "[ ] GPU enabled (Runtime ‚Üí Change runtime type)",
    "[ ] Section 1: Libraries installed",
    "[ ] Section 2-3: Data loaded & preprocessed",
    "[ ] Section 4: Model loaded (google/gemma-2b-it)",
    "[ ] Section 5: Data loaders created",
    "[ ] Section 6: Training complete (best model saved)",
    "[ ] Section 7: Evaluation complete (metrics computed)",
    "[ ] Section 8: Predictions generated",
    "[ ] Section 9: QC validation passed",
    "[ ] Section 11: Model uploaded to Hugging Face"
]

for checkpoint in checkpoints:
    print(f"  {checkpoint}")

print("\n" + "="*80)
print("üöÄ START EXECUTING: Run Section 1 now!")
print("="*80)


‚ïî‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïó
‚ïë                    üéØ FINE-TUNING EXECUTION ROADMAP                         ‚ïë
‚ïö‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïù


1Ô∏è‚É£ Setup Environment
   üìå Action: Install libraries & verify CUDA
   ‚è±Ô∏è  Time: 2-3 min
   üìç Location: Section 1
   ‚îÇ
   ‚Üì

2Ô∏è‚É£ Load Data
   üìå Action: Import domain examples (Math/Physics/Econ/Chem)
   ‚è±Ô∏è  Time: 1 min
   üìç Location: Section 2
   ‚îÇ
   ‚Üì

3Ô∏è‚É£ Preprocess & Check
   üìå Action: Tokenize + overlap detection (Jaccard)
   ‚è±Ô∏è  Time: 1-2 min
   üìç Location: Section 3
   ‚î

---

### üéì What This Notebook Does

This notebook fine-tunes **google/gemma-2b-it** to generate educational content for **4 specific domains**:
- üìê **Mathematics** (algebra, calculus, geometry, statistics)
- ‚öõÔ∏è **Physics** (mechanics, thermodynamics, electromagnetism, quantum)
- üí∞ **Economics** (microeconomics, macroeconomics, finance, trade)
- üß™ **Chemistry** (organic, inorganic, physical, biochemistry)

The model **will not** generate content outside these domains.

---

### üõ†Ô∏è Key Features

‚úÖ **LoRA Fine-tuning**: Efficient parameter-efficient training (r=16, alpha=32)  
‚úÖ **Overlap Detection**: Prevents duplicate content (Jaccard ‚â• 0.95)  
‚úÖ **1000+ QC Guidelines**: Automated quality validation across 6 categories  
‚úÖ **Hierarchy Validation**: Ensures logical content structure  
‚úÖ **Early Stopping**: Automatic training optimization  
‚úÖ **Visualization**: Safe plotting with collision detection  
‚úÖ **Deployment Ready**: Hugging Face Hub upload + Gradio demo

---

### üí° Quick Tips

- **First time?** Just run cells **1 ‚Üí 2 ‚Üí 3 ‚Üí 4 ‚Üí 5 ‚Üí 6** in order
- **Training slow?** Check GPU is enabled (should see "T4" or "A100" in Runtime)
- **Out of memory?** Reduce batch_size in Section 5 from 8 ‚Üí 4
- **Need help?** Use the Gemini prompt in Section 11
- **Want to share?** Follow Hugging Face upload guide in Section 11

---

In [6]:
# ============================================================================
# CHEAT SHEET: Quick Command Reference
# ============================================================================

print("‚îè‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îì")
print("‚îÉ                    üéØ QUICK REFERENCE CHEAT SHEET                      ‚îÉ")
print("‚îó‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îõ")

print("\nüì¶ ESSENTIAL COMMANDS:\n")

commands = {
    "Check GPU": "!nvidia-smi",
    "Install package": "!pip install transformers peft accelerate",
    "Check PyTorch+CUDA": "import torch; print(torch.cuda.is_available())",
    "List files": "!ls -lh",
    "Check disk space": "!df -h",
    "Download from Colab": "from google.colab import files; files.download('model.pt')",
    "Mount Google Drive": "from google.colab import drive; drive.mount('/content/drive')",
    "Kill process": "!kill -9 <PID>",
    "Clear output": "from IPython.display import clear_output; clear_output()"
}

for name, cmd in commands.items():
    print(f"  ‚Ä¢ {name:<20} {cmd}")

print("\n" + "‚îÄ"*76)
print("üîß CONFIGURATION:\n")

config = {
    "Model": "google/gemma-2b-it (2.2GB)",
    "LoRA rank (r)": "16",
    "LoRA alpha": "32",
    "LoRA dropout": "0.05",
    "Batch size": "8 (reduce to 4 if OOM)",
    "Learning rate": "2e-4",
    "Epochs": "10 (with early stopping)",
    "Domains": "Math, Physics, Economics, Chemistry"
}

for key, value in config.items():
    print(f"  ‚Ä¢ {key:<20} {value}")

print("\n" + "‚îÄ"*76)
print("üìÇ KEY FILE PATHS:\n")

paths = {
    "Best model": "/tmp/best_model/",
    "Checkpoint": "/tmp/checkpoint/",
    "Training logs": "./training.log (if saved)",
    "Plots": "./loss_curve.png (from Section 7)"
}

for name, path in paths.items():
    print(f"  ‚Ä¢ {name:<20} {path}")

print("\n" + "‚îÄ"*76)
print("üêõ COMMON ERRORS & FIXES:\n")

errors = [
    ("CUDA out of memory", "‚Üí Reduce batch_size=8 to batch_size=4"),
    ("No module named 'peft'", "‚Üí Run: !pip install peft"),
    ("RuntimeError: Expected...", "‚Üí Restart kernel & re-run setup"),
    ("Model download timeout", "‚Üí Wait 1 min and retry cell"),
    ("Validation loss not improving", "‚Üí Early stopping will trigger automatically")
]

for error, fix in errors:
    print(f"  ‚úó {error:<30} {fix}")

print("\n" + "‚îÄ"*76)
print("üìä EXPECTED METRICS (after training):\n")

metrics = {
    "Training Loss (final)": "< 0.5",
    "Validation Loss": "< 0.8",
    "Validation Accuracy": "> 90%",
    "Perplexity": "< 5.0",
    "Training Time": "10-30 min (depends on data & GPU)"
}

for metric, value in metrics.items():
    print(f"  ‚Ä¢ {metric:<25} {value}")

print("\n" + "‚îÄ"*76)
print("üîó USEFUL LINKS:\n")

links = [
    "Hugging Face Hub: https://huggingface.co/models",
    "Gemini API: https://ai.google.dev/gemini-api/docs",
    "LoRA Paper: https://arxiv.org/abs/2106.09685",
    "PEFT Docs: https://huggingface.co/docs/peft",
    "Accelerate Docs: https://huggingface.co/docs/accelerate"
]

for link in links:
    print(f"  ‚Ä¢ {link}")

print("\n‚îè‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îì")
print("‚îÉ  üí° TIP: Bookmark this cell for quick reference during execution!       ‚îÉ")
print("‚îó‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îõ")


‚îè‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îì
‚îÉ                    üéØ QUICK REFERENCE CHEAT SHEET                      ‚îÉ
‚îó‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îõ

üì¶ ESSENTIAL COMMANDS:

  ‚Ä¢ Check GPU            !nvidia-smi
  ‚Ä¢ Install package      !pip install transformers peft accelerate
  ‚Ä¢ Check PyTorch+CUDA   import torch; print(torch.cuda.is_available())
  ‚Ä¢ List files           !ls -lh
  ‚Ä¢ Check disk space     !df -h
  ‚Ä¢ Download from Colab  from google.colab import files; files.download('model.pt')
  ‚Ä¢ Mount Google Drive   from google.colab import drive; drive.mount('/content/drive')
  ‚Ä¢ Kill process  

## 1. Setup Colab Environment and Install Dependencies

In [7]:
# If running in Colab, uncomment the next line to mount Drive.
# from google.colab import drive
# drive.mount("/content/drive")

# Install dependencies
%pip -q install "transformers>=4.40" "datasets>=2.18" "accelerate>=0.27" "peft>=0.10" "bitsandbytes>=0.43" "torch>=2.2" "matplotlib>=3.8"

import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.1/59.1 MB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hCUDA available: False


## 2. Load and Prepare Domain-Specific Dataset

In [8]:
from pathlib import Path
import json
import random
import re
from datasets import Dataset

# Paths: upload details.txt to /content or update this path
DETAILS_PATH = Path("/content/details.txt")
if not DETAILS_PATH.exists():
    # Fallback to local workspace path if running outside Colab
    DETAILS_PATH = Path(r"c:\Users\SUDISH_DEUJA\Desktop\Phiversity-main\details.txt")

# Optional: Read details.txt if it exists (currently unused in this cell)
# Uncomment below if you have a details.txt file to load
# if DETAILS_PATH.exists():
#     text = DETAILS_PATH.read_text(encoding="utf-8", errors="ignore")

ALLOWED_DOMAINS = ["physics", "math", "economics", "chemistry"]

SYSTEM_PROMPT = (
    "You are a domain-restricted tutor. Answer ONLY questions in Physics, Math, "
    "Economics, or Chemistry. If the question is out of domain, refuse politely. "
    "Provide step-by-step reasoning, validate numerical results, and cite academic sources."
)

def normalize_question(q: str) -> str:
    q = q.lower()
    q = re.sub(r"[^a-z0-9\s]", " ", q)
    q = re.sub(r"\s+", " ", q).strip()
    return q

def load_raw_data(raw_path: Path | None = None):
    """
    Load question-answer pairs from JSONL file or return demo data.
    
    Args:
        raw_path: Optional path to JSONL file with {question, answer, domain} objects
    
    Returns:
        List of example dictionaries with demo data for 4 domains
    """
    if raw_path and raw_path.exists():
        rows = []
        with raw_path.open("r", encoding="utf-8") as f:
            for line in f:
                obj = json.loads(line)
                rows.append(obj)
        return rows
    
    # Demo data for Math, Physics, Economics, and Chemistry
    # Replace with your own academic dataset for production use
    return [
        {
            "question": "Solve 2x^2 + 3x + 1 = 0.",
            "answer": "Identify a=2, b=3, c=1. Use quadratic formula: x = (-b ¬± sqrt(b^2-4ac)) / 2a. Discriminant: 9-8=1. Solutions: (-3¬±1)/4 -> x=-1/2, x=-1.\nCitations: [Standard Algebra Text]",
            "domain": "math",
        },
        {
            "question": "What is the first law of thermodynamics?",
            "answer": "The first law states that energy is conserved: the change in internal energy equals heat added minus work done, dU = dQ - dW.\nCitations: [Thermodynamics Text]",
            "domain": "physics",
        },
        {
            "question": "Define price elasticity of demand.",
            "answer": "Price elasticity of demand is the percentage change in quantity demanded divided by the percentage change in price, holding other factors constant.\nCitations: [Microeconomics Text]",
            "domain": "economics",
        },
        {
            "question": "What is a nucleophile in organic chemistry?",
            "answer": "A nucleophile is an electron-rich species that donates a pair of electrons to form a chemical bond.\nCitations: [Organic Chemistry Text]",
            "domain": "chemistry",
        },
    ]

# Optional: point this at your JSONL dataset with fields: question, answer, domain
RAW_DATA_PATH = None  # Example: Path("/content/domain_qa.jsonl")
raw_examples = load_raw_data(RAW_DATA_PATH)

print(f"‚úì Loaded {len(raw_examples)} examples (using demo data)")
print(f"  Domains: {set(ex['domain'] for ex in raw_examples)}")

# Filter to allowed domains
raw_examples = [ex for ex in raw_examples if ex.get("domain", "").lower() in ALLOWED_DOMAINS]

# Add refusal examples (out-of-domain)
ood_questions = [
    "Who won the last football world cup?",
    "Write a poem about the ocean.",
    "Give me travel tips for Japan.",
]
refusal_answer = "Sorry, I can only answer questions about Physics, Math, Economics, or Chemistry."
raw_examples += [{"question": q, "answer": refusal_answer, "domain": "refusal"} for q in ood_questions]

# Deduplicate by normalized question to prevent overlap
seen = set()
deduped = []
for ex in raw_examples:
    key = normalize_question(ex["question"])
    if key in seen:
        continue
    seen.add(key)
    deduped.append(ex)

# Balance domains (ignore refusal during balancing)
domain_groups = {d: [] for d in ALLOWED_DOMAINS}
refusals = [ex for ex in deduped if ex["domain"] == "refusal"]
for ex in deduped:
    d = ex["domain"]
    if d in domain_groups:
        domain_groups[d].append(ex)

min_count = min((len(v) for v in domain_groups.values()), default=0)
if min_count == 0:
    raise ValueError("Each domain needs at least one example.")

balanced = []
random.seed(42)
for d in ALLOWED_DOMAINS:
    balanced.extend(random.sample(domain_groups[d], min_count))

balanced.extend(refusals)
random.shuffle(balanced)

# Train/val split with no overlap
split_idx = int(0.9 * len(balanced))
train_examples = balanced[:split_idx]
val_examples = balanced[split_idx:]

train_keys = {normalize_question(ex["question"]) for ex in train_examples}
val_examples = [ex for ex in val_examples if normalize_question(ex["question"]) not in train_keys]

print("\n" + "="*60)
print("üìä DATA PREPARATION COMPLETE")
print("="*60)
print(f"‚úì Train set: {len(train_examples)} examples")
print(f"‚úì Validation set: {len(val_examples)} examples")
print(f"‚úì Domains: {ALLOWED_DOMAINS}")
print(f"‚úì Total unique questions: {len(balanced)}")
print("="*60)

train_ds = Dataset.from_list(train_examples)
val_ds = Dataset.from_list(val_examples)

# Display sample examples
print("\nüìñ Sample Training Examples:")
for i, ex in enumerate(train_examples[:2], 1):
    print(f"\n  Example {i}:")
    print(f"    Domain: {ex['domain']}")
    print(f"    Question: {ex['question'][:60]}...")
    print(f"    Answer: {ex['answer'][:80]}...")


‚úì Loaded 4 examples (using demo data)
  Domains: {'physics', 'math', 'chemistry', 'economics'}

üìä DATA PREPARATION COMPLETE
‚úì Train set: 6 examples
‚úì Validation set: 1 examples
‚úì Domains: ['physics', 'math', 'economics', 'chemistry']
‚úì Total unique questions: 7

üìñ Sample Training Examples:

  Example 1:
    Domain: chemistry
    Question: What is a nucleophile in organic chemistry?...
    Answer: A nucleophile is an electron-rich species that donates a pair of electrons to fo...

  Example 2:
    Domain: refusal
    Question: Write a poem about the ocean....
    Answer: Sorry, I can only answer questions about Physics, Math, Economics, or Chemistry....


In [9]:
OUTPUT_GUIDELINES = [
    "Define clear learning objectives before starting video creation.",
    "Break content into logical sections or chapters.",
    "Use topic hierarchy to arrange concepts from simple to complex.",
    "Outline subtopics under each main topic.",
    "Allocate time per section based on complexity.",
    "Include an introduction that summarizes the video.",
    "Plan transitions between topics for smooth flow.",
    "Include key takeaways at the start.",
    "Use bullet points for outlining concepts.",
    "Apply backward design to plan outcomes first.",
    "Segment long videos into shorter chapters.",
    "Prioritize critical concepts at the beginning.",
    "Include a 'why this matters' statement for engagement.",
    "Map overlapping topics to reduce redundancy.",
    "Identify visual content needed for each section.",
    "Schedule recurring concepts to reinforce learning.",
    "Include examples for abstract concepts.",
    "Predefine exercises or practice questions.",
    "Annotate topic dependencies to maintain hierarchy.",
    "Include optional deep-dive sections for advanced learners.",
    "Determine pacing for each segment.",
    "Use audience persona to guide content complexity.",
    "Highlight common misconceptions per topic.",
    "Include real-world applications of concepts.",
    "Predefine storytelling techniques to enhance memory retention.",
    "Schedule summary slides after each topic.",
    "Plan cue points for interactive elements.",
    "Include self-assessment checkpoints.",
    "Predefine visual cues for key concepts.",
    "Map out examples vs theory balance.",
    "Include context for diagrams before showing them.",
    "Avoid introducing multiple topics simultaneously.",
    "Predefine transitions for overlap-heavy content.",
    "Ensure logical progression between sections.",
    "Identify sections that need reinforcement.",
    "Map topic dependencies to avoid skipping steps.",
    "Include periodic recaps every 5-10 minutes.",
    "Segment content to match attention span limits.",
    "Include intro hooks to engage learners.",
    "Predefine concept summaries for each chapter.",
    "Prioritize visuals for high-complexity topics.",
    "Include mnemonic aids in planning.",
    "Predefine interactive questions for each section.",
    "Highlight keywords in planning stage.",
    "Define glossary terms for technical topics.",
    "Track cross-topic references to maintain hierarchy.",
    "Schedule rest points to reduce cognitive load.",
    "Plan alternative examples for complex concepts.",
    "Predefine voice modulation points.",
    "Outline multiple methods to explain a single concept.",
    "Map potential confusion points and clarify in plan.",
    "Include context slides before data-heavy visuals.",
    "Highlight prerequisite knowledge for each topic.",
    "Plan demonstration or simulation segments.",
    "Include storyboarding for concept animations.",
    "Schedule pacing adjustments for difficult topics.",
    "Include summary slides with visual emphasis.",
    "Predefine chapter opening lines for engagement.",
    "Include reinforcement exercises in planning.",
    "Predefine quiz placement for active recall.",
    "Track topic coverage completeness.",
    "Plan backup examples for complex topics.",
    "Predefine color coding for hierarchy.",
    "Include context for formula-heavy sections.",
    "Map diagrams to exact narration points.",
    "Predefine annotations for charts.",
    "Include step-by-step instructions for problem-solving.",
    "Identify sections that need slower pacing.",
    "Include storytelling cues in plan.",
    "Map content to Bloom's taxonomy levels.",
    "Include prompts for learners to pause and reflect.",
    "Track topic repetition to reinforce memory.",
    "Predefine voice emphasis points for key terms.",
    "Include analogies for abstract topics.",
    "Map cross-references between chapters.",
    "Include scaffolding for difficult concepts.",
    "Predefine slide transitions for clarity.",
    "Highlight step-wise logic in planning.",
    "Include margin notes for potential improvements.",
    "Plan mini-recaps every 3-5 slides.",
    "Predefine examples for multiple learning styles.",
    "Include optional advanced exercises.",
    "Map visuals to spoken content to prevent overlap.",
    "Predefine highlight points for critical data.",
    "Include summary questions at end of topic.",
    "Track time allocation per section.",
    "Include micro-learning segments for retention.",
    "Predefine end-of-video call-to-action.",
    "Map content flow for cognitive load management.",
    "Include visual hierarchy for diagrams.",
    "Predefine figure references in narration.",
    "Track repeated themes to avoid redundancy.",
    "Include pauses for reflection after complex explanations.",
    "Plan for consistency in tone and pacing.",
    "Map overlapping charts to avoid misinterpretation.",
    "Include cue words for emphasis in scripts.",
    "Plan dynamic visuals for engagement.",
    "Predefine font sizes for readability.",
    "Include chapter-wise learning outcomes.",
    "Map audio cues to visual changes.",
    "Plan transitions for overlapping topics.",
    "Include annotations for misaligned graphs.",
    "Track visual density per slide.",
    "Predefine captions for clarity.",
    "Include reminders to reinforce hierarchy.",
    "Map overlapping equations for clarity.",
    "Include error-spotting prompts in planning.",
    "Predefine problem-solving demonstrations.",
    "Include cross-topic example integration.",
    "Map diagram labeling for clarity.",
    "Track redundant phrases to avoid repetition.",
    "Include pacing markers in storyboard.",
    "Predefine color coding for overlapping topics.",
    "Include alternate explanations for diverse learners.",
    "Plan for voice clarity in technical sections.",
    "Map visual elements to hierarchy levels.",
    "Include narrative emphasis for key takeaways.",
    "Predefine summary charts.",
    "Track concept coverage for completeness.",
    "Include 'next topic' hints to maintain flow.",
    "Plan alignment between text and graphics.",
    "Map overlapping steps in problem-solving.",
    "Include repetition for reinforcement.",
    "Predefine slide labels for reference.",
    "Track audience comprehension checkpoints.",
    "Include visual hierarchy in diagrams.",
    "Map redundant explanations for removal.",
    "Predefine cue cards for narrator.",
    "Include analogies aligned to topic level.",
    "Track formula introduction order.",
    "Plan visual spacing for clarity.",
    "Map content redundancy to prevent overlap.",
    "Include guided question prompts.",
    "Predefine animation timings.",
    "Track overlapping terms across topics.",
    "Include emphasis on key concepts.",
    "Map figures to correct narration timing.",
    "Predefine voice tone for difficult topics.",
    "Include transitions between overlapping charts.",
    "Plan chapter summaries with hierarchy emphasis.",
    "Track logical step progression.",
    "Include visual cues for problem-solving steps.",
    "Predefine font consistency.",
    "Map color coding to topic hierarchy.",
    "Include time markers for pacing.",
    "Track recurring examples for reinforcement.",
    "Plan alternative visual examples.",
    "Include summary points in bullet form.",
    "Map slide content density.",
    "Predefine alignment between visuals and text.",
    "Write scripts in simple, conversational language.",
    "Predefine key terms to emphasize in narration.",
    "Use active voice for clarity.",
    "Break long sentences into shorter ones for readability.",
    "Include rhetorical questions to engage viewers.",
    "Add examples immediately after introducing a concept.",
    "Predefine intonation markers for AI voice.",
    "Include pauses after important points.",
    "Use repetition of keywords to reinforce learning.",
    "Highlight formulas in speech for clarity.",
    "Predefine emphasis points in narration script.",
    "Include analogies for abstract concepts.",
    "Align narration with visual content.",
    "Predefine chapter opening and closing statements.",
    "Use stories or real-life examples to illustrate concepts.",
    "Include summary statements at the end of each segment.",
    "Track common student mistakes and address them.",
    "Predefine voice speed variations for complex sections.",
    "Use consistent terminology throughout the video.",
    "Include pronunciation guides for technical terms.",
    "Predefine filler-free narration to maintain focus.",
    "Track narration clarity using AI speech analysis.",
    "Include 'think-aloud' demonstrations for problem-solving.",
    "Predefine Q&A sections in narration.",
    "Use voice modulation to indicate importance.",
    "Include periodic recaps in script.",
    "Highlight contrasting concepts verbally.",
    "Predefine storytelling hooks at key points.",
    "Use rhetorical emphasis to reinforce hierarchy.",
    "Include guiding questions in narration for active thinking.",
    "Track pacing to maintain attention.",
    "Predefine tone shifts for transitions.",
    "Include repetition of essential steps in problem-solving.",
    "Highlight relationships between topics verbally.",
    "Predefine script sections for graphics references.",
    "Use metaphorical language for abstract ideas.",
    "Include verbal cues for interactive exercises.",
    "Predefine explanation for overlapping topics.",
    "Use synonyms to avoid monotony but keep clarity.",
    "Include reinforcement questions in narration.",
    "Track audience comprehension cues through AI analysis.",
    "Predefine key takeaway statements in script.",
    "Use voice emphasis for hierarchically important topics.",
    "Include stepwise verbal breakdowns for procedures.",
    "Predefine narration for visual-only content.",
    "Use clear transitions like 'next, we will...' or 'then...'.",
    "Include mini-quizzes verbally in script.",
    "Predefine explanations for potential misconceptions.",
    "Use summaries before introducing a new subtopic.",
    "Include repetition of topic hierarchy verbally.",
    "Predefine narration for overlapping diagrams.",
    "Track and reduce filler words using AI analysis.",
    "Use analogies aligned to learner level.",
    "Include pronunciation emphasis for foreign terms.",
    "Predefine voice pauses for note-taking.",
    "Include reflective questions for active engagement.",
    "Highlight formulas in verbal explanation.",
    "Predefine script markers for AI-generated voice pitch.",
    "Use pacing variations to match concept difficulty.",
    "Include verbal summaries of previous topics.",
    "Predefine alternative phrasing for clarity.",
    "Include motivational reinforcement in narration.",
    "Highlight cross-topic connections verbally.",
    "Predefine section introductions to set context.",
    "Include storytelling for historical context.",
    "Track repetition to reinforce key points.",
    "Predefine cues for overlapping content explanation.",
    "Use emphasis to indicate importance in hierarchy.",
    "Include clear stepwise instructions in narration.",
    "Predefine examples for visual-only slides.",
    "Highlight potential student pitfalls in narration.",
    "Include rhetorical devices to maintain attention.",
    "Predefine key questions to ask viewers verbally.",
    "Use consistent script tone for cohesion.",
    "Include verbal analogies for formulas and graphs.",
    "Predefine narration for animated sequences.",
    "Track listener comprehension using AI speech metrics.",
    "Include repetition of key takeaways verbally.",
    "Predefine cues for figure and diagram references.",
    "Use verbal scaffolding for complex topics.",
    "Include mini-recaps after each subsection.",
    "Highlight relationships between concepts verbally.",
    "Predefine clarification statements for ambiguous content.",
    "Include thought prompts in narration.",
    "Use pauses before introducing critical formulas.",
    "Predefine verbal cues for transitions between topics.",
    "Include historical context for discoveries or formulas.",
    "Track clarity and simplicity of phrasing.",
    "Predefine narration for overlapping word topics.",
    "Include reinforcement of topic hierarchy verbally.",
    "Highlight important definitions verbally.",
    "Predefine cues for animated figure explanations.",
    "Include storytelling to illustrate abstract principles.",
    "Use intonation to signal hierarchy changes.",
    "Include reflective pauses for problem-solving steps.",
    "Predefine narration for interactive quizzes.",
    "Highlight differences between similar concepts verbally.",
    "Include summaries before moving to next major topic.",
    "Track redundant phrases and remove them from script.",
    "Predefine key question prompts for engagement.",
    "Include verbal warnings for common mistakes.",
    "Use analogies for better conceptual understanding.",
    "Highlight hierarchy relationships verbally.",
    "Predefine narration for charts with overlapping labels.",
    "Include stepwise instructions for procedural tasks.",
    "Use consistent voice style throughout sections.",
    "Predefine explanations for difficult-to-understand graphs.",
    "Include rhetorical devices to increase retention.",
    "Track flow of narration with AI tools.",
    "Predefine reinforcement statements after each topic.",
    "Use pause markers for learners to take notes.",
    "Include verbal cues for overlapping topics.",
    "Highlight key points using voice emphasis.",
    "Predefine alternative examples for complex concepts.",
    "Include mini-challenges verbally to engage learners.",
    "Track script readability using AI tools.",
    "Predefine step-by-step narration for calculations.",
    "Include analogies for complex problem-solving.",
    "Highlight topic connections verbally for coherence.",
    "Predefine narration for graphs with multiple layers.",
    "Include periodic verbal summaries of key points.",
    "Use intonation to differentiate main vs subtopics.",
    "Predefine narration for overlapping word and figure references.",
    "Include repetition of crucial steps in problem-solving.",
    "Track clarity of formula explanation verbally.",
    "Predefine questions for self-assessment.",
    "Include verbal tips to prevent common errors.",
    "Use examples aligned with learner familiarity.",
    "Highlight main topic transitions verbally.",
    "Predefine narration for stepwise diagram walkthroughs.",
    "Include emphasis markers for hierarchical importance.",
    "Track topic coverage to avoid missing key points.",
    "Predefine reinforcement statements after critical formulas.",
    "Include rhetorical questions to maintain engagement.",
    "Highlight contrast between concepts verbally.",
    "Predefine voice markers for AI-based narration.",
    "Include verbal analogies for abstract visual content.",
    "Track pacing adjustments for comprehension.",
    "Predefine narration for hierarchical topic introductions.",
    "Include mini-recaps before moving to advanced sections.",
    "Use intonation to differentiate overlapping topics.",
    "Predefine narration for chart interpretation.",
    "Include repetition for memory retention.",
    "Highlight key transitions in problem-solving verbally.",
    "Predefine script cues for animations with voice.",
    "Include reflective prompts for active thinking.",
    "Track redundant explanations to remove from script.",
    "Predefine narration for figures with overlapping labels.",
    "Include reinforcement of learning objectives verbally.",
    "Highlight key takeaways for each section in narration.",
    "Predefine figure types for each concept (chart, diagram, graph).",
    "Use consistent color schemes to indicate hierarchy.",
    "Ensure axes are labeled clearly in all graphs.",
    "Avoid clutter in graphs by limiting data points per figure.",
    "Include legends for multi-series graphs.",
    "Predefine figure placement relative to narration timing.",
    "Align visual emphasis with verbal emphasis.",
    "Include callouts for critical data points.",
    "Use animation to highlight stepwise changes in graphs.",
    "Predefine figure dimensions to maintain clarity.",
    "Apply gridlines for reference without overwhelming the figure.",
    "Include arrows or highlights for directional flows.",
    "Track overlapping labels and automatically adjust positions.",
    "Predefine font size and type for consistency.",
    "Use contrasting colors for overlapping data.",
    "Include numbered steps for multi-part diagrams.",
    "Apply layering to separate overlapping elements.",
    "Predefine figure captions to reinforce hierarchy.",
    "Use symbols consistently across figures.",
    "Highlight key trends visually.",
    "Predefine sequence for multiple figures to avoid cognitive overload.",
    "Include interactive layers for optional detailed exploration.",
    "Use whitespace strategically to reduce clutter.",
    "Predefine figure references in script for cross-referencing.",
    "Apply zoom or pan animations for complex diagrams.",
    "Include comparison charts for similar concepts.",
    "Track overlapping visuals and resolve automatically.",
    "Predefine visual hierarchy in multi-layered diagrams.",
    "Highlight cause-and-effect relationships in charts.",
    "Include transitional effects between overlapping figures.",
    "Use color gradients to show progression.",
    "Predefine templates for recurring figure types.",
    "Include stepwise buildup for complex visuals.",
    "Track figure-to-topic alignment to maintain logical flow.",
    "Include arrows or paths to indicate process flow.",
    "Predefine icons to represent recurring elements.",
    "Use highlights to focus attention on key areas.",
    "Include labels for sub-parts of complex figures.",
    "Track figure density per slide to avoid overload.",
    "Predefine consistent spacing between overlapping elements.",
    "Use visual cues for hierarchy (size, color, boldness).",
    "Include interactive toggles for layered information.",
    "Predefine figure legends for clarity.",
    "Highlight overlapping data using transparency.",
    "Track visual redundancy to avoid repetition.",
    "Predefine figure introduction sequence in narration.",
    "Include animated transitions for overlapping charts.",
    "Use arrows to indicate relationships or correlations.",
    "Predefine color codes for topic hierarchy.",
    "Include annotations directly on figures to clarify points.",
    "Track alignment between figures and captions.",
    "Apply consistent axis scaling across similar graphs.",
    "Predefine figure references in learning materials.",
    "Use callout boxes for overlapping text on visuals.",
    "Highlight important trends using motion or animation.",
    "Predefine figure order to match narration flow.",
    "Include zoom-in effects for critical graph areas.",
    "Track overlapping curves or lines and offset them.",
    "Use consistent shapes for recurring elements.",
    "Predefine figure templates for different difficulty levels.",
    "Include interactive overlays for optional extra info.",
    "Highlight changes over time in stepwise fashion.",
    "Predefine figure padding to prevent visual overlap.",
    "Track figure hierarchy to emphasize important parts.",
    "Use contrasting colors for overlapping text labels.",
    "Predefine size ratio for figure elements.",
    "Include callouts for exceptions or anomalies.",
    "Track figure repetition and reduce redundancy.",
    "Use shadow or outline effects for overlapping items.",
    "Predefine figure transitions for clarity.",
    "Highlight key relationships using lines or arrows.",
    "Include layered visuals to separate concepts.",
    "Track axis scaling consistency across multiple charts.",
    "Predefine figure highlights to match narration cues.",
    "Use fading effects to reveal overlapping data progressively.",
    "Include numbered labels for hierarchical topics.",
    "Track color contrast for accessibility.",
    "Predefine figure update timing for animations.",
    "Highlight key trends using thicker lines or larger points.",
    "Include comparison panels for before-and-after visuals.",
    "Track alignment between visuals and spoken keywords.",
    "Predefine figure margin sizes to avoid overlap with text.",
    "Use icons to indicate repeated concepts.",
    "Include motion paths to show process flow.",
    "Track overlapping elements in dense diagrams.",
    "Predefine layering order for clarity.",
    "Highlight cause-effect chains in process diagrams.",
    "Include stepwise construction of complex charts.",
    "Track legend placement for readability.",
    "Predefine figure background contrast for clarity.",
    "Use highlighting to direct attention sequentially.",
    "Include annotations for overlapping figures.",
    "Track figure size consistency across slides.",
    "Predefine templates for different visual types.",
    "Highlight major trends using color or animation.",
    "Include micro-labels for detailed areas.",
    "Track overlapping text and adjust dynamically.",
    "Predefine hierarchy markers in multi-level visuals.",
    "Use motion cues to guide visual attention.",
    "Include overlay boxes for optional explanations.",
    "Track visual progression to match narration.",
    "Predefine color codes for overlapping data points.",
    "Highlight exceptions or anomalies visually.",
    "Include arrows to show sequential flow.",
    "Track figure placement to prevent overlap with text.",
    "Predefine shape usage for recurring concepts.",
    "Use transparency to differentiate overlapping elements.",
    "Include pop-up explanations for complex visuals.",
    "Track figure readability for all device sizes.",
    "Predefine layer order for complex graphics.",
    "Highlight connections between concepts visually.",
    "Include animation to show stepwise formula development.",
    "Track figure scaling to maintain proportion.",
    "Predefine legend size and placement for clarity.",
    "Use callouts to emphasize hierarchical importance.",
    "Include fade-in/fade-out effects for overlapping sections.",
    "Track figure color consistency across chapters.",
    "Predefine labels for repeated elements.",
    "Highlight overlapping areas with shading.",
    "Include interactive toggles to explore details.",
    "Track figure complexity to match viewer cognitive load.",
    "Predefine margin spacing for text and visuals.",
    "Use motion highlights to draw attention sequentially.",
    "Include comparison visuals for complex data.",
    "Track hierarchy in multi-part diagrams.",
    "Predefine figure captions for reinforcement.",
    "Highlight trends visually using thicker lines or brighter colors.",
    "Include sequential layering for multi-step processes.",
    "Track overlapping chart axes and adjust spacing.",
    "Predefine color coding for recurring topics.",
    "Use overlay arrows to indicate relationships.",
    "Include zoom-in effects for detailed sections.",
    "Track figure density to prevent overload.",
    "Predefine animation timing for multi-layer diagrams.",
    "Highlight critical data points using visual cues.",
    "Include annotations for overlapping text labels.",
    "Track figure alignment across multiple slides.",
    "Predefine figure spacing for visual hierarchy.",
    "Use color contrast to differentiate overlapping elements.",
    "Include pop-up labels for optional deep dive info.",
    "Track axis label consistency for clarity.",
    "Predefine layering to highlight key points.",
    "Highlight sequence of steps using arrows or numbering.",
    "Include motion cues to maintain attention.",
    "Track figure redundancy and remove unnecessary visuals.",
    "Predefine hierarchy markers visually (size, color, position).",
    "Use animation to gradually reveal overlapping content.",
    "Include micro-annotations for complex diagrams.",
    "Track font sizes for readability in all visuals.",
    "Predefine legend consistency across related charts.",
    "Highlight exceptions using unique color or symbols.",
    "Include layered visuals for multi-step problem-solving.",
    "Track visual clarity when combining multiple figures.",
    "Predefine color gradient for progress visualization.",
    "Use visual arrows to indicate cause-effect relationships.",
    "Include fading transitions for overlapping sections.",
    "Track figure-to-text alignment for clarity.",
    "Predefine spacing between chart elements.",
    "Highlight hierarchy of concepts visually.",
    "Include sequential numbering for multi-part figures.",
    "Track overlapping labels dynamically and adjust.",
    "Predefine icons for repeated concept representation.",
    "Use transparency for overlapping visuals.",
    "Include interactive figure layers for optional exploration.",
    "Track figure consistency across multiple videos.",
    "Predefine animation for stepwise data presentation.",
    "Highlight trend differences using color intensity.",
    "Include explanatory callouts on dense figures.",
    "Track cognitive load of complex visuals.",
    "Predefine hierarchy indicators (bold, color, size).",
    "Use motion paths to show process flow.",
    "Include layered annotations for clarity.",
    "Track figure scaling across device formats.",
    "Predefine legend placement to avoid overlap.",
    "Highlight multi-layer diagram steps sequentially.",
    "Include zoom-in on critical intersections.",
    "Track figure hierarchy relative to narration.",
    "Predefine overlay effects for overlapping charts.",
    "Use contrasting shapes for repeated elements.",
    "Include animated arrows to show relationships.",
    "Track figure readability after compression or export.",
    "Predefine visual templates for repetitive topics.",
    "Highlight key patterns using shading or color.",
    "Include sequential layering to reveal information gradually.",
    "Track consistency in color coding across figures.",
    "Predefine annotation style for dense charts.",
    "Use transparency for overlapping labels.",
    "Include visual cues for hierarchy in graphs.",
    "Track alignment of visual elements across slides.",
    "Predefine figure spacing to prevent clutter.",
    "Highlight critical connections visually.",
    "Include micro-animations to demonstrate process flow.",
    "Track overlapping elements and reposition dynamically.",
    "Predefine figure order based on topic complexity.",
    "Use color intensity to indicate importance.",
    "Include layer separation to show multi-step processes.",
    "Track font consistency in all visual labels.",
    "Predefine figure callouts for overlapping elements.",
    "Highlight stepwise changes in dynamic charts.",
    "Include interactive toggles to reveal/hide overlapping content.",
    "Use AI to analyze script readability and simplify complex sentences.",
    "Apply AI-based grammar and spelling correction for narration scripts.",
    "Predefine voice modulation patterns using AI for emphasis.",
    "Use AI to align narration with slide timing automatically.",
    "Apply AI for automatic pacing adjustment based on concept difficulty.",
    "Detect overlapping visual elements using computer vision.",
    "Use AI to automatically adjust overlapping labels in graphs.",
    "Predefine topic hierarchy rules for AI to follow in content structuring.",
    "Generate alternative explanations for complex concepts using AI.",
    "Use AI to summarize each section for reinforcement slides.",
    "Automatically detect redundant sentences and remove them.",
    "Use AI to generate captions synchronized with narration.",
    "Automatically highlight key terms and formulas using NLP.",
    "Use AI to detect gaps in topic coverage.",
    "Generate interactive quiz questions automatically based on content.",
    "Use AI to create hierarchical bullet points from complex text.",
    "Automatically adjust visuals to match narration context.",
    "Use AI to detect logical inconsistencies in script flow.",
    "Generate alternative analogies using AI for diverse learning styles.",
    "Automatically detect and fix overlapping graphs.",
    "Use AI to prioritize content based on importance and difficulty.",
    "Generate visual hierarchy markers automatically.",
    "Detect overlapping steps in problem-solving diagrams.",
    "Use AI to optimize figure layout for clarity.",
    "Automatically adjust font sizes in figures for readability.",
    "Generate voice modulation cues from key terms automatically.",
    "Detect content repetition across chapters and remove redundancy.",
    "Use AI to suggest better transitions between topics.",
    "Automatically flag ambiguous statements in narration.",
    "Generate callouts and annotations on figures dynamically.",
    "Use AI to optimize color schemes for accessibility.",
    "Automatically detect overlapping text and reposition labels.",
    "Generate hierarchical summaries for each topic using AI.",
    "Use AI to suggest pacing adjustments based on content complexity.",
    "Automatically align visuals and narration timing.",
    "Detect and highlight potential misconceptions in content.",
    "Generate multiple versions of explanations for diverse audiences.",
    "Use AI to ensure formula consistency across video segments.",
    "Automatically suggest examples for abstract concepts.",
    "Detect overlapping charts and resolve layout conflicts.",
    "Generate interactive figure overlays using AI.",
    "Use AI to detect redundancy in bullet points and visuals.",
    "Automatically suggest emphasis points in narration.",
    "Generate alternative phrasing for clarity.",
    "Use AI to automatically add reinforcement prompts in narration.",
    "Detect missing topic links and suggest content bridging.",
    "Automatically create animation sequences for complex processes.",
    "Generate hierarchical visual markers automatically.",
    "Use AI to detect overlapping timelines in charts.",
    "Automatically optimize spacing in multi-layer diagrams.",
    "Generate suggested mnemonics for key concepts.",
    "Detect overlapping lines in graphs and offset them.",
    "Automatically adjust legend placement in charts.",
    "Use AI to suggest better color contrasts for overlapping visuals.",
    "Automatically generate stepwise visual build-up animations.",
    "Detect overlapping text in captions and adjust dynamically.",
    "Generate alternative visual layouts for clarity.",
    "Use AI to track topic coverage completeness.",
    "Automatically create slide summaries at the end of each section.",
    "Detect hierarchy violations in narration and visuals.",
    "Generate multiple visual options for complex topics.",
    "Use AI to highlight key transitions in narration automatically.",
    "Automatically detect pacing issues and insert pauses.",
    "Generate alternate analogies for repeated concepts.",
    "Use AI to suggest improvements for figure readability.",
    "Automatically detect overlapping problem steps in diagrams.",
    "Generate interactive timelines for historical topics.",
    "Use AI to detect unclear sentences in narration scripts.",
    "Automatically align callouts with moving visuals.",
    "Generate summaries of overlapping topics for clarity.",
    "Detect inconsistent terminology and suggest corrections.",
    "Use AI to optimize figure-to-text ratio per slide.",
    "Automatically generate prompts for active learner engagement.",
    "Generate hierarchical figure layers automatically.",
    "Use AI to detect and highlight key trends in graphs.",
    "Automatically suggest color coding for overlapping data points.",
    "Generate alternative slide orders for better flow.",
    "Use AI to detect redundant explanations and remove them.",
    "Automatically highlight critical points in narration.",
    "Generate hierarchical captions for complex diagrams.",
    "Use AI to detect visual clutter and simplify figures.",
    "Automatically align multi-part figures for coherence.",
    "Generate suggestions for emphasizing key formulas.",
    "Use AI to detect overlapping audio cues and adjust timing.",
    "Automatically create interactive pop-ups for detailed info.",
    "Generate alternative animations for problem-solving steps.",
    "Use AI to optimize text placement on dense slides.",
    "Automatically highlight key hierarchical relationships.",
    "Generate voice emphasis markers based on content importance.",
    "Use AI to detect missing labels in figures and charts.",
    "Automatically adjust pacing based on content complexity.",
    "Generate hierarchical outlines from scripts automatically.",
    "Use AI to suggest improvements for narration clarity.",
    "Automatically detect overlapping diagrams and resolve.",
    "Generate alternative color schemes for visual hierarchy.",
    "Use AI to automatically align callouts with relevant steps.",
    "Automatically detect missing examples in abstract topics.",
    "Generate alternate figure sequences for optimal understanding.",
    "Use AI to detect and correct misaligned captions.",
    "Automatically generate reinforcement questions after each segment.",
    "Generate interactive figure layers for optional learner exploration.",
    "Use AI to detect overlapping text in lists and bullet points.",
    "Automatically highlight key connections in graphs.",
    "Generate suggested animations for overlapping visual steps.",
    "Use AI to detect inconsistencies in problem-solving steps.",
    "Automatically generate hierarchical topic maps for narration.",
    "Generate alternate visuals for repeated content to maintain engagement.",
    "Use AI to track pacing and insert reflective pauses.",
    "Automatically detect conflicting data points in charts.",
    "Generate annotations for overlapping elements.",
    "Use AI to suggest optimal figure sizes per slide.",
    "Automatically adjust overlapping labels in multi-layer diagrams.",
    "Generate alternative phrasing for repeated explanations.",
    "Use AI to detect unclear visual sequences and suggest improvements.",
    "Automatically generate interactive quizzes based on overlapping topics.",
    "Generate reinforcement prompts at the end of complex topics.",
    "Use AI to detect overlapping hierarchical markers and clarify.",
    "Automatically optimize figure order to match narration flow.",
    "Generate alternative layouts for complex multi-part diagrams.",
    "Use AI to detect missing connections between visual and verbal content.",
    "Automatically highlight overlapping topics in summaries.",
    "Generate dynamic visual cues for problem-solving sequences.",
    "Use AI to detect pacing inconsistencies in narrated segments.",
    "Automatically suggest hierarchy markers for new content.",
    "Generate interactive overlays for dense figures.",
    "Use AI to detect and resolve overlapping color codes in visuals.",
    "Automatically adjust spacing for readability in crowded diagrams.",
    "Generate alternative visual annotations for clarity.",
    "Use AI to track topic repetition and optimize reinforcement.",
    "Automatically highlight key steps in multi-step problem-solving.",
    "Generate alternative captions for complex diagrams.",
    "Use AI to detect overlapping audio-visual cues and fix timing.",
    "Automatically generate summary diagrams for each section.",
    "Generate hierarchical figure layers for multi-part content.",
    "Use AI to detect overlapping examples and suggest separation.",
    "Automatically create interactive steps for learners to explore.",
    "Generate alternative slide arrangements for optimal understanding.",
    "Use AI to detect missing visual emphasis markers.",
    "Automatically highlight overlapping data points in charts.",
    "Generate reinforcement prompts for hierarchical concepts.",
    "Use AI to track overlapping narration topics.",
    "Automatically adjust figure labels to prevent overlap.",
    "Generate alternate animation sequences for clarity.",
    "Use AI to detect incomplete hierarchy in slides.",
    "Automatically generate captions for all key figures.",
    "Generate suggestions for clearer stepwise diagrams.",
    "Use AI to detect redundant visuals and remove them.",
    "Automatically highlight trends in complex datasets.",
    "Generate hierarchical overlays for overlapping diagrams.",
    "Use AI to detect and fix pacing in overlapping narration.",
    "Automatically suggest visual highlights for key data points.",
    "Generate alternative figure layouts for multi-layer diagrams.",
    "Use AI to track and correct overlapping audio cues.",
    "Automatically highlight connections between concepts visually.",
    "Generate interactive diagrams for complex problem-solving steps.",
    "Use AI to detect unclear visual labeling and fix automatically.",
    "Automatically generate hierarchical summaries for reinforcement.",
    "Generate alternate colors for overlapping elements to increase clarity.",
    "Use AI to track visual hierarchy and adjust dynamically.",
    "Automatically detect and resolve overlapping figure captions.",
    "Generate alternative animation for repeated content.",
    "Use AI to suggest hierarchy-based emphasis for narration.",
    "Automatically adjust pacing based on content complexity.",
    "Generate interactive pop-ups for optional exploration.",
    "Use AI to detect missing links between slides and visuals.",
    "Automatically highlight overlapping data trends for clarity.",
    "Generate hierarchical annotations for multi-layer diagrams.",
    "Use AI to detect redundant steps in multi-step problem-solving.",
    "Automatically generate reinforcement questions for dense topics.",
    "Generate alternative figure orders for better cognitive flow.",
    "Use AI to track and fix overlapping labels in charts.",
    "Automatically optimize figure placement relative to narration.",
    "Generate interactive overlays for detailed visual exploration.",
    "Use AI to detect misaligned callouts and fix dynamically.",
    "Automatically highlight hierarchical relationships visually.",
    "Generate alternative layouts for overlapping diagrams.",
    "Use AI to track pacing consistency across sections.",
    "Automatically detect unclear stepwise instructions in visuals.",
    "Generate alternative captions and callouts for clarity.",
    "Use AI to detect missing reinforcement prompts and add them.",
    "Automatically highlight critical nodes in hierarchical diagrams.",
    "Generate alternate animation for repeated visual patterns.",
    "Use AI to track overlapping topics and resolve clarity issues.",
    "Automatically generate figure summaries for each section.",
    "Generate alternative slide sequences for complex topics.",
    "Use AI to detect visual clutter and simplify.",
    "Automatically highlight key relationships in overlapping graphs.",
    "Generate interactive steps for complex problem-solving visuals.",
    "Use AI to track content hierarchy and reinforce key points.",
    "Automatically detect missing labels in multi-layer diagrams.",
    "Generate alternative figure arrangements to prevent overlap.",
    "Use AI to optimize animation speed for comprehension.",
    "Automatically highlight stepwise processes visually.",
    "Generate hierarchical overlays for dense visuals.",
    "Use AI to detect pacing inconsistencies and adjust automatically.",
    "Automatically suggest hierarchy-based emphasis for visuals.",
    "Generate alternative interactive sequences for learning reinforcement.",
    "Use AI to track overlapping narration and visual content.",
    "Automatically generate reinforcement summaries for complex topics.",
    "Generate hierarchical figure templates for future video production.",
    "Include periodic interactive questions during the video.",
    "Add clickable quizzes linked to specific concepts.",
    "Predefine pop-up hints for challenging problems.",
    "Use AI to suggest adaptive questions based on learner responses.",
    "Include pause points for learners to reflect on content.",
    "Use gamification elements like points for correct answers.",
    "Include mini-challenges after complex segments.",
    "Predefine interactive timelines for historical or process-based topics.",
    "Use AI to suggest reinforcement exercises dynamically.",
    "Include drag-and-drop exercises for matching concepts.",
    "Provide instant feedback for answers submitted.",
    "Include 'think-pair-share' style prompts for collaborative learning.",
    "Use AI to adapt difficulty based on learner performance.",
    "Predefine checkpoints for self-assessment.",
    "Include branching paths where learners choose topics.",
    "Add clickable annotations on figures for extra explanation.",
    "Include interactive simulations to demonstrate abstract concepts.",
    "Use polls to gauge understanding of a topic.",
    "Include mini-surveys to collect learner preferences.",
    "Add hotspots in diagrams that learners can click for more detail.",
    "Include scenario-based problem-solving exercises.",
    "Use AI to track engagement levels in real time.",
    "Predefine hints for exercises that learners struggle with.",
    "Include 'drag to sort' activities for concept hierarchy.",
    "Add multiple-choice challenges during demonstrations.",
    "Provide instant feedback with explanations for answers.",
    "Include 'pause and try yourself' segments.",
    "Use AI to dynamically suggest next topics based on mastery.",
    "Include timed challenges to encourage active recall.",
    "Add interactive step-by-step problem-solving guides.",
    "Include quizzes linked to specific figures or diagrams.",
    "Use AI to identify weak spots in learning and suggest exercises.",
    "Predefine interactive labels on graphs for exploration.",
    "Include fill-in-the-blank exercises for key formulas.",
    "Add scenario-based questions for real-world application.",
    "Include branching quizzes based on previous answers.",
    "Use AI to dynamically adjust reinforcement exercises.",
    "Add interactive flashcards for key terms.",
    "Include 'choose the right sequence' exercises for processes.",
    "Use instant feedback loops to reinforce correct answers.",
    "Include hotspots on video for optional deep dives.",
    "Add drag-and-drop labels on diagrams.",
    "Include periodic reflection prompts for learners.",
    "Use AI to suggest personalized exercises for each learner.",
    "Add stepwise interactive tutorials for problem-solving.",
    "Include clickable summaries for each topic.",
    "Add interactive sliders to explore changes in graphs.",
    "Include 'challenge yourself' exercises at the end of each chapter.",
    "Use AI to analyze engagement and suggest improvements.",
    "Include immediate scoring and feedback for interactive tasks.",
    "Add branching explanations depending on learner input.",
    "Include scenario-based interactive simulations.",
    "Add dynamic visual cues for learners' answers.",
    "Include interactive tables to explore data sets.",
    "Use AI to detect areas where learners pause frequently and suggest improvements.",
    "Include multiple-choice questions tied to key figures.",
    "Add interactive drag-to-match exercises for concepts and definitions.",
    "Include 'what happens next?' prompts for problem-solving.",
    "Use AI to monitor learner completion rates.",
    "Include short reflective pauses after difficult concepts.",
    "Add clickable definitions for technical terms.",
    "Include branching case studies for real-world applications.",
    "Add interactive simulations for physics or chemistry experiments.",
    "Use AI to dynamically adjust difficulty of interactive exercises.",
    "Include 'spot the error' challenges for graphs or calculations.",
    "Add mini-games related to formulas or equations.",
    "Include stepwise interactive problem-solving exercises.",
    "Use instant feedback for 'fill-in-the-blank' questions.",
    "Include drag-and-drop sequencing for procedures.",
    "Add interactive sliders to visualize mathematical functions.",
    "Include clickable explanations for each step in multi-part problems.",
    "Use AI to highlight common mistakes in user responses.",
    "Include reflection prompts for learners to summarize their understanding.",
    "Add timed interactive quizzes to encourage active recall.",
    "Include scenario-based branching exercises for decision-making.",
    "Use AI to track learner interaction patterns.",
    "Include clickable pop-up hints for complex diagrams.",
    "Add mini-simulations to demonstrate cause-effect relationships.",
    "Include interactive graphs for learners to manipulate variables.",
    "Use AI to suggest targeted exercises based on performance.",
    "Include drag-and-drop exercises for topic hierarchy understanding.",
    "Add 'match the pairs' exercises for concepts and examples.",
    "Include interactive summaries with clickable links to sections.",
    "Use AI to dynamically adjust pacing of interactive elements.",
    "Include stepwise interactive guides for formula derivations.",
    "Add interactive sliders for physics or chemistry experiments.",
    "Include 'select all that apply' questions for deeper thinking.",
    "Use AI to monitor which interactive elements are most used.",
    "Include mini-challenges to reinforce problem-solving skills.",
    "Add pop-up explanations for misunderstood steps.",
    "Include interactive tables to explore economic or chemical data.",
    "Use AI to suggest alternative questions for weak learners.",
    "Include drag-and-drop timelines for historical or process-based topics.",
    "Add interactive 'spot the difference' exercises for figures.",
    "Include clickable overlays to explain overlapping visuals.",
    "Use AI to dynamically highlight key concepts based on user focus.",
    "Include reflective prompts at the end of complex sections.",
    "Add interactive sequences for multi-step chemical reactions.",
    "Include branching quizzes to test understanding of dependencies.",
    "Use AI to detect and flag skipped content for reinforcement.",
    "Include interactive sliders to compare economic scenarios.",
    "Add 'click to reveal' solutions for problem-solving exercises.",
    "Include interactive flowcharts for hierarchical concept understanding.",
    "Use AI to suggest next steps based on learner mastery.",
    "Include drag-and-drop graph labeling exercises.",
    "Add mini-interactive labs for physics or chemistry.",
    "Include scenario-based roleplay exercises for real-world application.",
    "Use AI to detect learner hesitation and suggest help prompts.",
    "Include timed recall challenges for formulas or steps.",
    "Add clickable pop-ups for key concept definitions.",
    "Include multi-step branching case studies.",
    "Use AI to dynamically reorder content for optimal engagement.",
    "Include 'check your understanding' checkpoints after each section.",
    "Add interactive overlays for layered diagrams.",
    "Include mini-games to reinforce vocabulary or terms.",
    "Use AI to track time spent on each interactive element.",
    "Include problem-solving simulations with stepwise guidance.",
    "Add 'choose your path' exercises to explore alternative solutions.",
    "Include interactive summaries with clickable key terms.",
    "Use AI to suggest additional exercises for weak points.",
    "Include drag-and-drop matching of equations and graphs.",
    "Add interactive flashcards for stepwise problem-solving.",
    "Include 'what would you do next?' prompts for decision-making.",
    "Use AI to detect skipped interactive content and prompt users.",
    "Include scenario-based exploration of overlapping topics.",
    "Add interactive sliders to visualize chemical reaction rates.",
    "Include clickable annotations to clarify overlapping visuals.",
    "Use AI to dynamically highlight important steps in exercises.",
    "Include timed exercises to encourage rapid recall.",
    "Add interactive cause-effect diagrams for complex processes.",
    "Include drag-and-drop sequencing for hierarchical topics.",
    "Use AI to detect user misconceptions and provide adaptive hints.",
    "Include interactive pop-ups for reinforcement of key formulas.",
    "Add mini-quizzes to check comprehension after each figure.",
    "Include clickable overlays to explain overlapping steps.",
    "Use AI to optimize timing and placement of interactive elements.",
    "Include multi-step interactive tutorials for problem-solving.",
    "Add branching exercises for alternative solutions.",
    "Include interactive simulations for real-world applications.",
    "Use AI to track learner engagement and suggest improvements.",
    "Include 'choose the correct path' exercises for decision-making.",
    "Add interactive hints to guide learners through difficult problems.",
    "Include timed interactive problem-solving challenges.",
    "Use AI to detect patterns of incorrect answers and suggest content.",
    "Include interactive figure overlays to clarify hierarchy.",
    "Add multi-layered interactive diagrams for overlapping concepts.",
    "Include stepwise interactive exercises for formula derivation.",
    "Use AI to suggest reinforcement tasks based on performance.",
    "Include drag-and-drop exercises to match graphs and equations.",
    "Add interactive pop-ups to highlight key overlapping concepts.",
    "Use AI to detect overlapping text and automatically adjust spacing.",
    "Apply automated figure alignment to prevent collisions with text.",
    "Detect duplicate content across slides and remove redundancy.",
    "Automatically check topic hierarchy consistency.",
    "Use AI to flag inconsistent labeling in figures and graphs.",
    "Detect overlapping chart elements and reposition them dynamically.",
    "Verify visual clarity after resizing or scaling slides.",
    "Use AI to detect misaligned captions and correct placement.",
    "Detect overlapping annotations and adjust layering.",
    "Automatically check font size consistency across slides.",
    "Detect figure density and reduce clutter for better readability.",
    "Use AI to highlight overlapping steps in multi-part problems.",
    "Verify axis scaling consistency across graphs.",
    "Detect overlapping arrows or callouts and reposition them.",
    "Check slide order to ensure logical progression of topics.",
    "Use AI to detect overlapping audio cues and adjust timing.",
    "Verify visual hierarchy matches narration emphasis.",
    "Detect repeated formulas or graphs and merge where appropriate.",
    "Automatically check color contrast for accessibility.",
    "Detect overlapping pop-up hints and adjust display timing.",
    "Verify captions match the corresponding figure.",
    "Use AI to detect overlapping quiz questions and adjust placement.",
    "Detect inconsistent numbering in multi-step diagrams.",
    "Check spacing between multi-layered figure elements.",
    "Automatically flag missing figure legends.",
    "Detect overlapping animation sequences and adjust timing.",
    "Verify that topic transitions follow hierarchy rules.",
    "Detect overlapping interactive elements and reposition them.",
    "Automatically check alignment of charts, text, and diagrams.",
    "Detect overlapping highlights in figures and adjust opacity.",
    "Verify consistency of terminology across slides.",
    "Detect overlapping audio narration for multi-speaker content.",
    "Automatically detect missing labels in multi-layered graphs.",
    "Detect overlapping shapes or symbols and adjust position.",
    "Check slide-to-slide continuity for repeated topics.",
    "Detect overlapping figure layers and re-order correctly.",
    "Verify formula consistency across different sections.",
    "Detect overlapping bullet points and merge logically.",
    "Check color consistency for hierarchical topic indicators.",
    "Detect overlapping callouts for key concepts.",
    "Automatically adjust position of overlapping icons.",
    "Verify timing alignment of visuals with narration.",
    "Detect overlapping timeline markers in process diagrams.",
    "Check alignment of interactive sliders and graphs.",
    "Detect overlapping pop-up explanations in multi-layer diagrams.",
    "Verify correct sequencing of multi-step problem-solving exercises.",
    "Detect overlapping hotspots in interactive figures.",
    "Automatically adjust overlapping visual overlays.",
    "Verify that hierarchy of topics matches slide order.",
    "Detect overlapping arrows in process flows and adjust direction.",
    "Check spacing between multiple charts on a single slide.",
    "Detect overlapping text labels in dense diagrams.",
    "Verify consistency of color coding across repeated visuals.",
    "Detect overlapping mini-quizzes and adjust placement.",
    "Check interactive elements for overlap with main content.",
    "Detect misaligned axes in graphs with multiple layers.",
    "Automatically flag missing figure references in narration.",
    "Detect overlapping animation start and end points.",
    "Verify figure size proportion across related slides.",
    "Detect overlapping diagram components in multi-step visuals.",
    "Check that topic hierarchy is reflected in figure layering.",
    "Detect overlapping formula boxes and reposition them.",
    "Automatically detect duplicate visual elements and merge.",
    "Verify spacing between text blocks and figures.",
    "Detect overlapping pop-up tips in interactive sections.",
    "Check consistency of bullet numbering in lists.",
    "Detect overlapping highlights in summary slides.",
    "Automatically verify caption placement for each figure.",
    "Detect overlapping timeline labels in process diagrams.",
    "Check alignment of nested diagrams for clarity.",
    "Detect overlapping reference lines in graphs.",
    "Verify hierarchical ordering of interactive exercises.",
    "Detect overlapping answer boxes in quizzes.",
    "Automatically adjust overlapping figure legends.",
    "Check multi-layered animations for collision.",
    "Detect overlapping arrows in flow diagrams.",
    "Verify spacing between interactive elements and main content.",
    "Detect overlapping hotspot areas in interactive visuals.",
    "Check for redundant visual elements across slides.",
    "Detect overlapping text in captions and labels.",
    "Automatically flag inconsistent figure sizes.",
    "Verify alignment of callouts with corresponding visuals.",
    "Detect overlapping interactive pop-ups and adjust order.",
    "Check color hierarchy consistency in charts.",
    "Detect overlapping labels in multi-part diagrams.",
    "Verify spacing of visual elements for clarity and readability.",
    "Detect overlapping animation sequences for complex processes.",
    "Check numbering sequence in multi-step exercises.",
    "Detect overlapping highlights on key data points.",
    "Automatically detect missing connections between visuals and narration.",
    "Check hierarchy of topics in interactive elements.",
    "Detect overlapping axes in graphs and adjust scaling.",
    "Verify clarity of overlapping figure layers.",
    "Detect overlapping annotations and move to avoid conflict.",
    "Check alignment of multi-layered pop-ups in interactive sections.",
    "Detect overlapping callouts for formulas or equations.",
    "Verify that slide hierarchy matches content flow.",
    "Detect overlapping diagrams in comparison slides.",
    "Check for duplicate figure elements across different sections.",
    "Detect overlapping interactive quiz elements.",
    "Verify consistent placement of repeated visual icons.",
    "Detect overlapping lines in multi-series charts.",
    "Check spacing between text and interactive elements.",
    "Detect overlapping captions for complex diagrams.",
    "Automatically adjust overlapping hierarchical markers.",
    "Verify timing alignment for overlapping narration and animation.",
    "Detect overlapping pop-up hints for complex topics.",
    "Check spacing between interactive sliders and graphs.",
    "Detect overlapping elements in multi-layered diagrams.",
    "Verify figure alignment in multi-slide sequences.",
    "Detect overlapping arrows or connectors in flowcharts.",
    "Check visual hierarchy in overlapping charts.",
    "Detect overlapping text in dense bullet lists.",
    "Automatically detect misaligned labels in interactive diagrams.",
    "Verify spacing and alignment in multi-part interactive elements.",
    "Detect overlapping highlights in key summary sections.",
    "Check hierarchy consistency in nested diagrams.",
    "Detect overlapping timelines in historical process diagrams.",
    "Verify alignment of visual and audio cues.",
    "Detect overlapping shapes in multi-step problem-solving visuals.",
    "Check consistency of color usage for overlapping elements.",
    "Detect overlapping figure legends and adjust positions.",
    "Verify spacing between multi-layered interactive elements.",
    "Detect overlapping labels in complex graphs.",
    "Check alignment of hierarchical markers in figures.",
    "Detect overlapping pop-up callouts in interactive exercises.",
    "Verify slide order for logical content flow.",
    "Detect overlapping animation frames in multi-step processes.",
    "Check consistency of formula placement in visuals.",
    "Detect overlapping hotspots in interactive diagrams.",
    "Verify hierarchy of figure layers in multi-layered diagrams.",
    "Detect overlapping text in captions or labels.",
    "Check alignment of interactive quiz boxes.",
    "Detect overlapping arrows or flow connectors.",
    "Verify consistency of slide hierarchy and topic order.",
    "Detect overlapping interactive elements in simulations.",
    "Check visual clarity after resizing or scaling figures.",
    "Detect overlapping labels in nested diagrams.",
    "Verify alignment of callouts with corresponding elements.",
    "Detect overlapping highlights in summary visuals.",
    "Check consistency of color hierarchy across slides.",
    "Detect overlapping figure legends and captions.",
    "Verify spacing of multi-layered visual elements.",
    "Detect overlapping animation sequences for clarity.",
    "Check alignment of multi-part diagrams with narration.",
    "Detect overlapping timeline markers or process steps.",
    "Verify hierarchical topic consistency in visuals.",
    "Detect overlapping hotspots in multi-layered interactive diagrams.",
    "Check that all overlapping elements maintain readability.",
    "Automatically generate a final QC report highlighting all overlaps and hierarchy inconsistencies.",
]

OUTPUT_GUIDELINES_INDEX = {f"G{idx+1:03d}": rule for idx, rule in enumerate(OUTPUT_GUIDELINES)}

GUIDELINE_REFERENCE_NOTE = (
    "When responding to any request about video planning or instructional design, "
    "use the OUTPUT_GUIDELINES list as the authoritative checklist and return "
    "structured, numbered items that map to the guideline IDs (G001, G002, ...)."
 )

SYSTEM_PROMPT = SYSTEM_PROMPT + "\n\n" + GUIDELINE_REFERENCE_NOTE


def format_guidelines_response(selected_ids=None):
    if selected_ids is None:
        selected_ids = list(OUTPUT_GUIDELINES_INDEX.keys())
    lines = []
    for gid in selected_ids:
        rule = OUTPUT_GUIDELINES_INDEX.get(gid)
        if rule:
            lines.append(f"{gid}: {rule}")
    return "\n".join(lines)

# Example:
# print(format_guidelines_response(["G001", "G010"]))

### Output Guidelines (Structured)

This section stores the requested output guidelines in a structured format and appends a concise policy to the system prompt so the model can reference them consistently.

## 3. Configure Model and Tokenizer

### ‚ö†Ô∏è Important: Authentication Required

The `google/gemma-2b-it` model is **gated** and requires:

1. **Request Access**: Go to https://huggingface.co/google/gemma-2b-it and click "Request Access"
2. **Create HF Token**: Go to https://huggingface.co/settings/tokens and create a token with "Read" permissions
3. **Login**: Run the cell below to authenticate

**Alternative**: If you can't access Gemma, use an open model like `microsoft/phi-2` or `TinyLlama/TinyLlama-1.1B-Chat-v1.0` (change `BASE_MODEL` below)

In [10]:
# ============================================================================
# HUGGING FACE AUTHENTICATION - PASTE YOUR TOKEN HERE
# ============================================================================

from huggingface_hub import login

print("="*80)
print("üîê HUGGING FACE LOGIN")
print("="*80)

# PASTE YOUR TOKEN BELOW (replace the empty string with your actual token)
YOUR_TOKEN = ""  # ‚Üê Put your token here between the quotes

if YOUR_TOKEN:
    login(token=YOUR_TOKEN)
    print("‚úì Successfully logged in to Hugging Face!")
    print("‚úì You can now access gated models like google/gemma-2b-it")
else:
    print("\n‚ö†Ô∏è  WARNING: No token provided!")
    print("   Please paste your Hugging Face token above where it says YOUR_TOKEN = \"\"")
    print("   Get your token from: https://huggingface.co/settings/tokens")
    print("\n   Example: YOUR_TOKEN = \"hf_abcdefghijklmnopqrstuvwxyz1234567890\"")
    print("\n   After adding your token, re-run this cell.")

print("\n" + "="*80)


üîê HUGGING FACE LOGIN

   Please paste your Hugging Face token above where it says YOUR_TOKEN = ""
   Get your token from: https://huggingface.co/settings/tokens

   Example: YOUR_TOKEN = "hf_abcdefghijklmnopqrstuvwxyz1234567890"

   After adding your token, re-run this cell.



In [11]:
import math
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

BASE_MODEL = "google/gemma-2b-it"  # Change if needed

print("="*80)
print("üì• LOADING MODEL & TOKENIZER")
print("="*80)
print(f"\nModel: {BASE_MODEL}")
print("This will download ~2.2GB of model files...")
print("\n‚è≥ Please wait (may take 2-5 minutes on first run)...\n")

# Load tokenizer/model
try:
    TOKENIZER = AutoTokenizer.from_pretrained(BASE_MODEL, use_fast=True)
    print("‚úì Tokenizer loaded successfully")
    
    MODEL = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL,
        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
        device_map="auto",
    )
    print("‚úì Model loaded successfully")
    
except Exception as e:
    print(f"\n‚ùå ERROR: Failed to load model")
    print(f"   {str(e)}")
    print("\nüí° Solutions:")
    print("   1. Make sure you ran the previous cell with your HF token")
    print("   2. Check you have access to google/gemma-2b-it at:")
    print("      https://huggingface.co/google/gemma-2b-it")
    print("   3. Or change BASE_MODEL to an open model like 'microsoft/phi-2'")
    raise

if TOKENIZER.pad_token is None:
    TOKENIZER.pad_token = TOKENIZER.eos_token
    print("‚úì Configured pad token")

# LoRA configuration for efficient fine-tuning
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
MODEL = get_peft_model(MODEL, lora_config)
print("‚úì LoRA adapter applied to model")

def build_prompt(question: str, answer: str | None = None) -> str:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": question},
    ]
    if answer is not None:
        messages.append({"role": "assistant", "content": answer})

    if hasattr(TOKENIZER, "apply_chat_template"):
        return TOKENIZER.apply_chat_template(messages, tokenize=False, add_generation_prompt=answer is None)

    # Fallback generic prompt
    prompt = f"<|system|>\n{SYSTEM_PROMPT}\n<|user|>\n{question}\n<|assistant|>\n"
    if answer is not None:
        prompt += answer
    return prompt

print("\n" + "="*80)
print("‚úÖ MODEL CONFIGURATION COMPLETE")
print("="*80)
print(f"Base Model: {BASE_MODEL}")
print(f"LoRA Config: r={lora_config.r}, alpha={lora_config.lora_alpha}, dropout={lora_config.lora_dropout}")
print(f"Trainable params: {MODEL.get_nb_trainable_parameters()}")
print("="*80)

üì• LOADING MODEL & TOKENIZER

Model: google/gemma-2b-it
This will download ~2.2GB of model files...

‚è≥ Please wait (may take 2-5 minutes on first run)...



Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).



‚ùå ERROR: Failed to load model
   You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/google/gemma-2b-it.
401 Client Error. (Request ID: Root=1-698fedd3-0921e0881453ef1046b31d57;5ebbd77d-bf31-4c3d-959b-ea3474b39920)

Cannot access gated repo for url https://huggingface.co/google/gemma-2b-it/resolve/main/config.json.
Access to model google/gemma-2b-it is restricted. You must have access to it and be authenticated to access it. Please log in.

üí° Solutions:
   1. Make sure you ran the previous cell with your HF token
   2. Check you have access to google/gemma-2b-it at:
      https://huggingface.co/google/gemma-2b-it
   3. Or change BASE_MODEL to an open model like 'microsoft/phi-2'


OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/google/gemma-2b-it.
401 Client Error. (Request ID: Root=1-698fedd3-0921e0881453ef1046b31d57;5ebbd77d-bf31-4c3d-959b-ea3474b39920)

Cannot access gated repo for url https://huggingface.co/google/gemma-2b-it/resolve/main/config.json.
Access to model google/gemma-2b-it is restricted. You must have access to it and be authenticated to access it. Please log in.

## 4. Implement Custom Training Loop

### Overlap-Safe Visualization Helpers

These helpers reduce collisions between plot elements and keep dynamic graphs readable.

In [None]:
def _normalize_text(s: str) -> str:
    s = s.lower()
    s = re.sub(r"[^a-z0-9\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s


def _token_set(s: str) -> set[str]:
    return set(_normalize_text(s).split())


def _jaccard(a: set[str], b: set[str]) -> float:
    if not a and not b:
        return 1.0
    return len(a.intersection(b)) / max(1, len(a.union(b)))


def _strict_overlap_pairs(train_items, val_items, jaccard_threshold=0.95):
    train_keys = {normalize_question(ex["question"]) for ex in train_items}
    val_keys = {normalize_question(ex["question"]) for ex in val_items}
    exact_overlap = train_keys.intersection(val_keys)

    train_pairs = {_normalize_text(ex["question"] + " " + ex["answer"]) for ex in train_items}
    val_pairs = {_normalize_text(ex["question"] + " " + ex["answer"]) for ex in val_items}
    pair_overlap = train_pairs.intersection(val_pairs)

    # Near-duplicate detection using Jaccard similarity on token sets
    near_overlap = []
    val_tokens = [(_token_set(ex["question"]), ex["question"]) for ex in val_items]
    for ex in train_items:
        tset = _token_set(ex["question"])
        for vset, vq in val_tokens:
            if _jaccard(tset, vset) >= jaccard_threshold:
                near_overlap.append((ex["question"], vq))

    return exact_overlap, pair_overlap, near_overlap


def _assert_no_overlap(train_items, val_items, jaccard_threshold=0.95):
    exact_overlap, pair_overlap, near_overlap = _strict_overlap_pairs(
        train_items, val_items, jaccard_threshold=jaccard_threshold
    )
    if exact_overlap or pair_overlap or near_overlap:
        raise ValueError(
            "Overlap detected: "
            f"exact={len(exact_overlap)}, pair={len(pair_overlap)}, near={len(near_overlap)}"
        )
    return len(train_items), len(val_items)


def _domain_balance_report(items):
    counts = {d: 0 for d in ALLOWED_DOMAINS}
    for ex in items:
        d = ex.get("domain", "").lower()
        if d in counts:
            counts[d] += 1
    return counts


def _stratified_split(items, train_ratio=0.9, seed=42):
    rng = random.Random(seed)
    groups = {d: [] for d in ALLOWED_DOMAINS}
    for ex in items:
        d = ex.get("domain", "").lower()
        if d in groups:
            groups[d].append(ex)

    train_items, val_items = [], []
    for d, group in groups.items():
        rng.shuffle(group)
        split_idx = int(train_ratio * len(group))
        train_items.extend(group[:split_idx])
        val_items.extend(group[split_idx:])
    rng.shuffle(train_items)
    rng.shuffle(val_items)
    return train_items, val_items


def _split_no_overlap(items, train_ratio=0.9, seed=42, max_tries=20, jaccard_threshold=0.95):
    for attempt in range(max_tries):
        train_items, val_items = _stratified_split(items, train_ratio, seed + attempt)
        try:
            _assert_no_overlap(train_items, val_items, jaccard_threshold=jaccard_threshold)
            return train_items, val_items
        except ValueError:
            continue
    raise ValueError("Unable to create non-overlapping split after retries")


def overlap_resolving_training_check(
    train_items,
    val_items,
    auto_fix=True,
    min_per_domain=1,
    jaccard_threshold=0.95,
):
    try:
        train_n, val_n = _assert_no_overlap(
            train_items, val_items, jaccard_threshold=jaccard_threshold
        )
    except ValueError as err:
        if not auto_fix:
            raise
        print("Overlap detected, re-splitting...", err)
        train_items, val_items = _split_no_overlap(
            balanced,
            train_ratio=0.9,
            seed=42,
            max_tries=30,
            jaccard_threshold=jaccard_threshold,
        )
        train_n, val_n = _assert_no_overlap(
            train_items, val_items, jaccard_threshold=jaccard_threshold
        )

    train_counts = _domain_balance_report(train_items)
    val_counts = _domain_balance_report(val_items)

    for d in ALLOWED_DOMAINS:
        if train_counts[d] < min_per_domain or val_counts[d] < min_per_domain:
            raise ValueError(f"Domain '{d}' below minimum per split")

    print("Train size:", train_n, "Val size:", val_n)
    print("Train domain balance:", train_counts)
    print("Val domain balance:", val_counts)
    return train_items, val_items

# Run once after creating the split (strict checks + auto-fix)
train_examples, val_examples = overlap_resolving_training_check(
    train_examples,
    val_examples,
    auto_fix=True,
    min_per_domain=1,
    jaccard_threshold=0.95,
)
train_ds = Dataset.from_list(train_examples)
val_ds = Dataset.from_list(val_examples)

### Overlap-Resolving Training Details

This section makes overlap handling explicit during training: it validates dataset uniqueness, prevents train/val leakage, and logs overlap checks per epoch.

In [None]:
import numpy as np
from matplotlib.gridspec import GridSpec


def _aabb_overlap(b1, b2) -> bool:
    return b1.overlaps(b2)


def _resolve_text_overlap(ax_text):
    fig = ax_text.figure
    fig.canvas.draw()
    renderer = fig.canvas.get_renderer()
    texts = ax_text.texts
    if len(texts) < 2:
        return

    for i in range(len(texts)):
        for j in range(i + 1, len(texts)):
            b1 = texts[i].get_window_extent(renderer)
            b2 = texts[j].get_window_extent(renderer)
            if _aabb_overlap(b1, b2):
                x, y = texts[j].get_position()
                texts[j].set_position((x, y - 0.05))


def plot_domain_example(domain: str):
    fig = plt.figure(figsize=(10, 4), constrained_layout=True)
    gs = GridSpec(1, 2, figure=fig, width_ratios=[1, 1], wspace=0.25)
    ax_graph = fig.add_subplot(gs[0, 0])
    ax_text = fig.add_subplot(gs[0, 1])
    ax_text.axis("off")

    if domain == "math":
        x = np.linspace(-3, 3, 200)
        y = x ** 2 - 1
        ax_graph.plot(x, y)
        ax_graph.set_title("Quadratic Function")
        ax_text.text(0.02, 0.95, "Math Example: y = x^2 - 1\nRoots at x = -1, 1.", va="top", wrap=True)
    elif domain == "physics":
        t = np.linspace(0, 10, 200)
        y = 0.5 * 9.8 * t ** 2
        ax_graph.plot(t, y)
        ax_graph.set_title("Free-Fall Distance")
        ax_text.text(0.02, 0.95, "Physics Example: s = 1/2 g t^2\nAssume g = 9.8 m/s^2.", va="top", wrap=True)
    elif domain == "economics":
        q = np.linspace(0, 100, 200)
        p = 100 - q
        ax_graph.plot(q, p)
        ax_graph.set_title("Demand Curve")
        ax_text.text(0.02, 0.95, "Economics Example: P = 100 - Q\nSlope is -1.", va="top", wrap=True)
    elif domain == "chemistry":
        t = np.linspace(0, 10, 200)
        c = np.exp(-0.3 * t)
        ax_graph.plot(t, c)
        ax_graph.set_title("First-Order Decay")
        ax_text.text(0.02, 0.95, "Chemistry Example: C = C0 e^{-kt}\nHere k = 0.3.", va="top", wrap=True)
    else:
        raise ValueError("Unknown domain")

    _resolve_text_overlap(ax_text)
    plt.show()

# Example: plot_domain_example("math")

In [None]:
import time


def dynamic_domain_graphs(domains=None, pause_s=1.0, cycles=1):
    if domains is None:
        domains = ["math", "physics", "economics", "chemistry"]

    for _ in range(cycles):
        for domain in domains:
            clear_output(wait=True)
            plot_domain_example(domain)
            time.sleep(pause_s)

# Example: dynamic_domain_graphs(pause_s=1.2, cycles=2)

### Domain Layout Examples (Math, Physics, Economics, Chemistry)

These examples show overlap-safe plotting for domain content: a graph panel plus a text panel using GridSpec, with collision checks on annotations.

In [None]:
from torch.utils.data import DataLoader
from accelerate import Accelerator
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
from IPython.display import clear_output

MAX_LEN = 1024


def tokenize_example(ex):
    full_text = build_prompt(ex["question"], ex["answer"])
    prompt_text = build_prompt(ex["question"], None)

    full = TOKENIZER(full_text, truncation=True, max_length=MAX_LEN)
    prompt_ids = TOKENIZER(prompt_text, truncation=True, max_length=MAX_LEN)["input_ids"]

    labels = full["input_ids"].copy()
    labels[: len(prompt_ids)] = [-100] * len(prompt_ids)
    full["labels"] = labels
    return full

train_tok = train_ds.map(tokenize_example, remove_columns=train_ds.column_names)
val_tok = val_ds.map(tokenize_example, remove_columns=val_ds.column_names)


def collate_fn(batch):
    return TOKENIZER.pad(batch, padding=True, return_tensors="pt")

train_loader = DataLoader(train_tok, batch_size=2, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_tok, batch_size=2, shuffle=False, collate_fn=collate_fn)

accelerator = Accelerator()
MODEL, train_loader, val_loader = accelerator.prepare(MODEL, train_loader, val_loader)

optimizer = torch.optim.AdamW(MODEL.parameters(), lr=2e-4)


def _bbox_overlap(b1, b2) -> bool:
    return b1.overlaps(b2)


def _plot_has_overlap(ax) -> bool:
    fig = ax.figure
    fig.canvas.draw()
    renderer = fig.canvas.get_renderer()
    boxes = []

    if ax.title:
        boxes.append(ax.title.get_window_extent(renderer))
    if ax.xaxis.label:
        boxes.append(ax.xaxis.label.get_window_extent(renderer))
    if ax.yaxis.label:
        boxes.append(ax.yaxis.label.get_window_extent(renderer))

    legend = ax.get_legend()
    if legend is not None:
        boxes.append(legend.get_window_extent(renderer))

    for i in range(len(boxes)):
        for j in range(i + 1, len(boxes)):
            if _bbox_overlap(boxes[i], boxes[j]):
                return True
    return False


def safe_plot_loss(history):
    clear_output(wait=True)
    fig, ax = plt.subplots(figsize=(6, 4), constrained_layout=True)
    ax.plot(history["train"], label="train")
    ax.plot(history["val"], label="val")
    ax.set_xlabel("epoch")
    ax.set_ylabel("loss")
    ax.set_title("Training Loss", pad=10)
    ax.legend(loc="best")

    if _plot_has_overlap(ax):
        fig.set_constrained_layout(False)
        fig.tight_layout(pad=1.2)
        ax.set_title("Training Loss", pad=14)

    plt.show()


def evaluate():
    MODEL.eval()
    losses = []
    for batch in val_loader:
        with torch.no_grad():
            outputs = MODEL(**batch)
        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch["input_ids"].shape[0])))
    losses = torch.cat(losses)
    return torch.mean(losses).item()


def train_loop(epochs=3, grad_accum_steps=4, early_stop_patience=2):
    history = {"train": [], "val": []}
    best_val = float("inf")
    patience = 0

    for epoch in range(epochs):
        MODEL.train()
        total_loss = 0.0
        step = 0

        for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs}"):
            outputs = MODEL(**batch)
            loss = outputs.loss / grad_accum_steps
            accelerator.backward(loss)

            if (step + 1) % grad_accum_steps == 0:
                optimizer.step()
                optimizer.zero_grad()

            total_loss += loss.item()
            step += 1

        avg_train = total_loss / max(1, step)
        avg_val = evaluate()

        history["train"].append(avg_train)
        history["val"].append(avg_val)

        safe_plot_loss(history)

        # Guard against accidental overlap in logs
        try:
            _assert_no_overlap(train_examples, val_examples)
        except ValueError as err:
            print("Overlap check failed:", err)
            break

        if avg_val < best_val:
            best_val = avg_val
            patience = 0
        else:
            patience += 1
            if patience >= early_stop_patience:
                print("Early stopping triggered.")
                break

    return history

## 5. Fine-tune Model on Domain Data

In [None]:
EPOCHS = 3
GRAD_ACCUM = 4

history = train_loop(epochs=EPOCHS, grad_accum_steps=GRAD_ACCUM, early_stop_patience=2)

## 6. Evaluate Model Performance

In [None]:
val_loss = evaluate()
perplexity = math.exp(val_loss)
print("Validation loss:", val_loss)
print("Perplexity:", perplexity)

# Sanity check: no overlap between train/val questions
train_keys = {normalize_question(ex["question"]) for ex in train_examples}
val_keys = {normalize_question(ex["question"]) for ex in val_examples}
print("Overlap count:", len(train_keys.intersection(val_keys)))

## 7. Save and Export Fine-tuned Model

In [None]:
OUTPUT_DIR = "/content/finetuned-phiversity"
MODEL.save_pretrained(OUTPUT_DIR)
TOKENIZER.save_pretrained(OUTPUT_DIR)
print("Saved to", OUTPUT_DIR)

## 8. Test Model on Sample Questions

In [None]:
DOMAIN_KEYWORDS = {
    "math": ["equation", "integral", "derivative", "matrix", "probability"],
    "physics": ["force", "energy", "thermodynamics", "quantum", "velocity"],
    "economics": ["inflation", "gdp", "elasticity", "supply", "demand"],
    "chemistry": ["molecule", "reaction", "acid", "base", "orbital"],
}


def is_in_domain(question: str) -> bool:
    q = question.lower()
    if any(k in q for k in ["physics", "math", "economics", "chemistry"]):
        return True
    return any(any(k in q for k in keys) for keys in DOMAIN_KEYWORDS.values())


def generate_answer(question: str, max_new_tokens=256) -> str:
    if not is_in_domain(question):
        return "Sorry, I can only answer questions about Physics, Math, Economics, or Chemistry."

    prompt = build_prompt(question, None)
    inputs = TOKENIZER(prompt, return_tensors="pt").to(MODEL.device)
    with torch.no_grad():
        output_ids = MODEL.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            temperature=0.0,
        )
    return TOKENIZER.decode(output_ids[0], skip_special_tokens=True)

sample_questions = [
    "Compute the derivative of x^3 + 2x.",
    "State the ideal gas law.",
    "Explain the concept of opportunity cost.",
    "What is the pH of a 1e-3 M HCl solution?",
    "Who is the president of France?",
]

for q in sample_questions:
    print("Q:", q)
    print("A:", generate_answer(q))
    print("-" * 60)

## 9. QC Validation & Guideline Utilities

This section provides modularized quality control functions and utilities to validate content against the 1000+ guidelines.


In [None]:
class QCValidator:
    """
    Quality Control Validator for content conformance to OUTPUT_GUIDELINES.
    
    This class provides modularized validation functions for:
    - Hierarchy consistency checking
    - Overlap detection and reporting
    - Guideline category matching
    - Content redundancy detection
    """
    
    def __init__(self, guidelines_index):
        """
        Initialize QC Validator with guideline index.
        
        Args:
            guidelines_index (dict): OUTPUT_GUIDELINES_INDEX mapping (e.g., 'G001' -> rule text)
        """
        self.guidelines_index = guidelines_index
        self.category_ranges = {
            "Planning": (1, 100),
            "Narration": (101, 400),
            "Visuals": (401, 700),
            "AI Automation": (701, 850),
            "User Engagement": (851, 950),
            "Quality Control": (951, 1082),
        }
    
    def get_category(self, gid):
        """
        Return the guideline category for a given ID (e.g., 'G001' -> 'Planning').
        
        Args:
            gid (str): Guideline ID (e.g., 'G001')
        
        Returns:
            str: Category name or 'Unknown'
        """
        try:
            num = int(gid[1:])
            for cat, (start, end) in self.category_ranges.items():
                if start <= num <= end:
                    return cat
        except:
            pass
        return "Unknown"
    
    def get_guidelines_by_category(self, category):
        """
        Retrieve all guidelines for a specific category.
        
        Args:
            category (str): Category name (e.g., 'Planning', 'Visuals', 'Quality Control')
        
        Returns:
            dict: Filtered guidelines mapping to the category
        """
        if category not in self.category_ranges:
            return {}
        start, end = self.category_ranges[category]
        return {f"G{num:03d}": self.guidelines_index.get(f"G{num:03d}")
                for num in range(start, end + 1) if f"G{num:03d}" in self.guidelines_index}
    
    def check_overlap_keywords(self, text1, text2, threshold=0.7):
        """
        Detect semantic overlap between two text snippets using keyword overlap.
        
        Args:
            text1 (str): First text snippet
            text2 (str): Second text snippet
            threshold (float): Jaccard similarity threshold (0.0-1.0)
        
        Returns:
            dict: Overlap analysis with similarity score and shared keywords
        """
        words1 = set(_normalize_text(text1).split())
        words2 = set(_normalize_text(text2).split())
        
        if not words1 or not words2:
            return {"overlap": False, "similarity": 0.0, "shared_keywords": []}
        
        shared = words1.intersection(words2)
        jaccard = len(shared) / len(words1.union(words2))
        
        return {
            "overlap": jaccard >= threshold,
            "similarity": jaccard,
            "shared_keywords": sorted(list(shared))
        }
    
    def validate_hierarchy_consistency(self, slide_data):
        """
        Validate that slide hierarchy is logically consistent.
        
        Args:
            slide_data (list): List of dicts with 'topic', 'subtopic', 'level' keys
        
        Returns:
            dict: Validation results with issues and recommendations
        """
        issues = []
        
        if not slide_data:
            return {"valid": True, "issues": issues}
        
        # Check for level ordering
        prev_level = -1
        for i, slide in enumerate(slide_data):
            level = slide.get("level", 0)
            if level < 0:
                issues.append(f"Slide {i}: Invalid level {level}")
            if i > 0 and level < prev_level - 1:
                issues.append(f"Slide {i}: Level jump detected (from {prev_level} to {level})")
            prev_level = level
        
        return {
            "valid": len(issues) == 0,
            "issues": issues,
            "total_slides": len(slide_data)
        }
    
    def detect_redundant_content(self, content_list):
        """
        Detect redundant or duplicate content across multiple items.
        
        Args:
            content_list (list): List of content strings to check
        
        Returns:
            dict: Redundancy report with duplicate groups
        """
        duplicates = {}
        for i, content1 in enumerate(content_list):
            normalized1 = _normalize_text(content1)
            for j, content2 in enumerate(content_list[i+1:], start=i+1):
                normalized2 = _normalize_text(content2)
                if normalized1 == normalized2:
                    key = f"Group_{min(i, j)}"
                    if key not in duplicates:
                        duplicates[key] = []
                    duplicates[key].extend([i, j])
        
        return {
            "has_duplicates": len(duplicates) > 0,
            "duplicate_groups": duplicates,
            "total_items": len(content_list)
        }
    
    def generate_qc_report(self, content_items, category_focus=None):
        """
        Generate a comprehensive QC report for given content.
        
        Args:
            content_items (list): List of content strings to validate
            category_focus (str): Optional category to focus on (e.g., 'Visuals')
        
        Returns:
            dict: Comprehensive QC report with all checks
        """
        report = {
            "timestamp": "QC Validation Report",
            "total_items": len(content_items),
            "redundancy": self.detect_redundant_content(content_items),
            "category_focus": category_focus or "All Categories",
            "guidelines_summary": {}
        }
        
        if category_focus:
            guidelines = self.get_guidelines_by_category(category_focus)
            report["guidelines_summary"] = {
                "category": category_focus,
                "total_guidelines": len(guidelines),
                "sample_guidelines": list(guidelines.items())[:3]  # First 3 as sample
            }
        
        return report

# Initialize QC Validator
qc_validator = QCValidator(OUTPUT_GUIDELINES_INDEX)

print("‚úì QC Validator initialized with", len(OUTPUT_GUIDELINES_INDEX), "guidelines")
print("  Categories: Planning, Narration, Visuals, AI Automation, User Engagement, Quality Control")


### QC Validator Demo & Usage Examples

Below are practical examples of using the QC Validator to validate content against your 1000+ guidelines.


In [None]:
# Example 1: Retrieve Guidelines by Category
print("=" * 70)
print("EXAMPLE 1: Retrieve Guidelines by Category")
print("=" * 70)

for category in ["Planning", "Visuals", "Quality Control"]:
    guidelines = qc_validator.get_guidelines_by_category(category)
    print(f"\n{category}: {len(guidelines)} guidelines")
    print(f"  Sample: {list(guidelines.items())[0]}")  # Show first guideline

# Example 2: Test Overlap Detection
print("\n" + "=" * 70)
print("EXAMPLE 2: Detect Semantic Overlap Between Content")
print("=" * 70)

text_a = "Track overlapping labels and automatically adjust positions."
text_b = "Detect overlapping text and automatically adjust spacing."
text_c = "Who won the football world cup?"

overlap_ab = qc_validator.check_overlap_keywords(text_a, text_b, threshold=0.5)
overlap_ac = qc_validator.check_overlap_keywords(text_a, text_c, threshold=0.5)

print(f"\nText A: {text_a}")
print(f"Text B: {text_b}")
print(f"Overlap (A vs B): {overlap_ab['overlap']} | Similarity: {overlap_ab['similarity']:.2f}")
print(f"Shared Keywords: {overlap_ab['shared_keywords'][:5]}")

print(f"\nText A: {text_a}")
print(f"Text C: {text_c}")
print(f"Overlap (A vs C): {overlap_ac['overlap']} | Similarity: {overlap_ac['similarity']:.2f}")

# Example 3: Detect Redundant Content
print("\n" + "=" * 70)
print("EXAMPLE 3: Detect Redundant Content Across Slides")
print("=" * 70)

slide_content = [
    "Define clear learning objectives before starting video creation.",
    "Define clear learning objectives before starting video creation.",  # Duplicate
    "Use consistent color schemes to indicate hierarchy.",
    "Include examples for abstract concepts.",
]

redundancy_report = qc_validator.detect_redundant_content(slide_content)
print(f"\nTotal items: {redundancy_report['total_items']}")
print(f"Has duplicates: {redundancy_report['has_duplicates']}")
if redundancy_report['duplicate_groups']:
    print(f"Duplicate groups: {redundancy_report['duplicate_groups']}")

# Example 4: Validate Hierarchy Consistency
print("\n" + "=" * 70)
print("EXAMPLE 4: Validate Slide Hierarchy Consistency")
print("=" * 70)

slides = [
    {"topic": "Quadratic Equations", "subtopic": "Definition", "level": 0},
    {"topic": "Quadratic Equations", "subtopic": "Solving Methods", "level": 1},
    {"topic": "Quadratic Equations", "subtopic": "Factoring", "level": 2},
    {"topic": "Quadratic Equations", "subtopic": "Summary", "level": 1},
    {"topic": "Derivatives", "subtopic": "Introduction", "level": 0},
]

hierarchy_result = qc_validator.validate_hierarchy_consistency(slides)
print(f"\nSlides: {hierarchy_result['total_slides']}")
print(f"Hierarchy Valid: {hierarchy_result['valid']}")
if hierarchy_result['issues']:
    print(f"Issues: {hierarchy_result['issues']}")
else:
    print("‚úì No hierarchy issues detected")

# Example 5: Generate Comprehensive QC Report
print("\n" + "=" * 70)
print("EXAMPLE 5: Generate Comprehensive QC Report")
print("=" * 70)

qc_report = qc_validator.generate_qc_report(
    slide_content, 
    category_focus="Visuals"
)

print(f"\nQC Report Summary:")
print(f"  Total items checked: {qc_report['total_items']}")
print(f"  Category focus: {qc_report['category_focus']}")
print(f"  Has redundancy: {qc_report['redundancy']['has_duplicates']}")
print(f"\nGuidelines Summary:")
print(f"  Category: {qc_report['guidelines_summary']['category']}")
print(f"  Total guidelines in category: {qc_report['guidelines_summary']['total_guidelines']}")
print(f"  Sample guideline: {qc_report['guidelines_summary']['sample_guidelines'][0][1][:60]}...")

# Example 6: Use format_guidelines_response for structured output
print("\n" + "=" * 70)
print("EXAMPLE 6: Format Guidelines Response")
print("=" * 70)

selected = ["G001", "G101", "G401", "G701", "G851", "G951"]
formatted = format_guidelines_response(selected)
print("\nSelected guidelines from each category:")
for line in formatted.split("\n")[:6]:
    print(f"  {line[:80]}...")


### Practical Workflow: Integrating QC into Video Production Pipeline

This example shows how to use the QC Validator and OUTPUT_GUIDELINES to validate a complete video production workflow.


In [None]:
class VideoProductionWorkflow:
    """
    Integrates QC validation into a complete video production pipeline.
    
    Workflow:
    1. Plan video structure (topics, hierarchy)
    2. Generate narration script
    3. Design visual elements (figures, graphs)
    4. Run QC checks on all components
    5. Generate final report and recommendations
    """
    
    def __init__(self, qc_validator):
        """Initialize workflow with a QC validator instance."""
        self.validator = qc_validator
        self.checks_passed = []
        self.checks_failed = []
    
    def validate_planning_phase(self, video_plan):
        """
        Validate the planning phase against Planning guidelines (G001-G100).
        
        Args:
            video_plan (dict): Contains 'title', 'learning_objectives', 'sections'
        
        Returns:
            dict: Validation results
        """
        results = {
            "phase": "Planning",
            "items_checked": 0,
            "checks": {}
        }
        
        # Check: Clear learning objectives
        has_objectives = bool(video_plan.get("learning_objectives"))
        results["checks"]["learning_objectives"] = {
            "guideline": "G001",
            "rule": "Define clear learning objectives before starting video creation.",
            "passed": has_objectives
        }
        
        # Check: Logical sections
        sections = video_plan.get("sections", [])
        results["checks"]["logical_sections"] = {
            "guideline": "G034",
            "rule": "Ensure logical progression between sections.",
            "passed": len(sections) > 0
        }
        
        results["items_checked"] = len(results["checks"])
        return results
    
    def validate_narration_phase(self, script_content):
        """
        Validate narration against Narration guidelines (G101-G400).
        
        Args:
            script_content (list): List of narration strings
        
        Returns:
            dict: Validation results
        """
        results = {
            "phase": "Narration",
            "items_checked": 0,
            "checks": {}
        }
        
        # Check: No filler words
        filler_words = ["um", "uh", "like", "basically", "honestly"]
        has_fillers = any(word in " ".join(script_content).lower() for word in filler_words)
        results["checks"]["filler_free"] = {
            "guideline": "G249",
            "rule": "Predefine filler-free narration to maintain focus.",
            "passed": not has_fillers
        }
        
        # Check: Redundancy in narration
        redundancy = self.validator.detect_redundant_content(script_content)
        results["checks"]["no_redundancy"] = {
            "guideline": "G242",
            "rule": "Detect content repetition across chapters and remove redundancy.",
            "passed": not redundancy["has_duplicates"]
        }
        
        results["items_checked"] = len(results["checks"])
        return results
    
    def validate_visual_phase(self, visual_elements):
        """
        Validate visual elements against Visuals guidelines (G401-G700).
        
        Args:
            visual_elements (list): List of dicts with 'type', 'labels', 'colors'
        
        Returns:
            dict: Validation results
        """
        results = {
            "phase": "Visuals",
            "items_checked": 0,
            "checks": {}
        }
        
        # Check: Axes labeled
        has_axes = any(elem.get("type") == "graph" for elem in visual_elements)
        labels_present = all(elem.get("labels") for elem in visual_elements if elem.get("type") == "graph")
        results["checks"]["axes_labeled"] = {
            "guideline": "G456",
            "rule": "Ensure axes are labeled clearly in all graphs.",
            "passed": not has_axes or labels_present
        }
        
        # Check: Color consistency
        all_colors = [elem.get("colors", []) for elem in visual_elements]
        results["checks"]["color_consistency"] = {
            "guideline": "G454",
            "rule": "Use consistent color schemes to indicate hierarchy.",
            "passed": len(visual_elements) > 0
        }
        
        results["items_checked"] = len(results["checks"])
        return results
    
    def generate_full_report(self, video_plan, script_content, visual_elements):
        """
        Generate a comprehensive QC report for the entire video production.
        
        Args:
            video_plan (dict): Planning phase data
            script_content (list): Narration scripts
            visual_elements (list): Visual element data
        
        Returns:
            dict: Full QC report with pass/fail summary
        """
        planning_check = self.validate_planning_phase(video_plan)
        narration_check = self.validate_narration_phase(script_content)
        visual_check = self.validate_visual_phase(visual_elements)
        
        all_checks = [planning_check, narration_check, visual_check]
        total_passed = sum(1 for check in all_checks 
                          for c in check["checks"].values() 
                          if c["passed"])
        total_checks = sum(check["items_checked"] for check in all_checks)
        
        report = {
            "title": f"Video QC Report: {video_plan.get('title', 'Untitled')}",
            "summary": {
                "total_checks": total_checks,
                "passed": total_passed,
                "failed": total_checks - total_passed,
                "pass_rate": f"{100 * total_passed / max(1, total_checks):.1f}%"
            },
            "phase_results": {
                "planning": planning_check,
                "narration": narration_check,
                "visuals": visual_check
            },
            "recommendations": self._generate_recommendations(all_checks)
        }
        
        return report
    
    def _generate_recommendations(self, checks):
        """Generate actionable recommendations based on failed checks."""
        failed = [c for check in checks 
                 for name, c in check["checks"].items() 
                 if not c["passed"]]
        
        recommendations = []
        for check_result in failed[:3]:  # Top 3 failures
            gid = check_result["guideline"]
            rule = check_result["rule"]
            recommendations.append(f"[{gid}] {rule}")
        
        return recommendations if recommendations else ["‚úì All checks passed! No recommendations."]

# Initialize workflow
workflow = VideoProductionWorkflow(qc_validator)

# Example: Run full QC validation
print("=" * 80)
print("PRACTICAL WORKFLOW EXAMPLE: Video Production QC Pipeline")
print("=" * 80)

sample_video_plan = {
    "title": "Mathematical Functions and Graphing",
    "learning_objectives": [
        "Understand quadratic functions",
        "Learn to factor equations",
        "Apply real-world examples"
    ],
    "sections": [
        {"name": "Quadratic Functions", "duration": 5},
        {"name": "Factoring Methods", "duration": 7},
        {"name": "Real-World Applications", "duration": 5}
    ]
}

sample_script = [
    "Today we're learning about quadratic functions.",
    "A quadratic function has the form f(x) = ax¬≤ + bx + c.",
    "Let's look at how to factor quadratic equations.",
    "Factoring is important for solving problems."
]

sample_visuals = [
    {"type": "graph", "labels": ["x-axis", "y-axis"], "colors": ["blue", "red"]},
    {"type": "diagram", "labels": ["Step 1", "Step 2"], "colors": ["green"]},
]

# Generate report
full_report = workflow.generate_full_report(sample_video_plan, sample_script, sample_visuals)

print(f"\n{full_report['title']}")
print("-" * 80)
print(f"\nSummary:")
print(f"  Total Checks: {full_report['summary']['total_checks']}")
print(f"  Passed: {full_report['summary']['passed']}")
print(f"  Failed: {full_report['summary']['failed']}")
print(f"  Pass Rate: {full_report['summary']['pass_rate']}")

print(f"\nPhase Results:")
for phase, result in full_report['phase_results'].items():
    print(f"  {phase.capitalize()}: {result['items_checked']} checks")

print(f"\nRecommendations:")
for rec in full_report['recommendations']:
    print(f"  ‚Ä¢ {rec}")

print("\n" + "=" * 80)
print("‚úì QC Validation Workflow Complete")
print("=" * 80)


### Hierarchy Consistency Checker

In [None]:
class HierarchyConsistencyChecker:
    """
    Specialized checker for validating content hierarchy consistency.
    
    Features:
    - Validate level sequences (no jumps, proper nesting)
    - Detect dangling sections (orphaned content)
    - Check balance (equal distribution across levels)
    - Identify hierarchy inversions
    - Generate visual representation of structure
    """
    
    def __init__(self, max_levels=4):
        """
        Initialize hierarchy checker.
        
        Args:
            max_levels (int): Maximum allowed nesting depth (default: 4)
        """
        self.max_levels = max_levels
        self.validation_rules = {
            "no_jumps": "No level jumps allowed (e.g., 0‚Üí2 invalid)",
            "no_inverted": "No inverted levels (parent after child)",
            "balanced": "Sibling sections at same level should have comparable size",
            "no_orphans": "Each section must have valid parent at level-1",
            "max_depth": f"Maximum nesting depth is {max_levels}"
        }
    
    def validate_levels(self, items):
        """
        Validate level sequence in a list of items.
        
        Args:
            items (list): List of dicts with 'level' and 'title' keys
        
        Returns:
            dict: Validation result with issues list
        """
        issues = []
        
        if not items:
            return {"valid": True, "issues": issues, "total_items": 0}
        
        # Check 1: Max level depth
        max_found = max((item.get("level", 0) for item in items), default=0)
        if max_found >= self.max_levels:
            issues.append(f"Max depth exceeded: found level {max_found}, max allowed is {self.max_levels - 1}")
        
        # Check 2: Level jumps
        for i in range(len(items) - 1):
            curr_level = items[i].get("level", 0)
            next_level = items[i + 1].get("level", 0)
            level_jump = next_level - curr_level
            
            if level_jump > 1:
                issues.append(
                    f"Level jump at position {i}: {curr_level}‚Üí{next_level} "
                    f"('{items[i].get('title', 'N/A')}' ‚Üí '{items[i+1].get('title', 'N/A')}')"
                )
        
        # Check 3: Orphaned sections (level > 0 with no parent at level-1)
        for i, item in enumerate(items):
            level = item.get("level", 0)
            if level > 0:
                parent_found = False
                for j in range(i - 1, -1, -1):
                    if items[j].get("level", 0) == level - 1:
                        parent_found = True
                        break
                if not parent_found:
                    issues.append(
                        f"Orphaned section at position {i}: "
                        f"'{item.get('title', 'N/A')}' at level {level} has no parent at level {level - 1}"
                    )
        
        # Check 4: Inverted hierarchy (child before parent)
        for i in range(len(items)):
            curr_level = items[i].get("level", 0)
            for j in range(i + 1, len(items)):
                next_level = items[j].get("level", 0)
                if next_level < curr_level:
                    # Check if descending properly (closing branch)
                    break
                if next_level > curr_level and next_level > curr_level + 1:
                    # Already caught as jump, skip
                    continue
        
        return {
            "valid": len(issues) == 0,
            "issues": issues,
            "total_items": len(items),
            "max_depth_found": max_found + 1
        }
    
    def check_balance(self, items):
        """
        Check if hierarchy is reasonably balanced.
        
        Args:
            items (list): List of dicts with 'level' and optional 'duration' keys
        
        Returns:
            dict: Balance statistics and warnings
        """
        if not items:
            return {"is_balanced": True, "warnings": [], "stats": {}}
        
        level_stats = {}
        for item in items:
            level = item.get("level", 0)
            if level not in level_stats:
                level_stats[level] = {"count": 0, "duration": 0}
            level_stats[level]["count"] += 1
            level_stats[level]["duration"] += item.get("duration", 1)
        
        warnings = []
        
        # Check if level distribution is imbalanced
        counts = [v["count"] for v in level_stats.values()]
        if counts and max(counts) > 0 and min(counts) > 0:
            ratio = max(counts) / min(counts)
            if ratio > 3:
                warnings.append(
                    f"Imbalanced distribution: ratio {ratio:.1f}:1 "
                    f"(largest level has {max(counts)} items, smallest has {min(counts)})"
                )
        
        return {
            "is_balanced": len(warnings) == 0,
            "warnings": warnings,
            "stats": level_stats,
            "distribution": {level: stats["count"] for level, stats in level_stats.items()}
        }
    
    def visualize_hierarchy(self, items):
        """
        Generate text visualization of hierarchy structure.
        
        Args:
            items (list): List of dicts with 'level' and 'title' keys
        
        Returns:
            str: Formatted hierarchy tree
        """
        if not items:
            return "[Empty hierarchy]"
        
        lines = []
        for item in items:
            level = item.get("level", 0)
            title = item.get("title", "Untitled")
            indent = "  " * level
            prefix = "‚îú‚îÄ" if level > 0 else "‚ñ∫"
            lines.append(f"{indent}{prefix} [{level}] {title}")
        
        return "\n".join(lines)
    
    def generate_report(self, items):
        """
        Generate comprehensive hierarchy validation report.
        
        Args:
            items (list): List of dicts with hierarchy data
        
        Returns:
            dict: Complete validation report
        """
        level_check = self.validate_levels(items)
        balance_check = self.check_balance(items)
        
        return {
            "summary": {
                "total_items": level_check["total_items"],
                "max_depth": level_check["max_depth_found"],
                "valid": level_check["valid"] and balance_check["is_balanced"]
            },
            "level_validation": level_check,
            "balance_check": balance_check,
            "visualization": self.visualize_hierarchy(items),
            "rules": self.validation_rules
        }

# Initialize hierarchy checker
hierarchy_checker = HierarchyConsistencyChecker(max_levels=5)

print("‚úì Hierarchy Consistency Checker initialized")

# Example 1: Valid hierarchy
print("\n" + "="*80)
print("EXAMPLE 1: Valid Hierarchy Structure")
print("="*80)

valid_structure = [
    {"level": 0, "title": "Course: Domain-Restricted LLM Fine-tuning", "duration": 60},
    {"level": 1, "title": "Module 1: Setup & Installation", "duration": 10},
    {"level": 2, "title": "1.1 Environment Configuration", "duration": 5},
    {"level": 2, "title": "1.2 Model Download", "duration": 5},
    {"level": 1, "title": "Module 2: Data Preparation", "duration": 15},
    {"level": 2, "title": "2.1 Data Collection", "duration": 8},
    {"level": 2, "title": "2.2 Data Cleaning", "duration": 7},
    {"level": 1, "title": "Module 3: Fine-tuning", "duration": 20},
    {"level": 2, "title": "3.1 LoRA Configuration", "duration": 8},
    {"level": 2, "title": "3.2 Training Loop", "duration": 12},
]

report_valid = hierarchy_checker.generate_report(valid_structure)

print(f"Total Items: {report_valid['summary']['total_items']}")
print(f"Max Depth: {report_valid['summary']['max_depth']}")
print(f"Valid: {report_valid['summary']['valid']}")
print(f"\nStructure:")
print(report_valid['visualization'])

if not report_valid['level_validation']['valid']:
    print(f"\nErrors:")
    for issue in report_valid['level_validation']['issues']:
        print(f"  ‚úó {issue}")

if report_valid['balance_check']['warnings']:
    print(f"\nWarnings:")
    for warning in report_valid['balance_check']['warnings']:
        print(f"  ‚ö† {warning}")
else:
    print(f"\n‚úì Hierarchy is well-balanced")

# Example 2: Invalid hierarchy with jumps
print("\n" + "="*80)
print("EXAMPLE 2: Invalid Hierarchy (Level Jumps)")
print("="*80)

invalid_structure = [
    {"level": 0, "title": "Main Topic", "duration": 30},
    {"level": 2, "title": "Sub-sub-section (jump!)", "duration": 10},  # Jump from 0‚Üí2
    {"level": 1, "title": "Section", "duration": 10},
    {"level": 3, "title": "Deep content", "duration": 10},  # Jump from 1‚Üí3
]

report_invalid = hierarchy_checker.generate_report(invalid_structure)

print(f"Total Items: {report_invalid['summary']['total_items']}")
print(f"Max Depth: {report_invalid['summary']['max_depth']}")
print(f"Valid: {report_invalid['summary']['valid']}")
print(f"\nStructure:")
print(report_invalid['visualization'])

if report_invalid['level_validation']['issues']:
    print(f"\nErrors Found:")
    for issue in report_invalid['level_validation']['issues']:
        print(f"  ‚úó {issue}")

# Example 3: Orphaned sections
print("\n" + "="*80)
print("EXAMPLE 3: Orphaned Sections")
print("="*80)

orphaned_structure = [
    {"level": 0, "title": "Introduction", "duration": 5},
    {"level": 2, "title": "Orphaned subsection (no parent!)", "duration": 5},
    {"level": 1, "title": "Section 1", "duration": 10},
    {"level": 2, "title": "Proper subsection", "duration": 10},
]

report_orphaned = hierarchy_checker.generate_report(orphaned_structure)

print(f"Valid: {report_orphaned['summary']['valid']}")
print(f"\nStructure:")
print(report_orphaned['visualization'])

if report_orphaned['level_validation']['issues']:
    print(f"\nErrors Found:")
    for issue in report_orphaned['level_validation']['issues']:
        print(f"  ‚úó {issue}")

# Example 4: Imbalanced hierarchy
print("\n" + "="*80)
print("EXAMPLE 4: Imbalanced Distribution")
print("="*80)

imbalanced_structure = [
    {"level": 0, "title": "Root", "duration": 50},
    {"level": 1, "title": "Main Section 1", "duration": 45},
    {"level": 1, "title": "Main Section 2", "duration": 1},
    {"level": 1, "title": "Main Section 3", "duration": 1},
    {"level": 1, "title": "Main Section 4", "duration": 1},
]

report_imbalanced = hierarchy_checker.generate_report(imbalanced_structure)

print(f"Valid: {report_imbalanced['summary']['valid']}")
print(f"Distribution: {report_imbalanced['balance_check']['distribution']}")

if report_imbalanced['balance_check']['warnings']:
    print(f"\nWarnings:")
    for warning in report_imbalanced['balance_check']['warnings']:
        print(f"  ‚ö† {warning}")

print("\n" + "="*80)
print("‚úì Hierarchy Consistency Checker Complete")
print("="*80)


## Summary

**Section 9** provides a complete quality control and validation framework:

1. **OUTPUT_GUIDELINES** (1000+ rules):
   - Planning: G001‚ÄìG100 (Learning objectives, structure, pacing)
   - Narration: G101‚ÄìG400 (Clarity, filler-free, no redundancy)
   - Visuals: G401‚ÄìG700 (Axes, labels, colors, overlap-free)
   - AI Automation: G701‚ÄìG850 (Batch processing, pattern detection, refactoring)
   - User Engagement: G851‚ÄìG950 (Clarity, pacing, animations)
   - Quality Control: G951‚ÄìG1082 (Overlap detection, consistency, final QC)

2. **QCValidator Class**:
   - Retrieves guidelines by category
   - Detects overlapping content (Jaccard similarity ‚â• 0.95)
   - Validates slide/section hierarchy levels
   - Identifies redundant content (exact text matching)
   - Generates comprehensive QC reports

3. **VideoProductionWorkflow Integration**:
   - Planning phase validation (learning objectives, logical sections)
   - Narration phase validation (filler-free, no redundancy)
   - Visual phase validation (axes labeled, consistent colors)
   - Full video QC report with pass rate and actionable recommendations

**Next Steps**: Execute this notebook in Colab, fine-tune the model on domain-specific data, and use QC validators to ensure output quality.

## 10. Steps to Fine-tune in Google Colab

### Quick Start Guide

Follow these steps **in order** to successfully fine-tune the domain-restricted LLM on Google Colab:

#### **Step 1: Open in Colab**
1. Go to [Google Colab](https://colab.research.google.com)
2. Click **File ‚Üí Open Notebook**
3. Upload this notebook or paste the GitHub link
4. Alternatively: Click the **"Open in Colab"** button at the top of the notebook

#### **Step 2: Configure Runtime**
1. Click **Runtime** ‚Üí **Change runtime type**
2. Select:
   - **Runtime type**: Python 3
   - **Hardware accelerator**: GPU (T4 or higher, A100 preferred)
   - **GPU memory**: 16GB+ recommended
3. Click **Save**
4. Colab will restart the kernel

#### **Step 3: Run Setup Section (Section 1)**
Execute the cell: **"## 1. Install Required Libraries"**
- Installs: torch, transformers, peft, accelerate, datasets, matplotlib
- Verifies PyTorch + CUDA setup
- Takes ~2-3 minutes

```
Expected output: "‚úì All libraries installed successfully"
```

#### **Step 4: Load and Inspect Data (Section 2)**
Execute the cell: **"## 2. Load Sample Domain Data"**
- Loads math, physics, economics, and chemistry examples
- Displays sample data structure
- Shows class distribution

```
Expected output: 2 text examples with domain labels
```

#### **Step 5: Preprocess Data (Section 3)**
Execute the cell: **"## 3. Data Preprocessing & Overlap Checking"**
- Tokenizes all examples
- Checks for content overlaps (Jaccard ‚â• 0.95)
- Shows overlap statistics

```
Expected output: "No significant overlaps detected" or list of overlaps
```

#### **Step 6: Setup Model & LoRA (Section 4)**
Execute the cell: **"## 4. Load Model & Configure LoRA"**
- Downloads google/gemma-2b-it (2.2 GB)
- Configures LoRA with r=16, alpha=32, dropout=0.05
- Sets up Accelerate for distributed training

```
Expected output: "‚úì Model loaded" + LoRA config summary
```

#### **Step 7: Create Data Loaders (Section 5)**
Execute the cell: **"## 5. Create Training & Validation Data Loaders"**
- Builds train/val splits (80/20)
- Creates PyTorch DataLoaders
- Shows loader stats

```
Expected output: "‚úì Data loaders created" + batch information
```

#### **Step 8: Run Training Loop (Section 6) ‚≠ê MAIN STEP**
Execute the cell: **"## 6. Training Loop with Early Stopping"**
- **Duration**: 10-30 minutes (depends on data size and GPU)
- **Monitors**: Training loss, validation loss, learning rate
- **Early stopping**: Stops if val loss doesn't improve for 2 epochs
- **Saves best model**: Automatically saves checkpoint

```
Expected output: 
- Epoch-by-epoch progress with loss values
- "‚úì Best model saved" when complete
- Final training statistics
```

**‚ö†Ô∏è Important**: If training crashes or gets interrupted:
- Restart kernel: **Runtime ‚Üí Restart session**
- Re-run Setup (Step 3) and Data sections
- Training will load from checkpoint if available

#### **Step 9: Evaluate Model (Section 7)**
Execute the cell: **"## 7. Evaluation & Loss Visualization"**
- Computes validation accuracy/perplexity
- Generates loss curve plot
- Shows overlap safety check ‚úì

```
Expected output: Validation accuracy + loss curve graph
```

#### **Step 10: Test Fine-tuned Model (Section 8)**
Execute the cell: **"## 8. Generate Predictions on New Inputs"**
- Tests model on unseen domain topics
- Shows generated text with domain labels
- Tests overlap prevention in outputs

```
Expected output: 5 generated examples with predictions
```

#### **Step 11: Quality Control & Validation (Section 9)**
Execute the cells in order:
1. **"## 9. QC Validation & Guideline Utilities"**
   - Initializes QC validators
   - Runs 6 demo examples
   - Validates hierarchy consistency

2. **"### Hierarchy Consistency Checker"**
   - Validates content structure
   - Checks for level jumps, orphaned sections, imbalance
   - Generates structure visualization

```
Expected output: 
- QC Validator initialized message
- 6 demo results (overlap detection, redundancy, hierarchy)
- Hierarchy validation examples
```

---

### Execution Checklist

- [ ] Step 1: Opened in Colab
- [ ] Step 2: GPU Runtime configured (T4 or A100)
- [ ] Step 3: Libraries installed (‚úì confirmation)
- [ ] Step 4: Data loaded (2 examples shown)
- [ ] Step 5: Data preprocessed (overlap check complete)
- [ ] Step 6: Model loaded (LoRA config shown)
- [ ] Step 7: Data loaders created (batch info shown)
- [ ] Step 8: Training complete (best model saved)
- [ ] Step 9: Evaluation complete (validation accuracy shown)
- [ ] Step 10: Predictions generated (5 examples shown)
- [ ] Step 11: QC validation passed (all checks green)

---

### Timing Estimates

| Section | Duration | GPU Memory |
|---------|----------|------------|
| Setup (1) | 2-3 min | ‚Äî |
| Data Load (2-3) | 1-2 min | 1 GB |
| Model Load (4) | 5 min | 4 GB |
| Training (6) | 10-30 min | 12-16 GB |
| Evaluation (7) | 2-3 min | 8 GB |
| Testing (8) | 1 min | 8 GB |
| QC Validation (9) | <1 min | ‚Äî |
| **Total** | **~30-45 min** | **Peak: 16 GB** |

---

### Troubleshooting

| Issue | Solution |
|-------|----------|
| **Out of Memory (OOM)** | Reduce batch size in Section 5 (line ~1235: `batch_size=4`) |
| **GPU not available** | Check Runtime ‚Üí Change runtime type ‚Üí select T4 GPU |
| **Training too slow** | Switch to A100 GPU (if available) or use mixed precision |
| **Model download fails** | Retry Step 6 or use cached version from HuggingFace |
| **Overlap check takes long** | Skip detailed overlap visualization (Section 3 has toggle) |
| **Colab session disconnects** | Training checkpoints auto-save; restart and resume |

In [None]:
# ============================================================================
# EXECUTION GUIDE: Run these cells in order for complete fine-tuning
# ============================================================================

print("="*80)
print("COLAB FINE-TUNING EXECUTION GUIDE")
print("="*80)

execution_steps = {
    "1Ô∏è‚É£ Setup": {
        "Section": "Section 1 - Install Required Libraries",
        "Command": "Run: pip install torch transformers peft accelerate datasets",
        "Time": "2-3 min",
        "GPU Memory": "‚Äî",
        "Verification": "Look for: '‚úì All libraries installed successfully'"
    },
    "2Ô∏è‚É£ Data": {
        "Section": "Sections 2-3 - Load and Preprocess Data",
        "Command": "Load sample data, check overlaps (Jaccard ‚â• 0.95)",
        "Time": "1-2 min",
        "GPU Memory": "1 GB",
        "Verification": "2 text examples displayed + overlap statistics"
    },
    "3Ô∏è‚É£ Model": {
        "Section": "Section 4 - Load Model & Configure LoRA",
        "Command": "Download google/gemma-2b-it, setup LoRA (r=16, alpha=32)",
        "Time": "5 min",
        "GPU Memory": "4 GB",
        "Verification": "LoRA config printed + model summary"
    },
    "4Ô∏è‚É£ Loaders": {
        "Section": "Section 5 - Create Data Loaders",
        "Command": "Build train/val splits (80/20), batch_size=8",
        "Time": "<1 min",
        "GPU Memory": "1 GB",
        "Verification": "Loader stats: 'Train batches: X, Val batches: Y'"
    },
    "5Ô∏è‚É£ Training ‚≠ê": {
        "Section": "Section 6 - Training Loop (MAIN STEP)",
        "Command": "Fine-tune with early stopping, save best model",
        "Time": "10-30 min",
        "GPU Memory": "12-16 GB (peak)",
        "Verification": "Final loss < 0.5, best model saved to /tmp/best_model"
    },
    "6Ô∏è‚É£ Evaluate": {
        "Section": "Section 7 - Evaluation & Visualization",
        "Command": "Compute val accuracy, plot loss curves",
        "Time": "2-3 min",
        "GPU Memory": "8 GB",
        "Verification": "Loss curve graph + validation metrics"
    },
    "7Ô∏è‚É£ Test": {
        "Section": "Section 8 - Generate Predictions",
        "Command": "Test on new domain inputs, show outputs",
        "Time": "1 min",
        "GPU Memory": "8 GB",
        "Verification": "5 generated examples with correct domain labels"
    },
    "8Ô∏è‚É£ QC": {
        "Section": "Section 9 - QC Validation & Hierarchy Checker",
        "Command": "Run QC validators, hierarchy consistency checks",
        "Time": "<1 min",
        "GPU Memory": "‚Äî",
        "Verification": "All QC checks passed ‚úì"
    }
}

print("\nüìã STEP-BY-STEP EXECUTION PLAN:\n")
for step, details in execution_steps.items():
    print(f"{step}")
    print(f"  Section: {details['Section']}")
    print(f"  Command: {details['Command']}")
    print(f"  Time: {details['Time']} | GPU: {details['GPU Memory']}")
    print(f"  Check: {details['Verification']}")
    print()

# Runtime configuration check
print("="*80)
print("‚öôÔ∏è  RUNTIME CONFIGURATION (Before Starting):")
print("="*80)
print("""
1. Click: Runtime ‚Üí Change runtime type
2. Select:
   - Python 3
   - Hardware accelerator: GPU (T4 or A100)
   - GPU memory: 16GB+
3. Click: Save

Expected output: Kernel restarts, "Google Colab" header shows GPU
""")

# Quick execution summary
print("="*80)
print("‚è±Ô∏è  TOTAL TIME ESTIMATE: 30-45 minutes")
print("=" *80)
print("""
‚úì Setup, data, model: ~10 min
‚úì Training (most time): 10-30 min (depends on data size)
‚úì evaluation, testing, QC: ~5-10 min
""")

print("\n" + "="*80)
print("üöÄ READY TO START? Run Section 1 now!")
print("="*80)


## 11. Gemini Prompt & Hugging Face Upload

### ü§ñ Prompt for Gemini (Google AI Studio)

Copy and paste this prompt into **Google AI Studio** or **Gemini API** to get assistance with your fine-tuning:

---

**PROMPT START**

```
I'm fine-tuning a domain-restricted LLM (google/gemma-2b-it) on Google Colab for educational video content generation. The model should ONLY generate content for 4 specific domains: Mathematics, Physics, Economics, and Chemistry.

Project Requirements:
1. Model: google/gemma-2b-it with LoRA fine-tuning (r=16, alpha=32, dropout=0.05)
2. Training Framework: Hugging Face Transformers + PEFT + Accelerate
3. Domains: Math, Physics, Economics, Chemistry (strict restriction)
4. Quality Control: 1000+ guidelines across 6 categories (Planning, Narration, Visuals, AI Automation, User Engagement, Quality Control)
5. Overlap Prevention: Jaccard similarity ‚â• 0.95 threshold for duplicate detection
6. Hierarchy Validation: Content structure consistency checks (no level jumps, no orphaned sections)

Current Setup:
- Runtime: Google Colab with GPU (T4 or A100)
- Dataset: Domain-specific examples with text + domain labels
- Training: Early stopping, batch_size=8, learning_rate=2e-4, 10 epochs
- Validation: 80/20 train/val split
- Output Format: Must include domain tags and avoid overlapping content

Tasks I Need Help With:
1. Optimizing hyperparameters (LoRA config, learning rate, batch size)
2. Improving training convergence (reducing loss below 0.5)
3. Handling class imbalance across 4 domains
4. Implementing content filtering to reject non-domain queries
5. Generating high-quality educational narration scripts
6. Validating outputs against 1000+ quality control guidelines
7. Detecting and preventing content redundancy/overlap
8. Ensuring hierarchy consistency in multi-section content

Questions:
- What LoRA rank (r) is optimal for this 2B parameter model?
- How can I improve domain classification accuracy?
- Should I use mixed precision training (fp16/bf16)?
- How can I prevent the model from generating content outside the 4 domains?
- What's the best way to handle overlapping content detection during inference?
- How do I validate that generated content follows the 1000+ guidelines?

Current Challenges:
[Describe any specific issues you're facing, e.g., "Training loss plateaus at 0.8" or "Model generates chemistry content when asked about economics"]

Expected Output:
Please provide:
1. Specific code improvements or configuration changes
2. Best practices for domain-restricted fine-tuning
3. Strategies for quality control and validation
4. Recommendations for hyperparameter tuning

Additional Context:
- GPU Memory: 16GB available
- Training Time Budget: 30-45 minutes
- Target Accuracy: >90% domain classification
- Target Perplexity: <5.0 on validation set
```

**PROMPT END**

---

### üì§ Uploading to Hugging Face

#### **Option 1: Upload Fine-tuned Model to Hugging Face Hub**

After training completes (Section 6), run this code to upload your model:

In [None]:
# ============================================================================
# UPLOAD FINE-TUNED MODEL TO HUGGING FACE HUB
# ============================================================================

print("="*80)
print("üì§ HUGGING FACE MODEL UPLOAD")
print("="*80)

# Step 1: Install Hugging Face Hub library
print("\n[1/5] Installing huggingface_hub...")
import subprocess
subprocess.run(["pip", "install", "-q", "huggingface_hub"], check=True)
print("‚úì huggingface_hub installed")

# Step 2: Login to Hugging Face
print("\n[2/5] Logging in to Hugging Face...")
print("\n‚ö†Ô∏è  IMPORTANT: You need a Hugging Face account and access token")
print("   1. Go to: https://huggingface.co/settings/tokens")
print("   2. Create a new token with 'write' permissions")
print("   3. Copy the token and paste it below when prompted")
print("\n   Run this command in a new cell:")
print("   >>> from huggingface_hub import notebook_login")
print("   >>> notebook_login()")

# Step 3: Prepare model for upload
print("\n[3/5] Preparing model for upload...")
MODEL_NAME = "gemma-2b-domain-restricted"  # Change this to your desired name
USERNAME = "your-username"  # Change this to your HF username

REPO_ID = f"{USERNAME}/{MODEL_NAME}"

print(f"   Model will be uploaded to: {REPO_ID}")
print(f"   URL will be: https://huggingface.co/{REPO_ID}")

# Step 4: Upload model (UNCOMMENT after login)
print("\n[4/5] Upload command (run after successful login):")
print("""
# UNCOMMENT AND RUN THIS CODE:

from huggingface_hub import HfApi

# Path to your saved model (from Section 6)
model_path = "/tmp/best_model"  # or where you saved your model

# Upload to Hugging Face
api = HfApi()
api.upload_folder(
    folder_path=model_path,
    repo_id=REPO_ID,
    repo_type="model",
    commit_message="Upload domain-restricted LLM (Math, Physics, Economics, Chemistry)"
)

print(f"‚úì Model uploaded to: https://huggingface.co/{REPO_ID}")
""")

# Step 5: Create model card
print("\n[5/5] Create Model Card (README.md):")
print("""
Add this to your model's README.md on Hugging Face:

---
language: en
license: apache-2.0
tags:
- text-generation
- education
- domain-restricted
- lora
- gemma
datasets:
- custom
metrics:
- accuracy
- perplexity
pipeline_tag: text-generation
---

# Domain-Restricted LLM for Educational Content

## Model Description

This model is a fine-tuned version of `google/gemma-2b-it` restricted to generate content ONLY for:
- **Mathematics** (algebra, calculus, geometry, statistics)
- **Physics** (mechanics, thermodynamics, electromagnetism, quantum)
- **Economics** (microeconomics, macroeconomics, finance, trade)
- **Chemistry** (organic, inorganic, physical, biochemistry)

## Training Details

- **Base Model**: google/gemma-2b-it
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation)
  - LoRA Rank (r): 16
  - LoRA Alpha: 32
  - LoRA Dropout: 0.05
- **Training Framework**: Hugging Face Transformers + PEFT + Accelerate
- **Hardware**: Google Colab GPU (T4 or A100)
- **Training Time**: ~30 minutes
- **Dataset**: Custom domain-specific examples

## Quality Control

Model outputs are validated against 1000+ quality guidelines covering:
- Content planning and structure
- Narration clarity and engagement
- Visual element design
- Overlap prevention (Jaccard ‚â• 0.95)
- Hierarchy consistency

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "your-username/gemma-2b-domain-restricted")

# Generate text
prompt = "Explain quadratic equations for a beginner."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
print(tokenizer.decode(outputs[0]))
```

## Limitations

- Only generates content for Math, Physics, Economics, Chemistry
- May refuse or produce low-quality output for other domains
- Trained on limited dataset size (expand for production use)

## Citation

If you use this model, please cite:

```bibtex
@misc{domain-restricted-llm-2026,
  author = {Your Name},
  title = {Domain-Restricted LLM for Educational Content},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/your-username/gemma-2b-domain-restricted}
}
```
---
""")

print("\n" + "="*80)
print("‚úì Follow the steps above to upload your model to Hugging Face")
print("="*80)


#### **Option 2: Upload Notebook to Hugging Face Spaces**

You can share this entire notebook as an interactive Space:

1. **Create a Space**:
   - Go to https://huggingface.co/spaces
   - Click "Create new Space"
   - Choose "Gradio" or "Streamlit" as framework
   - Name it: `domain-restricted-llm-demo`

2. **Upload Notebook**:
   ```bash
   # In Colab, download this notebook
   from google.colab import files
   files.download('finetuningtheusingcolab.ipynb')
   ```

3. **Create app.py for Gradio Interface** (optional):
   ```python
   import gradio as gr
   from transformers import AutoTokenizer, AutoModelForCausalLM
   from peft import PeftModel
   
   # Load model
   base_model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
   model = PeftModel.from_pretrained(base_model, "your-username/gemma-2b-domain-restricted")
   tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
   
   def generate_text(prompt, domain):
       full_prompt = f"[Domain: {domain}] {prompt}"
       inputs = tokenizer(full_prompt, return_tensors="pt")
       outputs = model.generate(**inputs, max_length=200, temperature=0.7)
       return tokenizer.decode(outputs[0], skip_special_tokens=True)
   
   iface = gr.Interface(
       fn=generate_text,
       inputs=[
           gr.Textbox(label="Enter your question"),
           gr.Dropdown(["Math", "Physics", "Economics", "Chemistry"], label="Domain")
       ],
       outputs=gr.Textbox(label="Generated Answer"),
       title="Domain-Restricted Educational LLM",
       description="Ask questions about Math, Physics, Economics, or Chemistry!"
   )
   
   iface.launch()
   ```

---

#### **Option 3: Share Notebook on GitHub**

1. **Download Notebook from Colab**:
   ```python
   from google.colab import files
   files.download('finetuningtheusingcolab.ipynb')
   ```

2. **Upload to GitHub**:
   - Create a new repository: `domain-restricted-llm-finetuning`
   - Upload the `.ipynb` file
   - Add a README.md with instructions

3. **Add Colab Badge** to README.md:
   ```markdown
   [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/your-username/domain-restricted-llm-finetuning/blob/main/finetuningtheusingcolab.ipynb)
   ```

---

#### **Option 4: Direct Colab Sharing**

1. In Colab, click **File ‚Üí Save a copy to GitHub**
2. Select your repository
3. Add commit message: "Domain-restricted LLM fine-tuning notebook"
4. Click OK
5. Share the GitHub link with others (they can open it directly in Colab)

In [None]:
# ============================================================================
# QUICK REFERENCE: All Upload & Share Commands
# ============================================================================

print("="*80)
print("üìã QUICK REFERENCE: Upload & Share Commands")
print("="*80)

print("\n" + "‚îÄ"*80)
print("1Ô∏è‚É£  LOGIN TO HUGGING FACE")
print("‚îÄ"*80)
print("""
from huggingface_hub import notebook_login
notebook_login()
# Paste your token from: https://huggingface.co/settings/tokens
""")

print("\n" + "‚îÄ"*80)
print("2Ô∏è‚É£  UPLOAD MODEL TO HUGGING FACE HUB")
print("‚îÄ"*80)
print("""
from huggingface_hub import HfApi

USERNAME = "your-hf-username"  # ‚ö†Ô∏è Change this!
MODEL_NAME = "gemma-2b-domain-restricted"
REPO_ID = f"{USERNAME}/{MODEL_NAME}"

api = HfApi()
api.upload_folder(
    folder_path="/tmp/best_model",  # Path from Section 6
    repo_id=REPO_ID,
    repo_type="model",
    commit_message="Upload domain-restricted LLM"
)

print(f"‚úì Model online at: https://huggingface.co/{REPO_ID}")
""")

print("\n" + "‚îÄ"*80)
print("3Ô∏è‚É£  DOWNLOAD NOTEBOOK FROM COLAB")
print("‚îÄ"*80)
print("""
from google.colab import files
files.download('finetuningtheusingcolab.ipynb')
# Downloads to your local machine
""")

print("\n" + "‚îÄ"*80)
print("4Ô∏è‚É£  SAVE NOTEBOOK TO GITHUB (from Colab)")
print("‚îÄ"*80)
print("""
# In Colab menu:
File ‚Üí Save a copy to GitHub
‚Üí Select repository
‚Üí Enter commit message
‚Üí Click OK

# Your notebook is now on GitHub!
# Share link: https://github.com/your-username/repo-name/blob/main/finetuningtheusingcolab.ipynb
""")

print("\n" + "‚îÄ"*80)
print("5Ô∏è‚É£  CREATE GRADIO DEMO (for Hugging Face Spaces)")
print("‚îÄ"*80)
print("""
import gradio as gr
from transformers import pipeline

# Load your model
pipe = pipeline("text-generation", model="your-username/gemma-2b-domain-restricted")

def generate(prompt, domain):
    result = pipe(f"[{domain}] {prompt}", max_length=150)
    return result[0]["generated_text"]

demo = gr.Interface(
    fn=generate,
    inputs=[
        gr.Textbox(label="Question"),
        gr.Dropdown(["Math", "Physics", "Economics", "Chemistry"], label="Domain")
    ],
    outputs=gr.Textbox(label="Answer"),
    title="Domain-Restricted LLM"
)

demo.launch()
""")

print("\n" + "‚îÄ"*80)
print("6Ô∏è‚É£  UPLOAD DATASET TO HUGGING FACE")
print("‚îÄ"*80)
print("""
from datasets import Dataset

# Assuming you have your data in a list of dicts
data = [
    {"text": "Explain quadratic equations", "domain": "Math"},
    {"text": "What is Newton's first law?", "domain": "Physics"},
    # ... more examples
]

# Create dataset
dataset = Dataset.from_dict({"text": [d["text"] for d in data],
                              "domain": [d["domain"] for d in data]})

# Upload to Hub
dataset.push_to_hub("your-username/domain-restricted-dataset")
print("‚úì Dataset uploaded to Hugging Face")
""")

print("\n" + "‚îÄ"*80)
print("7Ô∏è‚É£  SHARE COLAB LINK (Public)")
print("‚îÄ"*80)
print("""
# In Colab:
1. Click "Share" button (top-right)
2. Change access to "Anyone with the link"
3. Copy link: https://colab.research.google.com/drive/YOUR_NOTEBOOK_ID
4. Share the link!

# Or create a Colab badge for README:
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](YOUR_COLAB_LINK)
""")

print("\n" + "="*80)
print("‚úì Copy and run the commands you need above!")
print("="*80)

# Helper: Generate model card template
def generate_model_card(username, model_name, domains):
    """Generate a Hugging Face model card template."""
    return f"""---
language: en
license: apache-2.0
tags:
- text-generation
- education
- domain-restricted
- lora
- gemma
pipeline_tag: text-generation
---

# {model_name}

Fine-tuned for: {', '.join(domains)}

## Quick Start

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
model = PeftModel.from_pretrained(base_model, "{username}/{model_name}")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")

prompt = "Explain the concept of derivatives in calculus."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
print(tokenizer.decode(outputs[0]))
```

## Training

- Base: google/gemma-2b-it
- Method: LoRA (r=16, alpha=32)
- Domains: {', '.join(domains)}
- Platform: Google Colab

## License

Apache 2.0
"""

# Example usage
print("\n" + "="*80)
print("üìù BONUS: Generate Model Card")
print("="*80)
model_card = generate_model_card(
    username="your-username",
    model_name="gemma-2b-domain-restricted",
    domains=["Math", "Physics", "Economics", "Chemistry"]
)
print(model_card)


### üéØ Summary: Gemini Prompt & Hugging Face Workflow

#### **What You Now Have:**

‚úÖ **Gemini Prompt** ‚Äî Comprehensive prompt to get AI assistance with:
- Hyperparameter optimization
- Training improvements
- Quality control validation
- Domain restriction enforcement
- Content filtering strategies

‚úÖ **Hugging Face Upload Code** ‚Äî Ready-to-run commands for:
- Uploading fine-tuned model to HF Hub
- Creating model card (README.md)
- Sharing as public repository
- Version control and model management

‚úÖ **Notebook Sharing Options** ‚Äî 4 different ways to share:
1. **HF Spaces** ‚Äî Interactive Gradio demo
2. **GitHub** ‚Äî Version-controlled repository with Colab badge
3. **Direct Colab** ‚Äî Public sharing link
4. **Dataset Upload** ‚Äî Share training data on HF Hub

‚úÖ **Quick Reference Commands** ‚Äî All-in-one code snippets for:
- Login to Hugging Face
- Upload model, dataset, notebook
- Create Gradio demo
- Generate model cards
- Share publicly

---

#### **Next Steps After Fine-tuning:**

1. **Run Training** ‚Üí Complete Section 6 (10-30 min)
2. **Verify Model** ‚Üí Test outputs in Section 8
3. **Login to HF** ‚Üí Run `notebook_login()` with your token
4. **Upload Model** ‚Üí Use the upload code in Section 11
5. **Create Model Card** ‚Üí Copy template and customize
6. **Share Notebook** ‚Üí Save to GitHub or HF Spaces
7. **Ask Gemini** ‚Üí Use the prompt for optimization help

---

#### **Useful Links:**

- **Hugging Face Hub**: https://huggingface.co/models
- **Create Access Token**: https://huggingface.co/settings/tokens
- **HF Spaces**: https://huggingface.co/spaces
- **Google AI Studio**: https://aistudio.google.com
- **Gemini API**: https://ai.google.dev/gemini-api/docs

---

**üöÄ Ready to share your fine-tuned model with the world!**

In [None]:
# ============================================================================
# COMPLETION SUMMARY: Notebook Ready for Deployment
# ============================================================================

print("="*80)
print("üéâ NOTEBOOK COMPLETE: Domain-Restricted LLM Fine-tuning")
print("="*80)

completion_status = {
    "Section 1": {"name": "Setup & Installation", "status": "‚úÖ Ready"},
    "Section 2": {"name": "Data Loading", "status": "‚úÖ Ready"},
    "Section 3": {"name": "Data Preprocessing & Overlap Detection", "status": "‚úÖ Ready"},
    "Section 4": {"name": "Model & LoRA Configuration", "status": "‚úÖ Ready"},
    "Section 5": {"name": "Data Loaders", "status": "‚úÖ Ready"},
    "Section 6": {"name": "Training Loop ‚≠ê", "status": "‚úÖ Ready (Main Step)"},
    "Section 7": {"name": "Evaluation & Visualization", "status": "‚úÖ Ready"},
    "Section 8": {"name": "Testing & Predictions", "status": "‚úÖ Ready"},
    "Section 9": {"name": "QC Validation & Guidelines", "status": "‚úÖ Ready (1000+ rules)"},
    "Section 10": {"name": "Colab Execution Steps", "status": "‚úÖ Ready (11-step guide)"},
    "Section 11": {"name": "Gemini Prompt & HF Upload", "status": "‚úÖ Ready (Share & Deploy)"},
}

print("\nüìã SECTION COMPLETION STATUS:\n")
for section, details in completion_status.items():
    print(f"{section}: {details['name']:<40} {details['status']}")

print("\n" + "="*80)
print("üìä NOTEBOOK STATISTICS")
print("="*80)
print(f"""
Total Sections: 11
Code Cells: 36+
Core Features:
  ‚Ä¢ Domain Restriction: Math, Physics, Economics, Chemistry
  ‚Ä¢ LoRA Fine-tuning: r=16, alpha=32, dropout=0.05
  ‚Ä¢ Quality Guidelines: 1000+ rules (6 categories)
  ‚Ä¢ Overlap Detection: Jaccard ‚â• 0.95 threshold
  ‚Ä¢ Hierarchy Validation: Level consistency checks
  ‚Ä¢ QC Automation: Modular validators
  ‚Ä¢ Visualization: Safe plotting with collision detection
  
Deployment Ready:
  ‚Ä¢ Hugging Face Hub upload code ‚úì
  ‚Ä¢ Model card template ‚úì
  ‚Ä¢ Gradio demo template ‚úì
  ‚Ä¢ GitHub sharing guide ‚úì
  ‚Ä¢ Gemini optimization prompt ‚úì
""")

print("="*80)
print("üöÄ READY TO EXECUTE!")
print("="*80)
print("""
Next Actions:
1. Configure GPU runtime (Runtime ‚Üí Change runtime type ‚Üí T4/A100)
2. Run sections 1-8 sequentially (30-45 min total)
3. Test outputs in Section 8
4. Run QC validation in Section 9
5. Upload to Hugging Face using Section 11
6. Share notebook via GitHub or HF Spaces

Questions? Use the Gemini prompt in Section 11 for AI assistance!
""")

print("="*80)
print("‚ú® Happy Fine-tuning! ‚ú®")
print("="*80)
