# 🚀 ULTRATHINK Perfect Training - Google Colab

## ✨ What's New in This Configuration?

This notebook uses the **PERFECT** training configuration that fixes:
- ✅ **Routing Collapse** (Entropy 0.52 → 0.8-1.2)
- ✅ **Expert Imbalance** (Max Expert 100% → 50-70%)
- ✅ **High Auxiliary Loss** (8.0 → 2.0-4.0)
- ✅ **Slow Convergence** (Better perplexity by step 200)

### 🎯 Key Improvements:
- **MoE Top-K**: 1 → **2** (prevents single expert dominance)
- **Load Balance Weight**: 0.01 → **0.1** (10x stronger)
- **Z-Loss Weight**: 0.001 → **0.0001** (10x weaker)
- **Expert Capacity**: 1.0 → **1.5** (50% overflow)
- **Effective Batch Size**: 16 → **64** (4x larger)

---

## 📋 Setup Instructions

1. **Runtime**: Go to `Runtime` → `Change runtime type` → Select `GPU` (T4 recommended)
2. **Upload Project**: Upload the ULTRATHINK project folder or clone from GitHub
3. **Run Cells**: Execute cells in order

---

In [None]:
# Check GPU availability
!nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader

import torch
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

In [None]:
# Option: Mount Google Drive (uncomment if needed)
from google.colab import drive
drive.mount('/content/drive')
# %cd /content/drive/MyDrive/path/to/UltraThinking-LLM-Training

In [None]:
# Clone repository (update with your repo URL)
!git clone https://github.com/vediyappanm/UltraThinking-LLM-Training.git
 %cd UltraThinking-LLM-Training

# Or if already uploaded:
%cd /content/UltraThinking-LLM-Training

# Verify we're in the right directory
!ls -la train_ultrathink.py

In [None]:
# Install dependencies
!pip install -q -r requirements.txt

# Upgrade core packages
!pip install -q --upgrade pip setuptools wheel

# Install additional packages for Colab
!pip install -q transformers datasets accelerate

print("✓ Dependencies installed successfully!")

## 🎯 Perfect Training Configuration

### Expected Results:

| Metric | Before | After (Step 50-100) | Meaning |
|--------|--------|---------------------|----------|
| **Entropy** | 0.52 | 0.8-1.2 | More uniform expert selection |
| **Max Expert %** | 100% | 50-65% | No single expert dominates |
| **Aux Loss** | 8.0-8.5 | 2.0-4.0 | Routing regularization working |
| **Perplexity** | 85k → 30k | <5k by step 200 | Faster learning |
| **Loss** | 11.3 → 10.3 | <8.0 by step 200 | Better optimization |

---

In [None]:
# ============================================================================
# PERFECT TRAINING CONFIGURATION
# ============================================================================
# This configuration fixes routing collapse and achieves optimal performance
# ============================================================================

!python train_ultrathink.py \
  --vocab_size 50257 \
  --hidden_size 512 \
  --num_layers 6 \
  --num_heads 8 \
  --num_kv_heads 4 \
  --intermediate_size 2048 \
  --max_seq_length 256 \
  --activation swiglu \
  --enable_moe \
  --num_knowledge_experts 4 \
  --num_skill_experts 2 \
  --num_meta_experts 1 \
  --num_safety_experts 1 \
  --moe_top_k 2 \
  --expert_capacity 1.5 \
  --load_balance_weight 0.1 \
  --z_loss_weight 0.0001 \
  --importance_weight 0.05 \
  --batch_size 2 \
  --gradient_accumulation_steps 32 \
  --learning_rate 0.0001 \
  --weight_decay 0.1 \
  --adam_beta1 0.9 \
  --adam_beta2 0.999 \
  --warmup_steps 1000 \
  --max_steps 100000 \
  --num_epochs 1 \
  --gradient_clipping 0.5 \
  --dropout 0.15 \
  --attention_dropout 0.15 \
  --gradient_checkpointing \
  --use_amp \
  --amp_warmup_steps 500 \
  --enable_dre \
  --dre_warmup_steps 1000 \
  --dataset c4 \
  --dataset_subset en \
  --tokenizer_name gpt2 \
  --streaming \
  --train_samples 10000 \
  --val_samples 1000 \
  --num_workers 2 \
  --use_mlflow \
  --mlflow_tracking_uri "file:./mlruns" \
  --mlflow_experiment "UltraThinking-LLM-Training" \
  --run_name "ultrathink_colab_perfect_v2" \
  --perf_log_interval 5 \
  --eval_frequency 50 \
  --output_dir "./outputs/ultrathink_colab_perfect"

## 🧪 Quick Test Run (Optional)

Run a quick 100-step test to verify everything works before full training.

---

In [None]:
# Quick test run (100 steps, ~2-3 minutes)
!python train_ultrathink.py \
  --vocab_size 50257 \
  --hidden_size 256 \
  --num_layers 2 \
  --num_heads 4 \
  --num_kv_heads 2 \
  --intermediate_size 1024 \
  --max_seq_length 128 \
  --enable_moe \
  --num_knowledge_experts 2 \
  --num_skill_experts 1 \
  --num_meta_experts 1 \
  --num_safety_experts 1 \
  --moe_top_k 2 \
  --expert_capacity 2.0 \
  --load_balance_weight 0.2 \
  --z_loss_weight 0.00001 \
  --batch_size 1 \
  --gradient_accumulation_steps 8 \
  --learning_rate 0.0001 \
  --warmup_steps 50 \
  --max_steps 100 \
  --num_epochs 1 \
  --dataset dummy \
  --train_samples 100 \
  --val_samples 20 \
  --eval_frequency 50 \
  --run_name "ultrathink_quick_test" \
  --output_dir "./outputs/ultrathink_quick_test"

print("\n✓ Quick test completed! Check the metrics above.")
print("If everything looks good, run the full training cell.")

## 📊 Monitoring & Metrics

### What to Watch For:

#### ✅ Good Signs (by step 50):
- Entropy increases from 0.52 → 0.7+
- Max expert drops from 100% → 60-70%
- Auxiliary loss drops from 8.0 → 3-5
- Loss decreases steadily

#### ⚠️ Warning Signs:
- Entropy stuck at 0.52 → Increase load_balance_weight
- Max expert still 100% → Increase expert_capacity
- Aux loss still >7.0 → Decrease z_loss_weight

#### 🛑 Critical Issues:
- NaN/Inf losses → Disable AMP temporarily
- OOM errors → Reduce batch_size or increase gradient_accumulation_steps

---

In [None]:
# View recent training logs
!tail -n 50 outputs/ultrathink_colab_perfect/training.log 2>/dev/null || echo "No logs yet"

In [None]:
# Start MLflow UI (optional - runs in background)
# Note: In Colab, you'll need to use ngrok or similar to expose the port

# Install ngrok for port forwarding
!pip install -q pyngrok

from pyngrok import ngrok
import threading
import subprocess

# Start MLflow UI in background
def start_mlflow():
    subprocess.run(["mlflow", "ui", "--backend-store-uri", "./mlruns", "--port", "5000"])

thread = threading.Thread(target=start_mlflow, daemon=True)
thread.start()

# Create ngrok tunnel
public_url = ngrok.connect(5000)
print(f"\n✓ MLflow UI available at: {public_url}")
print("Click the link above to view training metrics in real-time!")

## 💾 Checkpoints & Model Export

Download trained models and checkpoints to your local machine or Google Drive.

---

In [None]:
# List available checkpoints
!ls -lh outputs/ultrathink_colab_perfect/*.pt 2>/dev/null || echo "No checkpoints yet"
!ls -lh outputs/ultrathink_colab_perfect/final_model/ 2>/dev/null || echo "No final model yet"

In [None]:
# Download model to local machine
from google.colab import files
import shutil
import os

# Create a zip file of the final model
if os.path.exists("outputs/ultrathink_colab_perfect/final_model"):
    shutil.make_archive("ultrathink_final_model", "zip", "outputs/ultrathink_colab_perfect/final_model")
    print("✓ Model archived as ultrathink_final_model.zip")
    
    # Download (this may take a while for large models)
    # files.download("ultrathink_final_model.zip")
    print("\nTo download, uncomment the files.download() line above.")
else:
    print("No final model found yet. Training may still be in progress.")

In [None]:
# Save to Google Drive (if mounted)
# Uncomment and modify path as needed

# from google.colab import drive
# drive.mount('/content/drive')

# import shutil
# shutil.copytree(
#     "outputs/ultrathink_colab_perfect",
#     "/content/drive/MyDrive/ULTRATHINK_Models/ultrathink_colab_perfect",
#     dirs_exist_ok=True
# )
# print("✓ Model saved to Google Drive!")

## 🎮 Quick Inference Test

Test your trained model with sample text generation.

---

In [None]:
# Quick inference test
!python scripts/inference.py \
  --model_path outputs/ultrathink_colab_perfect/final_model \
  --prompt "The future of artificial intelligence is" \
  --max_length 100 \
  --temperature 0.8 \
  --top_p 0.9 2>/dev/null || echo "Inference script not available or model not ready"

## 🔧 Troubleshooting

### Common Issues:

| Issue | Solution |
|-------|----------|
| **OOM Error** | Reduce `--batch_size` to 1, increase `--gradient_accumulation_steps` |
| **NaN Losses** | Remove `--use_amp` or increase `--amp_warmup_steps` |
| **Slow Training** | Reduce `--num_workers` to 0 for streaming datasets |
| **Routing Collapse** | Increase `--load_balance_weight` to 0.15 or 0.2 |
| **High Aux Loss** | Decrease `--z_loss_weight` to 0.00005 |

### Need Help?
- Check the [Training Config Guide](Training%20congig.md)
- Review logs in `outputs/ultrathink_colab_perfect/training.log`
- Open an issue on GitHub

---

## 📚 Additional Resources

- **Documentation**: See `README.md` and `ADVANCED_TRAINING_GUIDE.md`
- **Architecture**: See `ARCHITECTURE_OVERVIEW.md`
- **Training Config**: See `Training congig.md`

---

## 🎉 Success Criteria

Your training is successful when:

**By Step 50:**
- ✓ Entropy > 0.7
- ✓ Max expert < 70%
- ✓ Aux loss < 5.0

**By Step 200:**
- ✓ Loss < 8.0
- ✓ Perplexity < 5,000
- ✓ All experts showing 5-40% usage

**By Step 1000:**
- ✓ Loss < 6.0
- ✓ Perplexity < 1,000
- ✓ Stable, consistent improvement

---

**Good luck with your training! 🚀**