# UltraThink — Google Colab Training

This notebook trains the UltraThink model with two ready-made configurations:
- Local/Colab GPU (16GB+ VRAM)
- High-End GPU (32GB+ VRAM, e.g., V100/A100)

Metrics: loss, perplexity, tokens/sec are logged to console. Optional MLflow tracking is supported.

In [None]:
#@title Check GPU
!nvidia-smi || echo 'No NVIDIA GPU detected'

## Project Setup
Upload your project folder (this repo) into Colab's working directory (`/content`).
If it's in Google Drive, mount and `cd` into it.

In [None]:
#@title (Optional) Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
# After mounting, you can: %cd /content/drive/MyDrive/path/to/your/project
# Or if you uploaded directly to /content, cd there instead
%cd /content

In [None]:
#@title Change directory to your project root (edit if needed)
# If you uploaded the repository folder named 'deep', run:
#@title Change directory to your project root (edit if needed)
# If you uploaded the repository folder named 'deep', run:
%cd /content/UltraThinking-LLM-Training

In [None]:
#@title Install dependencies
# Prefer the project's requirements; Colab will build wheels as needed
!pip install -q -r requirements.txt
# Workarounds for Colab environments (safe no-ops if already satisfied)
!pip install -q --upgrade pip setuptools wheel
# Optional: Deepspeed is heavy; skip on Colab unless required
# !pip install -q deepspeed

### MLflow (Optional)
This project uses local MLflow (`file:./mlruns`). You can run the UI in Colab background shell and use the proxy to view it if desired.

In [None]:
!python train_ultrathink.py \
  --dataset c4 --dataset_subset en --streaming \
  --tokenizer_name gpt2 --vocab_size 50257 \
  --hidden_size 512 --num_layers 6 --num_heads 8 --num_kv_heads 4 \
  --intermediate_size 2048 --max_seq_length 256 \
  --activation swiglu \
  --dropout 0.1 --attention_dropout 0.1 \
  --enable_moe \
  --num_knowledge_experts 4 --num_skill_experts 2 --num_meta_experts 1 --num_safety_experts 1 \
  --moe_top_k 1 --expert_capacity 1.0 \
  --enable_dre --dre_warmup_steps 500 \
  --amp_warmup_steps 500 \
  --batch_size 1 --gradient_accumulation_steps 16 \
  --learning_rate 3e-4 --weight_decay 0.1 \
  --adam_beta1 0.9 --adam_beta2 0.999 \
  --warmup_steps 500 --num_epochs 1 \
  --gradient_clipping 1.0 \
  --use_amp --gradient_checkpointing \
  --eval_frequency 100 --num_workers 2 \
  --train_samples 2000 --val_samples 200 \
  --perf_log_interval 10 \
  --use_mlflow --run_name ultrathink_debug_loss \
  --output_dir ./outputs/ultrathink_debug_loss

In [None]:
#@title Train — High-End GPU (32GB+ VRAM: V100/A100)
!python train_ultrathink.py \
  --dataset c4 --dataset_subset en --streaming \
  --tokenizer_name gpt2 --vocab_size 50257 \
  --hidden_size 2048 --num_layers 24 --num_heads 16 --num_kv_heads 8 \
  --intermediate_size 8192 --max_seq_length 2048 \
  --activation swiglu \
  --dropout 0.05 --attention_dropout 0.05 \
  --enable_moe \
  --num_knowledge_experts 32 --num_skill_experts 16 \
  --num_meta_experts 8 --num_safety_experts 4 \
  --moe_top_k 2 --expert_capacity 1.25 \
  --enable_dre --dre_warmup_steps 2000 \
  --enable_constitutional \
  --enable_multimodal \
  --amp_warmup_steps 1000 \
  --batch_size 4 --gradient_accumulation_steps 32 \
  --learning_rate 1e-4 --weight_decay 0.01 \
  --adam_beta1 0.9 --adam_beta2 0.999 \
  --warmup_steps 10000 --num_epochs 3 \
  --gradient_clipping 1.0 \
  --use_amp --gradient_checkpointing --use_flash_attention \
  --eval_frequency 2 \
  --use_mlflow --run_name ultrathink_large_complete \
  --output_dir ./outputs/ultrathink_large_complete

## Monitoring
Console logs will include:
- **[step]** loss, ppl, toks/s every N steps (set by `perf_log_interval`)
- **[train]** epoch avg_loss and avg_ppl
- **[val]** avg_loss and avg_ppl

To run MLflow UI (optional):
```bash
mlflow ui --host 0.0.0.0 --port 5000
```
Use Colab's proxy or `cloudflared` to expose the port if needed.