9.45B Total | 2.7B Active per Token | 8 Experts | 16+ Languages | Drop-Upcycled MoE
HuggingFace · Model Card · Fine-Tuning Guide · Blog · Training Logs
YUA MoE 9.45B is a multilingual Mixture-of-Experts language model built on Google TPU v4-32 using free TRC (TPU Research Cloud) credits. The model was created through Drop-Upcycling (ICLR 2025) — a 1.93B dense transformer was first trained, then expanded to 9.45B MoE with 8 experts via partial weight re-initialization.
It uses 8 experts with top-2 routing, activating only 2.7B parameters per token while maintaining the capacity of a 9.45B model. This means inference cost similar to a ~3B dense model, with quality closer to a ~7B model.
Built by one person, with $0 compute budget, trained on 16+ languages.
graph LR
A["Input<br/>Tokens"] --> B["Embedding<br/>128K × 2048"]
B --> D["× 32 Layers"]
subgraph D["× 32 Transformer Layers"]
direction TB
N1["RMSNorm"] --> ATT["GQA Attention<br/>32 heads, 8 KV<br/>QK-Norm + RoPE + Flash"]
ATT --> R1["+ Residual"]
R1 --> N2["RMSNorm"]
N2 --> RT["Router → Top-2"]
RT --> EX["8 Expert FFN<br/>(SwiGLU 2048→5461)"]
EX --> R2["+ Residual"]
end
D --> E["RMSNorm"]
E --> F["LM Head → 128K"]
style ATT fill:#2196F3,color:#fff
style RT fill:#FF9800,color:#fff
style EX fill:#4CAF50,color:#fff
How MoE routing works: Each token goes through all 32 layers. At each layer, a router selects the top-2 experts out of 8. Only the selected experts compute — the rest stay idle. This means each token uses ~2.7B of the 9.45B total parameters.
Per token: Input → Embed → [RMSNorm → GQA(+RoPE) → RMSNorm → Route → 2/8 Experts → Residual] × 32 → LM Head
Active: 2.7B params (2 experts per layer)
Inactive: 6.75B params (6 idle experts per layer)
Detailed Architecture (from source code)
YuaModel (src/model/yua_model.py)
│
├─ TokenEmbedding (src/model/embeddings.py)
│ └─ nn.Embedding(128000, 2048) + optional RMSNorm
│
├─ × 32 TransformerBlock (src/model/transformer.py)
│ ├─ RMSNorm(2048, eps=1e-6)
│ ├─ CausalSelfAttention (src/model/attention.py)
│ │ ├─ Q: Linear(2048, 2048) — 32 heads × 64 dim
│ │ ├─ K: Linear(2048, 512) — 8 KV heads (GQA)
│ │ ├─ V: Linear(2048, 512) — 8 KV heads (GQA)
│ │ ├─ O: Linear(2048, 2048)
│ │ ├─ QK-Norm: RMSNorm per head
│ │ ├─ RoPE (θ=500,000)
│ │ └─ Backend: auto (Flash > SDPA > manual)
│ ├─ + Residual
│ ├─ RMSNorm(2048, eps=1e-6)
│ ├─ MoE FFN (src/model/moe.py) — via create_ffn()
│ │ ├─ TopKRouter: Linear(2048, 8) → softmax → top-2
│ │ ├─ 8 × Expert SwiGLU FFN:
│ │ │ ├─ gate_proj: Linear(2048, 5461)
│ │ │ ├─ up_proj: Linear(2048, 5461)
│ │ │ └─ down_proj: Linear(5461, 2048)
│ │ ├─ Capacity limit + weight-priority token selection
│ │ └─ Aux loss: load balance + z-loss (fp32)
│ └─ + Residual
│
├─ RMSNorm(2048, eps=1e-6)
└─ LM Head: tied with embedding (2048 → 128000)
Loss = CE loss + MTP loss (optional) + MoE aux loss (layer-averaged)
| Component | Value |
|---|---|
| Type | Decoder-only Transformer + Sparse MoE |
| Total Parameters | 9.45B |
| Active Parameters | 2.7B per token |
| Layers | 32 |
Hidden Dim (d_model) |
2,048 |
| Attention | 32 heads, 8 KV heads (GQA 4:1) |
| Head Dim | 64 |
| Experts | 8 per layer, top-2 routing |
| FFN per Expert | SwiGLU, dim 5,461 |
| Vocabulary | 128,000 (SentencePiece BPE + byte fallback) |
| Position | RoPE (θ=500,000) |
| Norm | RMSNorm (pre-norm) |
| Attention Features | QK-Norm, Flash Attention 2, GQA |
| Context Length | 2,048 (→ 256K planned via YaRN 2-stage) |
| Precision | bfloat16 |
| License | Apache 2.0 |
Dense 7B: Every parameter fires for every token → 7B FLOPs/token
MoE 9.45B: Only 2/8 experts fire → 2.7B FLOPs/token
Result: 3.4× more capacity at same compute cost
🟢 Training in progress — updated live
| Metric | Value |
|---|---|
| Current Step | ~14.8K / 511K (MoE phase) |
| Epoch Progress | 2.9% MoE + 80K dense prior |
| Loss | avg 2.07, trending down |
| Tokens Seen | ~3.9B MoE + 42B dense prior |
| Step Time | 7.89 sec |
| Throughput | 66,400 tokens/sec |
| TFLOP/s | 71.8 per device |
| Uptime | Stable, zero OOM |
| Method | Drop-Upcycling (ICLR 2025) from 1.93B dense |
| ETA (1 epoch) | ~May 1, 2026 |
Loss
12 |*
| *
8 | *
| *
4 | * * *
| * * * * * * *
2 | * * * * * * * *
|____________________________________________
0 5K 10K 15K 20K 25K 30K 35K Step
Live CSV: training_log_moe_full.csv
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "jungwon-ai/YUA-MoE-9.45B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Generate
prompt = "인공지능의 미래는"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))prompts = {
"Korean": "서울의 봄에 대해 설명해줘",
"English": "Explain quantum computing simply",
"Japanese": "東京の観光スポットを教えて",
"Chinese": "什么是人工智能",
"Code": "def fibonacci(n):",
}
for lang, prompt in prompts.items():
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=128, temperature=0.7)
print(f"[{lang}] {tokenizer.decode(out[0], skip_special_tokens=True)}\n")YUA supports tool-augmented generation with an OpenAI-compatible tool calling API. The model can invoke external tools during generation, receive structured results, and continue reasoning.
See the full Tool Call API Reference for detailed specification.
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "yua-9.45b-moe",
"messages": [
{"role": "user", "content": "What is sqrt(144) + cbrt(27)?"}
],
"tools": [{
"type": "function",
"function": {
"name": "calculate",
"description": "Evaluate math expressions",
"parameters": {
"type": "object",
"properties": {
"expression": {"type": "string"}
},
"required": ["expression"]
}
}
}]
}'16 built-in tools:
| Category | Tools | Approval Required |
|---|---|---|
| Search | web_search, url_fetch, http_request |
No / No / Yes |
| Compute | calculator, calculate, execute, pytest |
No / No / Yes / Yes |
| Files | file_read, file_write, pdf_read, json_parse, grep_search |
No / Yes / No / No / No |
| System | shell, git_ops, datetime, unit_convert |
Yes / Mixed / No / No |
Native format (ChatML):
<tool_call>
{"name": "calculate", "arguments": {"expression": "sqrt(144) + 27**(1/3)"}}
</tool_call>
Structured result:
{"id": "call_0001", "name": "calculate", "status": "success", "output": "15.0", "duration_ms": 0.3}Error handling — parameters are validated against JSON Schema before execution. Errors include type (invalid_params, timeout, execution_error, etc.) and a human-readable message so the model can self-correct.
{"id": "call_0001", "name": "calculate", "status": "error",
"error": {"type": "invalid_params", "message": "missing required parameter: expression"},
"duration_ms": 0.0}pie title Training Data Composition
"Web (FineWeb, mC4, DCLM)" : 42
"Math & Science" : 20
"Code (The Stack, OpenCode)" : 13
"Wikipedia (4 langs)" : 9
"Books & Documents" : 9
"Other (StackExchange, arXiv)" : 7
| Domain | Size | Languages |
|---|---|---|
| Web crawl | 411 GB | ko, en, ja, zh + 12 langs (FineWeb2) |
| Math & Science | 200 GB | en, ja, zh |
| Code | 129 GB | Python, JS, Java, C++ + 28 langs |
| Wikipedia | 88 GB | ko, en, ja, zh |
| Books & Documents | 88 GB | en |
| Other | 71 GB | multilingual |
| Total | 987 GB | 16+ languages |
Languages: Korean, English, Japanese, Chinese, Arabic, Bengali, German, French, Hindi, Indonesian, Italian, Polish, Portuguese, Spanish, Thai, Turkish, Urdu, Vietnamese
| Setting | Value |
|---|---|
| Framework | MaxText (JAX/XLA) |
| Hardware | TPU v4-32 (4 hosts × 4 chips = 16 chips, 512 GB HBM) |
| Optimizer | AdamW (β₁=0.9, β₂=0.95, ε=1e-8, wd=0.1) |
| Learning Rate | 1e-4, cosine decay |
| Warmup | 0.5% of total steps |
| Batch Size | 8/device × 16 devices = 128 seqs (262K tok/step) |
| Sequence Length | 2,048 |
| Gradient Clipping | 1.0 |
| MoE Aux Loss | 0.01 |
| Checkpointing | Every 500 steps to GCS |
| Compute Cost | $0 (Google TRC free credits) |
YUA MoE 9.45B supports LoRA, QLoRA, and full SFT fine-tuning. See the Fine-Tuning Guide for complete instructions including:
- LoRA — 20-24GB VRAM (RTX 3090/4090)
- QLoRA — 8-12GB VRAM (RTX 3060/4060)
- Full SFT — TPU or multi-GPU
- MoE-specific tips (router training, expert collapse prevention)
Model weights will be available on HuggingFace after SFT completion. The examples below show planned usage.
from vllm import LLM, SamplingParams
llm = LLM(model="jungwon-ai/YUA-MoE-9.45B", trust_remote_code=True, dtype="bfloat16", tensor_parallel_size=1)
params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
outputs = llm.generate(["인공지능의 미래는"], params)
print(outputs[0].outputs[0].text)ollama run yua-moe-9b./llama-cli -m yua-moe-9b-Q4_K_M.gguf \
-p "서울의 봄에 대해 설명해줘" \
-n 256 --temp 0.7 --top-p 0.9⏳ Evaluation in progress after 1 epoch completion (~April 17).
| Benchmark | Metric | YUA MoE 9.45B | OLMoE 7B | Phi-2 2.7B |
|---|---|---|---|---|
| MMLU | 5-shot | TBD | ~26% | ~50% |
| HellaSwag | 10-shot | TBD | — | ~73% |
| ARC-C | 25-shot | TBD | — | ~61% |
| HumanEval | pass@1 | TBD | — | ~48% |
| K-MMLU | 5-shot | TBD | — | — |
graph LR
A["YUA 125M<br/>Dense<br/>✅ Complete"] --> B["YUA 1.3B<br/>Dense<br/>✅ Complete"]
B --> C["YUA MoE 9.45B<br/>8 experts, top-2<br/>🔄 Training"]
C --> D["YUA MoE 26B<br/>32 experts, top-4<br/>+ Shared Expert<br/>📋 Planned"]
D --> E["YUA MoE 70B<br/>128 experts, top-8<br/>📋 Planned"]
style A fill:#4CAF50,color:#fff
style B fill:#4CAF50,color:#fff
style C fill:#FF9800,color:#fff
style D fill:#2196F3,color:#fff
style E fill:#2196F3,color:#fff
- MoE upcycling: Add experts (8→32→128) + shared experts (DeepSeek style)
- Depth stacking: Add layers for deeper models
- No width expansion: Proven to cause symmetry issues (see our research notes)
yua/
├── configs/ # Model & training configs
│ ├── maxtext_yua_7b_moe.yml # MoE training config
│ ├── model_1b.yaml # 1.3B architecture
│ └── model_7b_growth.yaml # 7B architecture
├── scripts/
│ ├── build_sft_data.py # SFT data pipeline (HF streaming)
│ ├── download_datasets_global.py # Data collection (45+ datasets)
│ ├── stream_shard_gcs.py # ArrayRecord conversion
│ ├── convert_yua_to_maxtext.py # PyTorch → MaxText converter
│ └── expand_checkpoint.py # Growth family expansion
├── src/model/ # YUA model architecture (PyTorch + HF)
│ ├── configuration_yua.py # HF YuaConfig
│ ├── modeling_yua.py # HF YuaForCausalLM
│ ├── yua_model.py # Core model (PyTorch)
│ └── moe.py # MoE routing + experts
├── tools/ # Conversion & utilities
│ ├── convert_maxtext_moe_to_hf.py # MaxText → HF converter
│ └── CONVERT_GUIDE.md
├── docs/
│ ├── HANDOVER.md # Full project state & history
│ ├── MODEL_CARD.md # HuggingFace model card
│ └── AI_CHAMPION_DATA.md # Competition data sheet
└── data/
└── tokenizer/ # SentencePiece 128K vocab
Everything is open:
| Artifact | Location |
|---|---|
| Training code | This repo |
| Training config | configs/maxtext_yua_7b_moe.yml |
| Training logs (per-step) | GCS CSV |
| Checkpoints | gs://yua-data-v1/maxtext_moe/ |
| Data shards | gs://yua-data-v1/maxtext_shards_text/ |
| Tokenizer | gs://yua-data-v1/tokenizer/yua_128k_v2.model |
| Raw data | gs://yua-data-v1/raw/ (987 GB) |
Apache License 2.0 — free for commercial and research use.
@misc{yua2026moe,
title = {YUA MoE 9.45B: Open Multilingual Mixture-of-Experts Language Model},
author = {Jungwon Eom},
year = {2026},
url = {https://github.com/yuaone/yua},
note = {9.45B total, 2.7B active, 8 experts top-2,
trained from scratch on 134B+ tokens, 16 languages,
TPU v4-32, $0 compute (Google TRC)}
}This work was made possible by the generous support of the Google TPU Research Cloud (TRC) program, which provided free access to TPU v4-32 compute resources. We are deeply grateful for their commitment to supporting independent AI research.
- Google TPU Research Cloud (TRC) — TPU v4-32 compute (Google Cloud Platform)
- MaxText — JAX training framework
- Hugging Face — Datasets & model hosting
- Open-source LLM community: OLMoE, DeepSeek, Qwen, Mistral