YUA MoE 9.45B

YUA MoE 9.45B

Open Multilingual Mixture-of-Experts Language Model

9.45B Total | 2.7B Active per Token | 8 Experts | 16+ Languages | Drop-Upcycled MoE

HuggingFace · Model Card · Fine-Tuning Guide · Blog · Training Logs

What is YUA?

YUA MoE 9.45B is a multilingual Mixture-of-Experts language model built on Google TPU v4-32 using free TRC (TPU Research Cloud) credits. The model was created through Drop-Upcycling (ICLR 2025) — a 1.93B dense transformer was first trained, then expanded to 9.45B MoE with 8 experts via partial weight re-initialization.

It uses 8 experts with top-2 routing, activating only 2.7B parameters per token while maintaining the capacity of a 9.45B model. This means inference cost similar to a ~3B dense model, with quality closer to a ~7B model.

Built by one person, with $0 compute budget, trained on 16+ languages.

Architecture

graph LR
    A["Input<br/>Tokens"] --> B["Embedding<br/>128K × 2048"]
    B --> D["× 32 Layers"]
    
    subgraph D["× 32 Transformer Layers"]
        direction TB
        N1["RMSNorm"] --> ATT["GQA Attention<br/>32 heads, 8 KV<br/>QK-Norm + RoPE + Flash"]
        ATT --> R1["+ Residual"]
        R1 --> N2["RMSNorm"]
        N2 --> RT["Router → Top-2"]
        RT --> EX["8 Expert FFN<br/>(SwiGLU 2048→5461)"]
        EX --> R2["+ Residual"]
    end
    
    D --> E["RMSNorm"]
    E --> F["LM Head → 128K"]

    style ATT fill:#2196F3,color:#fff
    style RT fill:#FF9800,color:#fff
    style EX fill:#4CAF50,color:#fff

How MoE routing works: Each token goes through all 32 layers. At each layer, a router selects the top-2 experts out of 8. Only the selected experts compute — the rest stay idle. This means each token uses ~2.7B of the 9.45B total parameters.

Per token: Input → Embed → [RMSNorm → GQA(+RoPE) → RMSNorm → Route → 2/8 Experts → Residual] × 32 → LM Head
Active:    2.7B params (2 experts per layer)
Inactive:  6.75B params (6 idle experts per layer)

Detailed Architecture (from source code)

YuaModel (src/model/yua_model.py)
│
├─ TokenEmbedding (src/model/embeddings.py)
│   └─ nn.Embedding(128000, 2048) + optional RMSNorm
│
├─ × 32 TransformerBlock (src/model/transformer.py)
│   ├─ RMSNorm(2048, eps=1e-6)
│   ├─ CausalSelfAttention (src/model/attention.py)
│   │   ├─ Q: Linear(2048, 2048)  — 32 heads × 64 dim
│   │   ├─ K: Linear(2048, 512)   — 8 KV heads (GQA)
│   │   ├─ V: Linear(2048, 512)   — 8 KV heads (GQA)
│   │   ├─ O: Linear(2048, 2048)
│   │   ├─ QK-Norm: RMSNorm per head
│   │   ├─ RoPE (θ=500,000)
│   │   └─ Backend: auto (Flash > SDPA > manual)
│   ├─ + Residual
│   ├─ RMSNorm(2048, eps=1e-6)
│   ├─ MoE FFN (src/model/moe.py) — via create_ffn()
│   │   ├─ TopKRouter: Linear(2048, 8) → softmax → top-2
│   │   ├─ 8 × Expert SwiGLU FFN:
│   │   │   ├─ gate_proj: Linear(2048, 5461)
│   │   │   ├─ up_proj:   Linear(2048, 5461)
│   │   │   └─ down_proj:  Linear(5461, 2048)
│   │   ├─ Capacity limit + weight-priority token selection
│   │   └─ Aux loss: load balance + z-loss (fp32)
│   └─ + Residual
│
├─ RMSNorm(2048, eps=1e-6)
└─ LM Head: tied with embedding (2048 → 128000)

Loss = CE loss + MTP loss (optional) + MoE aux loss (layer-averaged)

Specifications

Component	Value
Type	Decoder-only Transformer + Sparse MoE
Total Parameters	9.45B
Active Parameters	2.7B per token
Layers	32
Hidden Dim (`d_model`)	2,048
Attention	32 heads, 8 KV heads (GQA 4:1)
Head Dim	64
Experts	8 per layer, top-2 routing
FFN per Expert	SwiGLU, dim 5,461
Vocabulary	128,000 (SentencePiece BPE + byte fallback)
Position	RoPE (θ=500,000)
Norm	RMSNorm (pre-norm)
Attention Features	QK-Norm, Flash Attention 2, GQA
Context Length	2,048 (→ 256K planned via YaRN 2-stage)
Precision	bfloat16
License	Apache 2.0

Why MoE over Dense?

Dense 7B:  Every parameter fires for every token → 7B FLOPs/token
MoE 9.45B:  Only 2/8 experts fire → 2.7B FLOPs/token

Result: 3.4× more capacity at same compute cost

Training Status

🟢 Training in progress — updated live

Metric	Value
Current Step	~14.8K / 511K (MoE phase)
Epoch Progress	2.9% MoE + 80K dense prior
Loss	avg 2.07, trending down
Tokens Seen	~3.9B MoE + 42B dense prior
Step Time	7.89 sec
Throughput	66,400 tokens/sec
TFLOP/s	71.8 per device
Uptime	Stable, zero OOM
Method	Drop-Upcycling (ICLR 2025) from 1.93B dense
ETA (1 epoch)	~May 1, 2026

Loss Curve

Loss
12 |*
   |  *
 8 |    *
   |      *
 4 |        * * *
   |              * * * * * * *
 2 |                            * * * * * * * *
   |____________________________________________
   0    5K   10K   15K   20K   25K   30K   35K   Step

Live CSV: training_log_moe_full.csv

Quick Start

Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "jungwon-ai/YUA-MoE-9.45B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Generate
prompt = "인공지능의 미래는"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Multilingual Examples

prompts = {
    "Korean":   "서울의 봄에 대해 설명해줘",
    "English":  "Explain quantum computing simply",
    "Japanese": "東京の観光スポットを教えて",
    "Chinese":  "什么是人工智能",
    "Code":     "def fibonacci(n):",
}

for lang, prompt in prompts.items():
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    out = model.generate(**inputs, max_new_tokens=128, temperature=0.7)
    print(f"[{lang}] {tokenizer.decode(out[0], skip_special_tokens=True)}\n")

Tool Calling

YUA supports tool-augmented generation with an OpenAI-compatible tool calling API. The model can invoke external tools during generation, receive structured results, and continue reasoning.

See the full Tool Call API Reference for detailed specification.

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "yua-9.45b-moe",
    "messages": [
      {"role": "user", "content": "What is sqrt(144) + cbrt(27)?"}
    ],
    "tools": [{
      "type": "function",
      "function": {
        "name": "calculate",
        "description": "Evaluate math expressions",
        "parameters": {
          "type": "object",
          "properties": {
            "expression": {"type": "string"}
          },
          "required": ["expression"]
        }
      }
    }]
  }'

16 built-in tools:

Category	Tools	Approval Required
Search	`web_search`, `url_fetch`, `http_request`	No / No / Yes
Compute	`calculator`, `calculate`, `execute`, `pytest`	No / No / Yes / Yes
Files	`file_read`, `file_write`, `pdf_read`, `json_parse`, `grep_search`	No / Yes / No / No / No
System	`shell`, `git_ops`, `datetime`, `unit_convert`	Yes / Mixed / No / No

Native format (ChatML):

<tool_call>
{"name": "calculate", "arguments": {"expression": "sqrt(144) + 27**(1/3)"}}
</tool_call>

Structured result:

{"id": "call_0001", "name": "calculate", "status": "success", "output": "15.0", "duration_ms": 0.3}

Error handling — parameters are validated against JSON Schema before execution. Errors include type (invalid_params, timeout, execution_error, etc.) and a human-readable message so the model can self-correct.

{"id": "call_0001", "name": "calculate", "status": "error",
 "error": {"type": "invalid_params", "message": "missing required parameter: expression"},
 "duration_ms": 0.0}

Training Details

Data (987GB raw → 134B tokens, 477 ArrayRecord shards)

pie title Training Data Composition
    "Web (FineWeb, mC4, DCLM)" : 42
    "Math & Science" : 20
    "Code (The Stack, OpenCode)" : 13
    "Wikipedia (4 langs)" : 9
    "Books & Documents" : 9
    "Other (StackExchange, arXiv)" : 7

Domain	Size	Languages
Web crawl	411 GB	ko, en, ja, zh + 12 langs (FineWeb2)
Math & Science	200 GB	en, ja, zh
Code	129 GB	Python, JS, Java, C++ + 28 langs
Wikipedia	88 GB	ko, en, ja, zh
Books & Documents	88 GB	en
Other	71 GB	multilingual
Total	987 GB	16+ languages

Languages: Korean, English, Japanese, Chinese, Arabic, Bengali, German, French, Hindi, Indonesian, Italian, Polish, Portuguese, Spanish, Thai, Turkish, Urdu, Vietnamese

Infrastructure

Setting	Value
Framework	MaxText (JAX/XLA)
Hardware	TPU v4-32 (4 hosts × 4 chips = 16 chips, 512 GB HBM)
Optimizer	AdamW (β₁=0.9, β₂=0.95, ε=1e-8, wd=0.1)
Learning Rate	1e-4, cosine decay
Warmup	0.5% of total steps
Batch Size	8/device × 16 devices = 128 seqs (262K tok/step)
Sequence Length	2,048
Gradient Clipping	1.0
MoE Aux Loss	0.01
Checkpointing	Every 500 steps to GCS
Compute Cost	$0 (Google TRC free credits)

Fine-Tuning

YUA MoE 9.45B supports LoRA, QLoRA, and full SFT fine-tuning. See the Fine-Tuning Guide for complete instructions including:

LoRA — 20-24GB VRAM (RTX 3090/4090)
QLoRA — 8-12GB VRAM (RTX 3060/4060)
Full SFT — TPU or multi-GPU
MoE-specific tips (router training, expert collapse prevention)

Inference Options (coming after SFT — May 2026)

Model weights will be available on HuggingFace after SFT completion. The examples below show planned usage.

vLLM (planned)

from vllm import LLM, SamplingParams

llm = LLM(model="jungwon-ai/YUA-MoE-9.45B", trust_remote_code=True, dtype="bfloat16", tensor_parallel_size=1)
params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
outputs = llm.generate(["인공지능의 미래는"], params)
print(outputs[0].outputs[0].text)

Ollama (planned)

ollama run yua-moe-9b

llama.cpp / GGUF (planned)

./llama-cli -m yua-moe-9b-Q4_K_M.gguf \
    -p "서울의 봄에 대해 설명해줘" \
    -n 256 --temp 0.7 --top-p 0.9

Benchmarks

⏳ Evaluation in progress after 1 epoch completion (~April 17).

Benchmark	Metric	YUA MoE 9.45B	OLMoE 7B	Phi-2 2.7B
MMLU	5-shot	TBD	~26%	~50%
HellaSwag	10-shot	TBD	—	~73%
ARC-C	25-shot	TBD	—	~61%
HumanEval	pass@1	TBD	—	~48%
K-MMLU	5-shot	TBD	—	—

Growth Roadmap

graph LR
    A["YUA 125M<br/>Dense<br/>✅ Complete"] --> B["YUA 1.3B<br/>Dense<br/>✅ Complete"]
    B --> C["YUA MoE 9.45B<br/>8 experts, top-2<br/>🔄 Training"]
    C --> D["YUA MoE 26B<br/>32 experts, top-4<br/>+ Shared Expert<br/>📋 Planned"]
    D --> E["YUA MoE 70B<br/>128 experts, top-8<br/>📋 Planned"]
    
    style A fill:#4CAF50,color:#fff
    style B fill:#4CAF50,color:#fff
    style C fill:#FF9800,color:#fff
    style D fill:#2196F3,color:#fff
    style E fill:#2196F3,color:#fff

Scaling Strategy

MoE upcycling: Add experts (8→32→128) + shared experts (DeepSeek style)
Depth stacking: Add layers for deeper models
No width expansion: Proven to cause symmetry issues (see our research notes)

Project Structure

yua/
├── configs/             # Model & training configs
│   ├── maxtext_yua_7b_moe.yml    # MoE training config
│   ├── model_1b.yaml             # 1.3B architecture
│   └── model_7b_growth.yaml      # 7B architecture
├── scripts/
│   ├── build_sft_data.py          # SFT data pipeline (HF streaming)
│   ├── download_datasets_global.py # Data collection (45+ datasets)
│   ├── stream_shard_gcs.py        # ArrayRecord conversion
│   ├── convert_yua_to_maxtext.py  # PyTorch → MaxText converter
│   └── expand_checkpoint.py       # Growth family expansion
├── src/model/           # YUA model architecture (PyTorch + HF)
│   ├── configuration_yua.py       # HF YuaConfig
│   ├── modeling_yua.py            # HF YuaForCausalLM
│   ├── yua_model.py               # Core model (PyTorch)
│   └── moe.py                     # MoE routing + experts
├── tools/               # Conversion & utilities
│   ├── convert_maxtext_moe_to_hf.py  # MaxText → HF converter
│   └── CONVERT_GUIDE.md
├── docs/
│   ├── HANDOVER.md      # Full project state & history
│   ├── MODEL_CARD.md    # HuggingFace model card
│   └── AI_CHAMPION_DATA.md  # Competition data sheet
└── data/
    └── tokenizer/       # SentencePiece 128K vocab

Reproducibility

Everything is open:

Artifact	Location
Training code	This repo
Training config	`configs/maxtext_yua_7b_moe.yml`
Training logs (per-step)	GCS CSV
Checkpoints	`gs://yua-data-v1/maxtext_moe/`
Data shards	`gs://yua-data-v1/maxtext_shards_text/`
Tokenizer	`gs://yua-data-v1/tokenizer/yua_128k_v2.model`
Raw data	`gs://yua-data-v1/raw/` (987 GB)

License

Apache License 2.0 — free for commercial and research use.

Citation

@misc{yua2026moe,
    title   = {YUA MoE 9.45B: Open Multilingual Mixture-of-Experts Language Model},
    author  = {Jungwon Eom},
    year    = {2026},
    url     = {https://github.com/yuaone/yua},
    note    = {9.45B total, 2.7B active, 8 experts top-2,
               trained from scratch on 134B+ tokens, 16 languages,
               TPU v4-32, $0 compute (Google TRC)}
}

Acknowledgments

This work was made possible by the generous support of the Google TPU Research Cloud (TRC) program, which provided free access to TPU v4-32 compute resources. We are deeply grateful for their commitment to supporting independent AI research.

Google TPU Research Cloud (TRC) — TPU v4-32 compute (Google Cloud Platform)
MaxText — JAX training framework
Hugging Face — Datasets & model hosting
Open-source LLM community: OLMoE, DeepSeek, Qwen, Mistral

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
configs		configs
docs		docs
src		src
tools		tools
yua-image		yua-image
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example_generate.py		example_generate.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YUA MoE 9.45B

Open Multilingual Mixture-of-Experts Language Model

What is YUA?

Architecture

Specifications

Why MoE over Dense?

Training Status

Loss Curve

Quick Start

Loading the Model

Multilingual Examples

Tool Calling

Training Details

Data (987GB raw → 134B tokens, 477 ArrayRecord shards)

Infrastructure

Fine-Tuning

Inference Options (coming after SFT — May 2026)

vLLM (planned)

Ollama (planned)

llama.cpp / GGUF (planned)

Benchmarks

Growth Roadmap

Scaling Strategy

Project Structure

Reproducibility

License

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

YUA MoE 9.45B

Open Multilingual Mixture-of-Experts Language Model

What is YUA?

Architecture

Specifications

Why MoE over Dense?

Training Status

Loss Curve

Quick Start

Loading the Model

Multilingual Examples

Tool Calling

Training Details

Data (987GB raw → 134B tokens, 477 ArrayRecord shards)

Infrastructure

Fine-Tuning

Inference Options (coming after SFT — May 2026)

vLLM (planned)

Ollama (planned)

llama.cpp / GGUF (planned)

Benchmarks

Growth Roadmap

Scaling Strategy

Project Structure

Reproducibility

License

Citation

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages