embedding-vocab-trimmer— reduce a multilingual text-embedding model to a single target language with no training and no GPU, by surgically trimming its token vocabulary.
In a multilingual embedding model, the token-embedding matrix dominates the parameter count
(EmbeddingGemma-300M: trim_vocab.py identifies the tokens that language actually uses, retains the top-$K$ by corpus
frequency, re-indexes the embedding matrix, and rewrites the BPE merge table —
leaving the transformer encoder and the SentenceTransformers pooling/Dense heads bit-for-bit unchanged.
Result on Portuguese (EmbeddingGemma-300M → 157M): the 64k-token trim retains 99.4% of the full model's MTEB(por) score at half the parameters — with zero training.
📦 Example model: tardellirs/embeddinggemma-pt-br · 🛠️ Tool: github.com/tardellirs/embedding-vocab-trimmer
Let
Given a target-language corpus
Let
A contiguous re-indexing bijection
A BPE vocabulary is defined by an ordered list of merge rules
This is the critical step most implementations overlook: retaining a merge whose product
The trimmed embedding matrix
The encoder weights
For every surviving token
For EmbeddingGemma-300M (
Because the encoder is identical across all trim sizes, quality loss arises solely from tokenization changes for out-of-vocabulary tokens. As
Empirically, convergence is fast: at
multilingual model language-trimmed model
┌───────────────────────────┐ ┌────────────────────────┐
│ embed_tokens 262144×768 │ ── trim ──▶ │ embed_tokens 64000×768 │ ← only this shrinks
├───────────────────────────┤ ├────────────────────────┤
│ transformer encoder │ (unchanged) │ transformer encoder │
│ pooling + Dense heads │ (unchanged) │ pooling + Dense heads │
└───────────────────────────┘ └────────────────────────┘
~308M params ~157M params
pip install -r requirements.txt# trim EmbeddingGemma-300M to a 64k Portuguese vocabulary
python trim_vocab.py \
--model google/embeddinggemma-300m \
--corpus-config por \
--vocab-size 64000 \
--output ./embeddinggemma-pt-br
# push to the Hub (requires HF_TOKEN)
python trim_vocab.py --model google/embeddinggemma-300m --corpus-config por \
--vocab-size 64000 --output ./out --push <user>/embeddinggemma-pt-br--corpus-config is the language code for the mining corpus (defaults to
lbourdois/fineweb-2-trimming,
e.g. por, fra, deu, spa). Pass --corpus-dataset to mine from any other HuggingFace text dataset.
Evaluated on MTEB(por) — 22 native PT-BR tasks spanning classification, pair-classification, STS,
clustering, retrieval, and reranking (mean_22). The transformer encoder and Dense heads are identical
at every vocab size; the only variable is the embedding matrix.
| vocab | params | MTEB(por) mean_22 |
% of full |
|---|---|---|---|
| 16k | ~119M | 0.5950 | 91.7% |
| 24k | ~125M | 0.6263 | 96.5% |
| 32k | ~131M | 0.6201 | 95.5% |
| 48k | ~144M | 0.6418 | 98.9% |
| 64k | ~157M | 0.6453 | 99.4% |
| 128k | ~207M | 0.6491 | ≈100% |
| full EG-300M | ~308M | 0.6490 | 100% |
Quality recovers monotonically above 32k. At 64k the trim reaches 99.4% of the full model's score
at 51% of the parameters. The 128k model ties the full model within measurement noise.
See results/ and examples/embeddinggemma_pt.md.
The key metric for assessing trimming potential is the embedding fraction
| Model | Emb (M) | Total (M) | |||
|---|---|---|---|---|---|
| sentence-transformers/LaBSE | 501,153 | 768 | 384.9 | 471 | 81.7% |
| intfloat/multilingual-e5-base | 250,002 | 768 | 192.0 | 278 | 69.1% |
| paraphrase-multilingual-mpnet-base-v2 | 250,002 | 768 | 192.0 | 278 | 69.1% |
| google/embeddinggemma-300m | 262,144 | 768 | 201.3 | 308 | 65.4% |
| intfloat/multilingual-e5-large | 250,002 | 1024 | 256.0 | 560 | 45.7% |
| BAAI/bge-m3 | 250,002 | 1024 | 256.0 | 568 | 45.1% |
| Qwen/Qwen3-Embedding-0.6B | 151,669 | 1024 | 155.3 | 596 | 26.1% |
| Qwen/Qwen3-Embedding-4B | 151,665 | 2560 | 388.3 | 4,020 | 9.7% |
| intfloat/e5-mistral-7b-instruct | 32,000 | 4096 | 131.1 | 7,111 | 1.8% |
The pattern is consistent: encoder-only or bi-encoder models with a multilingual tokenizer (XLM-RoBERTa = 250k; Google SentencePiece = 262k; LaBSE = 501k) concentrate parameters in the embedding matrix and yield the largest reductions. Decoder-only models (Mistral, Qwen, LLaMA families) have small vocabularies relative to their encoder size, making trimming largely ineffective.
- Compression, not enhancement. Vocabulary trimming removes unused parameters; it does not improve the model. Fine-tuning, layer pruning, and distillation from a larger teacher were all evaluated and each reduced MTEB(por) by 0.02–0.04 points. The base model is at its representational ceiling for the target language; trimming recovers deployment efficiency at no quality cost, but cannot exceed it.
- Tokenizer family. Validated on BPE with
byte_fallback(Gemma/EmbeddingGemma). The method generalises to other BPE/SentencePiece embedders; merge-filtering logic may require minor adaptation per family. - Architecture. Targets SentenceTransformers models with a
transformersencoder and anembed_tokensweight matrix. The encoder, pooling, and Dense heads pass through untouched.
Tool: Apache-2.0 (see LICENSE). The example model is derived from Google's
EmbeddingGemma and is released under the Gemma Terms of Use
(see NOTICE). Trimmed models inherit the license of their base model.
If this work is useful, a link or star is appreciated. Benchmark: MTEB(por) leaderboard.

