Authors: Thong Nguyen, Yibin Lei, Jia-Huei Ju, Eugene Yang, Andrew Yates
Pretrained models are available on the MILCO HuggingFace Collection.
We use uv to manage the environment and dependencies.
# Create a virtual environment
uv venv milcoenv
source milcoenv/bin/activate
# Install dependencies
uv pip install -r requirements.txtTo add a new package later:
uv pip install <package-name>Both training scripts are designed to run under SLURM and read the following
environment variables set automatically by srun:
| Variable | Purpose |
|---|---|
SLURM_NNODES |
Number of nodes (--nnodes) |
SLURM_GPUS_PER_NODE |
GPUs per node (--nproc_per_node) |
SLURM_PROCID |
Node rank (--node_rank) |
SLURM_JOBID |
Rendezvous ID (--rdzv-id) |
A typical submission script wraps the training command in srun:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --time=24:00:00
srun bash scripts/alignment.sh # or scripts/distillation.shAdjust --nodes, --gpus-per-node, --time, and your account/partition flags
to match your cluster. Set MASTER_PORT to a free port if the default (25900)
is in use.
Aligns the multilingual encoder to the English LSR space using parallel and multilingual corpora.
srun bash scripts/alignment.shKey arguments in scripts/alignment.sh:
--multilingual_encoder_checkpoint— multilingual backbone (e.g.BAAI/bge-m3-unsupervised).--lsr_encoder_checkpoint— English LSR teacher providing the target lexical space (e.g.naver/splade-v3).--train_datasets— alignment corpora (mMARCO, WikiMatrix, Europarl, OpenSubtitles, Talks, Tatoeba, JW300, news-commentary).--output_dir— checkpoint location, consumed by Stage 2.
Trains the aligned model with hard negatives and teacher-score distillation.
srun bash scripts/distillation.shKey arguments in scripts/distillation.sh:
--pretrained_alignment_checkpoint— checkpoint produced by Stage 1.--echo— enables the LexEcho head (source-language view).--train_datasets bge-distillation— distillation data with teacher scores.--train_group_size— number of passages per query (1 positive + N−1 negatives).--lambda_q/--lambda_d— FLOPS regularization weights for queries and documents.
@inproceedings{nguyen2026milco,
title={MILCO: Learned Sparse Retrieval Across Languages via a Multilingual Connector},
author={Nguyen, Thong and Lei, Yibin and Ju, Jia-Huei and Yang, Eugene and Yates, Andrew},
booktitle={International Conference on Learning Representations},
year={2026}
}