MILCO: Learned Sparse Retrieval Across Languages via a Multilingual Connector

Authors: Thong Nguyen, Yibin Lei, Jia-Huei Ju, Eugene Yang, Andrew Yates

Pretrained models are available on the MILCO HuggingFace Collection.

Training

Installation

We use uv to manage the environment and dependencies.

# Create a virtual environment
uv venv milcoenv
source milcoenv/bin/activate

# Install dependencies
uv pip install -r requirements.txt

To add a new package later:

uv pip install <package-name>

Launching on SLURM

Both training scripts are designed to run under SLURM and read the following environment variables set automatically by srun:

Variable	Purpose
`SLURM_NNODES`	Number of nodes (`--nnodes`)
`SLURM_GPUS_PER_NODE`	GPUs per node (`--nproc_per_node`)
`SLURM_PROCID`	Node rank (`--node_rank`)
`SLURM_JOBID`	Rendezvous ID (`--rdzv-id`)

A typical submission script wraps the training command in srun:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --time=24:00:00

srun bash scripts/alignment.sh    # or scripts/distillation.sh

Adjust --nodes, --gpus-per-node, --time, and your account/partition flags to match your cluster. Set MASTER_PORT to a free port if the default (25900) is in use.

Stage 1: Sparse Alignment Pretraining

Aligns the multilingual encoder to the English LSR space using parallel and multilingual corpora.

srun bash scripts/alignment.sh

Key arguments in scripts/alignment.sh:

--multilingual_encoder_checkpoint — multilingual backbone (e.g. BAAI/bge-m3-unsupervised).
--lsr_encoder_checkpoint — English LSR teacher providing the target lexical space (e.g. naver/splade-v3).
--train_datasets — alignment corpora (mMARCO, WikiMatrix, Europarl, OpenSubtitles, Talks, Tatoeba, JW300, news-commentary).
--output_dir — checkpoint location, consumed by Stage 2.

Stage 2: Contrastive Distillation

Trains the aligned model with hard negatives and teacher-score distillation.

srun bash scripts/distillation.sh

Key arguments in scripts/distillation.sh:

--pretrained_alignment_checkpoint — checkpoint produced by Stage 1.
--echo — enables the LexEcho head (source-language view).
--train_datasets bge-distillation — distillation data with teacher scores.
--train_group_size — number of passages per query (1 positive + N−1 negatives).
--lambda_q / --lambda_d — FLOPS regularization weights for queries and documents.

Citation

@inproceedings{nguyen2026milco,
  title={MILCO: Learned Sparse Retrieval Across Languages via a Multilingual Connector},
  author={Nguyen, Thong and Lei, Yibin and Ju, Jia-Huei and Yang, Eugene and Yates, Andrew},
  booktitle={International Conference on Learning Representations},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
milco		milco
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MILCO: Learned Sparse Retrieval Across Languages via a Multilingual Connector

Training

Installation

Launching on SLURM

Stage 1: Sparse Alignment Pretraining

Stage 2: Contrastive Distillation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MILCO: Learned Sparse Retrieval Across Languages via a Multilingual Connector

Training

Installation

Launching on SLURM

Stage 1: Sparse Alignment Pretraining

Stage 2: Contrastive Distillation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages