KaLM-Embedding

Overview | Features | Usage | Acknowledgements | Citation | License

✨ Overview

Code for training and evaluation of our KaLM-Embedding models.

For a more comprehensive understanding of the technical details, please refer to our paper KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model.

⚡ Features

Training
- Ranking Consistency Filtering
- Semi-homogeneous Task Batching
- Matryoshka Representation Learning
Evaluation
- Multi-GPU Asynchronous Computation

💻 Usage

🌈 Environment:

conda env create -f environment.yaml
conda activate kalm

⛏️ Hard-negative Mining (with Filtering):

bash ./scripts/hn_mine.sh

You can customize the filter_topk parameter to set the threshold of ranking consistency filtering.

🔥 Training:

bash ./scripts/train.sh

📊 Evaluation:

We have provided a code for evaluating MTEB using multiple GPUs, which allocates each task from the task set to a single GPU in a queue-based manner, thereby enhancing evaluation efficiency.

bash ./scripts/eval_mteb.sh

Below, we present a portion of the results from the MTEB study. For a more comprehensive analysis, please refer to our technical report.

Model Name	Model Size	MTEB(zh)	MTEB(en)	MTEB(fr)	MTEB(pl)	avg
multilingual-e5-large	560M	58.54	60.89	55.64	60.08	58.79
bge-m3 (dense)	560M	61.07	59.57	58.79	60.35	59.95
gte-multilingual-base (dense)	305M	62.72	61.40	59.79	58.22	60.53
KaLM-embedding-multilingual-mini-v1	494M	62.31	61.87	60.59	54.79	59.89
KaLM-embedding-multilingual-mini-instruct-v1	494M	63.57	64.74	64.04	58.16	62.62
KaLM-embedding-multilingual-mini-instruct-v1.5	494M	64.13	64.94	63.08	57.05	62.3

📢 Acknowledgements

Specifically, our training code was forked from FlagOpen/FlagEmbedding. We have made modifications to suit our specific needs, but the core functionality and structure are derived from their excellent work. Please check out their repository for more details!

🔗 Citation

Please cite the repo if you use the model or code in this repo.

@article{hu2025kalm,
  title={KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model},
  author={Hu, Xinshuo and Shan, Zifei and Zhao, Xinping and Sun, Zetian and Liu, Zhenyu and Li, Dongfang and Ye, Shaolin and Wei, Xinyuan and Chen, Qian and Hu, Baotian and others},
  journal={arXiv preprint arXiv:2501.01028},
  year={2025}
}

📜 License

This repository respects to MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
evaluate		evaluate
imgs		imgs
scipts		scipts
train		train
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KaLM-Embedding

Overview | Features | Usage | Acknowledgements | Citation | License

✨ Overview

⚡ Features

💻 Usage

🌈 Environment:

⛏️ Hard-negative Mining (with Filtering):

🔥 Training:

📊 Evaluation:

📢 Acknowledgements

🔗 Citation

📜 License

About

Releases

Packages

Languages

License

HITsz-TMG/KaLM-Embedding

Folders and files

Latest commit

History

Repository files navigation

KaLM-Embedding

Overview | Features | Usage | Acknowledgements | Citation | License

✨ Overview

⚡ Features

💻 Usage

🌈 Environment:

⛏️ Hard-negative Mining (with Filtering):

🔥 Training:

📊 Evaluation:

📢 Acknowledgements

🔗 Citation

📜 License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages