Skip to content

HITsz-TMG/KaLM-Embedding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KaLM-Embedding

kalm_logo

✨ Overview

Code for training and evaluation of our KaLM-Embedding models.

For a more comprehensive understanding of the technical details, please refer to our paper KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model.

⚑ Features

  • Training
    • Ranking Consistency Filtering
    • Semi-homogeneous Task Batching
    • Matryoshka Representation Learning
  • Evaluation
    • Multi-GPU Asynchronous Computation

πŸ’» Usage

🌈 Environment:

conda env create -f environment.yaml
conda activate kalm

⛏️ Hard-negative Mining (with Filtering):

bash ./scripts/hn_mine.sh

You can customize the filter_topk parameter to set the threshold of ranking consistency filtering.

πŸ”₯ Training:

bash ./scripts/train.sh

πŸ“Š Evaluation:

We have provided a code for evaluating MTEB using multiple GPUs, which allocates each task from the task set to a single GPU in a queue-based manner, thereby enhancing evaluation efficiency.

bash ./scripts/eval_mteb.sh

Below, we present a portion of the results from the MTEB study. For a more comprehensive analysis, please refer to our technical report.

Model Name Model Size MTEB(zh) MTEB(en) MTEB(fr) MTEB(pl) avg
multilingual-e5-large 560M 58.54 60.89 55.64 60.08 58.79
bge-m3 (dense) 560M 61.07 59.57 58.79 60.35 59.95
gte-multilingual-base (dense) 305M 62.72 61.40 59.79 58.22 60.53
KaLM-embedding-multilingual-mini-v1 494M 62.31 61.87 60.59 54.79 59.89
KaLM-embedding-multilingual-mini-instruct-v1 494M 63.57 64.74 64.04 58.16 62.62
KaLM-embedding-multilingual-mini-instruct-v1.5 494M 64.13 64.94 63.08 57.05 62.3

πŸ“’ Acknowledgements

Specifically, our training code was forked from FlagOpen/FlagEmbedding. We have made modifications to suit our specific needs, but the core functionality and structure are derived from their excellent work. Please check out their repository for more details!

πŸ”— Citation

Please cite the repo if you use the model or code in this repo.

@article{hu2025kalm,
  title={KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model},
  author={Hu, Xinshuo and Shan, Zifei and Zhao, Xinping and Sun, Zetian and Liu, Zhenyu and Li, Dongfang and Ye, Shaolin and Wei, Xinyuan and Chen, Qian and Hu, Baotian and others},
  journal={arXiv preprint arXiv:2501.01028},
  year={2025}
}

πŸ“œ License

This repository respects to MIT license.