Releases · stanford-futuredata/ColBERT

Overview

Bug Fixes
- fix: AdamW import error #390 in training.py
- Update loaders.py to use r flag for regex #368.
- use correct GitPython #388 in setup.py.
Improvements
- omit tqdm when pooling embeddings for a single document #367.
- Do not do initial retrieval if pids are passed in #352

The main fix is #390 which fixes the AdamW import error (transformers.AdamW is deprecated). This allows you to use the latest version of transformers. However the latest transformers depends on torch>=2.1 but ColBERT currently depends on torch==1.13.1. As shown in the sections below, using the latest version of torch (2.8.0) and transformers (4.55.0) does not break core functionality but does result in different indexing, retrieval and training results than using torch==1.13.1 and transformers==4.38.2 (the last version that had AdamW).

ColBERT 0.2.21 and 0.2.22 Comparison

In this section, I'm going to compare the index artifacts, retrieval artifacts, and training logs between two different colbert-ai installations:

The previous release on PyPI (0.2.21; August 20, 2024) with torch==1.13.1 and transformers==4.38.2 (the last version with AdamW) referred to as "0.2.21" below.
The latest release (0.2.22; August 10, 2025) with torch==2.8.0 and transformers==4.55.0 (the latest versions of both), referred to as "0.2.22" below.

My indexing, retrieval, training and comparison scripts can be found in my colbert-maintenance repo

ConditionalQA Index Artifacts

I compared the shapes and values (torch.allclose) of the .pt files in both indexes (of the UKPLab/DAPR/ConditionalQA document collection). All but one pair of tensors have the same shapes (it's an important one: ivf.pid.pt which maps document token embedding IDs to centroid IDs) and none of the tensors' values pass torch.allclose:

File	Shapes match	Values match
0.codes.pt	Yes	No
0.residuals.pt	Yes	No
1.codes.pt	Yes	No
1.residuals.pt	Yes	No
2.codes.pt	Yes	No
2.residuals.pt	Yes	No
avg_residual.pt	Yes	No
buckets.pt	Yes	No
centroids.pt	Yes	No
ivf.pid.pt	No	No

The changes in from 0.2.21 to 0.2.22 result in nearly wholesale index artifact changes.

ConditionalQA Retrieval Metrics

I compared aggregate and query-level metrics as well as raw retrieved passage IDs between the two colbert-ai installations.

PyTorch version	Transformers version	Mean Recall@10	Mean MRR@10
1.13.1	4.38.2	0.1309418985666801	0.1769138405669771
2.8.0	4.55.0	0.12709371722772383	0.17931236455221697

The changes in PyTorch (1.13.1 --> 2.8.0) and Transformers (4.38.2 --> 4.55.0) result in a decrease in Mean Recall@10 and increase in Mean MRR@10. On average there is 1 different passage retrieved per query between the two versions.

Here are the query-level differences in metrics, where "Increase" means the metric increased in the 0.2.22 install (relative to 0.2.21).

Recall@10 Difference	Count
Equal	264
Decrease	4
Increase	3

MRR@10 Difference	Count
Equal	256
Decrease	8
Increase	7

In most cases the query-level metrics are equal for 0.2.21 and 0.2.22. There are 4 queries for which Recall@10 decreases and 8 queries for which MRR@10 decreases for 0.2.22.

Training Dynamics (MS MARCO)

I trained the default bert-base-uncased model for 1000 batches with collection.tsv, queries.train.tsv and triples.train.small.json (JSON file created from the available tsv).

0.2.22 results in a higher training loss, higher mean positive document score and higher mean negative document score.

Multi-GPU Setting

I also ran indexing, search and training for the two colbert-ai installs on 4 x L4 GPUs. For indexing, the results are similar---all but one index tensor's shapes match, none of the index tensor values match.

In the case of multi-GPU search, the Mean Recall@10 and Mean MRR@10 both decrease for 0.2.22:

PyTorch version	Transformers version	Mean Recall@10	Mean MRR@10
1.13.1	4.38.2	0.1290968801164956	0.17891261055467697
2.8.0	4.55.0	0.12869273321788374	0.17573361447900193

At the query-level, multi-GPU search results in a net of 2 lower MRR@10 values for 0.2.22.

Recall@10 Difference	Count
Equal	267
Decrease	2
Increase	2

MRR@10 Difference	Count
Equal	253
Decrease	10
Increase	8

For multi-GPU training, 0.2.22 results in a lower training loss, and similar mean positive and negative scores.

Backwards Compatibility

Installing colbert-ai[torch]==0.2.22 with transformers==4.38.2 yields identical index, search and training artifacts as colbert-ai[torch]==0.2.21 with transformers==4.38.2.

In conclusion, while using the latest PyTorch and Transformers versions yields different index artifacts and thus different retrieval results as well as different training logs, the core functionality is not broken in 0.2.22. With this deprecated AdamW fix, users can now install pip install colbert-ai without error, install the latest transformers and torch version, and use the core functionality. The differences between index, search and training across different PyTorch versions will be investigated and documented before colbert-ai's torch dependency is changed to 2.x.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Overview

ColBERT 0.2.21 and 0.2.22 Comparison

ConditionalQA Index Artifacts

ConditionalQA Retrieval Metrics

Training Dynamics (MS MARCO)

Multi-GPU Setting

Backwards Compatibility

Uh oh!

Releases: stanford-futuredata/ColBERT

ColBERT 0.2.22

Overview

ColBERT 0.2.21 and 0.2.22 Comparison

ConditionalQA Index Artifacts

ConditionalQA Retrieval Metrics

Training Dynamics (MS MARCO)

Multi-GPU Setting

Backwards Compatibility

Uh oh!