Releases: stanford-futuredata/ColBERT
ColBERT 0.2.22
Overview
-
Bug Fixes
- fix: AdamW import error #390 in training.py
- Update loaders.py to use
rflag for regex #368. - use correct GitPython #388 in setup.py.
-
Improvements
- omit tqdm when pooling embeddings for a single document #367.
- Do not do initial retrieval if pids are passed in #352
The main fix is #390 which fixes the AdamW import error (transformers.AdamW is deprecated). This allows you to use the latest version of transformers. However the latest transformers depends on torch>=2.1 but ColBERT currently depends on torch==1.13.1. As shown in the sections below, using the latest version of torch (2.8.0) and transformers (4.55.0) does not break core functionality but does result in different indexing, retrieval and training results than using torch==1.13.1 and transformers==4.38.2 (the last version that had AdamW).
ColBERT 0.2.21 and 0.2.22 Comparison
In this section, I'm going to compare the index artifacts, retrieval artifacts, and training logs between two different colbert-ai installations:
- The previous release on PyPI (0.2.21; August 20, 2024) with
torch==1.13.1andtransformers==4.38.2(the last version withAdamW) referred to as "0.2.21" below. - The latest release (0.2.22; August 10, 2025) with
torch==2.8.0andtransformers==4.55.0(the latest versions of both), referred to as "0.2.22" below.
My indexing, retrieval, training and comparison scripts can be found in my colbert-maintenance repo
ConditionalQA Index Artifacts
I compared the shapes and values (torch.allclose) of the .pt files in both indexes (of the UKPLab/DAPR/ConditionalQA document collection). All but one pair of tensors have the same shapes (it's an important one: ivf.pid.pt which maps document token embedding IDs to centroid IDs) and none of the tensors' values pass torch.allclose:
| File | Shapes match | Values match |
|---|---|---|
| 0.codes.pt | Yes | No |
| 0.residuals.pt | Yes | No |
| 1.codes.pt | Yes | No |
| 1.residuals.pt | Yes | No |
| 2.codes.pt | Yes | No |
| 2.residuals.pt | Yes | No |
| avg_residual.pt | Yes | No |
| buckets.pt | Yes | No |
| centroids.pt | Yes | No |
| ivf.pid.pt | No | No |
The changes in from 0.2.21 to 0.2.22 result in nearly wholesale index artifact changes.
ConditionalQA Retrieval Metrics
I compared aggregate and query-level metrics as well as raw retrieved passage IDs between the two colbert-ai installations.
| PyTorch version | Transformers version | Mean Recall@10 | Mean MRR@10 |
|---|---|---|---|
| 1.13.1 | 4.38.2 | 0.1309418985666801 | 0.1769138405669771 |
| 2.8.0 | 4.55.0 | 0.12709371722772383 | 0.17931236455221697 |
The changes in PyTorch (1.13.1 --> 2.8.0) and Transformers (4.38.2 --> 4.55.0) result in a decrease in Mean Recall@10 and increase in Mean MRR@10. On average there is 1 different passage retrieved per query between the two versions.
Here are the query-level differences in metrics, where "Increase" means the metric increased in the 0.2.22 install (relative to 0.2.21).
| Recall@10 Difference | Count |
|---|---|
| Equal | 264 |
| Decrease | 4 |
| Increase | 3 |
| MRR@10 Difference | Count |
|---|---|
| Equal | 256 |
| Decrease | 8 |
| Increase | 7 |
In most cases the query-level metrics are equal for 0.2.21 and 0.2.22. There are 4 queries for which Recall@10 decreases and 8 queries for which MRR@10 decreases for 0.2.22.
Training Dynamics (MS MARCO)
I trained the default bert-base-uncased model for 1000 batches with collection.tsv, queries.train.tsv and triples.train.small.json (JSON file created from the available tsv).
0.2.22 results in a higher training loss, higher mean positive document score and higher mean negative document score.
Multi-GPU Setting
I also ran indexing, search and training for the two colbert-ai installs on 4 x L4 GPUs. For indexing, the results are similar---all but one index tensor's shapes match, none of the index tensor values match.
In the case of multi-GPU search, the Mean Recall@10 and Mean MRR@10 both decrease for 0.2.22:
| PyTorch version | Transformers version | Mean Recall@10 | Mean MRR@10 |
|---|---|---|---|
| 1.13.1 | 4.38.2 | 0.1290968801164956 | 0.17891261055467697 |
| 2.8.0 | 4.55.0 | 0.12869273321788374 | 0.17573361447900193 |
At the query-level, multi-GPU search results in a net of 2 lower MRR@10 values for 0.2.22.
| Recall@10 Difference | Count |
|---|---|
| Equal | 267 |
| Decrease | 2 |
| Increase | 2 |
| MRR@10 Difference | Count |
|---|---|
| Equal | 253 |
| Decrease | 10 |
| Increase | 8 |
For multi-GPU training, 0.2.22 results in a lower training loss, and similar mean positive and negative scores.
Backwards Compatibility
Installing colbert-ai[torch]==0.2.22 with transformers==4.38.2 yields identical index, search and training artifacts as colbert-ai[torch]==0.2.21 with transformers==4.38.2.
In conclusion, while using the latest PyTorch and Transformers versions yields different index artifacts and thus different retrieval results as well as different training logs, the core functionality is not broken in 0.2.22. With this deprecated AdamW fix, users can now install pip install colbert-ai without error, install the latest transformers and torch version, and use the core functionality. The differences between index, search and training across different PyTorch versions will be investigated and documented before colbert-ai's torch dependency is changed to 2.x.