Skip to content

ColBERT 0.2.22

Latest

Choose a tag to compare

@vishalbakshi vishalbakshi released this 11 Aug 03:22
· 2 commits to main since this release
501c29d

Overview

The main fix is #390 which fixes the AdamW import error (transformers.AdamW is deprecated). This allows you to use the latest version of transformers. However the latest transformers depends on torch>=2.1 but ColBERT currently depends on torch==1.13.1. As shown in the sections below, using the latest version of torch (2.8.0) and transformers (4.55.0) does not break core functionality but does result in different indexing, retrieval and training results than using torch==1.13.1 and transformers==4.38.2 (the last version that had AdamW).

ColBERT 0.2.21 and 0.2.22 Comparison

In this section, I'm going to compare the index artifacts, retrieval artifacts, and training logs between two different colbert-ai installations:

  • The previous release on PyPI (0.2.21; August 20, 2024) with torch==1.13.1 and transformers==4.38.2 (the last version with AdamW) referred to as "0.2.21" below.
  • The latest release (0.2.22; August 10, 2025) with torch==2.8.0 and transformers==4.55.0 (the latest versions of both), referred to as "0.2.22" below.

My indexing, retrieval, training and comparison scripts can be found in my colbert-maintenance repo

ConditionalQA Index Artifacts

I compared the shapes and values (torch.allclose) of the .pt files in both indexes (of the UKPLab/DAPR/ConditionalQA document collection). All but one pair of tensors have the same shapes (it's an important one: ivf.pid.pt which maps document token embedding IDs to centroid IDs) and none of the tensors' values pass torch.allclose:

File Shapes match Values match
0.codes.pt Yes No
0.residuals.pt Yes No
1.codes.pt Yes No
1.residuals.pt Yes No
2.codes.pt Yes No
2.residuals.pt Yes No
avg_residual.pt Yes No
buckets.pt Yes No
centroids.pt Yes No
ivf.pid.pt No No

The changes in from 0.2.21 to 0.2.22 result in nearly wholesale index artifact changes.

ConditionalQA Retrieval Metrics

I compared aggregate and query-level metrics as well as raw retrieved passage IDs between the two colbert-ai installations.

PyTorch version Transformers version Mean Recall@10 Mean MRR@10
1.13.1 4.38.2 0.1309418985666801 0.1769138405669771
2.8.0 4.55.0 0.12709371722772383 0.17931236455221697

The changes in PyTorch (1.13.1 --> 2.8.0) and Transformers (4.38.2 --> 4.55.0) result in a decrease in Mean Recall@10 and increase in Mean MRR@10. On average there is 1 different passage retrieved per query between the two versions.

Here are the query-level differences in metrics, where "Increase" means the metric increased in the 0.2.22 install (relative to 0.2.21).

Recall@10 Difference Count
Equal 264
Decrease 4
Increase 3
MRR@10 Difference Count
Equal 256
Decrease 8
Increase 7

In most cases the query-level metrics are equal for 0.2.21 and 0.2.22. There are 4 queries for which Recall@10 decreases and 8 queries for which MRR@10 decreases for 0.2.22.

Training Dynamics (MS MARCO)

I trained the default bert-base-uncased model for 1000 batches with collection.tsv, queries.train.tsv and triples.train.small.json (JSON file created from the available tsv).

0.2.22 results in a higher training loss, higher mean positive document score and higher mean negative document score.

image

Multi-GPU Setting

I also ran indexing, search and training for the two colbert-ai installs on 4 x L4 GPUs. For indexing, the results are similar---all but one index tensor's shapes match, none of the index tensor values match.

In the case of multi-GPU search, the Mean Recall@10 and Mean MRR@10 both decrease for 0.2.22:

PyTorch version Transformers version Mean Recall@10 Mean MRR@10
1.13.1 4.38.2 0.1290968801164956 0.17891261055467697
2.8.0 4.55.0 0.12869273321788374 0.17573361447900193

At the query-level, multi-GPU search results in a net of 2 lower MRR@10 values for 0.2.22.

Recall@10 Difference Count
Equal 267
Decrease 2
Increase 2
MRR@10 Difference Count
Equal 253
Decrease 10
Increase 8

For multi-GPU training, 0.2.22 results in a lower training loss, and similar mean positive and negative scores.

image

Backwards Compatibility

Installing colbert-ai[torch]==0.2.22 with transformers==4.38.2 yields identical index, search and training artifacts as colbert-ai[torch]==0.2.21 with transformers==4.38.2.


In conclusion, while using the latest PyTorch and Transformers versions yields different index artifacts and thus different retrieval results as well as different training logs, the core functionality is not broken in 0.2.22. With this deprecated AdamW fix, users can now install pip install colbert-ai without error, install the latest transformers and torch version, and use the core functionality. The differences between index, search and training across different PyTorch versions will be investigated and documented before colbert-ai's torch dependency is changed to 2.x.