# ColBERTv2: Indexing & Search Notebook

We start by importing the relevant classes. As we'll see below, `Indexer` and `Searcher` are the key actors here. 

In [3]:
import os
import sys
sys.path.insert(0, '../')

from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert.data import Queries, Collection
from colbert import Indexer, Searcher

The workflow here assumes an IR dataset: a set of queries and a corresponding collection of passages.

The classes `Queries` and `Collection` provide a convenient interface for working with such datasets.

We will use the *dev set* of the **LoTTE benchmark** we recently introduced in the ColBERTv2 paper. The dev and test sets contain several domain-specific corpora, and we'll use the smallest dev set corpus, namely `lifestyle:dev`.

In [2]:
!mkdir -p downloads/

# ColBERTv2 checkpoint trained on MS MARCO Passage Ranking (388MB compressed)
!wget https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/colbertv2.0.tar.gz -P downloads/
!tar -xvzf downloads/colbertv2.0.tar.gz -C downloads/

# The LoTTE dev and test sets (3.4GB compressed)
!wget https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz -P downloads/
!tar -xvzf downloads/lotte.tar.gz -C downloads/

--2022-09-30 09:57:15--  https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/colbertv2.0.tar.gz
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 405924985 (387M) [application/octet-stream]
Saving to: ‘downloads/colbertv2.0.tar.gz’


2022-09-30 09:58:28 (5.32 MB/s) - ‘downloads/colbertv2.0.tar.gz’ saved [405924985/405924985]

colbertv2.0/
colbertv2.0/artifact.metadata
colbertv2.0/vocab.txt
colbertv2.0/tokenizer.json
colbertv2.0/special_tokens_map.json
colbertv2.0/tokenizer_config.json
colbertv2.0/config.json
colbertv2.0/pytorch_model.bin
--2022-09-30 09:58:36--  https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:4

In [4]:
dataroot = 'downloads/lotte'
dataset = 'lifestyle'
datasplit = 'dev'

queries = os.path.join(dataroot, dataset, datasplit, 'questions.search.tsv')
collection = os.path.join(dataroot, dataset, datasplit, 'collection.tsv')

queries = Queries(path=queries)
collection = Collection(path=collection)

f'Loaded {len(queries)} queries and {len(collection):,} passages'

[Sep 30, 10:58:21] #> Loading the queries from downloads/lotte/lifestyle/dev/questions.search.tsv ...
[Sep 30, 10:58:21] #> Got 417 queries. All QIDs are unique.

[Sep 30, 10:58:21] #> Loading collection...
0M 


'Loaded 417 queries and 268,893 passages'

This loaded 417 queries and 269k passages. Let's inspect one query and one passage.

In [5]:
print(queries[72])
print()
print(collection[123456])
print()

what kind of coffee do you put in a coffee maker?

Just call or e-mail them and ask for a better price. More often than not you'll get a discount. Works on appliances, materials, parts and everything else. Exception: big chain stores. My better half does this for most of the things we buy for the house. Maybe her voice plays a role here -- hard to say :)



## Indexing

For efficient search, we can pre-compute the ColBERT representation of each passage and index them.

Below, the `Indexer` take a model checkpoint and writes a (compressed) index to disk. We then prepare a `Searcher` for retrieval from this index.

(With four Titan V GPUs, indexing should take about 13 minutes. The output is fairly long/ugly at the moment!)

In [6]:
nbits = 2   # encode each dimension with 2 bits
doc_maxlen = 300   # truncate passages at 300 tokens

checkpoint = 'downloads/colbertv2.0'
index_name = f'{dataset}.{datasplit}.{nbits}bits'

In [7]:
with Run().context(RunConfig(nranks=4, experiment='notebook')):  # nranks specifies the number of GPUs to use.
    config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits)

    indexer = Indexer(checkpoint=checkpoint, config=config)
    indexer.index(name=index_name, collection=collection, overwrite=True)



[Sep 30, 10:58:32] #> Note: Output directory /future/u/udingank/ColBERT/docs/experiments/notebook/indexes/lifestyle.dev.2bits already exists


#> Starting...
#> Starting...
#> Starting...
#> Starting...
nranks = 4 	 num_gpus = 4 	 device=3
[Sep 30, 10:58:53] [3] 		 #> Encoding 16879 passages..
nranks = 4 	 num_gpus = 4 	 device=0
{
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "index_path": null,
    "nbits": 2,
    "kmeans_niters": 20,
    "resume": false,
    "similarity": "cosine",
    "bsize": 64,
    "accumsteps": 1,
    "lr": 1e-5,
    "maxsteps": 400000,
    "save_every": null,
    "warmup": 20000,
    "warmup_bert": null,
    "relu": false,
    "nway": 64,
    "use_ib_negatives": true,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "query_maxlen": 32,
    "attend_to_mask_tokens": false,
    "interaction": "colbert",
    "dim": 128,
    "doc_maxlen": 300,
    "mask_punctuation": true,
    "checkpoint": "

0it [00:00, ?it/s]

[Sep 30, 11:06:19] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Sep 30, 11:06:19] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Sep 30, 11:06:19] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Sep 30, 11:06:19] [0] 		 #> Encoding 25000 passages..
[Sep 30, 11:06:20] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Sep 30, 11:06:20] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Sep 30, 11:06:20] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Sep 30, 11:06:20] [3] 		 #> Encoding 25000 passages..
[Sep 30, 11:06:21] [1] 		 #> Encoding 25000 passages..
[Sep 30, 11:06:21] [2] 		 #> Encoding 25000 passages..
[Sep 30, 11:07:19] [0] 		 #> Saving chunk 0: 	 25,000 

1it [01:02, 62.83s/it]

[Sep 30, 11:07:22] [0] 		 #> Encoding 25000 passages..
[Sep 30, 11:07:25] [3] 		 #> Encoding 25000 passages..
[Sep 30, 11:07:25] [2] 		 #> Encoding 25000 passages..
[Sep 30, 11:07:25] [1] 		 #> Encoding 25000 passages..
[Sep 30, 11:08:22] [0] 		 #> Saving chunk 4: 	 25,000 passages and 3,953,755 embeddings. From #100,000 onward.


2it [02:06, 63.24s/it]

[Sep 30, 11:08:26] [0] 		 #> Encoding 25000 passages..
[Sep 30, 11:08:27] [2] 		 #> Encoding 18893 passages..
[Sep 30, 11:08:28] [1] 		 #> Encoding 25000 passages..
[Sep 30, 11:09:25] [0] 		 #> Saving chunk 8: 	 25,000 passages and 3,698,831 embeddings. From #200,000 onward.


3it [03:08, 62.80s/it]3it [03:08, 62.89s/it]

[Sep 30, 11:09:34] [0] 		 #> Checking all files were saved...
[Sep 30, 11:09:34] [0] 		 Found all files!
[Sep 30, 11:09:34] [0] 		 #> Building IVF...
[Sep 30, 11:09:34] [0] 		 #> Loading codes...
[Sep 30, 11:09:34] [0] 		 Sorting codes...



  0%|          | 0/11 [00:00<?, ?it/s]100%|██████████| 11/11 [00:00<00:00, 65.66it/s]100%|██████████| 11/11 [00:00<00:00, 65.57it/s]

[Sep 30, 11:09:38] [0] 		 Getting unique codes...
[Sep 30, 11:09:38] #> Optimizing IVF to store map from centroids to list of pids..
[Sep 30, 11:09:38] #> Building the emb2pid mapping..
[Sep 30, 11:09:39] len(emb2pid) = 40756586



100%|██████████| 65536/65536 [00:03<00:00, 16630.99it/s]

[Sep 30, 11:09:44] #> Saved optimized IVF to /future/u/udingank/ColBERT/docs/experiments/notebook/indexes/lifestyle.dev.2bits/ivf.pid.pt
[Sep 30, 11:09:44] [0] 		 #> Saving the indexing metadata to /future/u/udingank/ColBERT/docs/experiments/notebook/indexes/lifestyle.dev.2bits/metadata.json ..





#> Joined...
#> Joined...
#> Joined...
#> Joined...


In [8]:
indexer.get_index() # You can get the absolute path of the index, if needed.

'/future/u/udingank/ColBERT/docs/experiments/notebook/indexes/lifestyle.dev.2bits'

## Search

Having built the index and prepared our `searcher`, we can search for individual query strings.

We can use the `queries` set we loaded earlier — or you can supply your own questions. Feel free to get creative! But keep in mind this set of ~300k lifestyle passages can only answer a small, focused set of questions!

In [9]:
# To create the searcher using its relative name (i.e., not a full path), set
# experiment=value_used_for_indexing in the RunConfig.
with Run().context(RunConfig(experiment='notebook')):
    searcher = Searcher(index=index_name)


# If you want to customize the search latency--quality tradeoff, you can also supply a
# config=ColBERTConfig(ncells=.., centroid_score_threshold=.., ndocs=..) argument.
# The default settings with k <= 10 (1, 0.5, 256) gives the fastest search,
# but you can gain more extensive search by setting larger values of k or
# manually specifying more conservative ColBERTConfig settings (e.g. (4, 0.4, 4096)).

[Sep 30, 11:14:29] #> Loading collection...
0M 
[Sep 30, 11:14:34] #> Loading codec...
[Sep 30, 11:14:34] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Sep 30, 11:14:34] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Sep 30, 11:14:35] #> Loading IVF...
[Sep 30, 11:14:35] #> Loading doclens...


100%|███████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 612.62it/s]

[Sep 30, 11:14:35] #> Loading codes and residuals...



100%|████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:01<00:00, 10.58it/s]


In [10]:
query = queries[37]   # or supply your own query

print(f"#> {query}")

# Find the top-3 passages for this query
results = searcher.search(query, k=3)

# Print out the top-k retrieved passages
for passage_id, passage_rank, passage_score in zip(*results):
    print(f"\t [{passage_rank}] \t\t {passage_score:.1f} \t\t {searcher.collection[passage_id]}")

#> what are white spots on raspberries?

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . what are white spots on raspberries?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2054,  2024,  2317,  7516,  2006, 20710,  2361, 20968,
         1029,   102,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])

	 [1] 		 26.0 		 You've got a heat problem, this is UV damage or excessive heat during the ripening phase and referred to as White Drupelet syndrome (white spot). It's quite common on Raspberries during the final crops of the year as summer heat increases. Also occurs in blackberries. Last year, we had issues in the US Pacific Northwest with it, only a handful at end of season 

## Batch Search

In many applications, you have a large batch of queries and you need to maximize the overall throughput. For that, you can use the `searcher.search_all(queries, k)` method, which returns a `Ranking` object that organizes the results across all queries.

(Batching provides many opportunities for higher-throughput search, though we have not implemented most of those optimizations for compressed indexes yet.)

In [11]:
rankings = searcher.search_all(queries, k=5).todict()

100%|█████████████████████████████████████████████████████████████████████████████████████████| 417/417 [00:03<00:00, 136.40it/s]


In [12]:
rankings[30]  # For query 30, a list of (passage_id, rank, score) for the top-k passages

[(24367, 1, 16.1875),
 (25089, 2, 16.015625),
 (35359, 3, 16.015625),
 (131623, 4, 15.9765625),
 (3789, 5, 15.9375)]