<a href="https://colab.research.google.com/github/wsanjay/Interesting_notebooks_collection/blob/main/Jina_ColBERT_Vectors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Jina-ColBERT](https://huggingface.co/jinaai/jina-colbert-v1-en) Vector Creation

## What is ColBERT?
Unlike dense vectors which compress the entire document into a single array of floats, ColBERT creates a vector for each token.

In practice, this can be computationally expensive: Here, we'll show the ColBERT Similarity score while using 32 tokens for each query and 180 tokens for each document -- independent of the query and document length i.e. we'll pad if shorter.

## What is this model about?
The new [jina-colbert-v1-en](https://huggingface.co/jinaai/jina-colbert-v1-en) model is based on the prior work such as [jinaai/jina-embeddings-v2-base-en](https://jinaai/jina-embeddings-v2-base-en) which enables much longer context windows of 8192 token, ALiBi training and other improvements.

## Credits
This notebook was created by [Qdrant](https://qdrant.tech)'s AI Engineering team based on a [tweet](https://twitter.com/lateinteraction/status/1758023391838380052) by ColBERT creator [@lateinteraction](https://twitter.com/lateinteraction)

In [None]:
!pip install git+https://github.com/stanford-futuredata/ColBERT -qq

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.3/288.3 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m38.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.2/53.2 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m196.4/196.4 kB[0m [31m19.8 

In [None]:
# Import this https://github.com/stanford-futuredata/ColBERT/blob/main/colbert/modeling/checkpoint.py

from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig
import pandas
import numpy
import torch

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'




## Load the model
Note that we pass the Huggingface path as is, with a ColBERTConfig object

In [None]:
ckpt = Checkpoint("jinaai/jina-colbert-v1-en", colbert_config=ColBERTConfig(root="experiments"))

artifact.metadata:   0%|          | 0.00/1.81k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

configuration_bert.py:   0%|          | 0.00/8.24k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-implementation:
- configuration_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_bert.py:   0%|          | 0.00/97.5k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-implementation:
- modeling_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/550M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

[Feb 15, 14:28:31] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...




## Seperate Document and Query Encoders
ColBERTv2.0 model had separate document and query encoders.

Here, we'll pad the query tokens to get tokens=32 for each query and d=128 for each token. This means, each query can be described with a 32*128 tensor.

In [None]:
Q = ckpt.queryFromText(["What does ColBERT do?", "This is a search query?"], bsize=16)

In [None]:
Q.shape

torch.Size([2, 32, 128])

For the purpose of illustration, I am going to keep only one query:

In [None]:
Q = Q[0]
Q = Q.unsqueeze(0)
Q.shape

torch.Size([1, 32, 128])

In [None]:
all_passages = ["This is a demo notebook", "This mentions BERT", "Qdrant is considering adding ColBERT indexing","Jina-ColBERT is amazing", "ColBERT is a multi-vector representation per document"]
D = ckpt.docFromText(all_passages, bsize=32)[0]

In [None]:
D.shape

torch.Size([5, 12, 128])

### Let's begin scoring

ColBERT is known for it's [late interaction](https://twitter.com/lateinteraction). Here, we'll use Max Sim to compute the Query-Document Similarity Scores.



In [None]:
from colbert.modeling.colbert import colbert_score

To make the computation simpler, we'll add Document Mask which is effectively boolean (notice the use of numpy.ones)

In [None]:
D_mask = numpy.ones(D.shape[:2], dtype=int)
D_mask = torch.tensor(D_mask)

In [None]:
print(colbert_score(Q, D, D_mask))

tensor([ 4.5121,  4.5288, 16.5575, 17.4917, 21.3111])


## Observations

Notice that the first 3 documents have the highest similarity score, despite being of very different length:


```python
[
    "ColBERT is a multi-vector representation per document", # 21.3
    "Jina-Colbert is amazing", # 17.49
    "Qdrant is considering adding colbert indexing" # 16.55
]
```

At the same time, the sentence mentioning 'BERT' falls behind -- barely doing better than the one which does not mention it.

## Questions

Ping us:
- Questions about the model: [@JinaAI_](https://twitter.com/@JinaAI_)
- Questions about this notebook: [@NirantK](https://twitter.com/NirantK)