[Bug/Model Request]: LateInteractionTextEmbedding("colbert-ir/colbertv2.0") creates different size of embeddings for large set of documents #273

shima-khoshraftar · 2024-06-11T19:22:36Z

What happened?

A bug happened!

I am using embedding_model = LateInteractionTextEmbedding("colbert-ir/colbertv2.0") in fastembed==0.3.0 and
following the code in https://qdrant.github.io/fastembed/examples/ColBERT_with_FastEmbed/#colbert-in-fastembed.

I created embeddings for some documents. However, I got an error on this part of the code when running on large collection of documents:
sorted_indices = compute_relevance_scores(
np.array(query_embeddings[0]), np.array(document_embeddings), k=3
)

complaining that it can not create np.array from document_embeddings. Looking into it I realized that sizes of each document_embedding in document_embeddings are different. For instance for 442 documents, the first ~260 documents have embedding size of (182,128) and for the next half the document embedding size is (164, 128). I am wondering if you can help me with that. Thanks.

What Python version are you on? e.g. python --version

Python 3.12

Version

0.2.7 (Latest)

What os are you seeing the problem on?

MacOS

Relevant stack traces and/or logs

Traceback (most recent call last):
    np.array(query_embeddings[0]), np.array(document_embeddings), k=3
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (442,) + inhomogeneous part.

joein · 2024-06-12T15:29:15Z

Hello @shima-khoshraftar

It's not really a bug.

ColBERT is different from the usual BERT-like models, it does not emit a single [CLS] token, but it emits embeddings for each token in the document.

If you have a sentence "I have an apple", and tokenizer split it into ["I", "have", "an", "apple"], then output shape will be (4, 128)

If you have a sentence "I have an apple and an orange", and tokenizer split it into ["I", "have", "an", "apple", "and", "an", "orange"], the output will be of a shape (7, 128)

Fastembed pads the sequences to the max amount of tokens in a batch, but length across different batches might be different.

It is meant to be used with vector databases like Qdrant, so you would not need to compute scores on your own, but leave it to the specified tools. (Qdrant will support ColBERT as of the next release).

shima-khoshraftar · 2024-06-12T19:33:42Z

Thanks for your reply. But just to use this at the moment, I was using the compute_relevance_scores that was defined in the Late Interaction Text Embedding Models link: https://qdrant.github.io/fastembed/examples/ColBERT_with_FastEmbed/

Exactly, if fastembed pads the sequence to the max number of tokens in the dataset(rather than each batch), this issue will not happen. Do you think it is not efficient to compute the max number of token across the dataset rather than each batch? I was thinking if it is, this can be added to fastembed and can be used until Qdrant next release. Thanks.

joein · 2024-06-12T21:22:43Z

No, if dataset is large (dozens of millions of records), then it would mean to read it all, compute tokens, then either save those tokens, or just drop, and then compute again, and it is not really possible.

shima-khoshraftar · 2024-06-13T13:07:20Z

right. Thanks for the reply.

joein · 2024-06-13T13:14:45Z

If you really want to have all of the embeddings have the same shape, you can modify padding, so it would pad sequences to some pre-defined length, e.g. if you set length to 100, embeddings for each of the documents will have shape (100, 128)

It is not exposed to the users, however, you can still do it (I haven't thoroughly tested it)

colbert = LateInteractionTextEmbedding('colbert-ir/colbertv2.0')
padding = colbert.model.tokenizer.padding
padding['length'] = 100
colbert.model.tokenizer.enable_padding(**padding)

shima-khoshraftar · 2024-06-13T13:33:20Z

Great, thanks I will try it.

qdrant locked and limited conversation to collaborators Jun 14, 2024

joein converted this issue into discussion #276 Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

[Bug/Model Request]: LateInteractionTextEmbedding("colbert-ir/colbertv2.0") creates different size of embeddings for large set of documents #273

[Bug/Model Request]: LateInteractionTextEmbedding("colbert-ir/colbertv2.0") creates different size of embeddings for large set of documents #273

shima-khoshraftar commented Jun 11, 2024

joein commented Jun 12, 2024

shima-khoshraftar commented Jun 12, 2024 •

edited

Loading

joein commented Jun 12, 2024

shima-khoshraftar commented Jun 13, 2024

joein commented Jun 13, 2024 •

edited

Loading

shima-khoshraftar commented Jun 13, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

[Bug/Model Request]: LateInteractionTextEmbedding("colbert-ir/colbertv2.0") creates different size of embeddings for large set of documents #273

[Bug/Model Request]: LateInteractionTextEmbedding("colbert-ir/colbertv2.0") creates different size of embeddings for large set of documents #273

Comments

shima-khoshraftar commented Jun 11, 2024

What happened?

What Python version are you on? e.g. python --version

Version

What os are you seeing the problem on?

Relevant stack traces and/or logs

joein commented Jun 12, 2024

shima-khoshraftar commented Jun 12, 2024 • edited Loading

joein commented Jun 12, 2024

shima-khoshraftar commented Jun 13, 2024

joein commented Jun 13, 2024 • edited Loading

shima-khoshraftar commented Jun 13, 2024

This issue was moved to a discussion.

shima-khoshraftar commented Jun 12, 2024 •

edited

Loading

joein commented Jun 13, 2024 •

edited

Loading