Ability to index (hnsw) multiple vector points per document #15854

jobergum · 2020-12-21T08:22:45Z

With the growing interest in ANN with Vespa, there has also been increased demand for the ability to index multiple vectors per document. If any of the vectors assigned to a document is in the top-k closest neighbors, it should be retrieved and exposed to the first-phase ranking function. This is similar to the behavior with position type fields (array) and geo-search.

Current support

field vector type tensor<float>(x[128]) {
  indexing: attribute | index
}

Future support:

field vector type tensor<float>(s{}, x[128]) {
  indexing: attribute |index
}

The ranking features, like closeness etc., return the value of the closest vector, which made the document retrieved and exposed to the first-phase ranking function where users can do more advanced computations over all vectors if they want to.

Query syntax is unchanged, but we might consider allowing passing N tensors with the query. This would then implement batch queries, e.g., where there are multiple query intents or token-based query embeddings like with ColBERT. Ideally, these can then share the filter execution so that the filters are only performed once, and the resulting filter is shared for the batch of NN queries.

schema doc {
  field vector type tensor<float>(s{}, x[128]) {
    indexing: attribute |index
  }
}
 rank-profile default {
      inputs {
        query(q) tensor<float>(x[129])
        query(q_batch) tensor<float>(s{}, x[128])
      }
    
      first-phase {
        expression: closeness(field, vector)
      }
     
  }

This would simplify expressing the end-to-end retrieval using
Colbert also handle longer documents, limited by the input length of transformer architectures.

The text was updated successfully, but these errors were encountered:

okhat · 2020-12-21T21:45:13Z

Excited to see this making its way to Vespa! (First author of the ColBERT paper.)

Happy to help with ideas/etc. about doing this. FYI our implementation is here: https://github.com/stanford-futuredata/ColBERT/tree/v0.2

jobergum · 2020-12-22T09:25:56Z

Thanks for reaching out @okhat and excellent work on the paper and also making the source code for training available. I've used the main branch for now and I'm able to train and reproduce your re-ranking metrics using cosine similarity. We reduced dimensionality to 32 to save space and memory bandwidth during evaluation (Vespa is currently cpu only).

Well done, it's not every paper and released code where it's easy to reproduce the reported results!

Metrics on top1000.dev file (as provided by MS Marco):

MRR@10 = 0.343441408559603
Recall@50 = 0.7450095510983765
Recall@200 = 0.7996657115568291
Recall@1000 = 0.8140401146131805
[Dec 21, 15:41:39] #> checkpoint['batch'] = 400000

Given the rather bad recall@1K in the original top1000 the MRR level is very impressive

We are able to represent your MaxSim operator using Vespa tensor functions (*1) and usable as a re-ranker. Adding support for indexing multiple document term representations for the same passage/document will make it easier to also use colbert for retrieval. Though I think it can also be distilled into a single dense tensor representation with ok recall@k.

We use the following document schema for MS Marco Passage ranking experiments (so far). The end work will be published as a sample application and probably an entry to the MS Marco Passage ranking leaderboard where we will use multiple features in the re-ranking stage.

document passage {

    field text type string {
      indexing: summary | index
      index: enable-bm25
    }

    field id type long {
      indexing: summary |attribute
    }

    field dt type tensor<float>(dt{}, x[32]){
      indexing: summary | attribute
    }

    field doc_t5_query type weightedset<string>{
      indexing: summary |index
      index: enable-bm25
    }

  }

Ranking profile:

 rank-profile colbert {
    first-phase {
       expression:...
    }
    second-phase {
      rerank-count:1000
      expression {
          sum(
            reduce(
              sum(
                query(qt) * attribute(dt), x
              ),
              max, dt
            ),
           qt
         )
      }
    }
  }

Where query(qt) is the run time tensor with query term to embedding representation defined as

<field name="ranking.features.query(qt)" type="tensor(qt{},x[32])"/>

The query tensor is computed by the exported bert query model (onnx format for serving). From a latency perspective the encoder is the overall bottleneck as the rest can be scaled easily while the single none-batched pass through Bert cannot so it might be that we try to use a smaller bert version (e.g bert-medium) and with quantisation.

It's a very nice use case for us as it allows us to demonstrate our mixed tensor support with advanced tensor functions and representing bert models in Vespa.

(*1) https://docs.vespa.ai/documentation/reference/ranking-expressions.html#tensor-functions

okhat · 2020-12-22T16:19:09Z

Very interesting---thanks @jobergum!

I see that you're using d=32. Another way to go about this is to use a larger dimension (say 128) but quantize the representations. This may perform better in terms of accuracy.

It's also interesting that you think (on CPU) the main bottleneck would be the query encoder. I think it could be pretty fast (tens of milliseconds), even with a modest few-core CPU.

There are some ONNX results in milliseconds here from this article. For bsize=1 and maxlen=32, they're doing inference in 59.84 ms on a commodity CPU.

jobergum · 2020-12-22T16:50:46Z

@okhat Yes, agree on dimensionality, but currently our tensor values are only float or double, we plan to add more tensor types with lower resolution (e.g int8). Generally it seems like number of parameters/dimensions is more important than weight resolution. So 32 is a reasonable tradeoff given our current tensor type limitations.

There are some ONNX results in milliseconds here from this article. For bsize=1 and maxlen=32, they're doing inference in >59.84 ms on a commodity CPU.

Yes, we will fully explore this (We also did similar accuracy versus speed for the DPR model https://blog.vespa.ai/from-research-to-production-scaling-a-state-of-the-art-machine-learning-system/)

Below 100ms overall with 1 node including ranking and query encoding is our goal (with 1 node of modern hw) and that is fully reachable without much accuracy loss. I'll update this ticket when we have more on this, but it's surely a very exciting model you have proposed and fits perfectly with Vespa tensor functions and storage.

jobergum · 2021-01-05T18:57:01Z

With a bert-medium-uncased model (https://huggingface.co/google/bert_uncased_L-8_H-512_A-8), with batch size 300 and dim 32 and trained for 120K iterations the accuracy is pretty good (0.354 MRR@10 on MS Marco Dev). Using just WAND(1000) BM25 as retriever over text and re-ranking top 1K using ColBERT:

MS Marco Passage Ranking Dev:

MRR@10 0.3540
Recall@10 0.6141
Recall@50 0.7781
Recall@100 0.8151
Recall@1000 0.8562

TREC DL 2019 Eval (43 queries) with deep graded judgements

NDCG@10 0.7039
MAP 0.4494

Sample response

okhat · 2021-01-05T19:00:30Z

That's really awesome! The results with medium BERT and a small dimension are better than I expected. Also the latency is pretty great. Is this on CPU?

I'm wondering how things would look with end-to-end retrieval, especially with the deeper relevance judgments!

jobergum · 2021-01-05T19:38:06Z

Yes, this runs on a single node in our private Vespa cloud with 2 x Xeon Gold 2.60GHz and 512 GB memory. Pretty standard stuff. We'll explore more on the performance, number of threads per search and see if we can optimize it further. We can also cut down the latency of the query encoder by using a quantized version (int8) which brings down the cost of encoding the query, but with an accuracy loss (from 0.35ish to 0.34 ish).

My conclusion is that it's perfectly practical to serve this model in production using Vespa and with a fraction of the cost (and latency) of a full interaction based model with almost the same accuracy.

jobergum · 2021-01-28T12:14:54Z

Unrelated to this ticket directly but an update on Colbert realization on Vespa. #16236 adds an optimization to the tensor expression which implements the Colbert MaxSim operator and we now reach end to end 2.2K QPS with low latency on a single node.

End to end latency includes

HTTPS end to end to the result is fully consumed by the http benchmarking client
Bert query tokenization to bert token_id
Encoding of the query through the BERT query encoder (ONNX and ONNX-RT)
Re-write of the d0[1],d1[32],d2[32] tensor to sparse qt tensor qt{},x[32] which is used in the ranking expression
sparse text retrieval using WAND over 9M passages and re-ranking using the Colbert MaxSim operator

Results:

Vespa version	average latency	99p latency	QPS
7.347.28	92.89	119.74	686
7.349.32	28.21	43.90	2267

This is also using 1 thread per search only so for use cases which are not that throughput sensitive one might reduce latency further at the cost of throughput.

We also did a submission to MS Marco Passage ranking with it

So thanks again @okhat for sharing the code to reproduce the training routine and the excellent paper. We will write more about this on our blog real soon.

okhat · 2021-01-28T14:38:43Z

Thank you @jobergum and thanks to the team! This is actually much faster than I thought.

I'm wondering how fast you guys could make the end-to-end retrieval version of ColBERT too if you gave that one a shot! ;-) Especially with iterated (gathering negatives on the training set based on the trained ColBERT model), this should have substantially higher accuracy (~37.5% MRR@10 on dev).

jobergum · 2021-02-01T12:42:45Z

So by iterated you mean creating a new triplet set over the train queries using the positives and a sample of the top negatives from the initial ColBERT model trained on the original bm25 triplet file?

We will publish the weights of this model on Huggingface models and also the pre-computed passsage tensors so people can reproduce. We will leave it to those interested to improve it and iterate over it, after all we are not scientists, we just want to demonstrate how one can efficiently represent the excellent ColBERT model and have close to state of the art ranking with great serving performance on cpu architectures. In this context we are rather happy with a MRR score of 0.35ish on dev and eval as it's 100% better than plain BM25, similar on TREC DL 2019.

Allowing representing ColBERT on Vespa has IMHO many benefits and especially cpu friendliness, and also Vespa allows the the final inference to be a combination of multiple ranking features. Later this year we will also add support for other tensor data types so we can reduce the memory footprint.

geirst · 2021-09-01T11:53:50Z

Discussed this with @jobergum. We concluded this is less important. Setting to Milestone later.

jobergum · 2021-09-01T12:02:25Z

In the context of the ColBERT multi representation model this is less important as there are alternatives using a single dense representation for the retriever and use ColBERT as the re-ranking step.

Now with paged on-disk tensors with binarization and unpacking the memory and storage footprint allows storing 100M passages with fixed length 256 tokens on 16-64-600 nodes.

baldersheim · 2023-09-13T20:49:45Z

@geirst I think this has been completed now.

geirst · 2023-09-14T08:34:56Z

Yes, this is part of Vespa 8.144.19. See blog post for details: https://blog.vespa.ai/semantic-search-with-multi-vector-indexing/

jobergum added the enhancement label Dec 21, 2020

johans1 added this to the soon milestone Jan 4, 2021

jobergum mentioned this issue Mar 23, 2021

Add support for int8 and bfloat16 tensor type #17118

Closed

jobergum mentioned this issue Apr 12, 2021

Feedback on ColBERT stanford-futuredata/ColBERT#32

Closed

geirst modified the milestones: soon, later Sep 1, 2021

bratseth modified the milestones: later, soon Nov 8, 2022

baldersheim assigned geirst Sep 13, 2023

baldersheim closed this as completed Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to index (hnsw) multiple vector points per document #15854

Ability to index (hnsw) multiple vector points per document #15854

jobergum commented Dec 21, 2020 •

edited

okhat commented Dec 21, 2020

jobergum commented Dec 22, 2020 •

edited

okhat commented Dec 22, 2020 •

edited

jobergum commented Dec 22, 2020

jobergum commented Jan 5, 2021

okhat commented Jan 5, 2021

jobergum commented Jan 5, 2021

jobergum commented Jan 28, 2021

okhat commented Jan 28, 2021

jobergum commented Feb 1, 2021

geirst commented Sep 1, 2021

jobergum commented Sep 1, 2021

baldersheim commented Sep 13, 2023

geirst commented Sep 14, 2023

Ability to index (hnsw) multiple vector points per document #15854

Ability to index (hnsw) multiple vector points per document #15854

Comments

jobergum commented Dec 21, 2020 • edited

okhat commented Dec 21, 2020

jobergum commented Dec 22, 2020 • edited

okhat commented Dec 22, 2020 • edited

jobergum commented Dec 22, 2020

jobergum commented Jan 5, 2021

okhat commented Jan 5, 2021

jobergum commented Jan 5, 2021

jobergum commented Jan 28, 2021

okhat commented Jan 28, 2021

jobergum commented Feb 1, 2021

geirst commented Sep 1, 2021

jobergum commented Sep 1, 2021

baldersheim commented Sep 13, 2023

geirst commented Sep 14, 2023

jobergum commented Dec 21, 2020 •

edited

jobergum commented Dec 22, 2020 •

edited

okhat commented Dec 22, 2020 •

edited