Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to index (hnsw) multiple vector points per document #15854

Closed
jobergum opened this issue Dec 21, 2020 · 14 comments
Closed

Ability to index (hnsw) multiple vector points per document #15854

jobergum opened this issue Dec 21, 2020 · 14 comments
Assignees
Milestone

Comments

@jobergum
Copy link
Member

jobergum commented Dec 21, 2020

With the growing interest in ANN with Vespa, there has also been increased demand for the ability to index multiple vectors per document. If any of the vectors assigned to a document is in the top-k closest neighbors, it should be retrieved and exposed to the first-phase ranking function. This is similar to the behavior with position type fields (array) and geo-search.

Current support

field vector type tensor<float>(x[128]) {
  indexing: attribute | index
}

Future support:

field vector type tensor<float>(s{}, x[128]) {
  indexing: attribute |index
}

The ranking features, like closeness etc., return the value of the closest vector, which made the document retrieved and exposed to the first-phase ranking function where users can do more advanced computations over all vectors if they want to.

Query syntax is unchanged, but we might consider allowing passing N tensors with the query. This would then implement batch queries, e.g., where there are multiple query intents or token-based query embeddings like with ColBERT. Ideally, these can then share the filter execution so that the filters are only performed once, and the resulting filter is shared for the batch of NN queries.

schema doc {
  field vector type tensor<float>(s{}, x[128]) {
    indexing: attribute |index
  }
}
 rank-profile default {
      inputs {
        query(q) tensor<float>(x[129])
        query(q_batch) tensor<float>(s{}, x[128])
      }
    
      first-phase {
        expression: closeness(field, vector)
      }
     
  }

This would simplify expressing the end-to-end retrieval using
Colbert also handle longer documents, limited by the input length of transformer architectures.

@okhat
Copy link

okhat commented Dec 21, 2020

Excited to see this making its way to Vespa! (First author of the ColBERT paper.)

Happy to help with ideas/etc. about doing this. FYI our implementation is here: https://github.com/stanford-futuredata/ColBERT/tree/v0.2

@jobergum
Copy link
Member Author

jobergum commented Dec 22, 2020

Thanks for reaching out @okhat and excellent work on the paper and also making the source code for training available. I've used the main branch for now and I'm able to train and reproduce your re-ranking metrics using cosine similarity. We reduced dimensionality to 32 to save space and memory bandwidth during evaluation (Vespa is currently cpu only).

Well done, it's not every paper and released code where it's easy to reproduce the reported results!

Metrics on top1000.dev file (as provided by MS Marco):

MRR@10 = 0.343441408559603
Recall@50 = 0.7450095510983765
Recall@200 = 0.7996657115568291
Recall@1000 = 0.8140401146131805
[Dec 21, 15:41:39] #> checkpoint['batch'] = 400000 

Given the rather bad recall@1K in the original top1000 the MRR level is very impressive

We are able to represent your MaxSim operator using Vespa tensor functions (*1) and usable as a re-ranker. Adding support for indexing multiple document term representations for the same passage/document will make it easier to also use colbert for retrieval. Though I think it can also be distilled into a single dense tensor representation with ok recall@k.

We use the following document schema for MS Marco Passage ranking experiments (so far). The end work will be published as a sample application and probably an entry to the MS Marco Passage ranking leaderboard where we will use multiple features in the re-ranking stage.

document passage {

    field text type string {
      indexing: summary | index
      index: enable-bm25
    }

    field id type long {
      indexing: summary |attribute
    }

    field dt type tensor<float>(dt{}, x[32]){
      indexing: summary | attribute
    }

    field doc_t5_query type weightedset<string>{
      indexing: summary |index
      index: enable-bm25
    }

  }

Ranking profile:

 rank-profile colbert {
    first-phase {
       expression:...
    }
    second-phase {
      rerank-count:1000
      expression {
          sum(
            reduce(
              sum(
                query(qt) * attribute(dt), x
              ),
              max, dt
            ),
           qt
         )
      }
    }
  }

Where query(qt) is the run time tensor with query term to embedding representation defined as

<field name="ranking.features.query(qt)" type="tensor(qt{},x[32])"/>

The query tensor is computed by the exported bert query model (onnx format for serving). From a latency perspective the encoder is the overall bottleneck as the rest can be scaled easily while the single none-batched pass through Bert cannot so it might be that we try to use a smaller bert version (e.g bert-medium) and with quantisation.

It's a very nice use case for us as it allows us to demonstrate our mixed tensor support with advanced tensor functions and representing bert models in Vespa.

(*1) https://docs.vespa.ai/documentation/reference/ranking-expressions.html#tensor-functions

@okhat
Copy link

okhat commented Dec 22, 2020

Very interesting---thanks @jobergum!

I see that you're using d=32. Another way to go about this is to use a larger dimension (say 128) but quantize the representations. This may perform better in terms of accuracy.

It's also interesting that you think (on CPU) the main bottleneck would be the query encoder. I think it could be pretty fast (tens of milliseconds), even with a modest few-core CPU.

There are some ONNX results in milliseconds here from this article. For bsize=1 and maxlen=32, they're doing inference in 59.84 ms on a commodity CPU.

@jobergum
Copy link
Member Author

@okhat Yes, agree on dimensionality, but currently our tensor values are only float or double, we plan to add more tensor types with lower resolution (e.g int8). Generally it seems like number of parameters/dimensions is more important than weight resolution. So 32 is a reasonable tradeoff given our current tensor type limitations.

There are some ONNX results in milliseconds here from this article. For bsize=1 and maxlen=32, they're doing inference in >59.84 ms on a commodity CPU.

Yes, we will fully explore this (We also did similar accuracy versus speed for the DPR model https://blog.vespa.ai/from-research-to-production-scaling-a-state-of-the-art-machine-learning-system/)

Below 100ms overall with 1 node including ranking and query encoding is our goal (with 1 node of modern hw) and that is fully reachable without much accuracy loss. I'll update this ticket when we have more on this, but it's surely a very exciting model you have proposed and fits perfectly with Vespa tensor functions and storage.

@johans1 johans1 added this to the soon milestone Jan 4, 2021
@jobergum
Copy link
Member Author

jobergum commented Jan 5, 2021

With a bert-medium-uncased model (https://huggingface.co/google/bert_uncased_L-8_H-512_A-8), with batch size 300 and dim 32 and trained for 120K iterations the accuracy is pretty good (0.354 MRR@10 on MS Marco Dev). Using just WAND(1000) BM25 as retriever over text and re-ranking top 1K using ColBERT:

MS Marco Passage Ranking Dev:

  • MRR@10 0.3540
  • Recall@10 0.6141
  • Recall@50 0.7781
  • Recall@100 0.8151
  • Recall@1000 0.8562

TREC DL 2019 Eval (43 queries) with deep graded judgements

  • NDCG@10 0.7039
  • MAP 0.4494

Sample response
image

@okhat
Copy link

okhat commented Jan 5, 2021

That's really awesome! The results with medium BERT and a small dimension are better than I expected. Also the latency is pretty great. Is this on CPU?

I'm wondering how things would look with end-to-end retrieval, especially with the deeper relevance judgments!

@jobergum
Copy link
Member Author

jobergum commented Jan 5, 2021

Yes, this runs on a single node in our private Vespa cloud with 2 x Xeon Gold 2.60GHz and 512 GB memory. Pretty standard stuff. We'll explore more on the performance, number of threads per search and see if we can optimize it further. We can also cut down the latency of the query encoder by using a quantized version (int8) which brings down the cost of encoding the query, but with an accuracy loss (from 0.35ish to 0.34 ish).

My conclusion is that it's perfectly practical to serve this model in production using Vespa and with a fraction of the cost (and latency) of a full interaction based model with almost the same accuracy.

@jobergum
Copy link
Member Author

Unrelated to this ticket directly but an update on Colbert realization on Vespa. #16236 adds an optimization to the tensor expression which implements the Colbert MaxSim operator and we now reach end to end 2.2K QPS with low latency on a single node.

End to end latency includes

  • HTTPS end to end to the result is fully consumed by the http benchmarking client
  • Bert query tokenization to bert token_id
  • Encoding of the query through the BERT query encoder (ONNX and ONNX-RT)
  • Re-write of the d0[1],d1[32],d2[32] tensor to sparse qt tensor qt{},x[32] which is used in the ranking expression
  • sparse text retrieval using WAND over 9M passages and re-ranking using the Colbert MaxSim operator

Results:

Vespa version average latency 99p latency QPS
7.347.28 92.89 119.74 686
7.349.32 28.21 43.90 2267

This is also using 1 thread per search only so for use cases which are not that throughput sensitive one might reduce latency further at the cost of throughput.

We also did a submission to MS Marco Passage ranking with it
image

So thanks again @okhat for sharing the code to reproduce the training routine and the excellent paper. We will write more about this on our blog real soon.

@okhat
Copy link

okhat commented Jan 28, 2021

Thank you @jobergum and thanks to the team! This is actually much faster than I thought.

I'm wondering how fast you guys could make the end-to-end retrieval version of ColBERT too if you gave that one a shot! ;-) Especially with iterated (gathering negatives on the training set based on the trained ColBERT model), this should have substantially higher accuracy (~37.5% MRR@10 on dev).

@jobergum
Copy link
Member Author

jobergum commented Feb 1, 2021

So by iterated you mean creating a new triplet set over the train queries using the positives and a sample of the top negatives from the initial ColBERT model trained on the original bm25 triplet file?

We will publish the weights of this model on Huggingface models and also the pre-computed passsage tensors so people can reproduce. We will leave it to those interested to improve it and iterate over it, after all we are not scientists, we just want to demonstrate how one can efficiently represent the excellent ColBERT model and have close to state of the art ranking with great serving performance on cpu architectures. In this context we are rather happy with a MRR score of 0.35ish on dev and eval as it's 100% better than plain BM25, similar on TREC DL 2019.

Allowing representing ColBERT on Vespa has IMHO many benefits and especially cpu friendliness, and also Vespa allows the the final inference to be a combination of multiple ranking features. Later this year we will also add support for other tensor data types so we can reduce the memory footprint.

@geirst
Copy link
Member

geirst commented Sep 1, 2021

Discussed this with @jobergum. We concluded this is less important. Setting to Milestone later.

@geirst geirst modified the milestones: soon, later Sep 1, 2021
@jobergum
Copy link
Member Author

jobergum commented Sep 1, 2021

In the context of the ColBERT multi representation model this is less important as there are alternatives using a single dense representation for the retriever and use ColBERT as the re-ranking step.

Now with paged on-disk tensors with binarization and unpacking the memory and storage footprint allows storing 100M passages with fixed length 256 tokens on 16-64-600 nodes.

@bratseth bratseth modified the milestones: later, soon Nov 8, 2022
@baldersheim
Copy link
Member

@geirst I think this has been completed now.

@geirst
Copy link
Member

geirst commented Sep 14, 2023

Yes, this is part of Vespa 8.144.19. See blog post for details: https://blog.vespa.ai/semantic-search-with-multi-vector-indexing/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Search and content
Awaiting triage
Development

No branches or pull requests

6 participants