Clarification about pre-filtering performance #113

mertalev · 2023-11-01T08:20:13Z

Hi! I'm looking at both pgvecto.rs and pgvector, and I think pre-filtering support is a killer feature that makes this project appealing.

But the README seems to recommend against using it in the filtering section. Instead, it suggests applying post-filtering for better optimization.

Could you clarify what optimization issues there are with pre-filtering? Relatedly, I think it would be helpful to add a performance comparison between pre-filtering and post-filtering.

VoVAllen · 2023-11-02T08:39:04Z

Thanks for your interest! The actual performance of filtering depends on how tight your condition is:

If your filter is loose, let's say 90% data satisfy it. Performing an ANN search first and then applying a filter to the results would yield the fastest and relatively good outcomes. This is post filtering
If your filter is really tight, let's say only 100 rows satisfy it. The optimal approach is to apply the filter first, obtaining 100 results, and then calculate the distance directly without utilizing any vector index. This is brute force
If your filter is kind of tight, let's say 20% data satisfy it. The post-filtering strategy may encounter issues as the ANN search might not yield sufficient results for filtering. The optimal approach in this case is prefiltering. As the algorithm traverses the hnsw graph to discover new points, it will simultaneously verify the filter until there are an adequate number of candidates. All the results from the vector index have already met the filter criteria.

Currently pgvector on hnsw index is doing the post filtering. Therefore when the filter is not loose enough, the recall drops rapidly, because index cannot provide sufficient results. User need to manually increase the ef_search parameter to ask the index provide more results for filtering to get reasonable results.

We have done the benchmark on the laion dataset with filter. This is the only real world dataset we can find that has filtering test. Basically you don't need to manually tune any parameter to get reasonable good results with filtering. Under the same configuration, pgvector can only achieve about 50% precision, as well as pgvecto.rs can achieve 95% precision with higher QPS.

VoVAllen · 2023-11-02T08:41:21Z

Currently we don't have time to carefully benchmark each mode, so we let user to select it on their own based on their data. Ideally we can select the best plan for the users in the future.

To select different mode:
Prefiltering (default mode): SET vectors.enable_vector_index=on; SET vectors.enable_prefilter=on
Postfiltering: SET vectors.enable_vector_index=on; SET vectors.enable_prefilter=off
Brute force: SET vectors.enable_vector_index=off

VoVAllen · 2023-11-02T08:42:53Z

We are also implementing a new method that retrieves the bitmap from the filter condition and indexed column. This allows us to push down the bitmap to enhance performance in the vector search process. Please stay tuned!

mertalev · 2023-11-03T06:26:19Z

Thank you for the detailed explanation (and of course for working on this great project)! I have a better sense of the pros/cons now.

I'm planning on testing pgvecto.rs soon and will let you know how it goes. Besides performance, there are some things in particular I want to make sure get handled, like changing the dimension size and not creating the index when the table is empty (I read that the index crashes in this case).

VoVAllen · 2023-11-03T07:21:33Z

@mertalev We're looking forward to your feedback! Yes, currently create index on empty table doesn't work well and dimension size cannot be changed after column creation. Another point worth attention is the capacity parameters. Current pgvecto.rs needs to predefine a capacity as the maximum number of vectors. If you want to change it, you can do this by recreate the index, like #101

mertalev · 2023-11-12T05:34:12Z

I've been testing pgvecto.rs over the last few days and it's very nice! With the latest release, it now handles empty tables so that makes migration easier.

From comparison with pgvector:

Brute-force performance is about 5% higher
Basic HNSW queries are about 5% slower
Indexing speed is night and day in favor of pgvecto.rs
Filtering works very well
- With pgvector, we have to think about partitioning an existing table, what to do if we need to have a second filter later, etc.
- pgvecto.rs gives an easy and scalable way to filter without any DBMS complexity

Ease-of-use improvements:

Automatically increasing capacity as the index grows
Not requiring vectors_load to use the index
More documentation
- Besides capacity (which I made an issue for), I'm also not sure what to expect from quantization. It seems to be about as fast, takes close to the same amount of time to build and uses the same amount of memory. This is with 200k vectors, so is it that it only makes a difference with a larger number?

usamoi · 2023-11-12T05:52:00Z

vectors_load is no more needed and the capacity increases automatically in the latest release.

usamoi · 2023-11-12T05:54:52Z

Quantization uses lower memory. By default it uses 1/4 memory compared before.

mertalev · 2023-11-12T06:13:05Z

I was using the latest tagged image, but after setting it to pg15-v0.1.6-amd64 you're right on both points. With x16 quantization, latency is more than halved and it uses much less memory.

VoVAllen · 2023-11-12T08:17:27Z

Thanks for your detailed feedback and it's super valuable to us! As usamoi explained, we just fixed the capacity problem this week so that user doesn't need to take care of it any more. And we'll work on prepare the arm64 image and more pg versions.

For quantization, you may also want to check the precision if needed. This can be accomplished by simply using brute force mode and comparing the results between them.

mertalev closed this as completed Nov 15, 2023

VoVAllen mentioned this issue Nov 23, 2023

doc: Update readme and add filter docs #161

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification about pre-filtering performance #113

Clarification about pre-filtering performance #113

mertalev commented Nov 1, 2023

VoVAllen commented Nov 2, 2023

VoVAllen commented Nov 2, 2023 •

edited

Loading

VoVAllen commented Nov 2, 2023

mertalev commented Nov 3, 2023

VoVAllen commented Nov 3, 2023

mertalev commented Nov 12, 2023

usamoi commented Nov 12, 2023

usamoi commented Nov 12, 2023

mertalev commented Nov 12, 2023

VoVAllen commented Nov 12, 2023

Clarification about pre-filtering performance #113

Clarification about pre-filtering performance #113

Comments

mertalev commented Nov 1, 2023

VoVAllen commented Nov 2, 2023

VoVAllen commented Nov 2, 2023 • edited Loading

VoVAllen commented Nov 2, 2023

mertalev commented Nov 3, 2023

VoVAllen commented Nov 3, 2023

mertalev commented Nov 12, 2023

usamoi commented Nov 12, 2023

usamoi commented Nov 12, 2023

mertalev commented Nov 12, 2023

VoVAllen commented Nov 12, 2023

VoVAllen commented Nov 2, 2023 •

edited

Loading