Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add selectivity estimate? #30

Open
msdemlei opened this issue Feb 10, 2021 · 35 comments
Open

Add selectivity estimate? #30

msdemlei opened this issue Feb 10, 2021 · 35 comments

Comments

@msdemlei
Copy link

msdemlei commented Feb 10, 2021

I'm still regularly struggling with abysimal query plans when q3c functions are in queries; disabling seqscans helps a bit of course, but it certainly is not an suboptimal and has unwelcome side effects in general.

Now the support functions Laurenz Albe reports on on https://www.cybertec-postgresql.com/en/optimizer-support-functions/ would seem to be helpful here; I would guess that giving q3c_join a selectivity of (roughly) match_radius**2/4 pi (so: circle area over sphere area) would already greatly improve the query plans (even acknowledging that star density varies by orders of magnitude over the sky).

Do you have plans to add support functions? Do you see major obstacles to at least a very rough implementation that would assume evenly distributed objects?

@segasai
Copy link
Owner

segasai commented Feb 10, 2021

Yes, I've been watching this topic for a long while and this is the reason I've implemented the selectivity function for parts of q3c_join (check here

q3c/q3c.c

Line 136 in 0254998

Datum pgq3c_seljoin(PG_FUNCTION_ARGS)
). The problem is that the q3c_ functions are SQL inlined and the underlying clauses that are executed are q3c_ang2ipix()<XX and q3c_ang2ipix()>YY. So the only way to have selectivity for this is to make custom type and operator which as far as I am aware would not allow the bitmap index scan.
So basically I don't know of a way of making it work. I would really want though. If you can demonstrate the approach that would work, I'll gladly help/accept it. But to my knowledge without going into new types/operators/operator classes it impossible and even with that I'm not sure.

@segasai
Copy link
Owner

segasai commented Feb 10, 2021

Looking at the https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=74dfe58a5927b22c744b29534e67bfdd203ac028 patch, it seems it only supports conditions like
select * from where myfunction()
If there was a way to support
select * from where myfunction()<X or
select * from where myfunction() between X and Y

That would do the job, because writing those selectivities would be easy.

@msdemlei
Copy link
Author

msdemlei commented Feb 11, 2021 via email

@segasai
Copy link
Owner

segasai commented Feb 11, 2021

(a) we can provide selectivity estimates to q3c_join(ra1, dec1, ra2,
dec2, radius) but then lose the ability to have index scans,
defeating the whole purpose, or
(b) we keep letting postgres inline things and then lose the ability
to provide selectivity estimates?

That's my understanding, but I've been also looking at the git commits of those support functions, and maybe there is something there that provides support for func()<value queries or provides a way to specify index strategy given a function call, but I am not sure. There isn't a lot of documentation there.. I was thinking previously writing an email pgsql-hackers asking about it, but haven't had time. (if @laurenz can help, that'd be great).
Realistically we need PG

to be able to execute

( func(a,b) between X and Y)  or 
( func(a,b) between X and Y)  or 
( func(a,b) between X and Y)  

queries with user-provided selectivity for between operator.
Alternatively I'm okay to introduce a user defined operator instead of "between". Something like
func(a,b) @@ ( X, Y)
if the usage of the operator will essentially lead
to the same query as above.

@laurenz
Copy link

laurenz commented Feb 11, 2021

Hm. Can't you inline a function, and the inlined expression contains a function with a support function?
I didn't look at this special case though.

@segasai
Copy link
Owner

segasai commented Feb 11, 2021

Yes, that's idea. But I don't know if the support functions for function F can provide selectivity for queries like this
where (F(a,b)> X and F(a,b)<Y) or ( F(a,b)>Z and F(a,b)<K)
because our current functions are inlined like that

SELECT (((q3c_ang2ipix($3,$4)>=(q3c_nearby_it($1,$2,$5,0))) AND (q3c_ang2ipix($3,$4)<=(q3c_nearby_it($1,$2,$5,1))))

@laurenz
Copy link

laurenz commented Feb 16, 2021

After looking at what q3c_join does, it seems indeed to be a perfect candidate for a support function.

You would do away with the inlined SQL function and use a support function in its place, see PostGIS' postgis_index_supportfn.

The support function would make the optimizer run an index scan instead of or as a filter before executing the function.

@segasai
Copy link
Owner

segasai commented Feb 16, 2021

Thank you for the pointer @laurenz , I'll take a look there.

@gshennessy
Copy link

I've never understood why in q3c--2.0.0.sql the routine q3c_join tests for a range of values against q3c_nearby_it with four cases but q3c_radial_query tests for 200 cases. Is that number related to the number of polygons that can be tested for, or is that a coincidence? Is this documented anywhere?

@segasai
Copy link
Owner

segasai commented Jan 11, 2022

The reason is that the radial query is mostly designed for large circle queries, so making a more precise approximation to a circle by going deeper in the quad-tree can save significant I/O. For the q3c_join, the main use case was thought to have a pretty small x-match radius, therefore it is less crucial to have a good quad-tree approximation to a circle, as likely stuff will be sitting on the same disk block anyway.

That's the original reasoning (I think). Whether in practice q3c_radial_query going deep is that beneficial is not 100% obvious (but for sure for a query near the galactic plane in say gaia q3c_radial_query with just 4 squares would run significantly longer)

@gshennessy
Copy link

Might we be over thinking this? I'm also struggling with the issue plans not using the q3c index when I (as a human) think they should. Rather than dealing with anything complicated, perhaps simply dealing with a SupportRequestSelectivity call to allow the planner to make better row estimates would help.

I don't know if people can see the plan at https://explain.dalibo.com/plan/RMS
I try to match a 526 star landolt table with a 747million star allwise catalog.
The bitmab index scan is estimated as 43 billion rows, ending up over estimated by 8 million.
The 43 billion row estimate is 526*747Million/9, which is what postgres assumes as a default selectivity.
if we have q3c_seljoin set a selectivity as theta squared divided by 41252 i woudl think we would be fine.

Thoughts?

@segasai
Copy link
Owner

segasai commented May 18, 2022

That's exactly what seljoin does

q3c/q3c.c

Line 168 in 1290e00

ratio = 3.14 * rad * rad / 41252.; /* pi*r^2/whole_sky_area */

But I attach that selectivity to a special operator see https://github.com/segasai/q3c/blob/master/scripts/q3c--2.0.0.sql
not the q3c_ang2pix> <= conditions. If someone helps me with how to attach SupportRequestSelectivity stuff to q3c_join, that'd be appreciated. It would take me probably few days to figure this out and I don't have that time now.

I also looked recently at a way of using the int8range types to replace the q3c_ang2ipix()> <= conditions
https://www.mail-archive.com/pgsql-general@lists.postgresql.org/msg30615.html
with the hope of better estimates, but it looks PG is still missing some functionality there for it to be useful.

@gshennessy
Copy link

I started a thread on https://dba.stackexchange.com/questions/312526/in-postgresql-13-how-do-i-ensure-a-supportrequestselectivity-is-made where I show a first attempt at a selectivity operator, but my function isn't
called with SupportRequestSelectivity, only SupportRequestSimplify. I've also posted the same information
to pgsql-hackers. I'll keep you in touch in case there is progress on this issue.

@segasai
Copy link
Owner

segasai commented May 26, 2022

Great!
There was also a long thread on that topic here https://www.mail-archive.com/pgsql-hackers@lists.postgresql.org/msg29836.html that I've spent some time reading.
I have in the mean-time hacked a quick version that would allow me to return multiranges
https://gist.github.com/segasai/12488c591ff7dd495ec78b42aef80631
So this way we'd have something like q3c_ang2ipix(ra,dec) OPERATOR FUNC()
where FUNC() is the function returning the multirange corresponding to the cone-search or xmatch.
What I had in mind is to do try to transform it then using SupportRequestIndexCondition like described here https://www.mail-archive.com/pgsql-general@lists.postgresql.org/msg30620.html

@gshennessy
Copy link

Actually, I'm not sure the issue is the row estimation, but the cost estimation. From an email to pgsql-hackers today:

Greg Hennessy greg.hennessy@gmail.com
12:04 PM (29 minutes ago)
to Tom, pgsql-hackers

On Thu, May 26, 2022 at 3:10 PM Tom Lane tgl@sss.pgh.pa.us wrote:

Can you do anything useful with attaching selectivity estimates
to the functions it references, instead?
I may have been doing down a bad path before. The function I'm
working to improve has five argument, the last being "degrees", which
is the match radius. Obviously a larger match radius should cause more
matches.

For a small value of a match radius (0.005 degrees):

q3c_test=# explain (analyze, buffers) select * from test as a, test1 as
b where q3c_join(a.ra,a.dec,b.ra,b.dec,.005);
QUERY PLAN
Nested Loop (cost=92.28..22787968818.00 rows=5 width=32) (actual
time=7.799..10758.566 rows=31 loops=1)
Buffers: shared hit=8005684
-> Seq Scan on test a (cost=0.00..15406.00 rows=1000000 width=16)
(actual time=0.008..215.570 rows=1000000 loops=1)
Buffers: shared hit=5406
-> Bitmap Heap Scan on test1 b (cost=92.28..22785.45 rows=250
width=16) (actual time=0.009..0.009 rows=0 loops=1000000)

(note: I deleted some of the output, since I think I'm keeping the
important bits)

So, the cost of the query is calculated as 2e10, where it expect five rows,
found 31, and a hot cache of reading 8 million units of disk space, I'd have
to check the fine manual to remind myself of the units of that.

When I do the same sort of query on a much larger match radius (5 deg) I
get:
q3c_test=# explain (analyze, buffers) select * from test as a, test1 as
b where q3c_join(a.ra,a.dec,b.ra,b.dec,5);
QUERY PLAN
Nested Loop (cost=92.28..22787968818.00 rows=4766288 width=32) (actual
time=0.086..254995.691 rows=38051626 loops=1)
Buffers: shared hit=104977026
-> Seq Scan on test a (cost=0.00..15406.00 rows=1000000 width=16)
(actual time=0.008..261.425 rows=1000000 loops=1)
Buffers: shared hit=5406
-> Bitmap Heap Scan on test1 b (cost=92.28..22785.45 rows=250
width=16) (actual time=0.053..0.247 rows=38 loops=1000000)

The "total cost" is the same identical 2e10, this time the number of
rows expectd
is 4.7 million, the number of rows delivered is 38 million (so the
calculation is off
by a factor of 8, I'm not sure that is important), but the io is now 104
million units.
So while we are doing a lot more IO, and dealing with a lot more rows, the
calculated cost is identical. That seems strange me me. Is that a normal
thing?
Is it possible that the cost calculation isn't including the selectivity
calculation?

@segasai
Copy link
Owner

segasai commented May 27, 2022

The problem here is that
q3c_join has two bits: one is the set of
q3c_ang2pipix()> < conditions and the other is the distance filtering.
The selectivity is currently implemented as part of the distance cut here:

AND ($5::double precision ==<<>>== ($1,$2,$3,$4)::q3c_type)

so PG still expects the same selectivity and therefore the cost for the part of the query involving
q3c_ang2ipix()> q3c_ang2ipix()< conditions which leads to similar costs for different radii. I think at some point I bumped up the CPU cost of the q3c_sindist() make the sequetial queries less likely, but I don't think that works correctly.

@gshennessy
Copy link

This may be a crazy thing for me to try, but I'm going to see if adding selectivity to q3c_nearby_it helps or not.

@segasai
Copy link
Owner

segasai commented May 27, 2022

I doubt it will work. In my understanding the selectivity can be applied only to functions or operators returning booleans.

@gshennessy
Copy link

its a rainy day, and i'm bored.

@segasai
Copy link
Owner

segasai commented May 27, 2022

I still believe now that the best approach is to use multirange type and special operator between (q3c_ang2ipix() <@ multirange) that would do the 'Simplify' operation into a bunch of ranges. or maybe even just using a set of single ranges.
I.e. we just need to make sure that the operation q3c_ang2ipix()<@int8range can be converted into indexable condition and we can set up the right selectivity for it.

@gshennessy
Copy link

You know far more about what is going on than I do, I'm just treating this as a learning experience, and don't mind trying
crazy ideas since it gives me a chance learn more about both postgresql and q3c.

@segasai
Copy link
Owner

segasai commented May 27, 2022

Sure, no problem. It'd be great if you could find a solution.

@gshennessy
Copy link

gshennessy commented Aug 10, 2022

I think it might be possible if we were to use a GIST index rather than a b-tree one. A GIST index allows @>, so I think we can use the same ang2ipix(ra,dec) index, and then come up with a multirange function that does the same as q3c_nearby_it.

I've not learned the buttons to push to deal with merge requests. I can spend some time on this. if i tried to do a solution
using multiranges, it would only work (as far as i can tell) for postgres 14 or higher. would you be ok with having a 2.1 version
of q3c that requires pg 14, while leaving pg12 and pg13 to use version 2.0 for q3c? its also possible that i spend some time
and effort trying to get all the multirange stuff done, and we end up no faster than before. I think that the default selectivities
for GIST stuff is more like 0.01 rather than the 0.33333333 default postgresql selectivity, so that may help avoid the dreaded "why is my q3c_join against gaia dr3 using a sequential scan?" issue. i would probably create new funcitons such as
q3c_join_multirange and q3c_radial_query_multirange to allow some A/B testing.

Thoughts?

@segasai
Copy link
Owner

segasai commented Aug 11, 2022

If you can make somehow the @> work, that'd be great.

Again, in here https://www.mail-archive.com/pgsql-general@lists.postgresql.org/msg30620.html I tried and I learned that I ang2ipix(ra,dec) <@ multirange won't use the index unfortunately. So the only way I know how to make it work with multirange if the gist index is created on multirange(ang2ipix(ra,dec),ang2ipix(ra,dec)+1).

Also I'm perfectly happy if improvements will only be acceptable in latest PG versions. My biggest production DB with Q3C currently uses 13. And I usually try to migrate to new version 1 year after the release. So 14 would be fine from my point of view.

@gshennessy
Copy link

gshennessy commented Aug 11, 2022

I'm able to get ang2ipix(ra,dec) <@ multirange if I create the make the ang2ipix part of a multirange (which means
I rediscovered the same thing you did) , which sort of sucks,
but upon reading a bit more I think I can get the multirange @> bigint to work, but I have more hacking to do.
The sql code looks much cleaner, but using a GIST index isn't appreciably faster/better than the b-tree index, since I've
still not got the selectivity portion to work. I have hopes.

I'm still using 12 in my DB, but I'm about to move to RH 8 and PG14, at least that's what I'm telling my folks.
Some people scream when ever you want to upgrade anything. :)

@segasai
Copy link
Owner

segasai commented Aug 11, 2022

And How do you plan make the multirange @>bigint work, since at least this doesn't use the index (PG14)

postgres=# create temp table xtmp (a bigint, b bigint);
CREATE TABLE
postgres=# insert INTO xtmp  select
(random()*10000000000)::bigint,(random()*10000000000)::bigint  from
generate_series(0,1000000);
INSERT 0 1000001
postgres=# create index  ON  xtmp using gist (a);
CREATE INDEX
postgres=# analyze xtmp ;
ANALYZE
postgres=# set enable_seqscan to off;
SET
postgres=# explain select * from xtmp where int8range(4,10)@>a;
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 Seq Scan on xtmp  (cost=10000000000.00..10000017906.01 rows=5000 width=16)
   Filter: ('[4,10)'::int8range @> a)
 JIT:
   Functions: 2
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(5 rows)

and Tom Lane says such functionality doesn't exist.
Transforming that into range queries through support function may work but probably will not solve the selectivity issue ?

@gshennessy
Copy link

I had been planning on using "anymultirange @> anyelement → boolean", which GIST is listed as being able to index on, but it doesn't work. Grumble. It works if I do multirange @> multirange. it works if i do multirnage @> range. it SHOULD work if I do multirange @> int8, but it doesn't. grr.

@gshennessy
Copy link

So, I did some small amount of coding trying to deal with multiranges. I had intended to create a pull request against my clone of q3c, I hadn't intended it to be against the main repository, but I'm not sure that causes any problems. Obviously it isn't ready for prime time, so you can reject it.

So far I'm mostly underwhelmed with the multirange functionality. While it does make the new sql code (aka q3c_radial_query_gist is way shorter), I've found

  1. (as you noticed first, you still can't index against a bigint, only range or multirange, which means the indexes are
    bigger than they should be,
  2. while there were times the multirange portions ran as fast as the "old" q3c functions, it was never faster, so why
    do work if the result isn't faster?
  3. there were times switching the table order in the "old" code where the query ran the same speed, with my new code there
    were times switching the order made a major difference in time
  4. my version radial_query_it_gist using multiranges seems to get VERY slow with large numbers of regions.
    doing 1 million calls of the q3c_multirange_nearby_it took about 4 seconds. (that builds 4 ranges),
    but 1 million calls of the q3c_multirang_radial_query_it took 90 seconds.
    If I did a second version that didn't create a range for where there was no data, the (so I got about 70 ranges rather
    than 100), the time did drop to about 60 seconds, but currently i'm not seeing the bang for the buck that i had
    hoped for.

There still isn't a solution to the not being able to calculate selectivity, but i think the selectivities are better
for the GIST indicies than the btree ones, for what's that worth.

I attempt to attach my timing results.
test_for_sergey.txt

@segasai
Copy link
Owner

segasai commented Aug 26, 2022

Thanks for giving it a try @gshennessy and the description!
I am not quite sure what to take from this, maybe its worth to benchmark multirange separately to see if it's just too slow for this. or maybe int<@multirange is expected to be faster when somebody implements this

@gshennessy
Copy link

I took a new look at this issue, and I beleive I have a selectivity function working. It do seem to make a difference in the row estimation, but not in the cost. My understanding is the postgresql query optimizer chooses the lowest cost.
What I did was create a new function q3c_sindist_boolean that takes an extra argument of the radius,
that way instead of doing

AND q3c_sindist($1,$2,$3,$4)<POW(SIN(RADIANS($5)/2),2)

I do

AND q3c_sindist_bool($1,$2,$3,$4,$5)

where I calculate the trig inside the q3c_sindist_bool function, and the function returns boolean rather than
DOUBLE PRECISION.

I can do a merge request whenever you want to show my code, I'm currently trying to figure out what the tall poles
in the cost function are.

Small progress is better than none, I guess.

@segasai
Copy link
Owner

segasai commented Oct 17, 2023

Thanks. If you do a PR, or post a patch in some form, I can take a look when I have time.

I remember i tried in the past increase artificially the sindist cost in order to minimize the amount of times it's called in order to avoid seq scans, but I forgot why I stopped doing that. I can see how making q3c_sindist_bool function can help.

@gshennessy
Copy link

I made a pull request. To first order changing the selectivity changes the number of estimated rows, but doesn't
change the cost function very much at all.

@gshennessy
Copy link

gshennessy commented Oct 18, 2023

I should give a shout out to Laurenz Albe since he pointed me in the right direction.

@laurenz
Copy link

laurenz commented Oct 18, 2023

How do you expect me to react to this shout?

@gshennessy
Copy link

How do you expect me to react to this shout?

No expectations. I simply wanted to acknowledge you were helpful in giving advice to me on how to solve the problem. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants