-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable aggregation and grouping pushdown in the engine #581
Conversation
gruuya
commented
Dec 15, 2021
•
edited
edited
- Bump Multicorn to new version which supports optional pushdown of aggregation/grouping, as per Implement aggregation and grouping pushdown Multicorn#1
- Bump Elasticsearch fork to version which utilizes this mechanism (Multicorn aggregation/grouping pushdown support postgres-elasticsearch-fdw#1)
- Add a couple of helper libs/configs to the engine debug image
- Add a full end-to-end setup for aggregation pushdown, including Elasticsearch server image seeded with fixture data to act as a foreign data source in integration tests
All existing tests pass with the new Multicorn code which is great news. Let's hold off on merging this until we have test coverage of the new groupby features though. |
chown -R elasticsearch:elasticsearch /data && \ | ||
echo 'path.data: /data' >> config/elasticsearch.yml && \ | ||
echo 'discovery.type: "single-node"' >> config/elasticsearch.yml && \ | ||
echo "xpack.security.enabled: false" >> config/elasticsearch.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add some of these here too so that ES doesn't refuse to start with high disk usage
cluster.routing.allocation.disk.watermark.flood_stage: "99%"
cluster.routing.allocation.disk.watermark.high: "99%"
for row in result: | ||
assert row == (gender, age) | ||
|
||
age += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion is kind of difficult to mentally unroll. I recently started using pytest-snapshot
(https://pypi.org/project/pytest-snapshot/, usage example: https://github.com/splitgraph/splitgraph/blob/master/test/splitgraph/cloud/project/test_merging.py#L21) that can store expected outputs as files and then easily regenerate them, so that you can essentially approve a change to a test through version control and see what changed. I think it's worth changing some of these (especially the bigger query outputs where you're asserting the whole output) to use pytest-snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, checking out pytest-snapshot
# DISTINCT queries are not going to be pushed down | ||
result = get_engine().run_sql("EXPLAIN SELECT COUNT(DISTINCT city) FROM es.account") | ||
assert len(result) > 2 | ||
assert _extract_query_from_explain(result) == _bare_sequential_scan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea for a test: some kind of a JOIN here, e.g. a JOIN between two aggregation results on es.account
(I think it won't be pushed down, but useful to test anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea: a nested expression, e.g. SELECT avg(age * balance) FROM es.account GROUP BY state
to make sure this doesn't crash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yet another: a window query that I think might be implemented as a sort/groupby, e.g. the balance of the oldest person in each state (haven't tested that the syntax is correct):
SELECT
age, balance, state
RANK () OVER (
PARTITION BY state
ORDER BY age DESC
) rank_number
FROM
es.accounts
WHERE rank_number = 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good ideas!
I'll add them to the test suites.
Extend the Dockerfile for PG debug image so that it optionally compiles Postgres with Valgrind support, and changes the container entrypoint so that PG gets run under Valgrind upon starting (`use_valgrind=1`).