Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add partition_update_enabled option #223

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

drnextgis
Copy link
Collaborator

@drnextgis drnextgis commented Oct 17, 2023

We ingest images on a daily basis into the catalog and notice that using the check_partition function here significantly slows down the ingestion process. Since we have prior knowledge of the temporal data distribution, we can pre-create the necessary partitions with

SELECT check_partition(
    'collectionxxx',
    tstzrange('2023-10-01', '2023-11-01', '[)'),
    tstzrange('2023-10-01', '2023-11-01', '[)')
);
...

instead of checking them during every ingestion. This pull request offers an option to disable partition checking, enhancing the ingestion performance. If a required partition isn't created, the loader will raise an exception with an appropriate message.

@drnextgis drnextgis marked this pull request as draft October 20, 2023 17:03
@drnextgis drnextgis marked this pull request as ready for review October 20, 2023 21:06
@bitner
Copy link
Collaborator

bitner commented Nov 6, 2023

hey @drnextgis, just want to give a heads up that I have seen this. I was out-of-the-office for a while. I'll be looking at this and a few other things in the check_partitions script this week.

@bitner
Copy link
Collaborator

bitner commented Nov 7, 2023

@drnextgis I've added an option to the pypgstac loader and a new command in #226 that I think should ameliorate the issue that you are having, while still making sure that you can go back and update all your constraints, indexes, and the statistics that are kept up to date by the check_partitions and update_partition_stats calls.

I've added an option "--usequeue" and a command pypgstac runqueue

My thinking is that when you are doing a large data loading session, you can use the following workflow regardless of if you are using the query queue along with a cron.

pypgstac load items src/pgstac/tests/testdata/items1.ndjson --debug --usequeue
pypgstac load items src/pgstac/tests/testdata/items2.ndjson --debug --usequeue
pypgstac load items src/pgstac/tests/testdata/items3.ndjson --debug --usequeue
pypgstac runqueue --debug

This PR should be my last round of work before kicking off a 0.8.2 release, so would be curious for your review.

@drnextgis
Copy link
Collaborator Author

@bitner finally got around to testing your changes. The ability to enforce the query queue for loading is quite useful! However, it only partially resolves my initial concern, queuing up only the SELECT update_partition_stats query. The heavy SELECT check_partition still runs. I believe the original PR still holds value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants