New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve database interaction to mark event publications as completed #251
Comments
Any reason you keep these publications around for so long? The ones completed could just be removed after a reasonable amount of time (which might evaluate to “right away”). In #35, we already have a ticket that intends to explore strategies to purge completed event publications. If you have a preferred way we could do that, would you mind sharing your thoughts there, too? I can imagine we expose a retention time and a purging cron expression that defines when to run the purges. |
Thank you @odrotbohm. I think the main reason that all historical events are being preserved is because that's the default behavior. Besides that, we believe having some look back period on events and the serialized_event data could be helpful in debugging and providing insights into why something happened in our application. If we missed some configurable options around purging it would be nice to know. In general, this problem makes me think of SQS deletion policy https://docs.awspring.io/spring-cloud-aws/docs/current/apidocs/io/awspring/cloud/messaging/listener/SqsMessageDeletionPolicy.html. Users can configure events to be deleted always, or on success, etc. In this case, I could see implementing something similar to never delete, or delete on success. Or allow an application to define a custom post-processing handler. Some folks may use the custom handler to delete records from the event publication table and write them to another historical table. That would keep event publication queries fast, but still provide access to historical events if desired. |
Worth noting: Even with a post-event-consumption handler that deletes or archives the event, the hot jpa_event_publication table is vulnerable to table bloat, at least in postgres, which itself will degrade performance. |
Also, assuming |
We currently do not expose the publication identifier on I'll open up a separate ticket to explore that idea because, while it might actually help performance issues as an index on the primary could be used, it still doesn't resolve the problem of too many events polluting the publication table. |
Turns out things are a little different than assumed. We unfortunately cannot use a primary key lookup for the publication, as the code that's marking the publication as complete only has access to the event and the listener id. I've looked into options into carrying the publication identifier forward via an That said, playing with this unveiled that we're rather inefficiently double-querying the publications and materialize them completely just to set the completion date. I've changed the About to polish the code and submit the changes. |
Changed the EventPublicationRepository interface to allow marking an event as completed without having to materialize it in the first place. This allows us to get rid of CompletableEventPublication. EventPublication not exposes its identifier to make sure the stores can actually store the same id. Introduced EventPublicationRegistry.deleteCompletedPublicationsOlderThan(Duration) to purge completed event publications before a given point in time.
This has been pushed against this ticket. Snapshots should be available in a minute. Please give them a try! |
Changed the EventPublicationRepository interface to allow marking an event as completed without having to materialize it in the first place. This allows us to get rid of CompletableEventPublication. EventPublication not exposes its identifier to make sure the stores can actually store the same id. Introduced EventPublicationRegistry.deleteCompletedPublicationsOlderThan(Duration) to purge completed event publications before a given point in time.
In an application that contains nearly 1 million rows in the jpa_event_publication table, certain queries are beginning to consume lots of CPU.
Query being run:
None of the fields in the where condition are indexed. This query requires full table scans.
I imagine the listener_id column has low cardinality in most applications and is a poor choice for an index. What about serialized_event? In most applications would that be a high cardinality field? Or perhaps a partial index on completion_date null (but that only works for databases that support partial indexes)
In any case, how can this query run effectively on large datasets so that applications can scale and deal with millions of modulith events.
The text was updated successfully, but these errors were encountered: