Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve database interaction to mark event publications as completed #251

Closed
bones418 opened this issue Jul 31, 2023 · 7 comments
Closed
Assignees
Labels
in: event publication registry Event publication registry type: improvement Minor improvements
Milestone

Comments

@bones418
Copy link

bones418 commented Jul 31, 2023

In an application that contains nearly 1 million rows in the jpa_event_publication table, certain queries are beginning to consume lots of CPU.

Query being run:

select j1_0.id,
       j1_0.completion_date,
       j1_0.event_type,
       j1_0.listener_id,
       j1_0.publication_date,
       j1_0.serialized_event
from jpa_event_publication j1_0
where j1_0.serialized_event = $1
  and j1_0.listener_id = $2
  and j1_0.completion_date is null

None of the fields in the where condition are indexed. This query requires full table scans.

I imagine the listener_id column has low cardinality in most applications and is a poor choice for an index. What about serialized_event? In most applications would that be a high cardinality field? Or perhaps a partial index on completion_date null (but that only works for databases that support partial indexes)

In any case, how can this query run effectively on large datasets so that applications can scale and deal with millions of modulith events.

@odrotbohm
Copy link
Member

Any reason you keep these publications around for so long? The ones completed could just be removed after a reasonable amount of time (which might evaluate to “right away”). In #35, we already have a ticket that intends to explore strategies to purge completed event publications. If you have a preferred way we could do that, would you mind sharing your thoughts there, too? I can imagine we expose a retention time and a purging cron expression that defines when to run the purges.

@odrotbohm odrotbohm added the meta: waiting for feedback Waiting for feedback of the original reporter label Aug 1, 2023
@odrotbohm odrotbohm self-assigned this Aug 1, 2023
@odrotbohm odrotbohm changed the title Expensive queries on jpa_event_publication table Expensive queries on event publications table Aug 1, 2023
@bones418
Copy link
Author

bones418 commented Aug 3, 2023

Thank you @odrotbohm. I think the main reason that all historical events are being preserved is because that's the default behavior. Besides that, we believe having some look back period on events and the serialized_event data could be helpful in debugging and providing insights into why something happened in our application. If we missed some configurable options around purging it would be nice to know.

In general, this problem makes me think of SQS deletion policy https://docs.awspring.io/spring-cloud-aws/docs/current/apidocs/io/awspring/cloud/messaging/listener/SqsMessageDeletionPolicy.html. Users can configure events to be deleted always, or on success, etc.

In this case, I could see implementing something similar to never delete, or delete on success. Or allow an application to define a custom post-processing handler. Some folks may use the custom handler to delete records from the event publication table and write them to another historical table. That would keep event publication queries fast, but still provide access to historical events if desired.

@masch712
Copy link

masch712 commented Aug 3, 2023

Worth noting: Even with a post-event-consumption handler that deletes or archives the event, the hot jpa_event_publication table is vulnerable to table bloat, at least in postgres, which itself will degrade performance.
One way around this is partitioning the table on publication_date, and dropping old partitions after some period of time. Ideally, we'd archive the data in those partitions to some "cold" archive table before dropping them.
Looks like JPA supports table partitioning on @Entity classes. Not sure if you want to pollute that class with something that might only be relevant for Postgres, though.

@masch712
Copy link

masch712 commented Aug 3, 2023

Also, assuming JpaEventPublicationRepository.findIncompletePublicationsByEventAndTargetIdentifier(…) is the source of the query here, I wonder why JpaEventPublicationRepository.update(...) needs to call findIncompletePublicationsByEventAndTargetIdentifier(…) to retrieve the event, and can't it just call findById(…) to do a primary key lookup?

@odrotbohm
Copy link
Member

We currently do not expose the publication identifier on EventPublication, which, IIUC, we could do actually, even without exposing the identifier type. I guess we could switch to that with the positive side effect of us not relying on assuming non-equality on the individual events. As you identifier correctly, currently, we could still end up with multiple results for that query if the serialized event matches multiple publications.

I'll open up a separate ticket to explore that idea because, while it might actually help performance issues as an index on the primary could be used, it still doesn't resolve the problem of too many events polluting the publication table.

@odrotbohm
Copy link
Member

odrotbohm commented Aug 7, 2023

Turns out things are a little different than assumed. We unfortunately cannot use a primary key lookup for the publication, as the code that's marking the publication as complete only has access to the event and the listener id. I've looked into options into carrying the publication identifier forward via an ApplicationEvent wrapper, but the completion interceptor lives close to the actual method invocation in which only see the user-defined event payload. I'm not saying we're not going to find a solution eventually, but it seems like a bigger challenge so that I'd like to avoid tackling it so short before our 1.0 RC/GA.

That said, playing with this unveiled that we're rather inefficiently double-querying the publications and materialize them completely just to set the completion date. I've changed the EventPublicationRepository API to use a different invocation model that allows repository implementations to trigger the update in a single query. It's now essentially calling the database-specific equivalent of set completion date where event = ? and listener_id = ? directly, which should make the completion interaction a bit more efficient as we don't have to issue a select in the first place. It also allowed us to get rid of the CompletableEventPublication interface completely. I've also added a method EventPublicationRegistry.deleteCompletedPublicationsOlderThan(Duration) to enable user code to implement the fine-tuned purging described above.

About to polish the code and submit the changes.

odrotbohm added a commit that referenced this issue Aug 7, 2023
Changed the EventPublicationRepository interface to allow marking an event as completed without having to materialize it in the first place. This allows us to get rid of CompletableEventPublication. EventPublication not exposes its identifier to make sure the stores can actually store the same id.

Introduced EventPublicationRegistry.deleteCompletedPublicationsOlderThan(Duration) to purge completed event publications before a given point in time.
@odrotbohm
Copy link
Member

This has been pushed against this ticket. Snapshots should be available in a minute. Please give them a try!

@odrotbohm odrotbohm changed the title Expensive queries on event publications table Improve database interaction to mark event publications as completed Aug 7, 2023
@odrotbohm odrotbohm added this to the 1.0 RC1 milestone Aug 7, 2023
@odrotbohm odrotbohm added in: event publication registry Event publication registry type: improvement Minor improvements and removed meta: waiting for feedback Waiting for feedback of the original reporter labels Aug 7, 2023
odrotbohm added a commit that referenced this issue Aug 9, 2023
Changed the EventPublicationRepository interface to allow marking an event as completed without having to materialize it in the first place. This allows us to get rid of CompletableEventPublication. EventPublication not exposes its identifier to make sure the stores can actually store the same id.

Introduced EventPublicationRegistry.deleteCompletedPublicationsOlderThan(Duration) to purge completed event publications before a given point in time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: event publication registry Event publication registry type: improvement Minor improvements
Projects
None yet
Development

No branches or pull requests

3 participants