Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize database upserts. #30

Closed
Tracked by #33
NunoSempere opened this issue Mar 29, 2022 · 2 comments
Closed
Tracked by #33

Optimize database upserts. #30

NunoSempere opened this issue Mar 29, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@NunoSempere
Copy link
Collaborator

Right now upserts are done really inefficiently: the whole latest.platform_name database is deleted, and forecasts are inserted one by one. This is inefficient and could be much improved.

It's not clear whether it's worth it to optimize on postgres directly, rather than waiting until the graphql code is set up, and in fact I'm leaning towards the later.

@NunoSempere NunoSempere added the enhancement New feature or request label Mar 29, 2022
@berekuk
Copy link
Collaborator

berekuk commented Mar 29, 2022

Yes! I've already noticed this and working on it currently (simultaneously with some refactorings for platforms code).

  1. The low hanging fruit is to replace DROPs with DELETEs.
  2. The next step is to do actual upserts and avoid unnecessary DELETE/INSERT steps.
    • But: do we want to delete stale forecasts if new fetch doesn't return the ones that were previously in the DB? (probably; for example, a platform can return invalid results, e.g. publish a bad draft and then delete it). This affects whether we want to check if the new set of forecasts covers all the existing rows.
  3. Then we could optimize inserts to be bulk. I predict most of the speedups to be from this step, right now the performance is probably limited by the ping distance between the DB and backend server.

GraphQL is a separate and higher layer of abstraction than DB-related code, I don't see how it's relevant. Maybe you mean Prisma?..

In that case, I agree that it doesn't make sense to write much SQL by hand now when it's all going to be replaced. But I might do (1) and (3) from the list anyway, it doesn't seem difficult.


PS: I'd also like to figure out some kind of partial updates API for platforms; that should make updates for some platforms much more efficient. E.g., if a platform supports sorting by "latest" then we could just fetch the new forecasts and not everything. But that's a different and a larger task.

Note that this kinda conflicts with point (2): if the platform code provides a stream of updates then it's hard to detect whether another forecast in the DB which wasn't updated for a while is dead and should be deleted.

@NunoSempere
Copy link
Collaborator Author

do we want to delete stale forecasts if new fetch doesn't return the ones that were previously in the DB?

So the way I was doing this is to delete the old ones if the new fetch brings up a non-empty array. I'm ok with other heuristics, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants