Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync versions: create new versions in bulk #7382

Merged
merged 2 commits into from Dec 10, 2020
Merged

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Aug 12, 2020

Currently we create version per version,
this is calling the db n times (n the number of new versions).

Bulk creation will produce only one call.
To avoid having in memory all versions,
we create them in batch as needed.

https://docs.djangoproject.com/en/3.0/ref/models/querysets/#bulk-create

Currently we create version per version,
this is calling the db n times (n the number of new versions).

Bulk creation will produce only one call.
To avoid having in memory all versions,
we create them in batch as needed.

https://docs.djangoproject.com/en/3.0/ref/models/querysets/#bulk-create
@stsewd
Copy link
Member Author

stsewd commented Aug 12, 2020

Tested this locally with the stress test repo, previously creating 30K new versions would take like >10 minutes, now it takes just a few.

@stsewd stsewd requested a review from a team August 12, 2020 19:19
Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good improvement. I just left some small questions about the implementation.

while True:
batch = list(itertools.islice(objs, batch_size))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually try to not use while True code in production. Can't we split objs in advance and iterate over them with a for loop instead? That's will be safer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitting in advance will evaluate all objects at once making it slow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... We can split versions in advance instead, and create multiple objs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python doesn't have something like take n, you'll be implementing the same with a while loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the documented way of doing this in Django: https://docs.djangoproject.com/en/2.2/ref/models/querysets/#bulk-create

Returns the slug of all added versions.
"""
added = set()
batch_size = 150
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did you get this number? Is 150 a recommended value?

Having 30k versions as you mentioned, is it better to create 150 of them on each iteration or it would be better to create 1k, 5k or all at once?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1K is a big number, is slow to evaluate 1k versions on each iteration.

@stale
Copy link

stale bot commented Oct 10, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: stale Issue will be considered inactive soon label Oct 10, 2020
@stale stale bot closed this Oct 17, 2020
@stsewd
Copy link
Member Author

stsewd commented Oct 19, 2020

This is stil valid

@stsewd stsewd reopened this Oct 19, 2020
@stale stale bot removed the Status: stale Issue will be considered inactive soon label Oct 19, 2020
@stsewd stsewd requested a review from a team October 19, 2020 14:49
@stale
Copy link

stale bot commented Dec 5, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: stale Issue will be considered inactive soon label Dec 5, 2020
@ericholscher ericholscher added the Accepted Accepted issue on our roadmap label Dec 7, 2020
@stale stale bot removed the Status: stale Issue will be considered inactive soon label Dec 7, 2020
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I don't love the while True, but most other ways of doing it have problems with iteration, or are the same basic logic (eg. while foo , and then set the foo = False where we're currently doing the break).

I do worry about the many caveats that are listed in the Django docs. This won't call save on the model or the post-save signals. I feel like we might be setting ourselves up for some future weird bug here. I'm not 100% sure the trade-off here is worth it, since we aren't creating versions that often. Perhaps we need some kind of warning on the Version model for this.

@stsewd
Copy link
Member Author

stsewd commented Dec 10, 2020

I think this is worth it, even for small projects, we will save a lot of db calls. I checked, and we only use the post_save signal for versions to purge the cache of inactive versions, this is for already created versions that the user mark as inactive.

This help us to not kill our db when someone imports a big project, and the user can see all its versions right away.

@stsewd stsewd merged commit 8c2f334 into master Dec 10, 2020
3 checks passed
@stsewd stsewd deleted the sync-versions-bulk-creation branch December 10, 2020 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants