New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync versions: create new versions in bulk #7382
Conversation
Currently we create version per version, this is calling the db n times (n the number of new versions). Bulk creation will produce only one call. To avoid having in memory all versions, we create them in batch as needed. https://docs.djangoproject.com/en/3.0/ref/models/querysets/#bulk-create
Tested this locally with the stress test repo, previously creating 30K new versions would take like >10 minutes, now it takes just a few. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good improvement. I just left some small questions about the implementation.
while True: | ||
batch = list(itertools.islice(objs, batch_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually try to not use while True
code in production. Can't we split objs
in advance and iterate over them with a for
loop instead? That's will be safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splitting in advance will evaluate all objects at once making it slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well... We can split versions
in advance instead, and create multiple objs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python doesn't have something like take n
, you'll be implementing the same with a while loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the documented way of doing this in Django: https://docs.djangoproject.com/en/2.2/ref/models/querysets/#bulk-create
Returns the slug of all added versions. | ||
""" | ||
added = set() | ||
batch_size = 150 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did you get this number? Is 150 a recommended value?
Having 30k versions as you mentioned, is it better to create 150 of them on each iteration or it would be better to create 1k, 5k or all at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1K is a big number, is slow to evaluate 1k versions on each iteration.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This is stil valid |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I don't love the while True
, but most other ways of doing it have problems with iteration, or are the same basic logic (eg. while foo
, and then set the foo = False
where we're currently doing the break).
I do worry about the many caveats that are listed in the Django docs. This won't call save
on the model or the post-save signals. I feel like we might be setting ourselves up for some future weird bug here. I'm not 100% sure the trade-off here is worth it, since we aren't creating versions that often. Perhaps we need some kind of warning on the Version model for this.
I think this is worth it, even for small projects, we will save a lot of db calls. I checked, and we only use the post_save signal for versions to purge the cache of inactive versions, this is for already created versions that the user mark as inactive. This help us to not kill our db when someone imports a big project, and the user can see all its versions right away. |
Currently we create version per version,
this is calling the db n times (n the number of new versions).
Bulk creation will produce only one call.
To avoid having in memory all versions,
we create them in batch as needed.
https://docs.djangoproject.com/en/3.0/ref/models/querysets/#bulk-create