-
Notifications
You must be signed in to change notification settings - Fork 963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduce memory consumption of uploads and reindex task #3470
Conversation
warehouse/packaging/models.py
Outdated
Release.created)) | ||
.order_by(Release._pypi_ordering.desc()) | ||
.all(), | ||
key=lambda r: parse_version(r.version), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to be redundant, sorting by Release._pypi_ordering
should be the same as sorting by parse_version(r.version)
.
warehouse/packaging/models.py
Outdated
return sorted( | ||
orm.object_session(self).query(Release) | ||
.filter(Release.project == self) | ||
.options(orm.load_only( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having load_only
only a "generic" name like all_releases
or latest_release
makes me nervous because I feel like it's very easy to trigger N+1 query problems if this gets used in another context without realizing it, since the name doesn't imply anything about the fact it's limiting the data available.
What information do we actually use out of these? I don't remember what the templates do, but it looks like the search indexing really only cares about version numbers, is it the same for the templates? If it is maybe something like:
@property
def all_versions(self):
return (orm.object_session(self)
.query(Release.version)
.filter(Release.project == self)
.order_by(Release._pypi_ordering.desc())
.all())
Would be enough to do the trick? We'd probably want a list()
or something in there so the return type is just ["1.0", "2.0", 3.0"]
in whatever order we expect this in.
Another thing to think about is making these just functions that aren't attached to the Project
object, I'm not sure if that ends up being clearer or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forklift/legacy.py also needs to pull in all_releases to re enumerate _pypi_ordering
. I guess it'd be OK to do what's proposed here and inline the query for that.
warehouse/packaging/models.py
Outdated
def latest_release(self): | ||
return first( | ||
sorted( | ||
self.all_releases, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this is going to cause things that already use Project.all_releases
to re-execute the all_releases
query instead of re-using the already fetched values like it did previously. If we're going to end up executing a second query, we might as well handle all of the sorting in SQL. I also ask a similar question, is the only thing we really care about getting the latest version? then maybe this would be better off specified as:
@property
def latest_version(self):
return (orm.object_session(self)
.query(Release.version)
.filter(Release.project = self)
.order_by(
Release.is_prerelease.nullslast(),
Release._pypi_ordering.desc())
.first())
warehouse/packaging/search.py
Outdated
default=release.project.releases[0], | ||
).version | ||
obj["version"] = [r.version for r in release.project.all_releases] | ||
obj["latest_version"] = release.project.latest_release.version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forget why I did this, but is the latest version not already the one being passed into this method as release
? If it is then this is redundant (and with these changes, causing an additional query to be executed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously it skipped prereleases, so I kept that logic in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that raises a bit of a point, should we use the other data from a pre-release here? Is it confusing to have say the summary and readme from a pre-release being shown with the latest non pre-release version?
I think maybe the version selection logic for indexing should use the same nullslast, order by logic that I posted above that will give the first non-prerelease, unless there is only pre-releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works for me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed.
5d83acd
to
9c6a748
Compare
9c6a748
to
00bda0a
Compare
I think the biggest drawback to this implementation is the N+1 query situation introduced by calling Unfortunately I wasn't able to find a sane way to get stable memory consumption doing it all in one JOIN. |
Appears this may still not be working against the full database :/ |
be8456d
to
be38131
Compare
Resolves two memory consumption issues for projects with long/large release histories:
Uploads:
https://github.com/pypa/warehouse/blob/ce894db254d2f9097da784768f5e8b55252c169b/warehouse/forklift/legacy.py#L966-L970
Search indexing:
https://github.com/pypa/warehouse/blob/ce894db254d2f9097da784768f5e8b55252c169b/warehouse/packaging/search.py#L63-L79