-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a memory leak on the Media Pipeline (Files and Images) #3813
Conversation
Before this patch You can see lots of
After this patch With 50 concurrent requests, we're holding only 62
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about to effort to test it, but if it is not too much, I think it would be nice
scrapy/pipelines/media.py
Outdated
@@ -139,6 +139,11 @@ def _cache_result_and_execute_waiters(self, result, fp, info): | |||
result.cleanFailure() | |||
result.frames = [] | |||
result.stack = None | |||
|
|||
# See twisted.internet.defer.returnValue docstring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there are no tests, at least a comment mentioning it fixes a memory leak
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message has the same description as this pull request.
I guess we're good checking the commit message for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_DefGen_Return
is private and very inner detail of inlineCallbacks implementation, I think it deserves a long comment explaining what it tries to prevents.
Codecov Report
@@ Coverage Diff @@
## master #3813 +/- ##
==========================================
- Coverage 85.42% 82.67% -2.76%
==========================================
Files 169 169
Lines 9635 9637 +2
Branches 1433 1434 +1
==========================================
- Hits 8231 7967 -264
- Misses 1156 1413 +257
- Partials 248 257 +9
|
Codecov Report
@@ Coverage Diff @@
## master #3813 +/- ##
==========================================
+ Coverage 85.43% 85.46% +0.03%
==========================================
Files 169 169
Lines 9637 9664 +27
Branches 1434 1440 +6
==========================================
+ Hits 8233 8259 +26
- Misses 1156 1157 +1
Partials 248 248
|
@ejulio, I've tried to write a simple test case just to cover the regression of the issue. The ideal scenario would be to count references to the Request and Response objects after executing the cache function, but that would involve some difficulty. I've tried to do that but ended up giving up after dealing with some magic properties from Pypy (example |
We're storing exceptions captured by Twisted on the media pipeline cache, but we're also using the defer.returnValue method with our own methods decorated with @defer.inlineCallbacks. The defer.returnValue method passes returned values forward by throwing a defer._DefGen_Return exception, which in its turn extends the BaseException class and is captured by Twisted. This way, the latest exception stored in the Failure's object may also have an HtmlResponse object in its __context__ attribute. As the Response object also keeps track of the Request object that has originated it, you could figure it out how many RAM we're wasting here. This could easily lead to a Memory Leak problem when running spiders with Media Pipeline enabled and a particular Request set that tends to raise a significant number of exceptions. Example triggers: - media requests with 404 status responses - user land exceptins coming from custom middlewares - etc.
The ideal here would be to implement a test that could be able to track _DefGen_Return references on memory under the spider info object. Since that would be a little bit complicated, I've decided to introduce this simple regression test case.
c84dc6e
to
c95cf3e
Compare
add doctring and comments between the code lines
@dangra, thanks for your feedback. I've just updated the source code renaming the test case and adding a proper docstring to it. I've included additional comments between the source code lines and improved the comment on the cache function as well. Let me know if you have an additional suggestion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am good with the changes and the comments. thanks
Although the code is not Python 2.7 compatible. |
@dangra, thank you for the feedback regarding Python 2.7 tests. We should be good to go now. |
# Exception Chaining (https://www.python.org/dev/peps/pep-3134/). | ||
context = getattr(result.value, '__context__', None) | ||
if isinstance(context, _DefGen_Return): | ||
setattr(result.value, '__context__', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious why did you use setattr()
instead of result.value.__context__ = None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason. Just copied from the line above and replaced get
with set
. Would you like me to change it?
Great work! |
We're storing exceptions captured by Twisted on the media pipeline
cache, but we're also using the defer.returnValue method with our
own methods decorated with @defer.inlineCallbacks.
The defer.returnValue method passes returned values forward by
throwing a defer._DefGen_Return exception, which in its turn
extends the BaseException class and is captured by Twisted.
This way, the latest exception stored in the Failure's object may
also have an HtmlResponse object in its
__context__
attribute. Asthe Response object also keeps track of the Request object that
has originated it, you could figure it out how many RAM we're
wasting here.
This could easily lead to a Memory Leak problem when running
spiders with Media Pipeline enabled and a particular Request set
that tends to raise a significant number of exceptions.
Example triggers: