Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a memory leak on the Media Pipeline (Files and Images) #3813

Merged
merged 8 commits into from
Jun 24, 2019

Conversation

victor-torres
Copy link
Contributor

@victor-torres victor-torres commented Jun 5, 2019

We're storing exceptions captured by Twisted on the media pipeline
cache, but we're also using the defer.returnValue method with our
own methods decorated with @defer.inlineCallbacks.

The defer.returnValue method passes returned values forward by
throwing a defer._DefGen_Return exception, which in its turn
extends the BaseException class and is captured by Twisted.

This way, the latest exception stored in the Failure's object may
also have an HtmlResponse object in its __context__ attribute. As
the Response object also keeps track of the Request object that
has originated it, you could figure it out how many RAM we're
wasting here.

This could easily lead to a Memory Leak problem when running
spiders with Media Pipeline enabled and a particular Request set
that tends to raise a significant number of exceptions.

Example triggers:

  • media requests with 404 status responses
  • userland exceptions coming from custom middlewares
  • etc.

@victor-torres
Copy link
Contributor Author

victor-torres commented Jun 5, 2019

Before this patch

You can see lots of HtmlResponse and Request objects being held for no reason. They continue in the memory until the spider ends its execution or we're stopped by a memusage_exceeded reason. The oldest object of those types continues increasing their elapsed seconds as the spider goes. It's counting one thousand Requests here just after 5 minutes running.

>>> prefs()            
                                   
Live References                                           
                                                          
HtmlResponse                      371   oldest: 328s ago
MySpider                            1   oldest: 333s ago
Request                          1000   oldest: 329s ago
Selector                           54   oldest: 65s ago 
TextResponse                      579   oldest: 304s ago
MyItem                             12   oldest: 14s ago

After this patch

With 50 concurrent requests, we're holding only 62 Request objects and 12 HtmlResponse objects in memory. Also, the oldest ones were created within the last minute, even the spider being running for 30 minutes.

>>> prefs()
                                                                                                                                                                         
Live References                                                                                                                                                                     
                                                                                                                                                                                    
HtmlResponse                       12   oldest: 22s ago                                                                                                                             
MySpider                            1   oldest: 1822s ago                                                                                                                           
Request                            62   oldest: 62s ago                                                                                                                             
Selector                           12   oldest: 22s ago                                                                                                                             

Copy link
Contributor

@ejulio ejulio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about to effort to test it, but if it is not too much, I think it would be nice

@@ -139,6 +139,11 @@ def _cache_result_and_execute_waiters(self, result, fp, info):
result.cleanFailure()
result.frames = []
result.stack = None

# See twisted.internet.defer.returnValue docstring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are no tests, at least a comment mentioning it fixes a memory leak

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message has the same description as this pull request.
I guess we're good checking the commit message for more details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_DefGen_Return is private and very inner detail of inlineCallbacks implementation, I think it deserves a long comment explaining what it tries to prevents.

@codecov
Copy link

codecov bot commented Jun 6, 2019

Codecov Report

Merging #3813 into master will decrease coverage by 2.75%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3813      +/-   ##
==========================================
- Coverage   85.42%   82.67%   -2.76%     
==========================================
  Files         169      169              
  Lines        9635     9637       +2     
  Branches     1433     1434       +1     
==========================================
- Hits         8231     7967     -264     
- Misses       1156     1413     +257     
- Partials      248      257       +9
Impacted Files Coverage Δ
scrapy/pipelines/media.py 97.27% <100%> (+0.05%) ⬆️
scrapy/linkextractors/sgml.py 0% <0%> (-96.81%) ⬇️
scrapy/linkextractors/regex.py 0% <0%> (-95.66%) ⬇️
scrapy/linkextractors/htmlparser.py 0% <0%> (-92.07%) ⬇️
scrapy/core/downloader/handlers/s3.py 62.9% <0%> (-32.26%) ⬇️
scrapy/extensions/statsmailer.py 0% <0%> (-30.44%) ⬇️
scrapy/utils/boto.py 46.66% <0%> (-26.67%) ⬇️
scrapy/_monkeypatches.py 42.85% <0%> (-14.29%) ⬇️
scrapy/link.py 86.36% <0%> (-13.64%) ⬇️
scrapy/core/downloader/tls.py 77.5% <0%> (-12.5%) ⬇️
... and 12 more

@codecov
Copy link

codecov bot commented Jun 6, 2019

Codecov Report

Merging #3813 into master will increase coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3813      +/-   ##
==========================================
+ Coverage   85.43%   85.46%   +0.03%     
==========================================
  Files         169      169              
  Lines        9637     9664      +27     
  Branches     1434     1440       +6     
==========================================
+ Hits         8233     8259      +26     
- Misses       1156     1157       +1     
  Partials      248      248
Impacted Files Coverage Δ
scrapy/pipelines/media.py 97.29% <100%> (+0.07%) ⬆️
scrapy/commands/check.py 25.35% <0%> (-0.37%) ⬇️
scrapy/utils/misc.py 96.29% <0%> (+0.58%) ⬆️
scrapy/utils/reqser.py 94.11% <0%> (+1.8%) ⬆️

@victor-torres
Copy link
Contributor Author

@ejulio, I've tried to write a simple test case just to cover the regression of the issue.

The ideal scenario would be to count references to the Request and Response objects after executing the cache function, but that would involve some difficulty. I've tried to do that but ended up giving up after dealing with some magic properties from Pypy (example _rawexcinfo) and Twisted's version of unittest.

We're storing exceptions captured by Twisted on the media pipeline
cache, but we're also using the defer.returnValue method with our
own methods decorated with @defer.inlineCallbacks.

The defer.returnValue method passes returned values forward by
throwing a defer._DefGen_Return exception, which in its turn
extends the BaseException class and is captured by Twisted.

This way, the latest exception stored in the Failure's object may
also have an HtmlResponse object in its __context__ attribute. As
the Response object also keeps track of the Request object that
has originated it, you could figure it out how many RAM we're
wasting here.

This could easily lead to a Memory Leak problem when running
spiders with Media Pipeline enabled and a particular Request set
that tends to raise a significant number of exceptions.

Example triggers:
- media requests with 404 status responses
- user land exceptins coming from custom middlewares
- etc.
The ideal here would be to implement a test that could be able to
track _DefGen_Return references on memory under the spider info
object.

Since that would be a little bit complicated, I've decided to
introduce this simple regression test case.
add doctring and comments between the code lines
@victor-torres
Copy link
Contributor Author

@dangra, thanks for your feedback. I've just updated the source code renaming the test case and adding a proper docstring to it. I've included additional comments between the source code lines and improved the comment on the cache function as well. Let me know if you have an additional suggestion.

@dangra dangra changed the title Fix a memory leak on the Media Pipeline (Files and Images) [MRG+1] Fix a memory leak on the Media Pipeline (Files and Images) Jun 17, 2019
Copy link
Member

@dangra dangra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am good with the changes and the comments. thanks

@dangra
Copy link
Member

dangra commented Jun 17, 2019

Although the code is not Python 2.7 compatible.

@dangra dangra changed the title [MRG+1] Fix a memory leak on the Media Pipeline (Files and Images) Fix a memory leak on the Media Pipeline (Files and Images) Jun 17, 2019
@victor-torres
Copy link
Contributor Author

@dangra, thank you for the feedback regarding Python 2.7 tests. We should be good to go now.

# Exception Chaining (https://www.python.org/dev/peps/pep-3134/).
context = getattr(result.value, '__context__', None)
if isinstance(context, _DefGen_Return):
setattr(result.value, '__context__', None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why did you use setattr() instead of result.value.__context__ = None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason. Just copied from the line above and replaced get with set. Would you like me to change it?

@Gallaecio Gallaecio merged commit f4f2b16 into scrapy:master Jun 24, 2019
@Gallaecio
Copy link
Member

Great work!

@victor-torres victor-torres deleted the media-pipeline-leak branch June 24, 2019 13:41
@kmike kmike added this to the v1.7 milestone Jun 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants