New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3FilesStore can use a lot of memory #482
Comments
|
I use |
|
|
|
Also there is a limitless cache: |
|
the limitless cache is true, but it never accounted for lot of memory unless you store responses on it. Looks like this change may be related to the memory issues 0a8bf2c |
|
Change the links to https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py and https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/media.py since the ones linked are depriciated. |
|
Are folks aware this has been reported as a security issue that affects most (if not all) versions of Scrapy? https://nvd.nist.gov/vuln/detail/CVE-2017-14158 & https://github.com/pypa/advisory-database/blob/8b7a4d62a95e8f605e5dfb4e0b4f299e6403dc12/vulns/scrapy/PYSEC-2017-83.yaml I'm currently trying to determine if the security angle has been fixed, and if not what is required for that since we're got this flagged in one of our applications; I'm happy to help with this as much as I can, but I don't work in Python a lot so suspect I'll be of limited use but I'm happy to be apart of discussions and even dive into code if someone more familiar with the codebase could spare some time to support me |
|
This issue is also flagged as PYSEC-2017-83 by pip-audit. I would recommend putting a hard configurable limit on the total size of downloads iniitiated by a single top-level request to prevent the kind of DoS attacks mentioned. Keep in mind scrapers like ScraPy are often detested by the sites being scraped (despite being perfectly legal) so deliberate DoS countermeasures are well within the realm of possibility. |
|
@fazalmajid that sounds like a reasonable solution - is that something you'd be comfortable implementing? As I've said I'm not really a Python developer so can't really take on trying to implement a solution myself |
|
No, as I don’t use Scrapy myself and have no idea about the code base. Plus since it’s a project sponsored by a commercial organization, they would certainly have opinions about whether to implement this and how, and I am not about to code “on spec”. |
|
Hi, is there a plan to fix this? I have seen this vulnerability for a while, but I don't see a clear solution for it. Thanks! |
|
Hey. It seems this CVE is popping up everywhere, and can cause some warnings for the users, so we should do something about it. Let's investigate possible solutions. https://nvd.nist.gov/vuln/detail/CVE-2017-14158 tells that the issue is an instance of https://cwe.mitre.org/data/definitions/400.html. Description of CWE-400:
Scrapy controls the number and size of the resources in the following way:
Reducing the amount of RAM, e.g. storing more data to disk, is not a solution for CVE. By doing so we're shifting the resource from "memory usage" to "disk usage", without addressing the security issue. It seems that having an option similar to SCRAPER_SLOT_MAX_ACTIVE_SIZE, but which works on Downloader level or on Engine level should solve the problem. I.e. limit (in a soft way - if we're over limit, stop accepting new work) the total byte size of responses being processed by all Scrapy components. What do you think? |
|
As for vulnerability, if the files/images pipelines or other components which add files to Downloader directly are not used, then it seems CWE-400 doesn't apply. There are DOWNLOAD_MAXSIZE and CONCURRENT_REQUESTS limits, and they look adequate according to the CWE-400 description. |
|
@kmike I think that sounds look a good idea |
|
I've submitted an update to the advisory on GitHub to reflect that this has not been addressed in 2.8 |
|
I indeed got a "Scrapy denial of service vulnerability" CVE-2017-14158. But there is no fixed release, right? |
|
Hi! Is there any on-going effort to fix this vulnerability? As this has been opened for a long time, I wonder if the issue is just not that critical from a security standpoint as listed on NVD or if it is the fix that is complex. @kmike Regarding a solution, I see that the scrapy/scrapy/pipelines/media.py Line 196 in 96033ce
And the Engine does have a Slot class that is tracking the requests... Would it be possible to have ENGINE_SLOT_MAX_ACTIVE_SIZE similarly to SCRAPER_SLOT_MAX_ACTIVE_SIZE? The thing I'm not sure about is: if I check for the limit in the _download method (Line 326 in 96033ce
Thanks! |
Hey! I don't think anyone works on it actively now. You're right; maybe we're careless, but it indeed doesn't look that critical. I haven't seen anyone reporting any practical problem related to this issue. Note that the issue is not that Scrapy can use a lot of memory, the issue is that there is no perfect way to set a limit on the amount of memory Scrapy can use. Tbh, the main motivation to fix it is to make the security notifications to go away - which is probably why there are security notifications in a first place, to motivate maintainers, so it works as intended :) That said, even if that's not super-critical, it's still an important issue, and a nice improvement to Scrapy behavior.
Yes, I think that's the idea.
The limit could be "soft" - you don't postpone processing of that request, you postpone processing of subsequent requests. |
|
@kmike To implement such a "SIZE" check, I would need to keep track of the active size when a response is returned in the |
Hi,
@nramirezuy and me were debugging memory issue with one of the spiders some time ago, and it seems to be caused by ImagesPipeline + S3FilesStore. I haven't confirmed that it was the cause of memory issue, this ticket is based solely on reading the source code.
FilesPipeline reads the whole file to memory and then defers the uploading to thread (via
S3FilesStore.persist_file, passing file contents as bytes). So there could be many files loaded to memory at the same time, and as soon as files are downloaded faster than they are are uploaded to s3, memory usage will grow. This is not unlikely IMHO because s3 is not super-fast. For ImagesPipeline it is worse because it uploads not only the image itself, but also the generated thumbnails.I think S3FilesStore should persist files to temporary location before uploading them to S3 (at least optionally). This would allow streaming files without storing them in memory.
The text was updated successfully, but these errors were encountered: