Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3FilesStore can use a lot of memory #482

Open
kmike opened this issue Dec 5, 2013 · 18 comments
Open

S3FilesStore can use a lot of memory #482

kmike opened this issue Dec 5, 2013 · 18 comments
Labels
Milestone

Comments

@kmike
Copy link
Member

kmike commented Dec 5, 2013

Hi,

@nramirezuy and me were debugging memory issue with one of the spiders some time ago, and it seems to be caused by ImagesPipeline + S3FilesStore. I haven't confirmed that it was the cause of memory issue, this ticket is based solely on reading the source code.

FilesPipeline reads the whole file to memory and then defers the uploading to thread (via S3FilesStore.persist_file, passing file contents as bytes). So there could be many files loaded to memory at the same time, and as soon as files are downloaded faster than they are are uploaded to s3, memory usage will grow. This is not unlikely IMHO because s3 is not super-fast. For ImagesPipeline it is worse because it uploads not only the image itself, but also the generated thumbnails.

I think S3FilesStore should persist files to temporary location before uploading them to S3 (at least optionally). This would allow streaming files without storing them in memory.

@rmax
Copy link
Contributor

rmax commented Dec 5, 2013

I use CONCURRENT_ITEMS = 1 in this cases. I haven't verified how much improve the memory usage, though.

@kmike
Copy link
Member Author

kmike commented Dec 5, 2013

CONCURRENT_ITEMS = 1 doesn't limit the number of concurrent uploads because (again, based on reading the source code) S3FilesStore.persist_file defers uploading to thread and doesn't wait for the result. CONCURRENT_ITEMS = 1 could help only because it slows down the scraping and thus allows s3 uploads to finish faster than new data is scheduled. I think download_delay could have similar effect.

@nramirezuy
Copy link
Contributor

@dangra
Copy link
Member

dangra commented Dec 16, 2013

the limitless cache is true, but it never accounted for lot of memory unless you store responses on it.

Looks like this change may be related to the memory issues 0a8bf2c

@Parth-Vader
Copy link
Member

@G-Rath
Copy link

G-Rath commented Mar 11, 2022

Are folks aware this has been reported as a security issue that affects most (if not all) versions of Scrapy? https://nvd.nist.gov/vuln/detail/CVE-2017-14158 & https://github.com/pypa/advisory-database/blob/8b7a4d62a95e8f605e5dfb4e0b4f299e6403dc12/vulns/scrapy/PYSEC-2017-83.yaml

I'm currently trying to determine if the security angle has been fixed, and if not what is required for that since we're got this flagged in one of our applications; I'm happy to help with this as much as I can, but I don't work in Python a lot so suspect I'll be of limited use but I'm happy to be apart of discussions and even dive into code if someone more familiar with the codebase could spare some time to support me 🙂

@fazalmajid
Copy link

This issue is also flagged as PYSEC-2017-83 by pip-audit.

I would recommend putting a hard configurable limit on the total size of downloads iniitiated by a single top-level request to prevent the kind of DoS attacks mentioned. Keep in mind scrapers like ScraPy are often detested by the sites being scraped (despite being perfectly legal) so deliberate DoS countermeasures are well within the realm of possibility.

@G-Rath
Copy link

G-Rath commented May 27, 2022

@fazalmajid that sounds like a reasonable solution - is that something you'd be comfortable implementing? As I've said I'm not really a Python developer so can't really take on trying to implement a solution myself 😞

@fazalmajid
Copy link

No, as I don’t use Scrapy myself and have no idea about the code base. Plus since it’s a project sponsored by a commercial organization, they would certainly have opinions about whether to implement this and how, and I am not about to code “on spec”.

@genismoreno
Copy link

Hi, is there a plan to fix this? I have seen this vulnerability for a while, but I don't see a clear solution for it. Thanks!

@kmike
Copy link
Member Author

kmike commented Nov 8, 2022

Hey. It seems this CVE is popping up everywhere, and can cause some warnings for the users, so we should do something about it.

Let's investigate possible solutions. https://nvd.nist.gov/vuln/detail/CVE-2017-14158 tells that the issue is an instance of https://cwe.mitre.org/data/definitions/400.html. Description of CWE-400:

The software does not properly control the allocation and maintenance of a limited resource, thereby enabling an actor to influence the amount of resources consumed, eventually leading to the exhaustion of available resources.

Limited resources include memory, file system storage, database connection pool entries, and CPU. If an attacker can trigger the allocation of these limited resources, but the number or size of the resources is not controlled, then the attacker could cause a denial of service that consumes all available resources.

Scrapy controls the number and size of the resources in the following way:

  1. There is DOWNLOAD_MAXSIZE (https://docs.scrapy.org/en/latest/topics/settings.html#download-maxsize) option, to prevent downloading large files.
  2. There are various CONCURRENT_REQUESTS options, which allow to limit the amount of parallel downloads (each download then causes an upload to S3).
  3. There is SCRAPER_SLOT_MAX_ACTIVE_SIZE, which is a soft limit for total size of all responses being processed by scraper ("While the sum of the sizes of all responses being processed is above this value, Scrapy does not process new requests."). I'm not sure though why is it applied on Scraper level, not on Downloader level. So, it seems this option doesn't have an effect if a request is added to downloader without going through scraper, and it seems requests initiated in MediaPipeline don't go through scraper.

Reducing the amount of RAM, e.g. storing more data to disk, is not a solution for CVE. By doing so we're shifting the resource from "memory usage" to "disk usage", without addressing the security issue.

It seems that having an option similar to SCRAPER_SLOT_MAX_ACTIVE_SIZE, but which works on Downloader level or on Engine level should solve the problem. I.e. limit (in a soft way - if we're over limit, stop accepting new work) the total byte size of responses being processed by all Scrapy components. What do you think?

@kmike
Copy link
Member Author

kmike commented Nov 8, 2022

As for vulnerability, if the files/images pipelines or other components which add files to Downloader directly are not used, then it seems CWE-400 doesn't apply. There are DOWNLOAD_MAXSIZE and CONCURRENT_REQUESTS limits, and they look adequate according to the CWE-400 description.

@Gallaecio Gallaecio added this to the Scrapy 2.8 milestone Dec 1, 2022
@G-Rath
Copy link

G-Rath commented Jan 12, 2023

@kmike I think that sounds look a good idea

@wRAR wRAR modified the milestones: Scrapy 2.8, Scrapy 2.9 Feb 2, 2023
@G-Rath
Copy link

G-Rath commented Feb 3, 2023

I've submitted an update to the advisory on GitHub to reflect that this has not been addressed in 2.8

@melroy89
Copy link

melroy89 commented Feb 16, 2023

I indeed got a "Scrapy denial of service vulnerability" CVE-2017-14158. But there is no fixed release, right?

@ret2libc
Copy link

ret2libc commented Apr 7, 2023

Hi! Is there any on-going effort to fix this vulnerability? As this has been opened for a long time, I wonder if the issue is just not that critical from a security standpoint as listed on NVD or if it is the fix that is complex.

@kmike Regarding a solution, I see that the MediaPipeline is using the Engine to download the object

dfd = self.crawler.engine.download(request)

And the Engine does have a Slot class that is tracking the requests... Would it be possible to have ENGINE_SLOT_MAX_ACTIVE_SIZE similarly to SCRAPER_SLOT_MAX_ACTIVE_SIZE? The thing I'm not sure about is: if I check for the limit in the _download method (
def _download(self, request: Request, spider: Optional[Spider]) -> Deferred:
) how can I postpone the processing of that request?

Thanks!

@kmike
Copy link
Member Author

kmike commented Apr 7, 2023

Hi! Is there any on-going effort to fix this vulnerability? As this has been opened for a long time, I wonder if the issue is just not that critical from a security standpoint as listed on NVD or if it is the fix that is complex

Hey! I don't think anyone works on it actively now. You're right; maybe we're careless, but it indeed doesn't look that critical. I haven't seen anyone reporting any practical problem related to this issue. Note that the issue is not that Scrapy can use a lot of memory, the issue is that there is no perfect way to set a limit on the amount of memory Scrapy can use. Tbh, the main motivation to fix it is to make the security notifications to go away - which is probably why there are security notifications in a first place, to motivate maintainers, so it works as intended :)

That said, even if that's not super-critical, it's still an important issue, and a nice improvement to Scrapy behavior.

Would it be possible to have ENGINE_SLOT_MAX_ACTIVE_SIZE similarly to SCRAPER_SLOT_MAX_ACTIVE_SIZE?

Yes, I think that's the idea.

how can I postpone the processing of that request?

The limit could be "soft" - you don't postpone processing of that request, you postpone processing of subsequent requests.

@ret2libc
Copy link

@kmike To implement such a "SIZE" check, I would need to keep track of the active size when a response is returned in the ExecutionEngine and I would also need to remove that size when a response is not needed anymore (in other words, when the file has been finally sent to the S3 storage and can be discarded from memory). However, I'm not sure what is the best way to detect that moment. I could explicitly call a Engine.Slot method after calling file_downloaded in the FilesPipeline, but I'm not sure that's the correct way. Any tip?

@kmike kmike modified the milestones: Scrapy 2.9, Scrapy 2.10 Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests