-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Security enhancement when following a "redirect" #457
Comments
|
The explanation makes sense, but what about pages containing links to |
|
I think that by default file:// should not be scraped nor indexed, but since access to local files might be useful when writing tests and contracts a new setting key could be introduced to scrapy's core. That key/value setting would define whether the scraping of file:// is allowed or not. And regarding the initial point, I believe that "redirects" should never follow file:// . |
|
Is there any situation where a remote |
|
@barraponto I don't believe file:// is ever of any use, except when testing local files and there's no http server running. I also believe that those use cases are so rare, and have easy natural alternatives, therefore having file:// enabled by default is a security risk. |
|
That's what I think as well. But I'd like to hear what @pablohoffman or @dangra have to say about it. |
|
@mvsantos: I understand the issue but must say Scrapy defaults are set to ease development, and tuned in production in a per case basis. As a general rule It's recommended to run spiders in constrained environments, limited in memory, CPU and filesystem access, and this environments makes access to local files a derisory issue. Disabling |
|
@dangra I understand and appreciate your point about Scrapy's approach to ease development. I understand that Scrapy's maintainers want to make it as easy as possible for newcomers. |
|
Just gave a re-read to the whole issue. I think the initial proposal of disallowing http redirections to non-http urls makes sense. Even if
|
|
@dangra Do you mean they shouldn't work from HTML pages because of |
|
That is the current behavior when I admit this is not the strong security-by-default you are expecting but It minimizes this issue for newbies again. @kmike, @redapple, @nramirezuy what do you think? |
|
Following Also, I don't quite get the scope of this issue: how will attacker get the contents of |
|
Ah, I've read the CVE and it is clearer now, please disregard my question. What about adding |
|
Protocol checking should be done outside Request constructor because unsupported link shouldn't cause an exception in response handler. |
|
@kmike Any service that scrapes content and outputs near-raw data could be targeted by an attacker. Examples, but not limited to:
Any of these services could be powered by Scrapy, and an attacker could exploit them by creating traps such as Just to clarify, the initial point I raised: |
|
I wouldn't mind having a
Additionally, could |
|
@redapple IMO this problem asks for a setting of its own, since there're security implications and therefore beyond the |
|
@mvsantos my comment above suggested a name, but it could be more explicit indeed. |
|
@redapple I personally don't like the idea of having security settings open by default, like you suggest in the second topic. But Scrapy's approach is clearly making as easy as possible to get the first "HelloWorld" spider working, then your 3 topics should work fine. Also, @dangra idea of adding a "Recommendation for running spiders in production" page to docs is great for newcomers like myself, to make sure we deploy Scrapy in a sane and secure fashion. |
|
@nramirezuy: OffsiteMW is not a great "protection", but to put my words to work, it won't let diff --git a/scrapy/tests/test_spidermiddleware_offsite.py b/scrapy/tests/test_spidermiddleware_offsite.py
index f7523f7..2e58827 100644
--- a/scrapy/tests/test_spidermiddleware_offsite.py
+++ b/scrapy/tests/test_spidermiddleware_offsite.py
@@ -23,7 +23,8 @@ class TestOffsiteMiddleware(TestCase):
Request('http://sub.scrapy.org/1'),
Request('http://offsite.tld/letmepass', dont_filter=True)]
offsite_reqs = [Request('http://scrapy2.org'),
- Request('http://offsite.tld/')]
+ Request('http://offsite.tld/'),
+ Request('file:///etc/passwd')]
reqs = onsite_reqs + offsite_reqs
out = list(self.mw.process_spider_output(res, reqs, self.spider))test output: |
|
@nramirezuy : OffisteMW doesn't apply to redirects, right, but I was mentioning it in the context of links extracted from html pages. The idea with redirects is to disallow others than |
|
Hi folks, any updates regarding this issue? EDIT: here's what might happen when follow redirect (from http:// to file://) is a default behavior |
|
I think that I don't like solutions with some global switch (be it a setting or a middleware) for several reasons:
#457 (comment) solution doesn't have these problems. There is a gotcha with redirects where |
|
I'm ok with @kmike's
What's the core maintainers' take on this? CC @dangra , @nramirezuy , @pablohoffman , @kmike |
|
@redapple I don't know if removing the proxy is a good idea, also the cookies may not change. Did you checked how the browser handle it? |
|
@nramirezuy , if the redirected URL of a proxied request changes scheme, the proxy doesn't change (e.g. use an HTTPS proxy for an http:// URL), that feels weird and wrong, and python-requests doesn't do that. |
I believe this is not a bug, but could fit in as a security enhancement.
TL;DR When following HTTP redirects, scrapy should only follow http/https requests. Quite possibly similar to (CVE-2009-0037) http://curl.haxx.se/docs/adv_20090303.html
Scenario:
a. A spider that simply stores raw or minimally sanitised data.
b. By default scrapy follows HTTP redirects.
c. The values of both DOWNLOAD_HANDLERS and DOWNLOAD_HANDLERS_BASE are set to their default.
And let's say a malicious client might issue a redirect such as
And so, my spider will be "leaking" my own "/etc/passwd" file because Scrapy followed the 301 redirect from "http:// ... /trap.html" to "file:///etc/passwd".
Shouldn't scrapy enforce some sort of protocol restriction so that it only follows the http/https protocols when following redirects?
The text was updated successfully, but these errors were encountered: