Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project name sanitization #117

Closed
csalazar opened this issue Nov 11, 2015 · 7 comments
Closed

Project name sanitization #117

csalazar opened this issue Nov 11, 2015 · 7 comments

Comments

@csalazar
Copy link

As described here (point 1,2,3 and 5) there are some security issues related to project name. I've thought in a fix sanitizing the project name value using the same logic as with variable version:

re.sub(r'[^a-zA-Z0-9_-]', '_', version)

It would have to apply to every method working with project name in FilesystemEggStorage to have consistency adding and then getting projects. It adds as side efect that two projects with non-common characters like project! and project? will share the same project name project_. Does someone see a better solution?

@Digenis
Copy link
Member

Digenis commented Nov 12, 2015

Percentage encoding them should also prevent name collisions.
This should also take care of the XSS issues to a point.
Still if someone succeeds to store an unescaped egg
the site module should ignore it and maybe warn about it.

One thing to note however before any development on this bug.
project name != project package
I can rename my default project package to something shorter
to keep its imports short
or the other way around, rename the project name to something longer
to make its name more descriptive than the package.
I do it all the time by modifing scrapy.cfg.

Last, I don't think these are security vulnerabilities.
Scrapyd is overly permissive and non-isolating by design.
However I agree that all these are bugs and need to be fixed
as I also believe it needs a total redesign.
The fixes probably break backwards compatibility
but mostly in scenarios that the bugs where used as features.

I remember "scrapyd2" being referenced by one of the maintainers
but I think they will not release such a package,
because they the work on the cloud platform.
They gave maintenance to @jayzeng
which pretty much means that the ball is on the community's court.

@csalazar
Copy link
Author

Hi @Digenis , I don't consider using percentage encoding the right way. When we insert the project name at root page, we have to display as it is (<img onerror=alert(1); src=x>) but taking care of special characters, that's HTML encoding and percentage (URL) encoding doesn't solve anything at that context.

I would prefer an intermediary representation like base64 which doesn't generate characters with meaning at filesystem level (slash, dots, etc) so we can deal with project name safely and when we have to display it, we choose the right encoding according to the context.

I think that they are vulnerabilities since my scenario to do the audit was thinking about someone using scrapyd and running the spiders in containers, so the issues related to python code execution are restricted at container level and he could use scrapyd "safely".

@Digenis
Copy link
Member

Digenis commented Dec 3, 2015

The base64 alphabet has "/"
and also makes the project names unreadable.
I can only think of one pitfall in the %-encoding
accidentally interpolating an environment variable in a windows batch script.

The example with the project name containing html
should rather be solved in the website template,
it shouldn't use non-escaped text in the first place.

Edit:
(attached patch)

diff --git a/scrapyd/website.py b/scrapyd/website.py
index fde1468..4bfa5c5 100644
--- a/scrapyd/website.py
+++ b/scrapyd/website.py
@@ -4,2 +4,4 @@ import socket

+from xml.sax.saxutils import escape as escape_html
+
 from twisted.web import resource, static
@@ -67,3 +69,4 @@ class Home(resource.Resource):
         vars = {
-            'projects': ', '.join(self.root.scheduler.list_projects()),
+            'projects': ', '.join(
+                map(escape_html, self.root.scheduler.list_projects())),
         }
@@ -122,5 +125,5 @@ class Jobs(resource.Resource):
                 s += "<tr>"
-                s += "<td>%s</td>" % project
-                s += "<td>%s</td>" % str(m['name'])
-                s += "<td>%s</td>" % str(m['_job'])
+                s += "<td>%s</td>" % escape_html(project)
+                s += "<td>%s</td>" % escape_html(str(m['name']))
+                s += "<td>%s</td>" % escape_html(str(m['_job']))
                 s += "</tr>"
@@ -130,3 +133,3 @@ class Jobs(resource.Resource):
             for a in ['project', 'spider', 'job', 'pid']:
-                s += "<td>%s</td>" % getattr(p, a)
+                s += "<td>%s</td>" % escape_html(getattr(p, a))
             s += "<td>%s</td>" % (datetime.now() - p.start_time)
@@ -140,3 +143,3 @@ class Jobs(resource.Resource):
             for a in ['project', 'spider', 'job']:
-                s += "<td>%s</td>" % getattr(p, a)
+                s += "<td>%s</td>" % escape_html(getattr(p, a))
             s += "<td></td>"

@Digenis
Copy link
Member

Digenis commented Dec 5, 2015

I forgot to consider case insensitive filesystems.
Just %-encoding won't do in them.

@Digenis
Copy link
Member

Digenis commented Jan 19, 2017

Since we are getting close to 1.2
we can consider defining a set of forbidden characters
and adding warnings for project names that include them.
This is forward compatible with either forbidding those characters completely
or using some escaping scheme (eg %-escapes).

@Digenis Digenis modified the milestones: 1.4.0, 1.3.0 Apr 13, 2021
@Digenis
Copy link
Member

Digenis commented Apr 13, 2021

moving to 1.3.0 to at least add deprecation warnings for project names with illegal characters

@jpmckinney
Copy link
Contributor

jpmckinney commented Jul 18, 2024

I went through the blog post at #518. Note: anyone with access to Scrapyd's API can run arbitrary code on the server, via addversion.json and schedule.json. In any case, the issues are concentrated in:

  • environ.py (which determines where log files and item feeds are written)
  • eggstorage.py (which writes and deletes eggs)
  • website.py (XSS), easily fixed by escaping the user-provided content

For the first two, the concern is escaping the configured base directory (eggs_dir, logs_dir, items_dir) via the project, spider, version or job API parameters.

For that, we can use the typical solution for a directory traversal attack, which is to resolve paths and then check for a common prefix. If it fails, we raise an error, which Scrapyd renders as 200 OK and a JSON message.

If a user presently has a project, spider, etc. with two consecutive dots ".." between path separators in its name, they are already encountering bugs, so this change makes no difference to them.

With this approach, a parameter with "/" or "\" can still descend the directory tree, but that's not a security issue. I suspect this would cause bugs, but I don't think any users actually put path separators in parameters, so it's not a priority to protect users from these bugs. I don't think any other characters would cause bugs.

So, we don't need to sanitize the project, spider, etc. parameters in the API (which would be backwards incompatible), since the eggstorage and environ will error before anything insecure happens.

There is the consideration that users might have configured their own filesystem-based eggstorage, and so adding these checks to the webservice can add a layer of protection, in case they didn't consider these issues. However, as I started with, Scrapyd is already a way to run arbitrary code, so this protection for user code is a bit paranoid.

I fixed these issues in the commits mentioned in #518, so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants