-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selector mis-parses html when elements have very large bodies #110
Comments
This was actually reported as an issue a while ago at scrapy/scrapy#3077 |
lxml/lxml#261 was just merged, adding the required functionality to lxml. |
Apologies, I didn't check the Scrapy issues since I knew if parsel was subject to this behavior then Scrapy was implicitly affected by it. Thanks so very much for tracking the upstream fix. And I would say given that this bug has likely existed for a while, I doubt it's high urgency. In addition, I just confirmed that one can build --- a/parsel/selector.py~ 2018-03-08 23:13:12.000000000 -0800
+++ b/parsel/selector.py 2018-03-08 23:14:41.000000000 -0800
@@ -11,13 +11,20 @@
from .csstranslator import HTMLTranslator, GenericTranslator
+class HugeHTMLParser(html.HTMLParser):
+ def __init__(self, *args, **kwargs):
+ kwargs.setdefault('huge_tree', True)
+ super(HugeHTMLParser, self).__init__(*args, **kwargs)
+
+
class SafeXMLParser(etree.XMLParser):
def __init__(self, *args, **kwargs):
kwargs.setdefault('resolve_entities', False)
super(SafeXMLParser, self).__init__(*args, **kwargs)
+
_ctgroup = {
- 'html': {'_parser': html.HTMLParser,
+ 'html': {'_parser': HugeHTMLParser,
'_csstranslator': HTMLTranslator(),
'_tostring_method': 'html'},
'xml': {'_parser': SafeXMLParser, edited to include a possible patch |
Actually, since --- a/parsel/selector.py
+++ b/parsel/selector.py
@@ -39,7 +39,7 @@ def create_root_node(text, parser_cls, base_url=None):
"""Create root node for text using given parser class.
"""
body = text.strip().encode('utf8') or b'<html/>'
- parser = parser_cls(recover=True, encoding='utf8')
+ parser = parser_cls(recover=True, encoding='utf8', huge_tree=True)
root = etree.fromstring(body, parser=parser, base_url=base_url)
if root is None:
root = etree.fromstring(b'<html/>', parser=parser, base_url=base_url) I thought maybe adding this as an optional feature (e.g. keyword argument) might be needed for performance reasons, but I've done some simple testing, and can see no performance difference, so just having it enabled all the time should be fine: >>> vanilla_parser = lxml.html.HTMLParser()
>>> huge_parser = lxml.html.HTMLParser(huge_tree=True)
>>> def vanilla():
... lxml.html.fromstring(data, parser=vanilla_parser)
...
>>> def huge():
... lxml.html.fromstring(data, parser=huge_parser)
...
>>> timeit.repeat(vanilla, number=1000)
[9.251072943676263, 9.233368926215917, 9.403614949900657]
>>> timeit.repeat(huge, number=1000)
[9.30129877710715, 9.280261006206274, 9.6553337960504]
>>> timeit.repeat(vanilla, number=1000)
[9.392337845172733, 9.374456495046616, 9.257086860947311]
>>> timeit.repeat(huge, number=1000)
[9.3193543930538, 9.24144608201459, 9.242193738929927] |
+1 to use huge_tree=True by default, though we need to make a check - parsel must work with old lxml which don't support this parameter. It may be also good to expose this option somehow for parsel users; some users who do broad crawls may want to limit tree size, as a security/robustness measure. |
I dislike version checking and would prefer to just require new lxml with new parsel, but I'm not the one making the decision. How would lower versions be handled? I think for having the option to disable it, having a keyword arg for |
I did a pull request to this issue, but the question @stranac raises still stands. I think the main problem are these scenarios:
How would you suggest do handle these cases? Right now, I implemented so that both scenarios would fail and raise a |
Fixed by #116. |
This was discovered by a Reddit user, concerning an Amazon page with an absurdly long
<script>
tag, but I was able to boil the bad outcome down into a reproducible test casewhat is expected
Selector(html).css('h1')
should produce allh1
elements within the documentwhat actually happens
Selector(html).css('h1')
produces only theh1
elements before the element containing a very large body. Neitherxml.etree
norhtml5lib
suffer from this defect.produces the output
The text was updated successfully, but these errors were encountered: