Selector mis-parses html when elements have very large bodies #110

mdaniel · 2018-03-03T21:29:28Z

This was discovered by a Reddit user, concerning an Amazon page with an absurdly long <script> tag, but I was able to boil the bad outcome down into a reproducible test case

what is expected

Selector(html).css('h1') should produce all h1 elements within the document

what actually happens

Selector(html).css('h1') produces only the h1 elements before the element containing a very large body. Neither xml.etree nor html5lib suffer from this defect.

pip install html5lib==1.0.1
pip install parsel==1.4.0

import html5lib
import parsel
import time

try:
    from xml.etree import cElementTree as ElementTree
except ImportError:
    from xml.etree import ElementTree

bad_len = 21683148
bad = 'a' * bad_len
bad_html = '''
<html>
    <body>
      <h1>pre-div h1</h1>
      <div>
        <h1>pre-script h1</h1>
        <p>var bogus = "{}"</p>
        <h1>hello I am eclipsed</h1>
      </div>
      <h1>post-div h1</h1>
    </body>
</html>
'''.format(bad)
t0 = time.time()
sel = parsel.Selector(text=bad_html)
t1 = time.time()
print('Selector.time={}'.format(t1 - t0))
for idx, h1 in enumerate(sel.xpath('//h1').extract()):
    print('h1[{} = {}'.format(idx, h1))

print('ElementTree')
t0 = time.time()
doc = ElementTree.fromstring(bad_html)
t1 = time.time()
print('ElementTree.time={}'.format(t1 - t0))
for idx, h1 in enumerate(doc.findall('.//h1')):
    print('h1[{}].txt = <<{}>>'.format(h1, h1.text))

print('html5lib')
t0 = time.time()
#: :type: xml.etree.ElementTree.Element
doc = html5lib.parse(bad_html, namespaceHTMLElements=False)
t1 = time.time()
print('html5lib.time={}'.format(t1 - t0))
for idx, h1 in enumerate(doc.findall('.//h1')):
    print('h1[{}].txt = <<{}>>'.format(h1, h1.text))

produces the output

Selector.time=0.3661611080169678
h1[0 = <h1>pre-div h1</h1>
h1[1 = <h1>pre-script h1</h1>
ElementTree
ElementTree.time=0.1052100658416748
h1[<Element 'h1' at 0x103029bd8>].txt = <<pre-div h1>>
h1[<Element 'h1' at 0x103029c78>].txt = <<pre-script h1>>
h1[<Element 'h1' at 0x103029d18>].txt = <<hello I am eclipsed>>
h1[<Element 'h1' at 0x103029d68>].txt = <<post-div h1>>
html5lib
html5lib.time=2.255831003189087
h1[<Element 'h1' at 0x107395098>].txt = <<pre-div h1>>
h1[<Element 'h1' at 0x1073951d8>].txt = <<pre-script h1>>
h1[<Element 'h1' at 0x107395318>].txt = <<hello I am eclipsed>>
h1[<Element 'h1' at 0x1073953b8>].txt = <<post-div h1>>

The text was updated successfully, but these errors were encountered:

stranac · 2018-03-08T20:26:03Z

This was actually reported as an issue a while ago at scrapy/scrapy#3077
I just came across it yesterday, and started working on fixing it at the source (lxml)

stranac · 2018-03-08T23:42:33Z

lxml/lxml#261 was just merged, adding the required functionality to lxml.
I guess now we wait for a release to fix this bug?

mdaniel · 2018-03-09T07:10:02Z

Apologies, I didn't check the Scrapy issues since I knew if parsel was subject to this behavior then Scrapy was implicitly affected by it.

Thanks so very much for tracking the upstream fix. And I would say given that this bug has likely existed for a while, I doubt it's high urgency.

In addition, I just confirmed that one can build lxml#3fac8341442be711f967ae321d8189df86f81b14, but to close this issue, the parser_cls kwargs will need to be updated to take advantage of the new lxml kwarg; here is one such way:

--- a/parsel/selector.py~	2018-03-08 23:13:12.000000000 -0800
+++ b/parsel/selector.py	2018-03-08 23:14:41.000000000 -0800
@@ -11,13 +11,20 @@
 from .csstranslator import HTMLTranslator, GenericTranslator


+class HugeHTMLParser(html.HTMLParser):
+    def __init__(self, *args, **kwargs):
+        kwargs.setdefault('huge_tree', True)
+        super(HugeHTMLParser, self).__init__(*args, **kwargs)
+
+
 class SafeXMLParser(etree.XMLParser):
     def __init__(self, *args, **kwargs):
         kwargs.setdefault('resolve_entities', False)
         super(SafeXMLParser, self).__init__(*args, **kwargs)

+
 _ctgroup = {
-    'html': {'_parser': html.HTMLParser,
+    'html': {'_parser': HugeHTMLParser,
              '_csstranslator': HTMLTranslator(),
              '_tostring_method': 'html'},
     'xml': {'_parser': SafeXMLParser,

edited to include a possible patch

stranac · 2018-03-09T09:43:10Z

Actually, since XMLParser has the same behavior, and already supports the argument, I think patching create_root_node might be preferable:

--- a/parsel/selector.py
+++ b/parsel/selector.py
@@ -39,7 +39,7 @@ def create_root_node(text, parser_cls, base_url=None):
     """Create root node for text using given parser class.
     """
     body = text.strip().encode('utf8') or b'<html/>'
-    parser = parser_cls(recover=True, encoding='utf8')
+    parser = parser_cls(recover=True, encoding='utf8', huge_tree=True)
     root = etree.fromstring(body, parser=parser, base_url=base_url)
     if root is None:
         root = etree.fromstring(b'<html/>', parser=parser, base_url=base_url)

I thought maybe adding this as an optional feature (e.g. keyword argument) might be needed for performance reasons, but I've done some simple testing, and can see no performance difference, so just having it enabled all the time should be fine:

>>> vanilla_parser = lxml.html.HTMLParser()
>>> huge_parser = lxml.html.HTMLParser(huge_tree=True)
>>> def vanilla():
...     lxml.html.fromstring(data, parser=vanilla_parser)
...
>>> def huge():
...     lxml.html.fromstring(data, parser=huge_parser)
...
>>> timeit.repeat(vanilla, number=1000)
[9.251072943676263, 9.233368926215917, 9.403614949900657]
>>> timeit.repeat(huge, number=1000)
[9.30129877710715, 9.280261006206274, 9.6553337960504]
>>> timeit.repeat(vanilla, number=1000)
[9.392337845172733, 9.374456495046616, 9.257086860947311]
>>> timeit.repeat(huge, number=1000)
[9.3193543930538, 9.24144608201459, 9.242193738929927]

kmike · 2018-03-12T11:43:47Z

+1 to use huge_tree=True by default, though we need to make a check - parsel must work with old lxml which don't support this parameter.

It may be also good to expose this option somehow for parsel users; some users who do broad crawls may want to limit tree size, as a security/robustness measure.

stranac · 2018-03-13T21:29:04Z

I dislike version checking and would prefer to just require new lxml with new parsel, but I'm not the one making the decision.

How would lower versions be handled?
Do we return a selector that behaves differently depending on lxml version? Do we issue a warning?

I think for having the option to disable it, having a keyword arg for Selector.__init__ makes the most sense.

Langdi · 2018-07-10T09:41:13Z

I did a pull request to this issue, but the question @stranac raises still stands. I think the main problem are these scenarios:

lxml supports huge_tree but it is disabled via the argument (see source code)
lxml doesn't support huge_tree but it is enabled (either passed or on via default).

How would you suggest do handle these cases? Right now, I implemented so that both scenarios would fail and raise a ValueError.

wRAR · 2023-10-28T18:12:55Z

Fixed by #116.

ghost added the upstream issue label Mar 9, 2018

Langdi mentioned this issue Jun 25, 2018

Added support for parsing huge trees #116

Merged

rodp63 mentioned this issue Jan 15, 2019

Huge text extraction #134

Closed

Gallaecio mentioned this issue Aug 26, 2019

Selector.get() cannot return content that is too deep #147

Closed

Gallaecio added the bug label Sep 30, 2019

Gallaecio mentioned this issue Feb 1, 2021

Selector unable to find HTML body element #208

Closed

wRAR closed this as completed Oct 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selector mis-parses html when elements have very large bodies #110

Selector mis-parses html when elements have very large bodies #110

mdaniel commented Mar 3, 2018

stranac commented Mar 8, 2018

stranac commented Mar 8, 2018

mdaniel commented Mar 9, 2018 •

edited

Loading

stranac commented Mar 9, 2018

kmike commented Mar 12, 2018 •

edited

Loading

stranac commented Mar 13, 2018

Langdi commented Jul 10, 2018

wRAR commented Oct 28, 2023

Selector mis-parses html when elements have very large bodies #110

Selector mis-parses html when elements have very large bodies #110

Comments

mdaniel commented Mar 3, 2018

what is expected

what actually happens

stranac commented Mar 8, 2018

stranac commented Mar 8, 2018

mdaniel commented Mar 9, 2018 • edited Loading

stranac commented Mar 9, 2018

kmike commented Mar 12, 2018 • edited Loading

stranac commented Mar 13, 2018

Langdi commented Jul 10, 2018

wRAR commented Oct 28, 2023

mdaniel commented Mar 9, 2018 •

edited

Loading

kmike commented Mar 12, 2018 •

edited

Loading