Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selector mis-parses html when elements have very large bodies #110

Closed
mdaniel opened this issue Mar 3, 2018 · 8 comments
Closed

Selector mis-parses html when elements have very large bodies #110

mdaniel opened this issue Mar 3, 2018 · 8 comments

Comments

@mdaniel
Copy link

mdaniel commented Mar 3, 2018

This was discovered by a Reddit user, concerning an Amazon page with an absurdly long <script> tag, but I was able to boil the bad outcome down into a reproducible test case

what is expected

Selector(html).css('h1') should produce all h1 elements within the document

what actually happens

Selector(html).css('h1') produces only the h1 elements before the element containing a very large body. Neither xml.etree nor html5lib suffer from this defect.


pip install html5lib==1.0.1
pip install parsel==1.4.0
import html5lib
import parsel
import time

try:
    from xml.etree import cElementTree as ElementTree
except ImportError:
    from xml.etree import ElementTree

bad_len = 21683148
bad = 'a' * bad_len
bad_html = '''
<html>
    <body>
      <h1>pre-div h1</h1>
      <div>
        <h1>pre-script h1</h1>
        <p>var bogus = "{}"</p>
        <h1>hello I am eclipsed</h1>
      </div>
      <h1>post-div h1</h1>
    </body>
</html>
'''.format(bad)
t0 = time.time()
sel = parsel.Selector(text=bad_html)
t1 = time.time()
print('Selector.time={}'.format(t1 - t0))
for idx, h1 in enumerate(sel.xpath('//h1').extract()):
    print('h1[{} = {}'.format(idx, h1))

print('ElementTree')
t0 = time.time()
doc = ElementTree.fromstring(bad_html)
t1 = time.time()
print('ElementTree.time={}'.format(t1 - t0))
for idx, h1 in enumerate(doc.findall('.//h1')):
    print('h1[{}].txt = <<{}>>'.format(h1, h1.text))

print('html5lib')
t0 = time.time()
#: :type: xml.etree.ElementTree.Element
doc = html5lib.parse(bad_html, namespaceHTMLElements=False)
t1 = time.time()
print('html5lib.time={}'.format(t1 - t0))
for idx, h1 in enumerate(doc.findall('.//h1')):
    print('h1[{}].txt = <<{}>>'.format(h1, h1.text))

produces the output

Selector.time=0.3661611080169678
h1[0 = <h1>pre-div h1</h1>
h1[1 = <h1>pre-script h1</h1>
ElementTree
ElementTree.time=0.1052100658416748
h1[<Element 'h1' at 0x103029bd8>].txt = <<pre-div h1>>
h1[<Element 'h1' at 0x103029c78>].txt = <<pre-script h1>>
h1[<Element 'h1' at 0x103029d18>].txt = <<hello I am eclipsed>>
h1[<Element 'h1' at 0x103029d68>].txt = <<post-div h1>>
html5lib
html5lib.time=2.255831003189087
h1[<Element 'h1' at 0x107395098>].txt = <<pre-div h1>>
h1[<Element 'h1' at 0x1073951d8>].txt = <<pre-script h1>>
h1[<Element 'h1' at 0x107395318>].txt = <<hello I am eclipsed>>
h1[<Element 'h1' at 0x1073953b8>].txt = <<post-div h1>>
@stranac
Copy link

stranac commented Mar 8, 2018

This was actually reported as an issue a while ago at scrapy/scrapy#3077
I just came across it yesterday, and started working on fixing it at the source (lxml)

@stranac
Copy link

stranac commented Mar 8, 2018

lxml/lxml#261 was just merged, adding the required functionality to lxml.
I guess now we wait for a release to fix this bug?

@mdaniel
Copy link
Author

mdaniel commented Mar 9, 2018

Apologies, I didn't check the Scrapy issues since I knew if parsel was subject to this behavior then Scrapy was implicitly affected by it.

Thanks so very much for tracking the upstream fix. And I would say given that this bug has likely existed for a while, I doubt it's high urgency.

In addition, I just confirmed that one can build lxml#3fac8341442be711f967ae321d8189df86f81b14, but to close this issue, the parser_cls kwargs will need to be updated to take advantage of the new lxml kwarg; here is one such way:

--- a/parsel/selector.py~	2018-03-08 23:13:12.000000000 -0800
+++ b/parsel/selector.py	2018-03-08 23:14:41.000000000 -0800
@@ -11,13 +11,20 @@
 from .csstranslator import HTMLTranslator, GenericTranslator


+class HugeHTMLParser(html.HTMLParser):
+    def __init__(self, *args, **kwargs):
+        kwargs.setdefault('huge_tree', True)
+        super(HugeHTMLParser, self).__init__(*args, **kwargs)
+
+
 class SafeXMLParser(etree.XMLParser):
     def __init__(self, *args, **kwargs):
         kwargs.setdefault('resolve_entities', False)
         super(SafeXMLParser, self).__init__(*args, **kwargs)

+
 _ctgroup = {
-    'html': {'_parser': html.HTMLParser,
+    'html': {'_parser': HugeHTMLParser,
              '_csstranslator': HTMLTranslator(),
              '_tostring_method': 'html'},
     'xml': {'_parser': SafeXMLParser,

edited to include a possible patch

@stranac
Copy link

stranac commented Mar 9, 2018

Actually, since XMLParser has the same behavior, and already supports the argument, I think patching create_root_node might be preferable:

--- a/parsel/selector.py
+++ b/parsel/selector.py
@@ -39,7 +39,7 @@ def create_root_node(text, parser_cls, base_url=None):
     """Create root node for text using given parser class.
     """
     body = text.strip().encode('utf8') or b'<html/>'
-    parser = parser_cls(recover=True, encoding='utf8')
+    parser = parser_cls(recover=True, encoding='utf8', huge_tree=True)
     root = etree.fromstring(body, parser=parser, base_url=base_url)
     if root is None:
         root = etree.fromstring(b'<html/>', parser=parser, base_url=base_url)

I thought maybe adding this as an optional feature (e.g. keyword argument) might be needed for performance reasons, but I've done some simple testing, and can see no performance difference, so just having it enabled all the time should be fine:

>>> vanilla_parser = lxml.html.HTMLParser()
>>> huge_parser = lxml.html.HTMLParser(huge_tree=True)
>>> def vanilla():
...     lxml.html.fromstring(data, parser=vanilla_parser)
...
>>> def huge():
...     lxml.html.fromstring(data, parser=huge_parser)
...
>>> timeit.repeat(vanilla, number=1000)
[9.251072943676263, 9.233368926215917, 9.403614949900657]
>>> timeit.repeat(huge, number=1000)
[9.30129877710715, 9.280261006206274, 9.6553337960504]
>>> timeit.repeat(vanilla, number=1000)
[9.392337845172733, 9.374456495046616, 9.257086860947311]
>>> timeit.repeat(huge, number=1000)
[9.3193543930538, 9.24144608201459, 9.242193738929927]

@ghost ghost added the upstream issue label Mar 9, 2018
@kmike
Copy link
Member

kmike commented Mar 12, 2018

+1 to use huge_tree=True by default, though we need to make a check - parsel must work with old lxml which don't support this parameter.

It may be also good to expose this option somehow for parsel users; some users who do broad crawls may want to limit tree size, as a security/robustness measure.

@stranac
Copy link

stranac commented Mar 13, 2018

I dislike version checking and would prefer to just require new lxml with new parsel, but I'm not the one making the decision.

How would lower versions be handled?
Do we return a selector that behaves differently depending on lxml version? Do we issue a warning?

I think for having the option to disable it, having a keyword arg for Selector.__init__ makes the most sense.

@Langdi
Copy link
Contributor

Langdi commented Jul 10, 2018

I did a pull request to this issue, but the question @stranac raises still stands. I think the main problem are these scenarios:

  1. lxml supports huge_tree but it is disabled via the argument (see source code)
  2. lxml doesn't support huge_tree but it is enabled (either passed or on via default).

How would you suggest do handle these cases? Right now, I implemented so that both scenarios would fail and raise a ValueError.

@wRAR
Copy link
Member

wRAR commented Oct 28, 2023

Fixed by #116.

@wRAR wRAR closed this as completed Oct 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants