[MRG+1] Expose lxml.html.HTMLParser as an optional parser. #41

rmax · 2016-06-23T17:59:24Z

The usefulness of this option is to be able to use helper methods from
the lxml.html.HTMLElement element class. For example, to make all links
absolute:

>>> import parsel
>>> sel = parsel.Selector(u'<a href="foo"></a>',
base_url="http://example.com/", type='html_html')
>>> sel.root.make_links_absolute()
>>> sel.xpath('//a/@href').extract()
[u'http://example.com/foo']

The type name html_html comes from the parser location lxml.html.HTMLParser.

@href

The usefulness of this option is to be able to use helper methods from the lxml.html.HTMLElement element class. For example, to make all links absolute: >>> import parsel >>> sel = parsel.Selector(u'<a href="foo"></a>', base_url="http://example.com/", type='html_html') >>> sel.root.make_links_absolute() >>> sel.xpath('//a/@href').extract() [u'http://example.com/foo'] The type name ``html_html` comes from the parser location ``lxml.html.HTMLParser``.

codecov-io · 2016-06-23T18:04:48Z

Current coverage is 100%

No coverage report found for master at fde9087.

Powered by Codecov. Last updated by fde9087...382bbb6

eliasdorneles · 2016-06-23T18:14:16Z

It seems these parsers are almost the same, this html.HTMLParser is a subclass of etree.HTMLParser configured to return different elements (elements from lxml.html module, I figure): http://lxml.de/api/lxml.html-pysrc.html#HTMLParser.__init__

What's stopping us from making lxml.html.HTMLParser the default? Is it slower or something?

eliasdorneles · 2016-06-23T18:16:30Z

right, already discussing in #40, sorry for the noise!

rmax · 2016-06-23T18:18:27Z

Yes, the only difference is the elements class.

In [19]: type(sel_html.root)
lxml.html.HtmlElement

In [20]: type(sel.root)
lxml.etree._Element

It's not clear whether there could be a performance degradation. My unscientific timeit benchmark gave same results. Also backwards compatibility may be a good reason to not change it right away as we don't know whether somebody is depending on the default element class.

eliasdorneles · 2016-06-23T18:22:08Z

Right, agreed.
Okay, looks good to me!

redapple · 2016-07-06T12:13:37Z

html_html looks weird to me as a type= argument value.
Also, how are we going to document this? "html", "xml" are easy to understand for users. "html_html" is harder: well, it's for HTML sure, but it only adds internal methods (that are not exposed so you don't need to care.
As I see it, it'll be there "for those who know", who know that self.root gives access to the underlying lxml parsed doc, which is handy indeed (and even used in scrapy linkextractors). How about exposing self.root explicitly (with docs) under another name with "lxml" in it?

In the end, I would lean towards what @eliasdorneles said in #41 (comment) , and would be interested in performance comparison, and if it's really close, make lxml.html.HTMLParser the default parser.

Another option is to add a parser= argument "for those who know what they're doing", so that we could even support html5lib and beautifulsoup parsers for cases where lxml chokes.

redapple · 2016-07-25T11:45:45Z

@rolando , @eliasdorneles , any comment on my previous comment? :)

eliasdorneles · 2016-07-25T12:35:52Z

I like the idea of introducing the parser= argument.

It's more flexible and it also allows us to postpone changing the default -- we could want to have some more experience using it in production first.

About performance comparison, I think when we do have numbers it would be nice to have an acceptance test making the speed requirements a bit more concrete. :)

redapple · 2016-11-14T16:29:37Z

Closed in favor of #63

rmax mentioned this pull request Jun 23, 2016

Selector.root is not an instance of lxml.html.HtmlElement even if parser is html #40

Closed

eliasdorneles changed the title ~~Expose lxml.html.HTMLParser as an optional parser.~~ [MRG+1] Expose lxml.html.HTMLParser as an optional parser. Jun 23, 2016

rmax mentioned this pull request Jun 23, 2016

Added command line interface. #42

Closed

eliasdorneles mentioned this pull request Nov 14, 2016

[MRG+1] Change default parser to html.HTMLParser #63

Merged

redapple closed this Nov 14, 2016

barrio mentioned this pull request Apr 30, 2024

Parsel import causes crash #294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] Expose lxml.html.HTMLParser as an optional parser. #41

[MRG+1] Expose lxml.html.HTMLParser as an optional parser. #41

rmax commented Jun 23, 2016 •

edited

Loading

codecov-io commented Jun 23, 2016

eliasdorneles commented Jun 23, 2016

eliasdorneles commented Jun 23, 2016

rmax commented Jun 23, 2016

eliasdorneles commented Jun 23, 2016

redapple commented Jul 6, 2016

redapple commented Jul 25, 2016

eliasdorneles commented Jul 25, 2016

redapple commented Nov 14, 2016

[MRG+1] Expose lxml.html.HTMLParser as an optional parser. #41

[MRG+1] Expose lxml.html.HTMLParser as an optional parser. #41

Conversation

rmax commented Jun 23, 2016 • edited Loading

codecov-io commented Jun 23, 2016

Current coverage is 100%

eliasdorneles commented Jun 23, 2016

eliasdorneles commented Jun 23, 2016

rmax commented Jun 23, 2016

eliasdorneles commented Jun 23, 2016

redapple commented Jul 6, 2016

redapple commented Jul 25, 2016

eliasdorneles commented Jul 25, 2016

redapple commented Nov 14, 2016

rmax commented Jun 23, 2016 •

edited

Loading