Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Expose lxml.html.HTMLParser as an optional parser. #41

Closed
wants to merge 1 commit into from

Conversation

rmax
Copy link

@rmax rmax commented Jun 23, 2016

The usefulness of this option is to be able to use helper methods from
the lxml.html.HTMLElement element class. For example, to make all links
absolute:

>>> import parsel
>>> sel = parsel.Selector(u'<a href="foo"></a>',
base_url="http://example.com/", type='html_html')
>>> sel.root.make_links_absolute()
>>> sel.xpath('//a/@href').extract()
[u'http://example.com/foo']

The type name html_html comes from the parser location lxml.html.HTMLParser.

The usefulness of this option is to be able to use helper methods from
the lxml.html.HTMLElement element class. For example, to make all links
absolute:

>>> import parsel
>>> sel = parsel.Selector(u'<a href="foo"></a>',
base_url="http://example.com/", type='html_html')
>>> sel.root.make_links_absolute()
>>> sel.xpath('//a/@href').extract()
[u'http://example.com/foo']

The type name ``html_html` comes from the parser location
``lxml.html.HTMLParser``.
@codecov-io
Copy link

Current coverage is 100%

No coverage report found for master at fde9087.

Powered by Codecov. Last updated by fde9087...382bbb6

@eliasdorneles
Copy link
Member

It seems these parsers are almost the same, this html.HTMLParser is a subclass of etree.HTMLParser configured to return different elements (elements from lxml.html module, I figure): http://lxml.de/api/lxml.html-pysrc.html#HTMLParser.__init__

What's stopping us from making lxml.html.HTMLParser the default? Is it slower or something?

@eliasdorneles
Copy link
Member

right, already discussing in #40, sorry for the noise!

@rmax
Copy link
Author

rmax commented Jun 23, 2016

Yes, the only difference is the elements class.

In [19]: type(sel_html.root)
lxml.html.HtmlElement

In [20]: type(sel.root)
lxml.etree._Element

It's not clear whether there could be a performance degradation. My unscientific timeit benchmark gave same results. Also backwards compatibility may be a good reason to not change it right away as we don't know whether somebody is depending on the default element class.

@eliasdorneles eliasdorneles changed the title Expose lxml.html.HTMLParser as an optional parser. [MRG+1] Expose lxml.html.HTMLParser as an optional parser. Jun 23, 2016
@eliasdorneles
Copy link
Member

Right, agreed.
Okay, looks good to me!

@redapple
Copy link
Contributor

redapple commented Jul 6, 2016

html_html looks weird to me as a type= argument value.
Also, how are we going to document this? "html", "xml" are easy to understand for users. "html_html" is harder: well, it's for HTML sure, but it only adds internal methods (that are not exposed so you don't need to care.
As I see it, it'll be there "for those who know", who know that self.root gives access to the underlying lxml parsed doc, which is handy indeed (and even used in scrapy linkextractors). How about exposing self.root explicitly (with docs) under another name with "lxml" in it?

In the end, I would lean towards what @eliasdorneles said in #41 (comment) , and would be interested in performance comparison, and if it's really close, make lxml.html.HTMLParser the default parser.

Another option is to add a parser= argument "for those who know what they're doing", so that we could even support html5lib and beautifulsoup parsers for cases where lxml chokes.

@redapple
Copy link
Contributor

@rolando , @eliasdorneles , any comment on my previous comment? :)

@eliasdorneles
Copy link
Member

I like the idea of introducing the parser= argument.

It's more flexible and it also allows us to postpone changing the default -- we could want to have some more experience using it in production first.

About performance comparison, I think when we do have numbers it would be nice to have an acceptance test making the speed requirements a bit more concrete. :)

@redapple
Copy link
Contributor

Closed in favor of #63

@redapple redapple closed this Nov 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants