-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Expose lxml.html.HTMLParser as an optional parser. #41
Conversation
The usefulness of this option is to be able to use helper methods from the lxml.html.HTMLElement element class. For example, to make all links absolute: >>> import parsel >>> sel = parsel.Selector(u'<a href="foo"></a>', base_url="http://example.com/", type='html_html') >>> sel.root.make_links_absolute() >>> sel.xpath('//a/@href').extract() [u'http://example.com/foo'] The type name ``html_html` comes from the parser location ``lxml.html.HTMLParser``.
Current coverage is 100%
|
It seems these parsers are almost the same, this html.HTMLParser is a subclass of etree.HTMLParser configured to return different elements (elements from lxml.html module, I figure): http://lxml.de/api/lxml.html-pysrc.html#HTMLParser.__init__ What's stopping us from making |
right, already discussing in #40, sorry for the noise! |
Yes, the only difference is the elements class.
It's not clear whether there could be a performance degradation. My unscientific |
Right, agreed. |
In the end, I would lean towards what @eliasdorneles said in #41 (comment) , and would be interested in performance comparison, and if it's really close, make Another option is to add a |
@rolando , @eliasdorneles , any comment on my previous comment? :) |
I like the idea of introducing the It's more flexible and it also allows us to postpone changing the default -- we could want to have some more experience using it in production first. About performance comparison, I think when we do have numbers it would be nice to have an acceptance test making the speed requirements a bit more concrete. :) |
Closed in favor of #63 |
The usefulness of this option is to be able to use helper methods from
the lxml.html.HTMLElement element class. For example, to make all links
absolute:
The type name
html_html
comes from the parser locationlxml.html.HTMLParser
.