-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Add parser_cls argument, changes default html parser to html.HTMLParser (closes #40) #54
Conversation
Current coverage is 100% (diff: 100%)@@ master #54 diff @@
===================================
Files 4 4
Lines 196 196
Methods 0 0
Messages 0 0
Branches 34 34
===================================
Hits 196 196
Misses 0 0
Partials 0 0
|
@@ -139,9 +139,9 @@ class Selector(object): | |||
selectorlist_cls = SelectorList | |||
|
|||
def __init__(self, text=None, type=None, namespaces=None, root=None, | |||
base_url=None, _expr=None): | |||
base_url=None, _expr=None, parser_cls=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there other use cases for this argument other than passing etree.HtmlParser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using lxml.html.html5parser.HTMLParser
but unfortunately, encoding
param makes it choke
>>> from lxml.html.html5parser import HTMLParser
>>> s = Selector(text=u'<html><body><p>test</p></body></html>', parser_cls=HTMLParser)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/paul/src/parsel/parsel/selector.py", line 151, in __init__
root = self._get_root(text, base_url)
File "/home/paul/src/parsel/parsel/selector.py", line 162, in _get_root
return create_root_node(text, self._parser, base_url=base_url)
File "/home/paul/src/parsel/parsel/selector.py", line 42, in create_root_node
parser = parser_cls(recover=True, encoding='utf8')
File "/home/paul/.virtualenvs/parseldev/lib/python3.4/site-packages/lxml/html/html5parser.py", line 32, in __init__
_HTMLParser.__init__(self, strict=strict, tree=TreeBuilder, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'encoding'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the idea was to support other custom subclasses of etree.HTMLParser (in the case of html5parser.HTMLParser, one could build a subclass of that supporting the encoding argument).
Does this make sense or do you prefer to drop it?
I'm not attached to it, so I'm fine with either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, I still don't like that create_root_node assumes a very specific HTMLParser class; it doesn't work with stdlib HTMLParser, it doesn't work with html5lib HTMLParser, and there is no parser class in lxml.html.soupparser.
This changes the default HTML parser to html.HTMLParser, and also introduces a parameter in Selector to specify another parser class if desired. The parser parameter will enable users that care a big deal about performance to use a custom parser if desired. This will affect Scrapy because it just uses the default here, but doesn't seem to have a perceived impact on performance, as per @kmike benchmark shared here: https://gist.github.com/kmike/af647777cef39c3d01071905d176c006
13eb040
to
0509ca7
Compare
@eliasdorneles , the docstring for |
Let's split the change into 1) the move to |
I suppose the customizability doesn't make that much sense after all, as @kmike pointed out -- it could be confusing as it's tied to a very specific parser API. I've sent a new PR for just changing the default parser, as it looks like it will solve all the problems and any performance degradation seems negligible. |
Makes sense @eliasdorneles ! |
I am using scarpy 1.4 and latest parsel but there is no parsel_cls option for initializing Selector.
Which outputs following:
But I expected:
How should I change Selector parser to get correct output? |
@bijanpiri parsel haven't implemented support for html5 parsers yet. You can use them directly without parsel. I think you can also try creating Selector, passing a properly parsed tree in a |
This changes the default HTML parser to html.HTMLParser, and also
introduces a parameter in Selector to specify another parser class
if desired.
The parser parameter will enable users that care a big deal about
performance to use a custom parser if desired.
This will affect Scrapy because it just uses the default here, but
doesn't seem to have a perceived impact on performance, as per @kmike
benchmark shared here:
https://gist.github.com/kmike/af647777cef39c3d01071905d176c006