Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Add parser_cls argument, changes default html parser to html.HTMLParser (closes #40) #54

Closed

Conversation

eliasdorneles
Copy link
Member

This changes the default HTML parser to html.HTMLParser, and also
introduces a parameter in Selector to specify another parser class
if desired.

The parser parameter will enable users that care a big deal about
performance to use a custom parser if desired.

This will affect Scrapy because it just uses the default here, but
doesn't seem to have a perceived impact on performance, as per @kmike
benchmark shared here:

https://gist.github.com/kmike/af647777cef39c3d01071905d176c006

@codecov-io
Copy link

codecov-io commented Aug 10, 2016

Current coverage is 100% (diff: 100%)

Merging #54 into master will not change coverage

@@           master   #54   diff @@
===================================
  Files           4     4          
  Lines         196   196          
  Methods         0     0          
  Messages        0     0          
  Branches       34    34          
===================================
  Hits          196   196          
  Misses          0     0          
  Partials        0     0          

Powered by Codecov. Last update 4434921...0509ca7

@@ -139,9 +139,9 @@ class Selector(object):
selectorlist_cls = SelectorList

def __init__(self, text=None, type=None, namespaces=None, root=None,
base_url=None, _expr=None):
base_url=None, _expr=None, parser_cls=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other use cases for this argument other than passing etree.HtmlParser?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using lxml.html.html5parser.HTMLParser but unfortunately, encoding param makes it choke

>>> from lxml.html.html5parser import HTMLParser
>>> s = Selector(text=u'<html><body><p>test</p></body></html>', parser_cls=HTMLParser)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/paul/src/parsel/parsel/selector.py", line 151, in __init__
    root = self._get_root(text, base_url)
  File "/home/paul/src/parsel/parsel/selector.py", line 162, in _get_root
    return create_root_node(text, self._parser, base_url=base_url)
  File "/home/paul/src/parsel/parsel/selector.py", line 42, in create_root_node
    parser = parser_cls(recover=True, encoding='utf8')
  File "/home/paul/.virtualenvs/parseldev/lib/python3.4/site-packages/lxml/html/html5parser.py", line 32, in __init__
    _HTMLParser.__init__(self, strict=strict, tree=TreeBuilder, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'encoding'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the idea was to support other custom subclasses of etree.HTMLParser (in the case of html5parser.HTMLParser, one could build a subclass of that supporting the encoding argument).
Does this make sense or do you prefer to drop it?
I'm not attached to it, so I'm fine with either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I still don't like that create_root_node assumes a very specific HTMLParser class; it doesn't work with stdlib HTMLParser, it doesn't work with html5lib HTMLParser, and there is no parser class in lxml.html.soupparser.

This changes the default HTML parser to html.HTMLParser, and also
introduces a parameter in Selector to specify another parser class
if desired.

The parser parameter will enable users that care a big deal about
performance to use a custom parser if desired.

This will affect Scrapy because it just uses the default here, but
doesn't seem to have a perceived impact on performance, as per @kmike
benchmark shared here:

https://gist.github.com/kmike/af647777cef39c3d01071905d176c006
@eliasdorneles eliasdorneles force-pushed the introduce-parser-arg-and-changes-default branch from 13eb040 to 0509ca7 Compare September 19, 2016 18:45
@eliasdorneles
Copy link
Member Author

@kmike @redapple I think we can merge this, what do you say?

@redapple redapple changed the title Add parser_cls argument, changes default html parser to html.HTMLParser (closes #40) [MRG+1] Add parser_cls argument, changes default html parser to html.HTMLParser (closes #40) Sep 19, 2016
@redapple
Copy link
Contributor

redapple commented Sep 19, 2016

@eliasdorneles , the docstring for Selector also needs update I believe.

@redapple
Copy link
Contributor

Let's split the change into 1) the move to lxml.html.HTMLParser and 2) the customizability of the parser class?

@eliasdorneles
Copy link
Member Author

I suppose the customizability doesn't make that much sense after all, as @kmike pointed out -- it could be confusing as it's tied to a very specific parser API.

I've sent a new PR for just changing the default parser, as it looks like it will solve all the problems and any performance degradation seems negligible.

@eliasdorneles
Copy link
Member Author

To be explicit, I'm in favor of closing this PR and merging #63 instead

@redapple @kmike 👆

@redapple
Copy link
Contributor

Makes sense @eliasdorneles !
I'm closing this one then.

@redapple redapple closed this Nov 14, 2016
@eliasdorneles eliasdorneles deleted the introduce-parser-arg-and-changes-default branch November 22, 2016 13:48
@bijanpiri
Copy link

I am using scarpy 1.4 and latest parsel but there is no parsel_cls option for initializing Selector.
By the way my problem is the bug to parse following html examples:
example 1:
<div> x<2</div>
example 2:
<div> x<y</div>
I used following code to check above example for parsing:

body=u'<div> x<2 </div>'
Selector(text=body).xpath('//div').extract()
body=u'<div> x<y </div>'
Selector(text=body).xpath('//div').extract()

Which outputs following:

[u'<div> x</div>']
[u'<div> x<y></y></div>']

But I expected:

[u'<div> x<1</div>']
[u'<div> x<y</div>']

How should I change Selector parser to get correct output?

@kmike
Copy link
Member

kmike commented Aug 21, 2017

@bijanpiri parsel haven't implemented support for html5 parsers yet. You can use them directly without parsel. I think you can also try creating Selector, passing a properly parsed tree in a root argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants