[MRG+1] Add parser_cls argument, changes default html parser to html.HTMLParser (closes #40) #54

eliasdorneles · 2016-08-10T13:13:35Z

This changes the default HTML parser to html.HTMLParser, and also
introduces a parameter in Selector to specify another parser class
if desired.

The parser parameter will enable users that care a big deal about
performance to use a custom parser if desired.

This will affect Scrapy because it just uses the default here, but
doesn't seem to have a perceived impact on performance, as per @kmike
benchmark shared here:

https://gist.github.com/kmike/af647777cef39c3d01071905d176c006

codecov-io · 2016-08-10T13:23:18Z

Current coverage is 100% (diff: 100%)

Merging #54 into master will not change coverage

@@           master   #54   diff @@
===================================
  Files           4     4          
  Lines         196   196          
  Methods         0     0          
  Messages        0     0          
  Branches       34    34          
===================================
  Hits          196   196          
  Misses          0     0          
  Partials        0     0

Powered by Codecov. Last update 4434921...0509ca7

kmike · 2016-08-11T06:21:45Z

parsel/selector.py

@@ -139,9 +139,9 @@ class Selector(object):
    selectorlist_cls = SelectorList

    def __init__(self, text=None, type=None, namespaces=None, root=None,
-                 base_url=None, _expr=None):
+                 base_url=None, _expr=None, parser_cls=None):


Are there other use cases for this argument other than passing etree.HtmlParser?

I tried using lxml.html.html5parser.HTMLParser but unfortunately, encoding param makes it choke

>>> from lxml.html.html5parser import HTMLParser >>> s = Selector(text=u'<html><body><p>test</p></body></html>', parser_cls=HTMLParser) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/paul/src/parsel/parsel/selector.py", line 151, in __init__ root = self._get_root(text, base_url) File "/home/paul/src/parsel/parsel/selector.py", line 162, in _get_root return create_root_node(text, self._parser, base_url=base_url) File "/home/paul/src/parsel/parsel/selector.py", line 42, in create_root_node parser = parser_cls(recover=True, encoding='utf8') File "/home/paul/.virtualenvs/parseldev/lib/python3.4/site-packages/lxml/html/html5parser.py", line 32, in __init__ _HTMLParser.__init__(self, strict=strict, tree=TreeBuilder, **kwargs) TypeError: __init__() got an unexpected keyword argument 'encoding'

Well, the idea was to support other custom subclasses of etree.HTMLParser (in the case of html5parser.HTMLParser, one could build a subclass of that supporting the encoding argument).
Does this make sense or do you prefer to drop it?
I'm not attached to it, so I'm fine with either way.

hm, I still don't like that create_root_node assumes a very specific HTMLParser class; it doesn't work with stdlib HTMLParser, it doesn't work with html5lib HTMLParser, and there is no parser class in lxml.html.soupparser.

@kmike

This changes the default HTML parser to html.HTMLParser, and also introduces a parameter in Selector to specify another parser class if desired. The parser parameter will enable users that care a big deal about performance to use a custom parser if desired. This will affect Scrapy because it just uses the default here, but doesn't seem to have a perceived impact on performance, as per @kmike benchmark shared here: https://gist.github.com/kmike/af647777cef39c3d01071905d176c006

eliasdorneles · 2016-09-19T18:51:15Z

@kmike @redapple I think we can merge this, what do you say?

redapple · 2016-09-19T19:09:52Z

@eliasdorneles , the docstring for Selector also needs update I believe.

redapple · 2016-09-19T21:15:59Z

Let's split the change into 1) the move to lxml.html.HTMLParser and 2) the customizability of the parser class?

eliasdorneles · 2016-11-14T14:58:43Z

I suppose the customizability doesn't make that much sense after all, as @kmike pointed out -- it could be confusing as it's tied to a very specific parser API.

I've sent a new PR for just changing the default parser, as it looks like it will solve all the problems and any performance degradation seems negligible.

eliasdorneles · 2016-11-14T14:59:34Z

To be explicit, I'm in favor of closing this PR and merging #63 instead

@redapple @kmike 👆

redapple · 2016-11-14T16:02:40Z

Makes sense @eliasdorneles !
I'm closing this one then.

bijanpiri · 2017-08-21T07:16:19Z

I am using scarpy 1.4 and latest parsel but there is no parsel_cls option for initializing Selector.
By the way my problem is the bug to parse following html examples:
example 1:
<div> x<2</div>
example 2:
<div> x<y</div>
I used following code to check above example for parsing:

body=u'<div> x<2 </div>'
Selector(text=body).xpath('//div').extract()
body=u'<div> x<y </div>'
Selector(text=body).xpath('//div').extract()

Which outputs following:

[u'<div> x</div>']
[u'<div> x<y></y></div>']

But I expected:

[u'<div> x<1</div>']
[u'<div> x<y</div>']

How should I change Selector parser to get correct output?

kmike · 2017-08-21T07:29:00Z

@bijanpiri parsel haven't implemented support for html5 parsers yet. You can use them directly without parsel. I think you can also try creating Selector, passing a properly parsed tree in a root argument.

eliasdorneles mentioned this pull request Aug 10, 2016

Selector.root is not an instance of lxml.html.HtmlElement even if parser is html #40

Closed

kmike reviewed Aug 11, 2016
View reviewed changes

redapple mentioned this pull request Aug 29, 2016

Bad HTML parsing #57

Open

This was referenced Sep 14, 2016

Illegal character (<,>,&) in HTML cause xpath extracted value to be empty #286

Open

adding option to use html5lib instead of default htmlparser scrapy/scrapy#1043

Closed

eliasdorneles added 2 commits September 19, 2016 15:39

add test for default behavior

0509ca7

eliasdorneles force-pushed the introduce-parser-arg-and-changes-default branch from 13eb040 to 0509ca7 Compare September 19, 2016 18:45

redapple changed the title ~~Add parser_cls argument, changes default html parser to html.HTMLParser (closes #40)~~ [MRG+1] Add parser_cls argument, changes default html parser to html.HTMLParser (closes #40) Sep 19, 2016

eliasdorneles mentioned this pull request Nov 14, 2016

[MRG+1] Change default parser to html.HTMLParser #63

Merged

redapple closed this Nov 14, 2016

eliasdorneles deleted the introduce-parser-arg-and-changes-default branch November 22, 2016 13:48

joaquingx mentioned this pull request Jan 11, 2019

Add HTML5Parser option #133

Closed

barrio mentioned this pull request Apr 30, 2024

Parsel import causes crash #294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] Add parser_cls argument, changes default html parser to html.HTMLParser (closes #40) #54

[MRG+1] Add parser_cls argument, changes default html parser to html.HTMLParser (closes #40) #54

eliasdorneles commented Aug 10, 2016

codecov-io commented Aug 10, 2016 •

edited

Loading

kmike Aug 11, 2016

redapple Aug 11, 2016

eliasdorneles Aug 11, 2016

kmike Sep 19, 2016

eliasdorneles commented Sep 19, 2016

redapple commented Sep 19, 2016 •

edited

Loading

redapple commented Sep 19, 2016

eliasdorneles commented Nov 14, 2016

eliasdorneles commented Nov 14, 2016

redapple commented Nov 14, 2016

bijanpiri commented Aug 21, 2017

kmike commented Aug 21, 2017

[MRG+1] Add parser_cls argument, changes default html parser to html.HTMLParser (closes #40) #54

[MRG+1] Add parser_cls argument, changes default html parser to html.HTMLParser (closes #40) #54

Conversation

eliasdorneles commented Aug 10, 2016

codecov-io commented Aug 10, 2016 • edited Loading

Current coverage is 100% (diff: 100%)

kmike Aug 11, 2016

Choose a reason for hiding this comment

redapple Aug 11, 2016

Choose a reason for hiding this comment

eliasdorneles Aug 11, 2016

Choose a reason for hiding this comment

kmike Sep 19, 2016

Choose a reason for hiding this comment

eliasdorneles commented Sep 19, 2016

redapple commented Sep 19, 2016 • edited Loading

redapple commented Sep 19, 2016

eliasdorneles commented Nov 14, 2016

eliasdorneles commented Nov 14, 2016

redapple commented Nov 14, 2016

bijanpiri commented Aug 21, 2017

kmike commented Aug 21, 2017

codecov-io commented Aug 10, 2016 •

edited

Loading

redapple commented Sep 19, 2016 •

edited

Loading