Bad HTML parsing #57

luc-phan · 2016-08-29T09:52:11Z

Given the same HTML code, here is what different parsers see :

=== HTML ===
<li>
 one
 <div>
</li>
<li>
 two
</li>
=== parsel (lxml) (marginal interpretation) ===
<html><body><li>
 one
 <div>

<li>
 two
</li></div></li></body></html>
=== html.parser ===
<li>
 one
 <div>
 </div>
</li>
<li>
 two
</li>
=== lxml (same problem as parsel of course) ===
<html>
 <body>
  <li>
   one
   <div>
    <li>
     two
    </li>
   </div>
  </li>
 </body>
</html>
=== html5lib (Parses pages the same way a web browser does) ===
<html>
 <head>
 </head>
 <body>
  <li>
   one
   <div>
   </div>
  </li>
  <li>
   two
  </li>
 </body>
</html>

This is very annoying to parse something when the parsing is different from a web browser parsing. It would be a good addition to provide a way to use something else than lxml.

#!/usr/bin/env python

from parsel import Selector
from bs4 import BeautifulSoup

print('=== HTML ===')
html = '''<li>
 one
 <div>
</li>
<li>
 two
</li>'''
print(html)

print('=== parsel (lxml) (marginal interpretation) ===')
sel = Selector(text=html)
print(sel.extract())

print('=== html.parser ===')
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())

print('=== lxml (same problem as parsel of course) ===')
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())

print('=== html5lib (Parses pages the same way a web browser does) ===')
soup = BeautifulSoup(html, 'html5lib')
print(soup.prettify())

The text was updated successfully, but these errors were encountered:

redapple · 2016-08-29T09:57:01Z

See related #54 which adds a parser_cls attribute to customize the parser.
Note that scrapy/parsel favors speed (lxml) over browser-parsing compliance: html5lib is still much slower than lxml (as far as I know, I didn't check recently)

redapple mentioned this issue Sep 1, 2016

Bad HTML parser scrapy/scrapy#2205

Closed

joaquingx mentioned this issue Jan 11, 2019

Add HTML5Parser option #133

Closed

Gallaecio added the enhancement label May 9, 2019

Gallaecio added the discuss label Sep 24, 2019

barrio mentioned this issue Apr 30, 2024

Parsel import causes crash #294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad HTML parsing #57

Bad HTML parsing #57

luc-phan commented Aug 29, 2016

redapple commented Aug 29, 2016

Bad HTML parsing #57

Bad HTML parsing #57

Comments

luc-phan commented Aug 29, 2016

redapple commented Aug 29, 2016