Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad HTML parsing #57

Open
luc-phan opened this issue Aug 29, 2016 · 1 comment
Open

Bad HTML parsing #57

luc-phan opened this issue Aug 29, 2016 · 1 comment

Comments

@luc-phan
Copy link

Given the same HTML code, here is what different parsers see :

=== HTML ===
<li>
 one
 <div>
</li>
<li>
 two
</li>
=== parsel (lxml) (marginal interpretation) ===
<html><body><li>
 one
 <div>

<li>
 two
</li></div></li></body></html>
=== html.parser ===
<li>
 one
 <div>
 </div>
</li>
<li>
 two
</li>
=== lxml (same problem as parsel of course) ===
<html>
 <body>
  <li>
   one
   <div>
    <li>
     two
    </li>
   </div>
  </li>
 </body>
</html>
=== html5lib (Parses pages the same way a web browser does) ===
<html>
 <head>
 </head>
 <body>
  <li>
   one
   <div>
   </div>
  </li>
  <li>
   two
  </li>
 </body>
</html>

This is very annoying to parse something when the parsing is different from a web browser parsing. It would be a good addition to provide a way to use something else than lxml.

#!/usr/bin/env python

from parsel import Selector
from bs4 import BeautifulSoup

print('=== HTML ===')
html = '''<li>
 one
 <div>
</li>
<li>
 two
</li>'''
print(html)

print('=== parsel (lxml) (marginal interpretation) ===')
sel = Selector(text=html)
print(sel.extract())

print('=== html.parser ===')
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())

print('=== lxml (same problem as parsel of course) ===')
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())

print('=== html5lib (Parses pages the same way a web browser does) ===')
soup = BeautifulSoup(html, 'html5lib')
print(soup.prettify())
@redapple
Copy link
Contributor

See related #54 which adds a parser_cls attribute to customize the parser.
Note that scrapy/parsel favors speed (lxml) over browser-parsing compliance: html5lib is still much slower than lxml (as far as I know, I didn't check recently)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants