Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for html5lib #2708

Closed
phaer opened this issue Apr 5, 2014 · 1 comment
Closed

Add support for html5lib #2708

phaer opened this issue Apr 5, 2014 · 1 comment

Comments

@phaer
Copy link
Contributor

@phaer phaer commented Apr 5, 2014

Hi,

I stumbled upon probems with malformed HTML when trying to implement an extractor for cba.fro.at and at least #2474 seems to suggest that I am not the only one.

So I would like to propose optional support for html5lib (or BeautifulSoup if that's preferred). It's a bit more liberal than the current parser and could easily be added. We could either replace the current implementation of InfoExtractor._download_xml() if html5libs ETree proves to be 100% compatible using the testsuite or we could add an InfoExtractor._download_faulty_xml() method if not.

What do you think? I coul implement that, but I wanted to hear some opinions before i get to work.

@phaer phaer changed the title Add (optional) support for html5lib Add support for html5lib Apr 5, 2014
@phihag
Copy link
Contributor

@phihag phihag commented Apr 6, 2014

_download_xml is intended to only download valid XML. Most of the time, that's actually not XHTML. Therefore, if we were to include an HTML parser, its function would be called _download_html.

However, youtube-dl has made it a policy to not import anything not available in the (2.6) stdlib, except if it is necessary for a specific feature (and even then, it's of course only imported when the feature is requested). The reason for this is that not only can package installation be hard to do, complicated, and occasionally insecure, but that it is also a significant hassle on Windows, old RHEL releases and even IRIX. I believe this to be one of the reasons for the (relative) success of the project.

Therefore, I do not think that html5lib is a good fit for youtube-dl. Luckily, I have yet to see a webpage that does not either serve contents as (correct) XML or is easily parseable with _html_search_regex. Sites that do impose order tend to use (or at least offer) JSON or XML, and sites where it's all a mess are a nice fit for regular expressions anyways.

Additionally, html5lib seems to be dangerously close to being unmaintained.

However, if you prefer writing extractors with html5lib, feel free to do so and we'll translate the pull request into regexes - ugly as they are, they're simple and available in the stdlib. Of course I hope that at some time in the future, Python does include a (usable) HTML parser in the stdlib, and that a couple of decades later we'll be able to use that.

@phihag phihag closed this Apr 6, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.