Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Add support for html5lib #2708
Add support for html5lib #2708
Comments
|
However, youtube-dl has made it a policy to not import anything not available in the (2.6) stdlib, except if it is necessary for a specific feature (and even then, it's of course only imported when the feature is requested). The reason for this is that not only can package installation be hard to do, complicated, and occasionally insecure, but that it is also a significant hassle on Windows, old RHEL releases and even IRIX. I believe this to be one of the reasons for the (relative) success of the project. Therefore, I do not think that html5lib is a good fit for youtube-dl. Luckily, I have yet to see a webpage that does not either serve contents as (correct) XML or is easily parseable with Additionally, html5lib seems to be dangerously close to being unmaintained. However, if you prefer writing extractors with html5lib, feel free to do so and we'll translate the pull request into regexes - ugly as they are, they're simple and available in the stdlib. Of course I hope that at some time in the future, Python does include a (usable) HTML parser in the stdlib, and that a couple of decades later we'll be able to use that. |
Hi,
I stumbled upon probems with malformed HTML when trying to implement an extractor for cba.fro.at and at least #2474 seems to suggest that I am not the only one.
So I would like to propose optional support for html5lib (or BeautifulSoup if that's preferred). It's a bit more liberal than the current parser and could easily be added. We could either replace the current implementation of
InfoExtractor._download_xml()if html5libs ETree proves to be 100% compatible using the testsuite or we could add anInfoExtractor._download_faulty_xml()method if not.What do you think? I coul implement that, but I wanted to hear some opinions before i get to work.