Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
YoutubeDL overrides HTMLParser.locatestarttagend with a regex that doesn't always work. #4081
Comments
|
Here is the default regex which works correctly: |
|
Thanks a lot for the report! It will be fixed in the next version. If you find any other place where we override something from the stdlib, please report it and we'll try to remove it. |
This is perhaps not a typical use case, but still an issue.
I've been experimenting with embedding youtube-dl in an existing python application and it's mostly working great, however I noticed an issue related to HTML parsing.
My application also parses HTML and I noticed that it was getting incorrect results after importing youtubedl. It turns out the issue is with this regex:
https://github.com/rg3/youtube-dl/blob/ecc0c5ee01f0e5bdd6af0c32cb5b4adcb2a2f78c/youtube_dl/utils.py#L155
This overrides the regex used by all HTML parser. Perhaps this should only be set for old versions of Python? (Not sure how old). I am using Python 2.7.6 and have not had problems with the default regex.
(Looks like latest python 3.4 uses same regex as well)
The following is an example where this custom regex doesn't work:
The output is as follows:
As you can see, the
<img>tag is no longer parsed correctly due the regex that youtube-dl sets.I can work around this for now, but would be nice to have this fixed, as it can affect parsing in other cases and should be a simple fix (such as using the regex from python 2.7.6 and set only if it is different)