Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YoutubeDL overrides HTMLParser.locatestarttagend with a regex that doesn't always work. #4081

Closed
ikreymer opened this issue Nov 1, 2014 · 2 comments

Comments

@ikreymer
Copy link

@ikreymer ikreymer commented Nov 1, 2014

This is perhaps not a typical use case, but still an issue.

I've been experimenting with embedding youtube-dl in an existing python application and it's mostly working great, however I noticed an issue related to HTML parsing.
My application also parses HTML and I noticed that it was getting incorrect results after importing youtubedl. It turns out the issue is with this regex:

https://github.com/rg3/youtube-dl/blob/ecc0c5ee01f0e5bdd6af0c32cb5b4adcb2a2f78c/youtube_dl/utils.py#L155

This overrides the regex used by all HTML parser. Perhaps this should only be set for old versions of Python? (Not sure how old). I am using Python 2.7.6 and have not had problems with the default regex.
(Looks like latest python 3.4 uses same regex as well)

The following is an example where this custom regex doesn't work:

from HTMLParser import HTMLParser
from htmlentitydefs import name2codepoint

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Start tag:", tag
        for attr in attrs:
            print "     attr:", attr
    def handle_endtag(self, tag):
        print "End tag  :", tag
    def handle_data(self, data):
        print "Data     :", data
    def handle_comment(self, data):
        print "Comment  :", data
    def handle_entityref(self, name):
        c = unichr(name2codepoint[name])
        print "Named ent:", c
    def handle_charref(self, name):
        if name.startswith('x'):
            c = unichr(int(name[1:], 16))
        else:
            c = unichr(int(name))
        print "Num ent  :", c
    def handle_decl(self, data):
        print "Decl     :", data

print('Before YoutubeDL import')
print('')

parser = MyHTMLParser()
parser.feed('<a href="foo" ><img src="bar" / ></a>')
parser.close()

from youtube_dl import YoutubeDL

print('')
print('After YoutubeDL import')
print('')

parser = MyHTMLParser()
parser.feed('<a href="foo" ><img src="bar" / ></a>')
parser.close()

The output is as follows:

Before YoutubeDL import

Start tag: a
     attr: ('href', 'foo')
Start tag: img
     attr: ('src', 'bar')
End tag  : a

After YoutubeDL import

Start tag: a
     attr: ('href', 'foo')
Data     : <img src="bar" / >
End tag  : a

As you can see, the <img> tag is no longer parsed correctly due the regex that youtube-dl sets.

I can work around this for now, but would be nice to have this fixed, as it can affect parsing in other cases and should be a simple fix (such as using the regex from python 2.7.6 and set only if it is different)

@ikreymer
Copy link
Author

@ikreymer ikreymer commented Nov 1, 2014

Here is the default regex which works correctly:
https://hg.python.org/cpython/file/2.7/Lib/HTMLParser.py#l37

@jaimeMF jaimeMF closed this in 4f195f5 Nov 2, 2014
@jaimeMF
Copy link
Collaborator

@jaimeMF jaimeMF commented Nov 2, 2014

Thanks a lot for the report! It will be fixed in the next version.
It will only be overriden in python 2.6 (we need it for extracting the youtube descriptions)

If you find any other place where we override something from the stdlib, please report it and we'll try to remove it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.