Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLParser incorrectly decodes string as HTML entities #11798

Closed
einstein95 opened this issue Jan 21, 2017 · 3 comments
Closed

HTMLParser incorrectly decodes string as HTML entities #11798

einstein95 opened this issue Jan 21, 2017 · 3 comments

Comments

@einstein95
Copy link
Contributor

@einstein95 einstein95 commented Jan 21, 2017

Please follow the guide below

  • You will be asked some questions and requested to provide some information, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your issue (like that [x])
  • Use Preview tab to see how your issue will actually look like

Make sure you are using the latest version: run youtube-dl --version and ensure your version is 2017.01.18. If it's not read this FAQ entry and update. Issues with outdated version will be rejected.

  • I've verified and I assure that I'm running youtube-dl 2017.01.18

Before submitting an issue make sure you have:

  • At least skimmed through README and most notably FAQ and BUGS sections
  • Searched the bugtracker for similar issues including closed ones

What is the purpose of your issue?

  • Bug report (encountered problems with youtube-dl)
  • Site support request (request for adding support for a new site)
  • Feature request (request for a new functionality)
  • Question
  • Other

Description of your issue, suggested solution and other information

Due to the reliance of HTMLParser, any site such as PornFlip (see #11795) that contains the string &sectime in the mpd manifest URL gets incorrectly decoded to §ime. I have no knowledge of why HTMLParser does this, as § HTML-encoded is § (note the semicolon, like other HTML-encoded symbols). It is due to this that I had to rely on extracting the MP4 links rather than simply calling _parse_html5_media_entries.
I felt that I needed to inform the developers of this, in case this or a similar problem happens to any current or future developer.

@dstftw
Copy link
Collaborator

@dstftw dstftw commented Jan 21, 2017

HTML5 allows entities without a semicolon. Thus this is perfectly legal.

@einstein95
Copy link
Contributor Author

@einstein95 einstein95 commented Jan 21, 2017

Albeit very stupid

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Jan 21, 2017

I guess it's PornFlip that provides broken HTML. fix_xml_ampersands can be used in such cases. If you believe there's a bug in HTMLParser, report it to http://bugs.python.org/ instead.

@yan12125 yan12125 closed this Jan 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.