Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
utils.clean_html produces different output on Python 2.x and 3.x #12903
Comments
That's because in Python 3 unicode patterns are used by default. You can force Unicode on Python 2 or ASCII matching on Python 3.
Now here comes the next question: which one should be used? The HTML5 standard [3] is closer to the ASCII way, so References: |
|
The White_Space characters are those that have the Unicode property "White_Space" in the Unicode and inside that file:
so I guess |
|
In https://www.w3.org/TR/html5/syntax.html#start-tags, a start tag in HTML uses "space characters" rather than "White_Space characters". Anyway many websites doesn't obey HTML standards, so it's OK as long as the result is consistent on Python 2 and 3. |
|
Also: |
Please follow the guide below
xinto all the boxes [ ] relevant to your issue (like that [x])Make sure you are using the latest version: run
youtube-dl --versionand ensure your version is 2017.04.26. If it's not read this FAQ entry and update. Issues with outdated version will be rejected.Before submitting an issue make sure you have:
What is the purpose of your issue?
In Python 3.x
\smatches non-breaking space (\xa0), in Python 2.x it doesn't. This causes different output and fails in tests, for exampletest_ArchiveOrg_1:Python 3.6:
Python 2.7:
Original from
clean_html:Solution 1:
may break some other tests
Solution 2a:
or 2b:
not really sure which one of them is correct because both of them work.
Which solution is better?