Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utils.clean_html produces different output on Python 2.x and 3.x #12903

Closed
Tithen-Firion opened this issue Apr 28, 2017 · 4 comments
Closed

utils.clean_html produces different output on Python 2.x and 3.x #12903

Tithen-Firion opened this issue Apr 28, 2017 · 4 comments

Comments

@Tithen-Firion
Copy link
Contributor

@Tithen-Firion Tithen-Firion commented Apr 28, 2017

Please follow the guide below

  • You will be asked some questions and requested to provide some information, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your issue (like that [x])
  • Use Preview tab to see how your issue will actually look like

Make sure you are using the latest version: run youtube-dl --version and ensure your version is 2017.04.26. If it's not read this FAQ entry and update. Issues with outdated version will be rejected.

  • I've verified and I assure that I'm running youtube-dl 2017.04.26

Before submitting an issue make sure you have:

  • At least skimmed through README and most notably FAQ and BUGS sections
  • Searched the bugtracker for similar issues including closed ones

What is the purpose of your issue?

  • Bug report (encountered problems with youtube-dl)
  • Site support request (request for adding support for a new site)
  • Feature request (request for a new functionality)
  • Question
  • Other

In Python 3.x \s matches non-breaking space (\xa0), in Python 2.x it doesn't. This causes different output and fails in tests, for example test_ArchiveOrg_1:

Python 3.6:

F:\GitHub\youtube-dl>python test\test_download.py TestDownload.test_ArchiveOrg_1
[archive.org] Cops1922: Downloading webpage
[archive.org] Cops1922: Downloading JSON metadata
[info] Writing video description metadata as JSON to: test_ArchiveOrg_1_Cops1922.info.json
[debug] Invoking downloader on 'https://archive.org/download/Cops1922/Cops-v2.mp4'
[download] Destination: test_ArchiveOrg_1_Cops1922.mp4
[download] 100% of 10.00KiB in 00:00
.
----------------------------------------------------------------------
Ran 1 test in 4.415s

OK

Python 2.7:

F:\GitHub\youtube-dl>python2 test\test_download.py TestDownload.test_ArchiveOrg_1
[archive.org] Cops1922: Downloading webpage
[archive.org] Cops1922: Downloading JSON metadata
[info] Writing video description metadata as JSON to: test_ArchiveOrg_1_Cops1922.info.json
[debug] Invoking downloader on u'https://archive.org/download/Cops1922/Cops-v2.mp4'
[download] Destination: test_ArchiveOrg_1_Cops1922.mp4
[download] 100% of 10.00KiB in 00:00
F
======================================================================
FAIL: test_ArchiveOrg_1 (__main__.TestDownload):
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test\test_download.py", line 210, in test_template
    expect_info_dict(self, tc_res_dict, tc.get('info_dict', {}))
  File "F:\GitHub\youtube-dl\test\helper.py", line 177, in expect_info_dict
    expect_dict(self, got_dict, expected_dict)
  File "F:\GitHub\youtube-dl\test\helper.py", line 173, in expect_dict
    expect_value(self, got, expected, info_field)
  File "F:\GitHub\youtube-dl\test\helper.py", line 167, in expect_value
    'Invalid value for field %s, expected %r, got %r' % (field, expected, got))
AssertionError: Invalid value for field description, expected u'md5:89e7c77bf5d965dd5c0372cfb49470f6', got u'md5:186aee7106d70bfa6dff7ea443855c7c'

----------------------------------------------------------------------
Ran 1 test in 4.190s

FAILED (failures=1)

Original from clean_html:

    html = re.sub(r'\s*<\s*br\s*/?\s*>\s*', '\n', html)
    html = re.sub(r'<\s*/\s*p\s*>\s*<\s*p[^>]*>', '\n', html)

Solution 1:

    html = html.replace('\xa0', ' ')
    # no change below
    html = re.sub(r'\s*<\s*br\s*/?\s*>\s*', '\n', html)
    html = re.sub(r'<\s*/\s*p\s*>\s*<\s*p[^>]*>', '\n', html)

may break some other tests

Solution 2a:

    html = re.sub(r'[\s\xa0]*<\s*br\s*/?\s*>[\s\xa0]*', '\n', html)
    html = re.sub(r'<\s*/\s*p\s*>[\s\xa0]*<\s*p[^>]*>', '\n', html)

or 2b:

    html = re.sub('[\\s\xa0]*<\\s*br\\s*/?\\s*>[\\s\xa0]*', '\n', html)
    html = re.sub('<\\s*/\\s*p\\s*>[\\s\xa0]*<\\s*p[^>]*>', '\n', html)

not really sure which one of them is correct because both of them work.

Which solution is better?

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Apr 28, 2017

In Python 3.x \s matches non-breaking space (\xa0), in Python 2.x it doesn't.

That's because in Python 3 unicode patterns are used by default. You can force Unicode on Python 2 or ASCII matching on Python 3.

$ python3 -c 'import re; print(re.match("(?a)\s", "\xa0"))'
None
$ python2 -c 'import re; print(re.match("(?u)\s", "\xa0"))'
<_sre.SRE_Match object at 0x7f325b186578>

Now here comes the next question: which one should be used? The HTML5 standard [3] is closer to the ASCII way, so (?a) is more reasonable.

References:
[1] Unicode spaces: https://github.com/python/cpython/blob/master/Objects/unicodetype_db.h#L5741
[2] ASCII spaces: https://github.com/python/cpython/blob/master/Python/pyctype.c#L5
[3] https://www.w3.org/TR/html5/infrastructure.html#space-character

@Tithen-Firion
Copy link
Contributor Author

@Tithen-Firion Tithen-Firion commented Apr 28, 2017

\s is supposed to match whitespace characters, not just spaces. And as [3] says:

The White_Space characters are those that have the Unicode property "White_Space" in the Unicode PropList.txt data file.

and inside that file:

00A0          ; White_Space # Zs       NO-BREAK SPACE

so I guess (?u)?

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Apr 28, 2017

In https://www.w3.org/TR/html5/syntax.html#start-tags, a start tag in HTML uses "space characters" rather than "White_Space characters". Anyway many websites doesn't obey HTML standards, so it's OK as long as the result is consistent on Python 2 and 3.

@Tithen-Firion
Copy link
Contributor Author

@Tithen-Firion Tithen-Firion commented Apr 28, 2017

Also: (?u) can be used in Python 3, (?a) can't in Python 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

2 participants
You can’t perform that action at this time.