Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reader for Unicode code points over 0xFFFF #351

Merged
merged 1 commit into from Dec 8, 2019
Merged

Fix reader for Unicode code points over 0xFFFF #351

merged 1 commit into from Dec 8, 2019

Conversation

@anishathalye
Copy link
Contributor

@anishathalye anishathalye commented Nov 20, 2019

Fixes #350


This patch fixes the handling of inputs with Unicode code points over 0xFFFF when running on a Python 2 that does not have UCS-4 support (which certain distributions still ship, e.g. macOS).

When Python is compiled without UCS-4 support, it uses UCS-2. In this situation, non-BMP Unicode characters, which have code points over 0xFFFF, are represented as surrogate pairs. For example, if we take u'\U0001f3d4', it will be represented as the surrogate pair u'\ud83c\udfd4'. This can be seen by running, for example:

[i for i in u'\U0001f3d4']

In PyYAML, the reader uses a function check_printable to validate inputs, making sure that they only contain printable characters. Prior to this patch, on UCS-2 builds, it incorrectly identified surrogate pairs as non-printable.

It would be fairly natural to write a regular expression that captures strings that contain only printable characters, as opposed to non-printable characters (as identified by the old code, so not excluding surrogate pairs):

PRINTABLE = re.compile(u'^[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]*$')

Adding support for surrogate pairs to this would be straightforward, adding the option of having a surrogate high followed by a surrogate low ([\uD800-\uDBFF][\uDC00-\uDFFF]):

PRINTABLE = re.compile(u'^(?:[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$')

Then, this regex could be used as follows:

def check_printable(self, data):
    if not self.PRINTABLE.match(data):
        raise ReaderError(...)

However, matching printable strings, rather than searching for non-printable characters as the code currently does, would have the disadvantage of not identifying the culprit character (we wouldn't get the position and the actual non-printable character from a lack of a regex match).

Instead, we can modify the NON_PRINTABLE regex to allow legal surrogate pairs. We do this by removing surrogate pairs from the existing character set and adding the following options for illegal uses of surrogate code points:

  • Surrogate low that doesn't follow a surrogate high (either a surrogate low at the start of a string, or a surrogate low that follows a character that's not a surrogate high):

    (?:^|[^\uD800-\uDBFF])[\uDC00-\uDFFF]

  • Surrogate high that isn't followed by a surrogate low (either a surrogate high at the end of a string, or a surrogate high that is followed by a character that's not a surrogate low):

    [\uD800-\uDBFF](?:[^\uDC00-\uDFFF]|$)

The behavior of this modified regex should match the one that is used when Python is built with UCS-4 support.

This patch fixes the handling of inputs with Unicode code points over
0xFFFF when running on a Python 2 that does not have UCS-4 support
(which certain distributions still ship, e.g. macOS).

When Python is compiled without UCS-4 support, it uses UCS-2. In this
situation, non-BMP Unicode characters, which have code points over
0xFFFF, are represented as surrogate pairs. For example, if we take
u'\U0001f3d4', it will be represented as the surrogate pair
u'\ud83c\udfd4'. This can be seen by running, for example:

    [i for i in u'\U0001f3d4']

In PyYAML, the reader uses a function `check_printable` to validate
inputs, making sure that they only contain printable characters. Prior
to this patch, on UCS-2 builds, it incorrectly identified surrogate
pairs as non-printable.

It would be fairly natural to write a regular expression that captures
strings that contain only *printable* characters, as opposed to
*non-printable* characters (as identified by the old code, so not
excluding surrogate pairs):

    PRINTABLE = re.compile(u'^[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]*$')

Adding support for surrogate pairs to this would be straightforward,
adding the option of having a surrogate high followed by a surrogate low
(`[\uD800-\uDBFF][\uDC00-\uDFFF]`):

    PRINTABLE = re.compile(u'^(?:[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$')

Then, this regex could be used as follows:

    def check_printable(self, data):
        if not self.PRINTABLE.match(data):
            raise ReaderError(...)

However, matching printable strings, rather than searching for
non-printable characters as the code currently does, would have the
disadvantage of not identifying the culprit character (we wouldn't get
the position and the actual non-printable character from a lack of a
regex match).

Instead, we can modify the NON_PRINTABLE regex to allow legal surrogate
pairs. We do this by removing surrogate pairs from the existing
character set and adding the following options for illegal uses of
surrogate code points:

- Surrogate low that doesn't follow a surrogate high (either a surrogate
  low at the start of a string, or a surrogate low that follows a
  character that's not a surrogate high):

    (?:^|[^\uD800-\uDBFF])[\uDC00-\uDFFF]

- Surrogate high that isn't followed by a surrogate low (either a
  surrogate high at the end of a string, or a surrogate high that is
  followed by a character that's not a surrogate low):

    [\uD800-\uDBFF](?:[^\uDC00-\uDFFF]|$)

The behavior of this modified regex should match the one that is used
when Python is built with UCS-4 support.
@perlpunk
Copy link
Member

@perlpunk perlpunk commented Nov 20, 2019

Thanks!
I think I can start with enabling the tests we explicitly disabled for this and see if Appveyor tests suceed.
But I will be quite busy until the rest of the week.

@anishathalye
Copy link
Contributor Author

@anishathalye anishathalye commented Nov 20, 2019

Oh yeah, I was going to do that and see what happened, but I forgot. Enabling those tests as part of this PR would make sense.

The tests can be enabled as follows:

diff --git i/tests/lib/test_appliance.py w/tests/lib/test_appliance.py
index 5ec4575..7822979 100644
--- i/tests/lib/test_appliance.py
+++ w/tests/lib/test_appliance.py
@@ -27,8 +27,6 @@ def find_test_filenames(directory):
             base, ext = os.path.splitext(filename)
             if base.endswith('-py3'):
                 continue
-            if not has_ucs4 and base.find('-ucs4-') > -1:
-                continue
             filenames.setdefault(base, []).append(ext)
     filenames = filenames.items()
     filenames.sort()

(it might make sense to rename those *-ucs4-* files too.)

For comparison, here's what happens with the UCS-4 tests enabled for a non-UCS-4 Python 2 build, prior to this patch:

running test
running build_py
running build_ext
building '_yaml' extension
clang -fno-strict-aliasing -fno-common -dynamic -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python@2/2.7.16/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c ext/_yaml.c -o build/temp.macosx-10.14-x86_64-2.7/ext/_yaml.o
.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................E.................................................................................................................................................................EE.......EE...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................EE.......
===========================================================================
test_representer_types(tests/data/emitting-unacceptable-unicode-character-bug-ucs4-py2.code): ERROR
Traceback (most recent call last):
  File "tests/lib/test_appliance.py", line 67, in execute
    function(verbose=verbose, *filenames)
  File "tests/lib/test_representer.py", line 15, in test_representer_types
    native2 = yaml.load(output, Loader=test_constructor.MyLoader)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 112, in load
    loader = Loader(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/loader.py", line 44, in __init__
    Reader.__init__(self, stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 81, in __init__
    self.determine_encoding()
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 137, in determine_encoding
    self.update(1)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 174, in update
    self.check_printable(data)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 149, in check_printable
    'unicode', "special characters are not allowed")
ReaderError: unacceptable character #xdd00: special characters are not allowed
  in "<string>", position 0
---------------------------------------------------------------------------
tests/data/emitting-unacceptable-unicode-character-bug-ucs4-py2.code:
u"\udd00"
===========================================================================
test_unicode_input(tests/data/emoticons-ucs4-.unicode): ERROR
Traceback (most recent call last):
  File "tests/lib/test_appliance.py", line 67, in execute
    function(verbose=verbose, *filenames)
  File "tests/lib/test_input_output.py", line 20, in test_unicode_input
    output = yaml.full_load(_unicode_open(StringIO.StringIO(data.encode('utf-8')), 'utf-8'))
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 142, in full_load
    return load(stream, FullLoader)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 112, in load
    loader = Loader(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/loader.py", line 24, in __init__
    Reader.__init__(self, stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 87, in __init__
    self.determine_encoding()
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 137, in determine_encoding
    self.update(1)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 174, in update
    self.check_printable(data)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 149, in check_printable
    'unicode', "special characters are not allowed")
ReaderError: unacceptable character #xd83d: special characters are not allowed
  in "<file>", position 0
---------------------------------------------------------------------------
tests/data/emoticons-ucs4-.unicode:
😀😁😂😃😄😅😆😇
😈😉😊😋😌😍😎😏
😐😑😒😓😔😕😖😗
😘😙😚😛😜😝😞😟
😠😡😢😣😤😥😦😧
😨😩😪😫😬😭😮😯
😰😱😲😳😴😵😶😷
😸😹😺😻😼😽😾😿
🙀🙁🙂🙃🙄🙅🙆🙇
🙈🙉🙊🙋🙌🙍🙎🙏
===========================================================================
test_unicode_input(tests/data/emoticons2-ucs4-.unicode): ERROR
Traceback (most recent call last):
  File "tests/lib/test_appliance.py", line 67, in execute
    function(verbose=verbose, *filenames)
  File "tests/lib/test_input_output.py", line 20, in test_unicode_input
    output = yaml.full_load(_unicode_open(StringIO.StringIO(data.encode('utf-8')), 'utf-8'))
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 142, in full_load
    return load(stream, FullLoader)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 112, in load
    loader = Loader(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/loader.py", line 24, in __init__
    Reader.__init__(self, stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 87, in __init__
    self.determine_encoding()
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 137, in determine_encoding
    self.update(1)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 174, in update
    self.check_printable(data)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 149, in check_printable
    'unicode', "special characters are not allowed")
ReaderError: unacceptable character #xd83d: special characters are not allowed
  in "<file>", position 0
---------------------------------------------------------------------------
tests/data/emoticons2-ucs4-.unicode:
😀
===========================================================================
test_unicode_transfer(tests/data/emoticons-ucs4-.unicode): ERROR
Traceback (most recent call last):
  File "tests/lib/test_appliance.py", line 67, in execute
    function(verbose=verbose, *filenames)
  File "tests/lib/test_input_output.py", line 123, in test_unicode_transfer
    output1 = yaml.emit(yaml.parse(input), allow_unicode=True)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 209, in emit
    for event in events:
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 73, in parse
    loader = Loader(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/loader.py", line 44, in __init__
    Reader.__init__(self, stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 76, in __init__
    self.check_printable(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 149, in check_printable
    'unicode', "special characters are not allowed")
ReaderError: unacceptable character #xd83d: special characters are not allowed
  in "<unicode string>", position 0
---------------------------------------------------------------------------
tests/data/emoticons-ucs4-.unicode:
😀😁😂😃😄😅😆😇
😈😉😊😋😌😍😎😏
😐😑😒😓😔😕😖😗
😘😙😚😛😜😝😞😟
😠😡😢😣😤😥😦😧
😨😩😪😫😬😭😮😯
😰😱😲😳😴😵😶😷
😸😹😺😻😼😽😾😿
🙀🙁🙂🙃🙄🙅🙆🙇
🙈🙉🙊🙋🙌🙍🙎🙏
===========================================================================
test_unicode_transfer(tests/data/emoticons2-ucs4-.unicode): ERROR
Traceback (most recent call last):
  File "tests/lib/test_appliance.py", line 67, in execute
    function(verbose=verbose, *filenames)
  File "tests/lib/test_input_output.py", line 123, in test_unicode_transfer
    output1 = yaml.emit(yaml.parse(input), allow_unicode=True)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 209, in emit
    for event in events:
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 73, in parse
    loader = Loader(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/loader.py", line 44, in __init__
    Reader.__init__(self, stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 76, in __init__
    self.check_printable(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 149, in check_printable
    'unicode', "special characters are not allowed")
ReaderError: unacceptable character #xd83d: special characters are not allowed
  in "<unicode string>", position 0
---------------------------------------------------------------------------
tests/data/emoticons2-ucs4-.unicode:
😀
===========================================================================
test_unicode_input_ext(tests/data/emoticons-ucs4-.unicode): ERROR
Traceback (most recent call last):
  File "tests/lib/test_appliance.py", line 67, in execute
    function(verbose=verbose, *filenames)
  File "tests/lib/test_yaml_ext.py", line 240, in wrapper
    function(*args, **kwds)
  File "tests/lib/test_input_output.py", line 20, in test_unicode_input
    output = yaml.full_load(_unicode_open(StringIO.StringIO(data.encode('utf-8')), 'utf-8'))
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 142, in full_load
    return load(stream, FullLoader)
  File "tests/lib/test_yaml_ext.py", line 30, in new_load
    return old_load(stream, Loader)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 112, in load
    loader = Loader(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/loader.py", line 24, in __init__
    Reader.__init__(self, stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 87, in __init__
    self.determine_encoding()
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 137, in determine_encoding
    self.update(1)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 174, in update
    self.check_printable(data)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 149, in check_printable
    'unicode', "special characters are not allowed")
ReaderError: unacceptable character #xd83d: special characters are not allowed
  in "<file>", position 0
---------------------------------------------------------------------------
tests/data/emoticons-ucs4-.unicode:
😀😁😂😃😄😅😆😇
😈😉😊😋😌😍😎😏
😐😑😒😓😔😕😖😗
😘😙😚😛😜😝😞😟
😠😡😢😣😤😥😦😧
😨😩😪😫😬😭😮😯
😰😱😲😳😴😵😶😷
😸😹😺😻😼😽😾😿
🙀🙁🙂🙃🙄🙅🙆🙇
🙈🙉🙊🙋🙌🙍🙎🙏
===========================================================================
test_unicode_input_ext(tests/data/emoticons2-ucs4-.unicode): ERROR
Traceback (most recent call last):
  File "tests/lib/test_appliance.py", line 67, in execute
    function(verbose=verbose, *filenames)
  File "tests/lib/test_yaml_ext.py", line 240, in wrapper
    function(*args, **kwds)
  File "tests/lib/test_input_output.py", line 20, in test_unicode_input
    output = yaml.full_load(_unicode_open(StringIO.StringIO(data.encode('utf-8')), 'utf-8'))
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 142, in full_load
    return load(stream, FullLoader)
  File "tests/lib/test_yaml_ext.py", line 30, in new_load
    return old_load(stream, Loader)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 112, in load
    loader = Loader(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/loader.py", line 24, in __init__
    Reader.__init__(self, stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 87, in __init__
    self.determine_encoding()
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 137, in determine_encoding
    self.update(1)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 174, in update
    self.check_printable(data)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 149, in check_printable
    'unicode', "special characters are not allowed")
ReaderError: unacceptable character #xd83d: special characters are not allowed
  in "<file>", position 0
---------------------------------------------------------------------------
tests/data/emoticons2-ucs4-.unicode:
😀
===========================================================================
TESTS: 2590
ERRORS: 7

Here's what happens with this patch:

running test
running build_py
copying lib/yaml/reader.py -> build/lib.macosx-10.14-x86_64-2.7/yaml
running build_ext
building '_yaml' extension
clang -fno-strict-aliasing -fno-common -dynamic -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python@2/2.7.16/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c ext/_yaml.c -o build/temp.macosx-10.14-x86_64-2.7/ext/_yaml.o
.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................E................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
===========================================================================
test_representer_types(tests/data/emitting-unacceptable-unicode-character-bug-ucs4-py2.code): ERROR
Traceback (most recent call last):
  File "tests/lib/test_appliance.py", line 67, in execute
    function(verbose=verbose, *filenames)
  File "tests/lib/test_representer.py", line 15, in test_representer_types
    native2 = yaml.load(output, Loader=test_constructor.MyLoader)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/__init__.py", line 112, in load
    loader = Loader(stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/loader.py", line 44, in __init__
    Reader.__init__(self, stream)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 81, in __init__
    self.determine_encoding()
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 137, in determine_encoding
    self.update(1)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 174, in update
    self.check_printable(data)
  File "build/lib.macosx-10.14-x86_64-2.7/yaml/reader.py", line 149, in check_printable
    'unicode', "special characters are not allowed")
ReaderError: unacceptable character #xdd00: special characters are not allowed
  in "<string>", position 0
---------------------------------------------------------------------------
tests/data/emitting-unacceptable-unicode-character-bug-ucs4-py2.code:
u"\udd00"
===========================================================================
TESTS: 2590
ERRORS: 1

The single test that's still failing involves encoding/decoding \udd00. I'm not sure what the expected behavior is here. Given that this is an invalid surrogate, shouldn't the serializer refuse to serialize it (or does the spec say it's undefined behavior, or something)?

@perlpunk
Copy link
Member

@perlpunk perlpunk commented Nov 25, 2019

We decided to get out a 5.2 as soon as possible and not add more PRs to it, but I hope we can do 5.3 soon after that and include this PR!

So far I haven't found out what the emitting-unacceptable-unicode-character-bug-ucs4-py2 test is doing. The commit log references a ticket but for the old ticket system...

@anishathalye
Copy link
Contributor Author

@anishathalye anishathalye commented Nov 26, 2019

Sounds good! Hopefully this can be included in 5.2.1, as it's more of a bugfix than a new feature?

Looks like that file was originally introduced in this commit: https://bitbucket.org/xi/pyyaml/commits/6c7c12bb8fe64828e400a5a6c0fb5dd9911392c3

It references this issue: https://bitbucket.org/xi/pyyaml/issues/11/

@perlpunk
Copy link
Member

@perlpunk perlpunk commented Nov 26, 2019

It references this issue: https://bitbucket.org/xi/pyyaml/issues/11/

Thanks for going through the commits, I hadn't saved the commit id after I searched for this :)

Actually it references this one, as bitbucket was introduced later:
https://web.archive.org/web/20100613112554/http://pyyaml.org/ticket/11

@anishathalye
Copy link
Contributor Author

@anishathalye anishathalye commented Nov 27, 2019

Ah that makes sense! The referenced issue on Bitbucket didn't really seem to match the commit

@perlpunk perlpunk added this to Backlog in 5.3 Release Dec 2, 2019
@perlpunk perlpunk changed the base branch from master to release/5.3 Dec 8, 2019
@perlpunk perlpunk merged commit 474dfc4 into yaml:release/5.3 Dec 8, 2019
2 checks passed
@perlpunk
Copy link
Member

@perlpunk perlpunk commented Dec 8, 2019

Thanks! merged to release/5.3
I will rename the tests (except the one we're not sure about)

@perlpunk perlpunk moved this from Backlog to Merged to release/5.3 in 5.3 Release Dec 8, 2019
perlpunk added a commit that referenced this issue Dec 8, 2019
perlpunk added a commit that referenced this issue Dec 8, 2019
They were disabled in d6cbff6

After #351 the tests are working again
perlpunk added a commit that referenced this issue Dec 20, 2019
This patch fixes the handling of inputs with Unicode code points over
0xFFFF when running on a Python 2 that does not have UCS-4 support
(which certain distributions still ship, e.g. macOS).

When Python is compiled without UCS-4 support, it uses UCS-2. In this
situation, non-BMP Unicode characters, which have code points over
0xFFFF, are represented as surrogate pairs. For example, if we take
u'\U0001f3d4', it will be represented as the surrogate pair
u'\ud83c\udfd4'. This can be seen by running, for example:

    [i for i in u'\U0001f3d4']

In PyYAML, the reader uses a function `check_printable` to validate
inputs, making sure that they only contain printable characters. Prior
to this patch, on UCS-2 builds, it incorrectly identified surrogate
pairs as non-printable.

It would be fairly natural to write a regular expression that captures
strings that contain only *printable* characters, as opposed to
*non-printable* characters (as identified by the old code, so not
excluding surrogate pairs):

    PRINTABLE = re.compile(u'^[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]*$')

Adding support for surrogate pairs to this would be straightforward,
adding the option of having a surrogate high followed by a surrogate low
(`[\uD800-\uDBFF][\uDC00-\uDFFF]`):

    PRINTABLE = re.compile(u'^(?:[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$')

Then, this regex could be used as follows:

    def check_printable(self, data):
        if not self.PRINTABLE.match(data):
            raise ReaderError(...)

However, matching printable strings, rather than searching for
non-printable characters as the code currently does, would have the
disadvantage of not identifying the culprit character (we wouldn't get
the position and the actual non-printable character from a lack of a
regex match).

Instead, we can modify the NON_PRINTABLE regex to allow legal surrogate
pairs. We do this by removing surrogate pairs from the existing
character set and adding the following options for illegal uses of
surrogate code points:

- Surrogate low that doesn't follow a surrogate high (either a surrogate
  low at the start of a string, or a surrogate low that follows a
  character that's not a surrogate high):

    (?:^|[^\uD800-\uDBFF])[\uDC00-\uDFFF]

- Surrogate high that isn't followed by a surrogate low (either a
  surrogate high at the end of a string, or a surrogate high that is
  followed by a character that's not a surrogate low):

    [\uD800-\uDBFF](?:[^\uDC00-\uDFFF]|$)

The behavior of this modified regex should match the one that is used
when Python is built with UCS-4 support.
perlpunk added a commit that referenced this issue Dec 20, 2019
They were disabled in d6cbff6

After #351 the tests are working again
@perlpunk
Copy link
Member

@perlpunk perlpunk commented Jan 6, 2020

asherf added a commit to asherf/pants that referenced this issue Apr 28, 2020
https://github.com/yaml/pyyaml/blob/d0d660d035905d9c49fc0f8dafb579d2cc68c0c8/CHANGES#L7

5.3.1 (2020-03-18)

* yaml/pyyaml#386 -- Prevents arbitrary code execution during python/object/new constructor

5.3 (2020-01-06)

* yaml/pyyaml#290 -- Use `is` instead of equality for comparing with `None`
* yaml/pyyaml#270 -- fix typos and stylistic nit
* yaml/pyyaml#309 -- Fix up small typo
* yaml/pyyaml#161 -- Fix handling of __slots__
* yaml/pyyaml#358 -- Allow calling add_multi_constructor with None
* yaml/pyyaml#285 -- Add use of safe_load() function in README
* yaml/pyyaml#351 -- Fix reader for Unicode code points over 0xFFFF
* yaml/pyyaml#360 -- Enable certain unicode tests when maxunicode not > 0xffff
* yaml/pyyaml#359 -- Use full_load in yaml-highlight example
* yaml/pyyaml#244 -- Document that PyYAML is implemented with Cython
* yaml/pyyaml#329 -- Fix for Python 3.10
* yaml/pyyaml#310 -- increase size of index, line, and column fields
* yaml/pyyaml#260 -- remove some unused imports
* yaml/pyyaml#163 -- Create timezone-aware datetimes when parsed as such
* yaml/pyyaml#363 -- Add tests for timezone

5.2 (2019-12-02)
------------------

* Repair incompatibilities introduced with 5.1. The default Loader was changed,
  but several methods like add_constructor still used the old default
  yaml/pyyaml#279 -- A more flexible fix for custom tag constructors
  yaml/pyyaml#287 -- Change default loader for yaml.add_constructor
  yaml/pyyaml#305 -- Change default loader for add_implicit_resolver, add_path_resolver
* Make FullLoader safer by removing python/object/apply from the default FullLoader
  yaml/pyyaml#347 -- Move constructor for object/apply to UnsafeConstructor
* Fix bug introduced in 5.1 where quoting went wrong on systems with sys.maxunicode <= 0xffff
  yaml/pyyaml#276 -- Fix logic for quoting special characters
* Other PRs:
  yaml/pyyaml#280 -- Update CHANGES for 5.1
asherf added a commit to asherf/pants that referenced this issue Apr 29, 2020
https://github.com/yaml/pyyaml/blob/d0d660d035905d9c49fc0f8dafb579d2cc68c0c8/CHANGES#L7

5.3.1 (2020-03-18)

* yaml/pyyaml#386 -- Prevents arbitrary code execution during python/object/new constructor

5.3 (2020-01-06)

* yaml/pyyaml#290 -- Use `is` instead of equality for comparing with `None`
* yaml/pyyaml#270 -- fix typos and stylistic nit
* yaml/pyyaml#309 -- Fix up small typo
* yaml/pyyaml#161 -- Fix handling of __slots__
* yaml/pyyaml#358 -- Allow calling add_multi_constructor with None
* yaml/pyyaml#285 -- Add use of safe_load() function in README
* yaml/pyyaml#351 -- Fix reader for Unicode code points over 0xFFFF
* yaml/pyyaml#360 -- Enable certain unicode tests when maxunicode not > 0xffff
* yaml/pyyaml#359 -- Use full_load in yaml-highlight example
* yaml/pyyaml#244 -- Document that PyYAML is implemented with Cython
* yaml/pyyaml#329 -- Fix for Python 3.10
* yaml/pyyaml#310 -- increase size of index, line, and column fields
* yaml/pyyaml#260 -- remove some unused imports
* yaml/pyyaml#163 -- Create timezone-aware datetimes when parsed as such
* yaml/pyyaml#363 -- Add tests for timezone

5.2 (2019-12-02)
------------------

* Repair incompatibilities introduced with 5.1. The default Loader was changed,
  but several methods like add_constructor still used the old default
  yaml/pyyaml#279 -- A more flexible fix for custom tag constructors
  yaml/pyyaml#287 -- Change default loader for yaml.add_constructor
  yaml/pyyaml#305 -- Change default loader for add_implicit_resolver, add_path_resolver
* Make FullLoader safer by removing python/object/apply from the default FullLoader
  yaml/pyyaml#347 -- Move constructor for object/apply to UnsafeConstructor
* Fix bug introduced in 5.1 where quoting went wrong on systems with sys.maxunicode <= 0xffff
  yaml/pyyaml#276 -- Fix logic for quoting special characters
* Other PRs:
  yaml/pyyaml#280 -- Update CHANGES for 5.1
Eric-Arellano pushed a commit to pantsbuild/pants that referenced this issue May 1, 2020
https://github.com/yaml/pyyaml/blob/d0d660d035905d9c49fc0f8dafb579d2cc68c0c8/CHANGES#L7

5.3.1 (2020-03-18)

* yaml/pyyaml#386 -- Prevents arbitrary code execution during python/object/new constructor

5.3 (2020-01-06)

* yaml/pyyaml#290 -- Use `is` instead of equality for comparing with `None`
* yaml/pyyaml#270 -- fix typos and stylistic nit
* yaml/pyyaml#309 -- Fix up small typo
* yaml/pyyaml#161 -- Fix handling of __slots__
* yaml/pyyaml#358 -- Allow calling add_multi_constructor with None
* yaml/pyyaml#285 -- Add use of safe_load() function in README
* yaml/pyyaml#351 -- Fix reader for Unicode code points over 0xFFFF
* yaml/pyyaml#360 -- Enable certain unicode tests when maxunicode not > 0xffff
* yaml/pyyaml#359 -- Use full_load in yaml-highlight example
* yaml/pyyaml#244 -- Document that PyYAML is implemented with Cython
* yaml/pyyaml#329 -- Fix for Python 3.10
* yaml/pyyaml#310 -- increase size of index, line, and column fields
* yaml/pyyaml#260 -- remove some unused imports
* yaml/pyyaml#163 -- Create timezone-aware datetimes when parsed as such
* yaml/pyyaml#363 -- Add tests for timezone

5.2 (2019-12-02)
------------------

* Repair incompatibilities introduced with 5.1. The default Loader was changed,
  but several methods like add_constructor still used the old default
  yaml/pyyaml#279 -- A more flexible fix for custom tag constructors
  yaml/pyyaml#287 -- Change default loader for yaml.add_constructor
  yaml/pyyaml#305 -- Change default loader for add_implicit_resolver, add_path_resolver
* Make FullLoader safer by removing python/object/apply from the default FullLoader
  yaml/pyyaml#347 -- Move constructor for object/apply to UnsafeConstructor
* Fix bug introduced in 5.1 where quoting went wrong on systems with sys.maxunicode <= 0xffff
  yaml/pyyaml#276 -- Fix logic for quoting special characters
* Other PRs:
  yaml/pyyaml#280 -- Update CHANGES for 5.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
5.3 Release
Merged to release/5.3
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants