-
Notifications
You must be signed in to change notification settings - Fork 516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix reader for Unicode code points over 0xFFFF #351
Fix reader for Unicode code points over 0xFFFF #351
Conversation
This patch fixes the handling of inputs with Unicode code points over 0xFFFF when running on a Python 2 that does not have UCS-4 support (which certain distributions still ship, e.g. macOS). When Python is compiled without UCS-4 support, it uses UCS-2. In this situation, non-BMP Unicode characters, which have code points over 0xFFFF, are represented as surrogate pairs. For example, if we take u'\U0001f3d4', it will be represented as the surrogate pair u'\ud83c\udfd4'. This can be seen by running, for example: [i for i in u'\U0001f3d4'] In PyYAML, the reader uses a function `check_printable` to validate inputs, making sure that they only contain printable characters. Prior to this patch, on UCS-2 builds, it incorrectly identified surrogate pairs as non-printable. It would be fairly natural to write a regular expression that captures strings that contain only *printable* characters, as opposed to *non-printable* characters (as identified by the old code, so not excluding surrogate pairs): PRINTABLE = re.compile(u'^[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]*$') Adding support for surrogate pairs to this would be straightforward, adding the option of having a surrogate high followed by a surrogate low (`[\uD800-\uDBFF][\uDC00-\uDFFF]`): PRINTABLE = re.compile(u'^(?:[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$') Then, this regex could be used as follows: def check_printable(self, data): if not self.PRINTABLE.match(data): raise ReaderError(...) However, matching printable strings, rather than searching for non-printable characters as the code currently does, would have the disadvantage of not identifying the culprit character (we wouldn't get the position and the actual non-printable character from a lack of a regex match). Instead, we can modify the NON_PRINTABLE regex to allow legal surrogate pairs. We do this by removing surrogate pairs from the existing character set and adding the following options for illegal uses of surrogate code points: - Surrogate low that doesn't follow a surrogate high (either a surrogate low at the start of a string, or a surrogate low that follows a character that's not a surrogate high): (?:^|[^\uD800-\uDBFF])[\uDC00-\uDFFF] - Surrogate high that isn't followed by a surrogate low (either a surrogate high at the end of a string, or a surrogate high that is followed by a character that's not a surrogate low): [\uD800-\uDBFF](?:[^\uDC00-\uDFFF]|$) The behavior of this modified regex should match the one that is used when Python is built with UCS-4 support.
Thanks! |
Oh yeah, I was going to do that and see what happened, but I forgot. Enabling those tests as part of this PR would make sense. The tests can be enabled as follows: diff --git i/tests/lib/test_appliance.py w/tests/lib/test_appliance.py
index 5ec4575..7822979 100644
--- i/tests/lib/test_appliance.py
+++ w/tests/lib/test_appliance.py
@@ -27,8 +27,6 @@ def find_test_filenames(directory):
base, ext = os.path.splitext(filename)
if base.endswith('-py3'):
continue
- if not has_ucs4 and base.find('-ucs4-') > -1:
- continue
filenames.setdefault(base, []).append(ext)
filenames = filenames.items()
filenames.sort() (it might make sense to rename those For comparison, here's what happens with the UCS-4 tests enabled for a non-UCS-4 Python 2 build, prior to this patch:
Here's what happens with this patch:
The single test that's still failing involves encoding/decoding |
We decided to get out a 5.2 as soon as possible and not add more PRs to it, but I hope we can do 5.3 soon after that and include this PR! So far I haven't found out what the |
Sounds good! Hopefully this can be included in 5.2.1, as it's more of a bugfix than a new feature? Looks like that file was originally introduced in this commit: https://bitbucket.org/xi/pyyaml/commits/6c7c12bb8fe64828e400a5a6c0fb5dd9911392c3 It references this issue: https://bitbucket.org/xi/pyyaml/issues/11/ |
Thanks for going through the commits, I hadn't saved the commit id after I searched for this :) Actually it references this one, as bitbucket was introduced later: |
Ah that makes sense! The referenced issue on Bitbucket didn't really seem to match the commit |
Thanks! merged to release/5.3 |
After #351 the tests are working again
This patch fixes the handling of inputs with Unicode code points over 0xFFFF when running on a Python 2 that does not have UCS-4 support (which certain distributions still ship, e.g. macOS). When Python is compiled without UCS-4 support, it uses UCS-2. In this situation, non-BMP Unicode characters, which have code points over 0xFFFF, are represented as surrogate pairs. For example, if we take u'\U0001f3d4', it will be represented as the surrogate pair u'\ud83c\udfd4'. This can be seen by running, for example: [i for i in u'\U0001f3d4'] In PyYAML, the reader uses a function `check_printable` to validate inputs, making sure that they only contain printable characters. Prior to this patch, on UCS-2 builds, it incorrectly identified surrogate pairs as non-printable. It would be fairly natural to write a regular expression that captures strings that contain only *printable* characters, as opposed to *non-printable* characters (as identified by the old code, so not excluding surrogate pairs): PRINTABLE = re.compile(u'^[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]*$') Adding support for surrogate pairs to this would be straightforward, adding the option of having a surrogate high followed by a surrogate low (`[\uD800-\uDBFF][\uDC00-\uDFFF]`): PRINTABLE = re.compile(u'^(?:[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$') Then, this regex could be used as follows: def check_printable(self, data): if not self.PRINTABLE.match(data): raise ReaderError(...) However, matching printable strings, rather than searching for non-printable characters as the code currently does, would have the disadvantage of not identifying the culprit character (we wouldn't get the position and the actual non-printable character from a lack of a regex match). Instead, we can modify the NON_PRINTABLE regex to allow legal surrogate pairs. We do this by removing surrogate pairs from the existing character set and adding the following options for illegal uses of surrogate code points: - Surrogate low that doesn't follow a surrogate high (either a surrogate low at the start of a string, or a surrogate low that follows a character that's not a surrogate high): (?:^|[^\uD800-\uDBFF])[\uDC00-\uDFFF] - Surrogate high that isn't followed by a surrogate low (either a surrogate high at the end of a string, or a surrogate high that is followed by a character that's not a surrogate low): [\uD800-\uDBFF](?:[^\uDC00-\uDFFF]|$) The behavior of this modified regex should match the one that is used when Python is built with UCS-4 support.
released https://pypi.org/project/PyYAML/5.3/ |
https://github.com/yaml/pyyaml/blob/d0d660d035905d9c49fc0f8dafb579d2cc68c0c8/CHANGES#L7 5.3.1 (2020-03-18) * yaml/pyyaml#386 -- Prevents arbitrary code execution during python/object/new constructor 5.3 (2020-01-06) * yaml/pyyaml#290 -- Use `is` instead of equality for comparing with `None` * yaml/pyyaml#270 -- fix typos and stylistic nit * yaml/pyyaml#309 -- Fix up small typo * yaml/pyyaml#161 -- Fix handling of __slots__ * yaml/pyyaml#358 -- Allow calling add_multi_constructor with None * yaml/pyyaml#285 -- Add use of safe_load() function in README * yaml/pyyaml#351 -- Fix reader for Unicode code points over 0xFFFF * yaml/pyyaml#360 -- Enable certain unicode tests when maxunicode not > 0xffff * yaml/pyyaml#359 -- Use full_load in yaml-highlight example * yaml/pyyaml#244 -- Document that PyYAML is implemented with Cython * yaml/pyyaml#329 -- Fix for Python 3.10 * yaml/pyyaml#310 -- increase size of index, line, and column fields * yaml/pyyaml#260 -- remove some unused imports * yaml/pyyaml#163 -- Create timezone-aware datetimes when parsed as such * yaml/pyyaml#363 -- Add tests for timezone 5.2 (2019-12-02) ------------------ * Repair incompatibilities introduced with 5.1. The default Loader was changed, but several methods like add_constructor still used the old default yaml/pyyaml#279 -- A more flexible fix for custom tag constructors yaml/pyyaml#287 -- Change default loader for yaml.add_constructor yaml/pyyaml#305 -- Change default loader for add_implicit_resolver, add_path_resolver * Make FullLoader safer by removing python/object/apply from the default FullLoader yaml/pyyaml#347 -- Move constructor for object/apply to UnsafeConstructor * Fix bug introduced in 5.1 where quoting went wrong on systems with sys.maxunicode <= 0xffff yaml/pyyaml#276 -- Fix logic for quoting special characters * Other PRs: yaml/pyyaml#280 -- Update CHANGES for 5.1
https://github.com/yaml/pyyaml/blob/d0d660d035905d9c49fc0f8dafb579d2cc68c0c8/CHANGES#L7 5.3.1 (2020-03-18) * yaml/pyyaml#386 -- Prevents arbitrary code execution during python/object/new constructor 5.3 (2020-01-06) * yaml/pyyaml#290 -- Use `is` instead of equality for comparing with `None` * yaml/pyyaml#270 -- fix typos and stylistic nit * yaml/pyyaml#309 -- Fix up small typo * yaml/pyyaml#161 -- Fix handling of __slots__ * yaml/pyyaml#358 -- Allow calling add_multi_constructor with None * yaml/pyyaml#285 -- Add use of safe_load() function in README * yaml/pyyaml#351 -- Fix reader for Unicode code points over 0xFFFF * yaml/pyyaml#360 -- Enable certain unicode tests when maxunicode not > 0xffff * yaml/pyyaml#359 -- Use full_load in yaml-highlight example * yaml/pyyaml#244 -- Document that PyYAML is implemented with Cython * yaml/pyyaml#329 -- Fix for Python 3.10 * yaml/pyyaml#310 -- increase size of index, line, and column fields * yaml/pyyaml#260 -- remove some unused imports * yaml/pyyaml#163 -- Create timezone-aware datetimes when parsed as such * yaml/pyyaml#363 -- Add tests for timezone 5.2 (2019-12-02) ------------------ * Repair incompatibilities introduced with 5.1. The default Loader was changed, but several methods like add_constructor still used the old default yaml/pyyaml#279 -- A more flexible fix for custom tag constructors yaml/pyyaml#287 -- Change default loader for yaml.add_constructor yaml/pyyaml#305 -- Change default loader for add_implicit_resolver, add_path_resolver * Make FullLoader safer by removing python/object/apply from the default FullLoader yaml/pyyaml#347 -- Move constructor for object/apply to UnsafeConstructor * Fix bug introduced in 5.1 where quoting went wrong on systems with sys.maxunicode <= 0xffff yaml/pyyaml#276 -- Fix logic for quoting special characters * Other PRs: yaml/pyyaml#280 -- Update CHANGES for 5.1
https://github.com/yaml/pyyaml/blob/d0d660d035905d9c49fc0f8dafb579d2cc68c0c8/CHANGES#L7 5.3.1 (2020-03-18) * yaml/pyyaml#386 -- Prevents arbitrary code execution during python/object/new constructor 5.3 (2020-01-06) * yaml/pyyaml#290 -- Use `is` instead of equality for comparing with `None` * yaml/pyyaml#270 -- fix typos and stylistic nit * yaml/pyyaml#309 -- Fix up small typo * yaml/pyyaml#161 -- Fix handling of __slots__ * yaml/pyyaml#358 -- Allow calling add_multi_constructor with None * yaml/pyyaml#285 -- Add use of safe_load() function in README * yaml/pyyaml#351 -- Fix reader for Unicode code points over 0xFFFF * yaml/pyyaml#360 -- Enable certain unicode tests when maxunicode not > 0xffff * yaml/pyyaml#359 -- Use full_load in yaml-highlight example * yaml/pyyaml#244 -- Document that PyYAML is implemented with Cython * yaml/pyyaml#329 -- Fix for Python 3.10 * yaml/pyyaml#310 -- increase size of index, line, and column fields * yaml/pyyaml#260 -- remove some unused imports * yaml/pyyaml#163 -- Create timezone-aware datetimes when parsed as such * yaml/pyyaml#363 -- Add tests for timezone 5.2 (2019-12-02) ------------------ * Repair incompatibilities introduced with 5.1. The default Loader was changed, but several methods like add_constructor still used the old default yaml/pyyaml#279 -- A more flexible fix for custom tag constructors yaml/pyyaml#287 -- Change default loader for yaml.add_constructor yaml/pyyaml#305 -- Change default loader for add_implicit_resolver, add_path_resolver * Make FullLoader safer by removing python/object/apply from the default FullLoader yaml/pyyaml#347 -- Move constructor for object/apply to UnsafeConstructor * Fix bug introduced in 5.1 where quoting went wrong on systems with sys.maxunicode <= 0xffff yaml/pyyaml#276 -- Fix logic for quoting special characters * Other PRs: yaml/pyyaml#280 -- Update CHANGES for 5.1
- Update from 3.13 to 6.0 - Update of rootfile - Changelog 6.0 (2021-10-13) * yaml/pyyaml#327 -- Change README format to Markdown * yaml/pyyaml#483 -- Add a test for YAML 1.1 types * yaml/pyyaml#497 -- fix float resolver to ignore `.` and `._` * yaml/pyyaml#550 -- drop Python 2.7 * yaml/pyyaml#553 -- Fix spelling of “hexadecimal” * yaml/pyyaml#556 -- fix representation of Enum subclasses * yaml/pyyaml#557 -- fix libyaml extension compiler warnings * yaml/pyyaml#560 -- fix ResourceWarning on leaked file descriptors * yaml/pyyaml#561 -- always require `Loader` arg to `yaml.load()` * yaml/pyyaml#564 -- remove remaining direct distutils usage 5.4.1 (2021-01-20) * yaml/pyyaml#480 -- Fix stub compat with older pyyaml versions that may unwittingly load it 5.4 (2021-01-19) * yaml/pyyaml#407 -- Build modernization, remove distutils, fix metadata, build wheels, CI to GHA * yaml/pyyaml#472 -- Fix for CVE-2020-14343, moves arbitrary python tags to UnsafeLoader * yaml/pyyaml#441 -- Fix memory leak in implicit resolver setup * yaml/pyyaml#392 -- Fix py2 copy support for timezone objects * yaml/pyyaml#378 -- Fix compatibility with Jython 5.3.1 (2020-03-18) * yaml/pyyaml#386 -- Prevents arbitrary code execution during python/object/new constructor 5.3 (2020-01-06) * yaml/pyyaml#290 -- Use `is` instead of equality for comparing with `None` * yaml/pyyaml#270 -- Fix typos and stylistic nit * yaml/pyyaml#309 -- Fix up small typo * yaml/pyyaml#161 -- Fix handling of __slots__ * yaml/pyyaml#358 -- Allow calling add_multi_constructor with None * yaml/pyyaml#285 -- Add use of safe_load() function in README * yaml/pyyaml#351 -- Fix reader for Unicode code points over 0xFFFF * yaml/pyyaml#360 -- Enable certain unicode tests when maxunicode not > 0xffff * yaml/pyyaml#359 -- Use full_load in yaml-highlight example * yaml/pyyaml#244 -- Document that PyYAML is implemented with Cython * yaml/pyyaml#329 -- Fix for Python 3.10 * yaml/pyyaml#310 -- Increase size of index, line, and column fields * yaml/pyyaml#260 -- Remove some unused imports * yaml/pyyaml#163 -- Create timezone-aware datetimes when parsed as such * yaml/pyyaml#363 -- Add tests for timezone 5.2 (2019-12-02) * Repair incompatibilities introduced with 5.1. The default Loader was changed, but several methods like add_constructor still used the old default yaml/pyyaml#279 -- A more flexible fix for custom tag constructors yaml/pyyaml#287 -- Change default loader for yaml.add_constructor yaml/pyyaml#305 -- Change default loader for add_implicit_resolver, add_path_resolver * Make FullLoader safer by removing python/object/apply from the default FullLoader yaml/pyyaml#347 -- Move constructor for object/apply to UnsafeConstructor * Fix bug introduced in 5.1 where quoting went wrong on systems with sys.maxunicode <= 0xffff yaml/pyyaml#276 -- Fix logic for quoting special characters * Other PRs: yaml/pyyaml#280 -- Update CHANGES for 5.1 5.1.2 (2019-07-30) * Re-release of 5.1 with regenerated Cython sources to build properly for Python 3.8b2+ 5.1.1 (2019-06-05) * Re-release of 5.1 with regenerated Cython sources to build properly for Python 3.8b1 5.1 (2019-03-13) * yaml/pyyaml#35 -- Some modernization of the test running * yaml/pyyaml#42 -- Install tox in a virtualenv * yaml/pyyaml#45 -- Allow colon in a plain scalar in a flow context * yaml/pyyaml#48 -- Fix typos * yaml/pyyaml#55 -- Improve RepresenterError creation * yaml/pyyaml#59 -- Resolves #57, update readme issues link * yaml/pyyaml#60 -- Document and test Python 3.6 support * yaml/pyyaml#61 -- Use Travis CI built in pip cache support * yaml/pyyaml#62 -- Remove tox workaround for Travis CI * yaml/pyyaml#63 -- Adding support to Unicode characters over codepoint 0xffff * yaml/pyyaml#75 -- add 3.12 changelog * yaml/pyyaml#76 -- Fallback to Pure Python if Compilation fails * yaml/pyyaml#84 -- Drop unsupported Python 3.3 * yaml/pyyaml#102 -- Include license file in the generated wheel package * yaml/pyyaml#105 -- Removed Python 2.6 & 3.3 support * yaml/pyyaml#111 -- Remove commented out Psyco code * yaml/pyyaml#129 -- Remove call to `ord` in lib3 emitter code * yaml/pyyaml#149 -- Test on Python 3.7-dev * yaml/pyyaml#158 -- Support escaped slash in double quotes "\/" * yaml/pyyaml#175 -- Updated link to pypi in release announcement * yaml/pyyaml#181 -- Import Hashable from collections.abc * yaml/pyyaml#194 -- Reverting yaml/pyyaml#74 * yaml/pyyaml#195 -- Build libyaml on travis * yaml/pyyaml#196 -- Force cython when building sdist * yaml/pyyaml#254 -- Allow to turn off sorting keys in Dumper (2) * yaml/pyyaml#256 -- Make default_flow_style=False * yaml/pyyaml#257 -- Deprecate yaml.load and add FullLoader and UnsafeLoader classes * yaml/pyyaml#261 -- Skip certain unicode tests when maxunicode not > 0xffff * yaml/pyyaml#263 -- Windows Appveyor build Signed-off-by: Adolf Belka <adolf.belka@ipfire.org> --git a/config/rootfiles/packages/python3-yaml b/config/rootfiles/packages/python3-yaml x 0870a2346..bd4009a08 100644 * yaml/pyyaml#195 -- Build libyaml on travis * yaml/pyyaml#196 -- Force cython when building sdist * yaml/pyyaml#254 -- Allow to turn off sorting keys in Dumper (2) * yaml/pyyaml#256 -- Make default_flow_style=False * yaml/pyyaml#257 -- Deprecate yaml.load and add FullLoader and Uns oader classes * yaml/pyyaml#261 -- Skip certain unicode tests when maxunicode not xffff * yaml/pyyaml#263 -- Windows Appveyor build Signed-off-by: Adolf Belka <adolf.belka@ipfire.org> Reviewed-by: Peter Müller <peter.mueller@ipfire.org>
Fixes #350
This patch fixes the handling of inputs with Unicode code points over 0xFFFF when running on a Python 2 that does not have UCS-4 support (which certain distributions still ship, e.g. macOS).
When Python is compiled without UCS-4 support, it uses UCS-2. In this situation, non-BMP Unicode characters, which have code points over 0xFFFF, are represented as surrogate pairs. For example, if we take u'\U0001f3d4', it will be represented as the surrogate pair u'\ud83c\udfd4'. This can be seen by running, for example:
In PyYAML, the reader uses a function
check_printable
to validate inputs, making sure that they only contain printable characters. Prior to this patch, on UCS-2 builds, it incorrectly identified surrogate pairs as non-printable.It would be fairly natural to write a regular expression that captures strings that contain only printable characters, as opposed to non-printable characters (as identified by the old code, so not excluding surrogate pairs):
Adding support for surrogate pairs to this would be straightforward, adding the option of having a surrogate high followed by a surrogate low (
[\uD800-\uDBFF][\uDC00-\uDFFF]
):Then, this regex could be used as follows:
However, matching printable strings, rather than searching for non-printable characters as the code currently does, would have the disadvantage of not identifying the culprit character (we wouldn't get the position and the actual non-printable character from a lack of a regex match).
Instead, we can modify the NON_PRINTABLE regex to allow legal surrogate pairs. We do this by removing surrogate pairs from the existing character set and adding the following options for illegal uses of surrogate code points:
Surrogate low that doesn't follow a surrogate high (either a surrogate low at the start of a string, or a surrogate low that follows a character that's not a surrogate high):
(?:^|[^\uD800-\uDBFF])[\uDC00-\uDFFF]
Surrogate high that isn't followed by a surrogate low (either a surrogate high at the end of a string, or a surrogate high that is followed by a character that's not a surrogate low):
[\uD800-\uDBFF](?:[^\uDC00-\uDFFF]|$)
The behavior of this modified regex should match the one that is used when Python is built with UCS-4 support.