Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix logic for quoting special characters #276

Merged
merged 2 commits into from Mar 31, 2019
Merged

Conversation

@perlpunk
Copy link
Member

@perlpunk perlpunk commented Mar 17, 2019

Fixes #275

Currently some strings with special characters are dumped as plain strings, on systems where sys.maxunicode <= 0xffff (typically Python 2.7 on Win/Mac) and when using allow_unicode=True. That doesn't roundtrip.

>>> import yaml
>>> string = "\tpart1\tpart2"
>>> print(yaml.dump(string, allow_unicode=True))

        part1   part2
...
@perlpunk
Copy link
Member Author

@perlpunk perlpunk commented Mar 20, 2019

While this fixes issue #275, I'm still not sure if the logic is correct now for systems where sys.maxunicode < 0xffff.
Will add an example later.

@perlpunk
Copy link
Member Author

@perlpunk perlpunk commented Mar 21, 2019

# coding=utf-8
import yaml

string1 = "part1\tpart2".decode('utf8')
string2 = "ü".decode('utf8')
string3 = "part1\rpart2"
string4 = "😀".decode('utf8')
data = [string1, string2, string3, string4]
print(data)
res = yaml.safe_dump(data,
                     default_flow_style=False, explicit_start=True, canonical=False,
                     allow_unicode=True, encoding='utf-8', width=float("inf"))
res = res.decode('utf8')
print(res)
newdata = yaml.safe_load(res)
print(newdata)

Tested on system with sys.maxunicode > 0xffff

Output with my fix no. 1:

[u'part1\tpart2', u'\xfc', 'part1\rpart2', u'\U0001f600']
---
- "part1\tpart2"
- ü
- "part1\rpart2"
- "\U0001F600"

['part1\tpart2', u'\xfc', 'part1\rpart2', u'\U0001f600']

Output if I leave out has_ucs4 from the condition:

[u'part1\tpart2', u'\xfc', 'part1\rpart2', u'\U0001f600']
---
- "part1\tpart2"
- ü
- "part1\rpart2"
- 😀

['part1\tpart2', u'\xfc', 'part1\rpart2', u'\U0001f600']

I think we don't need has_ucs4 here, but I'm not a unicode expert, especially not in python. But if sys.maxunicode <= 0xffff then (u'\U00010000' <= ch < u'\U0010ffff') can't actually be true. So I'm not sure what the has_ucs4 was supposed to do in this if condition.

I also tried it out on a Mac with python 2.7.

on systems with `sys.maxunicode <= 0xffff` the comparison
(u'\U00010000' <= ch < u'\U0010ffff') can't be true anyway I think
@perlpunk
Copy link
Member Author

@perlpunk perlpunk commented Mar 21, 2019

Pushed my second fix

@perlpunk
Copy link
Member Author

@perlpunk perlpunk commented Mar 22, 2019

@peterkmurphy Do you remember what the purpose of has_ucs4 was in your original PR?

or ((not has_ucs4) or (u'\U00010000' <= ch < u'\U0010ffff'))) and ch != u'\uFEFF':

@peterkmurphy
Copy link
Contributor

@peterkmurphy peterkmurphy commented Mar 22, 2019

@perlpunk
Copy link
Member Author

@perlpunk perlpunk commented Mar 23, 2019

Thanks @peterkmurphy
The purpose of the PR is clear; my question was more what the has_ucs4 was supposed to do in the if statement.
The problem is, on those systems where it is true, basically the whole condition became true, for characters like tabs, for example. As a result, a string like "\tstring" was dumped as a plain string which didn't roundtrip. See #275
I fixed it by taking the has_ucs4 out and can't think of a case where it would be needed.
Here's the current statement from emitter.py:

    if (ch == u'\x85' or u'\xA0' <= ch <= u'\uD7FF'
            or u'\uE000' <= ch <= u'\uFFFD'
            or ((not has_ucs4) or (u'\U00010000' <= ch < u'\U0010ffff'))) and ch != u'\uFEFF':
        unicode_characters = True
        if not self.allow_unicode:
            special_characters = True
    else:
        special_characters = True

@peterkmurphy
Copy link
Contributor

@peterkmurphy peterkmurphy commented Mar 23, 2019

@perlpunk
Copy link
Member Author

@perlpunk perlpunk commented Mar 26, 2019

ok thanks @peterkmurphy

@perlpunk perlpunk changed the base branch from master to release/5.2 Mar 31, 2019
@perlpunk perlpunk merged commit 60ca52d into release/5.2 Mar 31, 2019
4 checks passed
@perlpunk perlpunk moved this from PRs and Notes to Consider to PyYAML release/5.2 branch in 5.2 Release Mar 31, 2019
perlpunk added a commit that referenced this issue Nov 18, 2019
* Fix logic for quoting special characters

* Remove has_ucs4 from condition

on systems with `sys.maxunicode <= 0xffff` the comparison
(u'\U00010000' <= ch < u'\U0010ffff') can't be true anyway I think
@perlpunk perlpunk deleted the perlpunk/fix-unicode branch Dec 2, 2019
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Dec 15, 2019
5.2:
* Repair incompatibilities introduced with 5.1. The default Loader was changed,
  but several methods like add_constructor still used the old default
  yaml/pyyaml#279 -- A more flexible fix for custom tag constructors
  yaml/pyyaml#287 -- Change default loader for yaml.add_constructor
  yaml/pyyaml#305 -- Change default loader for add_implicit_resolver, add_path_resolver
* Make FullLoader safer by removing python/object/apply from the default FullLoader
  yaml/pyyaml#347 -- Move constructor for object/apply to UnsafeConstructor
* Fix bug introduced in 5.1 where quoting went wrong on systems with sys.maxunicode <= 0xffff
  yaml/pyyaml#276 -- Fix logic for quoting special characters
* Other PRs:
  yaml/pyyaml#280 -- Update CHANGES for 5.1
asherf added a commit to asherf/pants that referenced this issue Apr 28, 2020
https://github.com/yaml/pyyaml/blob/d0d660d035905d9c49fc0f8dafb579d2cc68c0c8/CHANGES#L7

5.3.1 (2020-03-18)

* yaml/pyyaml#386 -- Prevents arbitrary code execution during python/object/new constructor

5.3 (2020-01-06)

* yaml/pyyaml#290 -- Use `is` instead of equality for comparing with `None`
* yaml/pyyaml#270 -- fix typos and stylistic nit
* yaml/pyyaml#309 -- Fix up small typo
* yaml/pyyaml#161 -- Fix handling of __slots__
* yaml/pyyaml#358 -- Allow calling add_multi_constructor with None
* yaml/pyyaml#285 -- Add use of safe_load() function in README
* yaml/pyyaml#351 -- Fix reader for Unicode code points over 0xFFFF
* yaml/pyyaml#360 -- Enable certain unicode tests when maxunicode not > 0xffff
* yaml/pyyaml#359 -- Use full_load in yaml-highlight example
* yaml/pyyaml#244 -- Document that PyYAML is implemented with Cython
* yaml/pyyaml#329 -- Fix for Python 3.10
* yaml/pyyaml#310 -- increase size of index, line, and column fields
* yaml/pyyaml#260 -- remove some unused imports
* yaml/pyyaml#163 -- Create timezone-aware datetimes when parsed as such
* yaml/pyyaml#363 -- Add tests for timezone

5.2 (2019-12-02)
------------------

* Repair incompatibilities introduced with 5.1. The default Loader was changed,
  but several methods like add_constructor still used the old default
  yaml/pyyaml#279 -- A more flexible fix for custom tag constructors
  yaml/pyyaml#287 -- Change default loader for yaml.add_constructor
  yaml/pyyaml#305 -- Change default loader for add_implicit_resolver, add_path_resolver
* Make FullLoader safer by removing python/object/apply from the default FullLoader
  yaml/pyyaml#347 -- Move constructor for object/apply to UnsafeConstructor
* Fix bug introduced in 5.1 where quoting went wrong on systems with sys.maxunicode <= 0xffff
  yaml/pyyaml#276 -- Fix logic for quoting special characters
* Other PRs:
  yaml/pyyaml#280 -- Update CHANGES for 5.1
asherf added a commit to asherf/pants that referenced this issue Apr 29, 2020
https://github.com/yaml/pyyaml/blob/d0d660d035905d9c49fc0f8dafb579d2cc68c0c8/CHANGES#L7

5.3.1 (2020-03-18)

* yaml/pyyaml#386 -- Prevents arbitrary code execution during python/object/new constructor

5.3 (2020-01-06)

* yaml/pyyaml#290 -- Use `is` instead of equality for comparing with `None`
* yaml/pyyaml#270 -- fix typos and stylistic nit
* yaml/pyyaml#309 -- Fix up small typo
* yaml/pyyaml#161 -- Fix handling of __slots__
* yaml/pyyaml#358 -- Allow calling add_multi_constructor with None
* yaml/pyyaml#285 -- Add use of safe_load() function in README
* yaml/pyyaml#351 -- Fix reader for Unicode code points over 0xFFFF
* yaml/pyyaml#360 -- Enable certain unicode tests when maxunicode not > 0xffff
* yaml/pyyaml#359 -- Use full_load in yaml-highlight example
* yaml/pyyaml#244 -- Document that PyYAML is implemented with Cython
* yaml/pyyaml#329 -- Fix for Python 3.10
* yaml/pyyaml#310 -- increase size of index, line, and column fields
* yaml/pyyaml#260 -- remove some unused imports
* yaml/pyyaml#163 -- Create timezone-aware datetimes when parsed as such
* yaml/pyyaml#363 -- Add tests for timezone

5.2 (2019-12-02)
------------------

* Repair incompatibilities introduced with 5.1. The default Loader was changed,
  but several methods like add_constructor still used the old default
  yaml/pyyaml#279 -- A more flexible fix for custom tag constructors
  yaml/pyyaml#287 -- Change default loader for yaml.add_constructor
  yaml/pyyaml#305 -- Change default loader for add_implicit_resolver, add_path_resolver
* Make FullLoader safer by removing python/object/apply from the default FullLoader
  yaml/pyyaml#347 -- Move constructor for object/apply to UnsafeConstructor
* Fix bug introduced in 5.1 where quoting went wrong on systems with sys.maxunicode <= 0xffff
  yaml/pyyaml#276 -- Fix logic for quoting special characters
* Other PRs:
  yaml/pyyaml#280 -- Update CHANGES for 5.1
Eric-Arellano pushed a commit to pantsbuild/pants that referenced this issue May 1, 2020
https://github.com/yaml/pyyaml/blob/d0d660d035905d9c49fc0f8dafb579d2cc68c0c8/CHANGES#L7

5.3.1 (2020-03-18)

* yaml/pyyaml#386 -- Prevents arbitrary code execution during python/object/new constructor

5.3 (2020-01-06)

* yaml/pyyaml#290 -- Use `is` instead of equality for comparing with `None`
* yaml/pyyaml#270 -- fix typos and stylistic nit
* yaml/pyyaml#309 -- Fix up small typo
* yaml/pyyaml#161 -- Fix handling of __slots__
* yaml/pyyaml#358 -- Allow calling add_multi_constructor with None
* yaml/pyyaml#285 -- Add use of safe_load() function in README
* yaml/pyyaml#351 -- Fix reader for Unicode code points over 0xFFFF
* yaml/pyyaml#360 -- Enable certain unicode tests when maxunicode not > 0xffff
* yaml/pyyaml#359 -- Use full_load in yaml-highlight example
* yaml/pyyaml#244 -- Document that PyYAML is implemented with Cython
* yaml/pyyaml#329 -- Fix for Python 3.10
* yaml/pyyaml#310 -- increase size of index, line, and column fields
* yaml/pyyaml#260 -- remove some unused imports
* yaml/pyyaml#163 -- Create timezone-aware datetimes when parsed as such
* yaml/pyyaml#363 -- Add tests for timezone

5.2 (2019-12-02)
------------------

* Repair incompatibilities introduced with 5.1. The default Loader was changed,
  but several methods like add_constructor still used the old default
  yaml/pyyaml#279 -- A more flexible fix for custom tag constructors
  yaml/pyyaml#287 -- Change default loader for yaml.add_constructor
  yaml/pyyaml#305 -- Change default loader for add_implicit_resolver, add_path_resolver
* Make FullLoader safer by removing python/object/apply from the default FullLoader
  yaml/pyyaml#347 -- Move constructor for object/apply to UnsafeConstructor
* Fix bug introduced in 5.1 where quoting went wrong on systems with sys.maxunicode <= 0xffff
  yaml/pyyaml#276 -- Fix logic for quoting special characters
* Other PRs:
  yaml/pyyaml#280 -- Update CHANGES for 5.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
5.2 Release
PyYAML release/5.2 branch
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants