Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_implicit_resolver does not match quoted strings #457

Open
att14 opened this issue Nov 13, 2020 · 11 comments
Open

add_implicit_resolver does not match quoted strings #457

att14 opened this issue Nov 13, 2020 · 11 comments
Labels

Comments

@att14
Copy link

att14 commented Nov 13, 2020

import re
import uuid
import yaml

regex = re.compile(r'^UUID\((.+)\)$')
yaml.add_implicit_resolver('!uuid', regex)

def convert_uuid(loader, node):
    value = loader.construct_scalar(node)
    str_value = regex.match(value).group(1)
    return uuid.UUID(str_value)

yaml.add_constructor('!uuid', convert_uuid)

print(yaml.load('''
    config:
        abc: UUID(6a02171e-6482-11e9-ab43-f2189845f1cc)
        def: abc
'''))

# {'config': {'abc': UUID('6a02171e-6482-11e9-ab43-f2189845f1cc'), 'def': 'abc'}}

print(yaml.load('''
    config:
        abc: 'UUID(6a02171e-6482-11e9-ab43-f2189845f1cc)'
        def: abc
'''))

# {'config': {'abc': 'UUID(6a02171e-6482-11e9-ab43-f2189845f1cc)', 'def': 'abc'}}

print(yaml.load('''
    config:
        abc: !uuid 'UUID(6a02171e-6482-11e9-ab43-f2189845f1cc)'
        def: abc
'''))

# {'config': {'abc': UUID(6a02171e-6482-11e9-ab43-f2189845f1cc), 'def': 'abc'}}

Maybe this is something I don't understand about the YAML spec, but if you explicitly tag the value it works. I initially thought it was #294, but I am using PyYAML 5.3.1 with Python 3.8.6.

@att14
Copy link
Author

att14 commented May 22, 2021

I guess this is described in the YAML spec:

Application specific tag resolution rules should be restricted to resolving the “?” non-specific tag, most commonly to resolving plain scalars. These may be matched against a set of regular expressions to provide automatic resolution of integers, floats, timestamps, and similar types. An application may also match the content of mapping nodes against sets of expected keys to automatically resolve points, complex numbers, and similar types. Resolved sequence node types such as the “ordered mapping” are also possible.

Since quoted strings are non-plain scalars, the spec suggests you do not do matching. This is also part of the PyYAML documentation that states that add_implicit_resolver "adds an implicit tag resolver for plain scalars." However, I would assume that 99% of people who use this library have never looked at the YAML spec and won't know what a "plain" scalar is. At least that was my confusion.

The YAML spec does go on to say:

That said, tag resolution is specific to the application. YAML processors should therefore provide a mechanism allowing the application to override and expand these default tag resolution rules.

It seems like we should be allowed to have a resolver that will tag explicit nodes.

@jimisola
Copy link

jimisola commented Feb 5, 2022

@att14 Did you find a solution?

@ingydotnet
Copy link
Member

When a yaml loader loads a yaml stream there are these phases under the hood:
(text)->scan->(tokens)->parse->(events)->compose->(node graph)->construct->(native)

BTW yaml.scan, yaml.parse and yaml.compose are all part of the PyYAML API just like yaml.load.

Try python3 -c 'import yaml; from pprint import pp; pp(list(yaml.scan("""{config: {abc: !uuid "UUID(6a02171e-6482-11e9-ab43-f2189845f1cc"), def: abc}""")))'.

Every node created by the composer is assigned a tag; either the explicit one or an implicit one.
The tag is used by the constructor to lookup the function that produces the final native result.

Untagged non-plain (quoted) scalars are are assigned !!str. Untagged plain scalars (unquoted) must have a resolver to assign them a tag. The default resolver also assigns !!str.

The end result in pyyaml is that non-plain scalars are not available for implicit tag resolution since you effectively assigned them a tag by quoting them.

PyYAML lets you do this:

print(yaml.unsafe_load('''
    config:                       
        abc: ! 'UUID(6a02171e-6482-11e9-ab43-f2189845f1cc'
        def: abc                         
'''))                                                              
                                                               
# {'config': {'abc': UUID(6a02171e-6482-11e9-ab43-f2189845f1cc), 'def': 'abc'}}

but I don't think that complies with the 1.2 spec. This is PyYAML's way of saying treat this quoted thing as unquoted and allow implicit tag resolution. I am in favor of adding this to the spec in a future version of yaml.

In your case there was no need to quote UUID(6a02171e-6482-11e9-ab43-f2189845f1cc) but consider:

user: ! '@ingydotnet'

Since plain scalars may not start with @ how else could you implicitly tag this to maybe create a githubUser object?

The spec allows implicit resolution not just by pattern matching but also by path position. You should be able to use a loader so that:

foo:
  uuid: ...

would always resolve the ... regardless of how it was quoted.

Unfortunately most current YAML implementations don't offer that kind of fine grained control yet.
But we're working on ways to make it both possible and simple.

@Thom1729
Copy link
Contributor

Thom1729 commented Feb 5, 2022

Just to expand on this:

When a node is parsed, if it doesn't have an explicit tag, it's assigned one of the two “non-specific tags”. Non-plain scalars (i.e. quoted scalars or block scalars) are assigned the ! non-specific tag, and all other nodes (i.e. plain scalars, sequences, and mappings) are assigned the ? non-specific tag. (It's also possible to explicitly give a node the ! non-specific tag, but not ?.

Tag resolution is the process of taking each node with a non-specific tag and assigning it a specific tag. According to the spec:

YAML processors should resolve nodes having the “!” non-specific tag as “tag:yaml.org,2002:seq”, “tag:yaml.org,2002:map” or “tag:yaml.org,2002:str” depending on their kind. This tag resolution convention allows the author of a YAML character stream to effectively “disable” the tag resolution process. By explicitly specifying a “!” non-specific tag property, the node would then be resolved to a “vanilla” sequence, mapping or string, according to its kind.

Application specific tag resolution rules should be restricted to resolving the “?” non-specific tag…

These are both “should” requirements, so it's valid for an application to resolve the ! non-specific tag in a nonstandard way, and it's valid for a YAML implementation to allow applications to do so. However, it's discouraged because authors might deliberately specify ! intending to avoid application-specific tag resolution.

@jimisola
Copy link

jimisola commented Feb 5, 2022

Thanks for the clarification. I was reading the specification before and have a related question regarding verbatim tags:

Verbatim Tags
A tag may be written verbatim by surrounding it with the “<” and “>” characters. In this case, the YAML processor must deliver the verbatim tag as-is to the application. In particular, verbatim tags are not subject to tag resolution. A verbatim tag must either begin with a “!” (a local tag) or be a valid URI (a global tag).

If a verbatim tag must be delivered as-is-to the application and without tag resolution why can't a node then not have multiple tags? Tags is this case doesn't resolve to a datatype per se.

E.g.

description: !force-inherit !tag2 some string

@Thom1729
Copy link
Contributor

Thom1729 commented Feb 5, 2022

It's a fundamental part of the YAML representation model that a node has exactly one tag. This one tag is used to determine the canonical form of scalar content and is used by implementations to choose what native types to construct. For example, the node !!int 0xFF has canonical form 256, and a Python implementation will probably construct it as a native Python int, whereas !!str 0xFF has canonical form 0xFF and will probably become a Python str.

So I'm not sure how to interpret the example !force-inherit !tag2 some string. Are the semantics of the node those of !tag2 or of !force-inherit? Which tag should an implementation consider when determining the node's canonical form, or when deciding what native type to construct?

Are you by chance using !force-inherit as a sort of annotation? If so, this is not really what tags were designed for and there are several practical obstacles you might have to deal with. I know this because I've used tags as annotations in a project of my own and run into those obstacles. This is something that we're looking into, and it's possible that in a future version we may provide an alternative syntax for annotating nodes that does not conflict with the established syntax and semantics of tags.

Would you mind explaining your use case in a bit more detail?

@jimisola
Copy link

jimisola commented Feb 5, 2022

Of course I can. Like you guessed I thought of using tags as a sort of annotation. I saw that the @ character has been reserved in the YAML specification.

I just got involved in the project and we are looking at a version 3 of a YAML config file for GitLabForm [https://github.com/gdubicki/gitlabform/blob/main/config.yml] to a accomodate a bunch of user requests.
That said, I found out that the project uses ruamel.yaml and not pyyaml, so this is not related to pyyaml per se, but rather to the YAML specification. Perhaps, we should continue the discussion there?
Users want to be able to force inheritance, ignore heritance etc on different levels including down to a simple key-value pair. I'm assuming that there might be more users requests like this.

Use-case:

# normal case
someKey: someValue

# want to avoid
someKey:
  value: someValue
  do_not_inherit: true
  
# would prefer
someKey: @do_not_inherit someValue

@ingydotnet
Copy link
Member

As @Thom1729 said

Tag resolution is the process of taking each node with a non-specific tag and assigning it a specific tag.

Tags themselves are annotation strings that can be used by the composition/construction process to configure that process.
In reality (in PyYAML) tags are keys to look up functions. The actual functions that transform representation nodes into native structures.

For example: the !!bool tag is a key to look up a python function to create a python bool value.

This is the PyYAML code that implicitly tags plain scalar values matching certain patterns:
https://github.com/yaml/pyyaml/blob/master/lib/yaml/resolver.py#L170-L175

to the tag tag:yaml.org,2002:python/bool which is configured here:
https://github.com/yaml/pyyaml/blob/master/lib/yaml/constructor.py#L665-L667

To tell the constructor to transform the canonical form of the scalar using this function:

def construct_yaml_bool(self, node):

to make the python boolean native value.

The process of turning a node into a canonical form is not well abstracted or exposed by pyyaml.
In this boolean example, it simply calls .lower() on the value and uses that as a local dict key.


It is easy to apply multiple functions (tags) to a node's construction, by wrapping calls in sequences.
Using @Thom1729 's example, you can:

- !force-inherit [ !tag2 some string ]

it also makes the order of operations visually clear.
And you can obviously do this right now in pyyaml; although we are working on ways to make it even cleaner.


On the subject of tag resolution and canonical form, this is usually seen in the context of scalars but you can also do this with collections.

For instance you shouldn't have to tag !color for:

pink: !color { r: 255, g: 192, b: 203 }
blue: { r: 0, g: 0, b: 255 }

and the normalization should be equally powerful. This should be able to produce the same:

pink: r=255 g=192 b=203
blue: [0, 0, 255]

If a schema dictates that a given node in the graph is required to have the tag !color it should be able to express the normalization from various serialization forms as shown above.

A yaml schema is the entity that defines the tags available to a load operation and how the implicit resolution and canonicalization should work.

A schema should be able to be written as a yaml data file and exposed to a YAML API like pyyaml's via something like:

native = yaml.load(yaml_input, schema_file='drawing-schema.yaml')

We're not there yet, but we're working on it.

@ingydotnet
Copy link
Member

# would prefer
someKey: @do_not_inherit someValue

There's no reason you couldn't do this now using:

someKey: !do_not_inherit someValue
# or:
someKey: !do_not_inherit [ !tag taggedValue ]

where the function you register to !do_not_inherit modifies the constructed value in some way desired by your framework.

Ruamel is a fork of PyYAML. I just took a quick look and all the resolver and constructor functions I mentioned in the previous comment are available in ruamel too.

wmfgerrit pushed a commit to wikimedia/homer that referenced this issue Apr 11, 2023
* Quoted string in pyyaml are automatically tagged as strings and would
  bypass the conversion to ipaddress objects.
* Ensure that the special syntax using ! or !TAGNAME works and allows to
  parse as ipaddresses also quoted strings.
* See also: yaml/pyyaml#457


Change-Id: I17ccda008f7e1250d293d014c6a9043d0c347984
@sbansla
Copy link

sbansla commented Dec 11, 2023

Any workaround for this issue ?

@att14
Copy link
Author

att14 commented Dec 11, 2023

IIRC it's not a workaround because it technically the correct implementation. But the solution is in #457 (comment). Use a ! in front of the scalar.

config:                       
    abc: ! 'UUID(6a02171e-6482-11e9-ab43-f2189845f1cc'
    def: abc                         

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants