Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow accessing numeric fields in remap without quoting #6780

Closed
jszwedko opened this issue Mar 16, 2021 · 12 comments · Fixed by #7045
Closed

Allow accessing numeric fields in remap without quoting #6780

jszwedko opened this issue Mar 16, 2021 · 12 comments · Fixed by #7045
Assignees
Labels
domain: vrl Anything related to the Vector Remap Language transform: remap Anything `remap` transform related type: enhancement A value-adding code change that enhances its existing functionality.

Comments

@jszwedko
Copy link
Member

jszwedko commented Mar 16, 2021

It seems like you cannot access fields in an object that start with a number. This is common, for example, with parse_regex which returns numeric capture groups.

For example, VRL REPL:

$ .message = s'C:\Users\firstname.lastname\Desktop\project\case\case_name\Account\file-events.csv'
$ matches = parse_regex!(.message, r'\\case\\(?P<case>[^\\]+)\\Account')
$ matches.0

error[E203]: syntax error
  ┌─ :1:9
  │
1 │ matches.0
  │         ^
  │         │
  │         unexpected syntax token: "IntegerLiteral"
  │         expected one of: "(", "identifier", "path field", "string literal"
  │
  = see language documentation at https://vrl.dev

$ matches."0"
"\case\case_name\Account"

I might be missing something, but I don't see any reason to require bare keys to not start with a number.

@jszwedko jszwedko added type: enhancement A value-adding code change that enhances its existing functionality. domain: vrl Anything related to the Vector Remap Language transform: remap Anything `remap` transform related labels Mar 16, 2021
@jszwedko jszwedko changed the title Allow accessing numeric fields in remap Allow accessing numeric fields in remap without quoting Mar 16, 2021
@JeanMertz
Copy link
Contributor

I might be missing something, but I don't see any reason to require bare keys to not start with a number.

The reason for this limitation (and others) is simple: start with a conservative/limited syntax scope, and expand as needed.

I don't have any real preference in this case, although it might be confusing for some that .foo.0 behaves differently from .foo[0]. At least .foo."0" makes it more clear that this has nothing to do with indexing.

Also, if we do this, we lose the capability to use numeric fields for other use-cases in the future. I don't have one in mind, though. Rust uses it for tuple access, and there might be reasons to add tuples (e.g. (foo, bar) = foobar(), but I highly doubt it at this point, so it's only a minor loss in syntax availability (for, potentially an increase in usability, which is obv. why you raised the issue).

@jszwedko
Copy link
Member Author

The reason for this limitation (and others) is simple: start with a conservative/limited syntax scope, and expand as needed.

True, I think that is a sensible strategy.

Also, if we do this, we lose the capability to use numeric fields for other use-cases in the future. I don't have one in mind, though. Rust uses it for tuple access, and there might be reasons to add tuples (e.g. (foo, bar) = foobar()

🤔 we could support it for both cases: if it is a tuple, it indexes into that, if it is a map, it treats it as a field name.

@ktff
Copy link
Contributor

ktff commented Mar 30, 2021

There are ambiguities for cases:

  • . 0 . key - the 0 . will be parsed as a float so we will have . float key which we can still cover by hacky rules, but the sub case of this is more problematic:
    • . 0 . 0key - this will be parsed as . float key which is not valid.
    • One solution is to use , for floats, but that would be a breaking change for VRL
  • . 0key - there is ambiguity between this case and . 0 . key case since the . is optional.
    • One solution is to require ., but that would be a breaking change for VRL

Considering those cases, we can only support pure integer, so no fields that start with a number, and only as the last field in path. But that will be confusing and of limited use, so, without an acceptable solution for those cases, I'm for closing this issue.

cc. @JeanMertz @jszwedko

@jszwedko
Copy link
Member Author

Thanks for writing this up @ktff . I'm not sure I completely follow. For your two cases:

  • .0.key: It seems like we'd be able to unambiguously parse that as a field index given the leading .. We don't allow floats with no integer (that is, we allow 0.1 but not simply .1 to represent 1/10)
  • .0key or 0key: In both these cases, it seems like we could parse that as an identifier again because we don't allow floats to be defined as simply without an integer component

I might be missing something here though.

@ktff
Copy link
Contributor

ktff commented Mar 31, 2021

@jszwedko

we allow 0.1 but not simply .1

That's true but that isn't the problem, the 0. is, which is interpreted as a float. In .0.key the parser will parse this as Dot Float String, now we could say ok and accept float as a field in a path and turn it into an integer at that step, but the case .0.0key will also be interpreted as Dot Float String and here is the ambiguity since both the Float and the String have the same values in both cases. Something as disallowing 0. as Float would help but that wouldn't be the end, there are other corner cases.

.0key or 0key

For this one we will parse it as an integer, so Dot Integer String, while the case .0.key will be interpreted as Dot Integer Dot String. Now those two examples are ambigues since Dot is optional for constructing path hence the two cases can be slimmed down to Integer String hence the ambiguity. Take for example case .0a0a, by just adding a rule to allow Integer as a field in a path this will be interpreted the same as .0.a.0.a.


There are ways around this but all that I see are hacky and like a hydra, you take care of one case just for several new corner cases to pop up. Especially case .0.0key which I think can't be covered with context less parser.

@ktff
Copy link
Contributor

ktff commented Mar 31, 2021

Now, there are ways to solve this but it will require breaking changes to the language and I'm note sure that it's worth it. Most notably changing floats from 0.1 to 0,1.

@jszwedko
Copy link
Member Author

🤔 I'm not super familiar with VRL's tokenizer and parser but it seems to handle path segments specifically:

https://github.com/timberio/vector/blob/fa27239c677959ab84267b99bf35d32a687431d1/lib/vrl/parser/src/parser.lalrpop#L375-L392

cc/ @JeanMertz I'm curious if you have thoughts here since you wrote most of the parser.

I'm happy to defer to people closer to it.

At the least, we should document the limitation on https://vector.dev/docs/reference/vrl/expressions/#path_segments.

@ktff
Copy link
Contributor

ktff commented Apr 1, 2021

That path segment is partly an issue and it's manageable, but the tokenizer is responsible for parsing the 0.0 into Float (I used 'parser' interchangeably with tokenizer in previous comments so that's maybe from where the confusion comes from). The tokenizer doesn't have context so it doesn't know that it's in a path so it happily interprets .0.0 as Dot Float. That's the crux of the issue, there is no way to distinguish a regular float from two integers in a path.

@ktff
Copy link
Contributor

ktff commented Apr 7, 2021

@JeanMertz could you give us your thought's on this. Most notably this

That's the crux of the issue, there is no way to distinguish a regular float from two integers in a path.

@JeanMertz
Copy link
Contributor

It seems to me that, to support this, we'd have to do one of three things:

  1. Change the lexer so that it lexes individual path segments.
  2. Remove floats from the lexer, and instead parse floats in the parser (e.g. Integer Dot Integer, etc).
  3. Update the path parser logic to accept floats in paths, and then separate a float into two integer-based path segments.

I'd say (1) is not ideal, as we want to keep as much contextual information out of the lexer. Doing (2) is a more favourable option to the first. Finally, (3) doesn't require us to change the lexer, but it'll still require a bit of fiddling with the path parsing rules to make that work as expected.

I'm not sure which of (2) or (3) I'd prefer. It's a toss-up for me, really. Doing (2) seems to be the least hacky, although (3) allows us to keep the lexer as-is, at the cost of making the path parsing rules more complex.

@ktff
Copy link
Contributor

ktff commented Apr 7, 2021

@JeanMertz thanks.

(3) isn't doable because we can't distinguish 0. from 0.0

So let's go with (2).

@JeanMertz
Copy link
Contributor

JeanMertz commented Apr 7, 2021

(3) isn't doable because we can't distinguish 0. from 0.0

If 0. is a valid float, I consider that a bug (something I should have tested before shipping 0.12). I'm okay with doing a breaking change to disallow 0. as a float. I know Rust and other languages do support this, but I don't consider it important enough for a language focused on observability data mapping, and rather require people to write 0.0 if they need a float literal (note that VRL allows integer arithmetic operations on floats, so 5.0 / 2 == 2.5, which further reduces the need for a short-hand 2. float notation).

So let's go with (2).

This does still seem like the most sensible approach, I agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: vrl Anything related to the Vector Remap Language transform: remap Anything `remap` transform related type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
3 participants