Skip to content

Draft: #2298/change yaml parser to a maintained one #2392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

Mivr
Copy link

@Mivr Mivr commented Jun 7, 2025

No description provided.

@Mivr Mivr changed the title #2298/change yaml parser to a maintained one WIP: #2298/change yaml parser to a maintained one Jun 7, 2025
@Mivr Mivr changed the title WIP: #2298/change yaml parser to a maintained one Draft: #2298/change yaml parser to a maintained one Jun 7, 2025
@Mivr Mivr marked this pull request as draft June 7, 2025 09:08
Copy link

@ccoVeille ccoVeille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor remark, more from an end user than a real code review

cmd/root.go Outdated
switch yamlParser {
case "goccy", "":
yqlib.ConfiguredYamlPreferences.UseGoccyParser = true
case "v3":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call it "legacy-v3", or simply "legacy"

It should be obvious this option is no longer maintained

Copy link
Author

@Mivr Mivr Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good note, thanks will update

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 36 to 44
// Try to parse date-only format
if _, err := time.Parse("2006-01-02", value); err == nil {
return true
}

// Try RFC3339 without timezone
if _, err := time.Parse("2006-01-02T15:04:05", value); err == nil {
return true
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are constants in time package for this.

You can use usestdlibvars linter to identify them easily

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprisingly usestdlibvars did not show any errors, maybe my setup has a problem...

Anyway fixed (next commit)

Comment on lines 96 to 102
// Skip empty lines, comments, directives, document separators
if trimmed == "" || strings.HasPrefix(trimmed, "#") ||
strings.HasPrefix(trimmed, "%") || strings.HasPrefix(trimmed, "---") ||
strings.HasPrefix(trimmed, "...") {
return true
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a switch with one case (or multiple) it would be more readable

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 109 to 113
if strings.HasSuffix(trimmed, "|") || strings.HasSuffix(trimmed, ">") ||
strings.HasSuffix(trimmed, "|-") || strings.HasSuffix(trimmed, ">-") ||
strings.HasSuffix(trimmed, "|+") || strings.HasSuffix(trimmed, ">+") {
return true
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 128 to 133
strings.HasPrefix(trimmedValue, "\"") ||
strings.HasPrefix(trimmedValue, "'") ||
strings.HasPrefix(trimmedValue, "{") ||
strings.HasPrefix(trimmedValue, "[") ||
strings.HasPrefix(trimmedValue, "&") ||
strings.HasPrefix(trimmedValue, "*") {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a switch on trimmedValue[0] would be more readable

Comment on lines 163 to 172
// Skip if value is empty, quoted, or already complex
if trimmedValue == "" ||
strings.HasPrefix(trimmedValue, "\"") ||
strings.HasPrefix(trimmedValue, "'") ||
strings.HasPrefix(trimmedValue, "{") ||
strings.HasPrefix(trimmedValue, "[") ||
strings.HasPrefix(trimmedValue, "&") ||
strings.HasPrefix(trimmedValue, "*") {
return line
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 65 to 66
var commentLineRegEx = regexp.MustCompile(`^\s*#`)
var yamlDirectiveLineRegEx = regexp.MustCompile(`^\s*%YA`) // e.g., %YAML

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this out, otherwise the regexp is computed each time the function is called

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 19 to 27
// ISO8601 date/time patterns that should be automatically tagged as timestamps
var dateTimePatterns = []*regexp.Regexp{
// RFC3339 / ISO8601 with timezone: 2006-01-02T15:04:05Z or 2006-01-02T15:04:05+07:00
regexp.MustCompile(`^\s*([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}(?:\.[0-9]+)?(?:Z|[+-][0-9]{2}:[0-9]{2}))\s*$`),
// Date only: 2006-01-02
regexp.MustCompile(`^\s*([0-9]{4}-[0-9]{2}-[0-9]{2})\s*$`),
// RFC3339 without timezone: 2006-01-02T15:04:05
regexp.MustCompile(`^\s*([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}(?:\.[0-9]+)?)\s*$`),
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this aside the function that use it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@Mivr
Copy link
Author

Mivr commented Jun 11, 2025

@ccoVeille Thank you very much for the review. I will be addressing the comments tomorrow.

And for the parser change, Goccy handles comments differently and this will break workflows for some users. I managed to equalize the behavior around timestamps and dates, but the comments will brake stuff. Also some arrays and anchors are handled differently, which will change the print out but the structure will be kept. So logical flows should not be impacted in contrast to the comments difference where if someone depends on some comment being attached to a specific node, his flow will break with the Goccy parser.

I will make a bit longer explanation of all the differences that I found and how they are addressed in this PR. Of course I am very opened for suggestions and comments. I will be working on this this and next week.

Mivr and others added 2 commits June 12, 2025 00:16
Co-authored-by: ccoVeille <3875889+ccoVeille@users.noreply.github.com>
Co-authored-by: ccoVeille <3875889+ccoVeille@users.noreply.github.com>
@mikefarah
Copy link
Owner

You're racing ahead here, I'd slow this down a little - the go-yaml parser now has the support of the official Yaml org - I've been chatting with them and they will be soon addressing the issues raised. I'd prefer to keep go-yaml and maintain comments, anchors, array and formatting etc.

@Mivr
Copy link
Author

Mivr commented Jun 16, 2025

@mikefarah

Thesis

You do make a good point: As I get it: Why to work on this if we can just wait for https://github.com/yaml/go-yaml and then just replace a dependency (theoretically).

I can state an argument for having a goccy parser as an alternative behind a flag:

  • go-yaml has been unmaintained for years, so the actual backlog of issues that need addressing will be probably more than even the one in the now archived repo (~300 open issues). Assuming the YAML foundation continues working on this project with more or less the same speed as the last few months (~10 PRs merged for 2 months) they will have to work for years before they catch up on all the issues goccy has already solved.
    • In short my first point is that go-yaml even maintained by a very serious company is still years behind other implementations. So the time for getting a reliable parser is not the time it takes to release the first version of go-yaml, but the time it takes to fix a 3+ year legacy codebase with >300 issues opened.
  • goccy can easily work behind a flag with a warning for known and possibly unknown incompatibilities, anyway the people that cannot use go-yaml and need goccy don't have much choice. They are either not working with yq because of go-yaml or they use goccy with some differences to standard yq.
  • goccy discrepancies are not so many nor very intrusive (more details below).

go-yaml vs goccy parsing differences (not mitigated already in this PR)

The differences between go-yaml and goccy displayed visually:

1. Comments association on array elements (BREAKING change if someone depends on comments inside the yaml file)

Can be mitigated with a pre processor that moves the comments so the user gets the same result in the end.

Input:

name:
  # under-name-comment
  - first-array-child

In this example go-yaml associates the comment with the first element of the array, while goccy associates the comment with the array itself

2. Flow maps will be expanded:

For input:

{a: 1, b: 2}

go-yaml produces the same as output.

goccy produces:

a: 1
b: 2

No workflows will be broken as the data structure is the same, formatting is the only change here.

It can be handled in some complicated pre + post processing but fairly I don't see this as a braking change, no one should depend on the formatting of the output yaml.

3. Merge anchor

Given:

foo: &foo
  a: foo_a
  thing: foo_thing
  c: foo_c

foobar:
  c: foobar_c
  <<: *foo
  thing: foobar_thing

go-yaml output for .foobar:

c: foobar_c
!!merge <<: *foo
thing: foobar_thing

Goccy output for .foobar:

<<:
    a: foo_a
    c: foo_c
    thing: foo_thing
c: foobar_c
thing: foobar_thing

For me goccy seems to be the cleaner more informative output, again if you really want to this can be handled in post processor.

4. Duplicate keys

go-yaml allows for duplicate keys
goccy does not

I think this is a feature rather than a bug. Again pre-processing can be used to level the behavior if this is what you believe is best.

5. Goccy is stricter on some invalid YAML cases

I think this is a feature.

Conclusion

I do think goccy should be supported behind a flag for the use cases that have been waiting for years for a fix in go-yaml and the same will have to wait for years more even with the support of the YAML foundation.

Not implementing goccy as a parser or any other parser than go-yaml forces anyone that want to use yq to fix go-yaml and wait for their changes to be merged, which can take months or years even if they have the capacity to fix go-yaml.

More or less a parallel parser can save hundreds to thousands of work hours for the users of yq.

@Mivr
Copy link
Author

Mivr commented Jun 16, 2025

P.S. if you see value in having goccy as a parser and you want some or all of the differences to be handled with pre and/or post processors please do write so and I will add them to this PR.

@LarsStegman
Copy link

4. Duplicate keys

go-yaml allows for duplicate keys goccy does not

There is a parser.Option for goccy to allow duplicate keys, parser.AllowDuplicateMapKey().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants