Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incapable of working with large (very large?) yaml files #1215

Open
MattMills opened this issue May 14, 2022 · 3 comments
Open

Incapable of working with large (very large?) yaml files #1215

MattMills opened this issue May 14, 2022 · 3 comments
Labels

Comments

@MattMills
Copy link

Describe the bug
I have a 202MB Yaml file generated by llvm-pdbutil's pdb2yaml functionality. It is a yaml export of an executable debug database. I'm trying to do simple queries against this file to try to find/understand the contents and it seems that yq is not capable of doing so.

time yq eval '.DbiStream.Modules.Module' [snip].yaml > /dev/null
^C
real    3m49.209s
user    4m9.729s
sys     0m21.666s

This command did not complete, it used about 10 GB of memory before I control-c'd it before I ran out of memory.

Note that any how to questions should be posted in the discussion board and not raised as an issue.

Version of yq: 4.16.2 from ubuntu PPA
Operating system: Ubuntu 20.04.4 LTS
Installed via: ppa

Input Yaml
Concise yaml document(s) (as simple as possible to show the bug, please keep it to 10 lines or less)
Err... I think it might be more than 10 lines.

Command
The command you ran:

yq eval '.DbiStream.Modules.Module' [snip].yaml > /dev/null

Actual behavior

Lots of memory and CPU use, no output (even without /dev/null). I'm guessing it loads the entire file and all structures into a parsed memory structure before doing anything.

Expected behavior

Not sure if large files is just considered an unsupported feature of yq, but ideally, not use 10 GB of ram to parse a 200 MB file.

Additional context

@mikefarah
Copy link
Owner

That's a lot of memory :(

You're right in that it reads the entire document into memory before doing anything - that's how the underlying yaml parsers work.

The only think I can think of that would help immediately is turning off colors with the -M flag (as that puts the document through another parser). I'd be curious as to what the stats look like with that turned off.

@MattMills
Copy link
Author

On that machine it seems to be about the same with -M, on my [Windows] desktop which has substantially more RAM with the latest version, it appears to get to 16-17 GB and then I get a golang panic:

panic: internal error: attempted to parse unknown event (please report): none [recovered]
        panic: internal error: attempted to parse unknown event (please report): none

goroutine 1 [running]:
gopkg.in/yaml%2ev3.handleErr(0xc00060d830)
        /home/runner/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/yaml.go:294 +0x6d
panic({0xdb7620, 0xc51e376130})
        /opt/hostedtoolcache/go/1.18.1/x64/src/runtime/panic.go:838 +0x207
gopkg.in/yaml%2ev3.(*parser).parse(0xc000100800)
        /home/runner/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/decode.go:163 +0x194
gopkg.in/yaml%2ev3.(*parser).parseChild(0xc000100800?, 0xc51e3783c0)
        /home/runner/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/decode.go:194 +0x25
gopkg.in/yaml%2ev3.(*parser).mapping(0xc000100800)
        /home/runner/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/decode.go:277 +0x12c
gopkg.in/yaml%2ev3.(*parser).parse(0xc000100800)
        /home/runner/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/decode.go:152 +0xff
gopkg.in/yaml%2ev3.(*parser).parseChild(...)
        /home/runner/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/decode.go:194
gopkg.in/yaml%2ev3.(*parser).sequence(0xc000100800)
        /home/runner/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/decode.go:259 +0x125
gopkg.in/yaml%2ev3.(*parser).parse(0xc000100800)
        /home/runner/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/decode.go:154 +0xe7
gopkg.in/yaml%2ev3.(*parser).parseChild(0xc000100800?, 0xc51c623ea0)
        /home/runner/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/decode.go:194 +0x25
gopkg.in/yaml%2ev3.(*parser).mapping(0xc000100800)
        /home/runner/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/decode.go:285 +0x1c8
gopkg.in/yaml%2ev3.(*parser).parse(0xc000100800)
        /home/runner/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/decode.go:152 +0xff
gopkg.in/yaml%2ev3.(*parser).parseChild(0xc000100800?, 0xc000203400)
        /home/runner/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/decode.go:194 +0x25
gopkg.in/yaml%2ev3.(*parser).mapping(0xc000100800)
        /home/runner/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/decode.go:285 +0x1c8
gopkg.in/yaml%2ev3.(*parser).parse(0xc000100800)
        /home/runner/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/decode.go:152 +0xff
gopkg.in/yaml%2ev3.(*parser).parseChild(...)
        /home/runner/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/decode.go:194
gopkg.in/yaml%2ev3.(*parser).document(0xc000100800)
        /home/runner/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/decode.go:203 +0x7d
gopkg.in/yaml%2ev3.(*parser).parse(0xc000100800)
        /home/runner/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/decode.go:156 +0xab
gopkg.in/yaml%2ev3.(*Decoder).Decode(0xc00040ee90, {0xdf4340?, 0xc0002032c0})
        /home/runner/go/pkg/mod/gopkg.in/yaml.v3@v3.0.0-20210107192922-496545a6307b/yaml.go:123 +0x12d
github.com/mikefarah/yq/v4/pkg/yqlib.(*yamlDecoder).Decode(0xc00040ee90?, 0xeabfe0?)
        /home/runner/work/yq/yq/pkg/yqlib/decoder_yaml.go:22 +0x25
github.com/mikefarah/yq/v4/pkg/yqlib.(*streamEvaluator).Evaluate(0xc00060dcf0, {0xc000124050, 0x43}, {0xeabfe0?, 0xc000418960?}, 0x1d7f7eb8e80?, {0xeadaf0, 0xc0004188a0}, {0xc0003fc5b8, 0x11}, ...)
        /home/runner/work/yq/yq/pkg/yqlib/stream_evaluator.go:98 +0xc3
github.com/mikefarah/yq/v4/pkg/yqlib.(*streamEvaluator).EvaluateFiles(0xc000144008?, {0xc00011a080, 0x19}, {0xc0002040a0, 0x1, 0x7?}, {0xeadaf0, 0xc0004188a0}, 0x20?, {0xead060, ...})
        /home/runner/work/yq/yq/pkg/yqlib/stream_evaluator.go:73 +0x30e
github.com/mikefarah/yq/v4/cmd.evaluateSequence(0xc000147b80?, {0xc000204090, 0x2, 0x3})
        /home/runner/work/yq/yq/cmd/evalute_sequence_command.go:132 +0x8a5
github.com/spf13/cobra.(*Command).execute(0xc000147b80, {0xc000204060, 0x3, 0x3})
        /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:856 +0x67c
github.com/spf13/cobra.(*Command).ExecuteC(0xc000147900)
        /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
        /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.4.0/command.go:902
main.main()
        /home/runner/work/yq/yq/yq.go:22 +0x1f1

@vgrebenschikov
Copy link

vgrebenschikov commented Dec 5, 2024

support of streaming processing would be great -
recently, I've tried to process huge CSV file (~10Gb) with yq - it is literally eat all my memory, althouth with jq - I can process it line by line (but need to parse CSV format with jq instructions)

something like that:

#!/usr/bin/jq -Rn -f 

def objectify(headers):
  def tonumberq: tonumber? // .;
  . as $in
  | reduce range(0; headers|length) as $i ({}; .[headers[$i]] = ($in[$i] | tonumberq) );

def trim:
  sub("\n";"") | sub("\r";"") | sub("^ +";"") | sub(" +$";"") | sub("\"";"") | sub("\"$";"");

def csv2table:
  split(",") | map(trim);

def csv2json:
  first(inputs) | csv2table as $headers |
  inputs | select(length > 0) | csv2table | objectify($headers);


csv2json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants