Join GitHub today
Meaningful error message required rather than core dump #109
Y:\XXXX\Converted videos, film\Consignment AV>sf -z -csv "Consignment AV.zip"
goroutine 1 [chan receive, 3 minutes]:
main.identifyRdr(0x9c5960, 0xc0424386c8, 0xc04239e580, 0xc042034fc0, 0x809918)
main.readFile(0xc04239e580, 0xc042034fc0, 0x809918)
goroutine 4 [chan receive, 3 minutes]:
goroutine 287 [chan receive, 3 minutes]:
goroutine 321 [semacquire, 3 minutes]:
goroutine 299 [select, 3 minutes]:
Y:\XXXX\Converted videos, film\Consignment AV>
Hey Terry - I've reproduced on my local with a synthetic file (cool tool on windows for generating files of arbitrary size fill of random bytes: RDFC).
My computer heroically survived unzipping a 2GB file, a 5GB file, but finally choked on a 15GB file. Giving a runtime error:
runtime: out of memory: cannot allocate 17179869184-byte block (17242652672 in use)
goroutine 11 [running]:
goroutine 1 [chan receive, 3 minutes]:
goroutine 9 [chan receive, 3 minutes]:
goroutine 12 [chan receive, 3 minutes]:
goroutine 14 [semacquire, 3 minutes]:
goroutine 18 [select, 3 minutes]:
It may be a little hard to guard against this error given that computers are all different in their RAM capacity, so can't just set an arbitrary limit of say 3GB of zipped content. But I'll see what can be done.
hmmm tried it on the same zip that broke for you TJ, but I didn't get your error, got one that looks a lot like my synthetic file error:
runtime: out of memory: cannot allocate 17179869184-byte block (17244651520 in use)
goroutine 277 [running]:
goroutine 1 [chan receive, 9 minutes]:
goroutine 33 [chan receive, 9 minutes]:
goroutine 278 [chan receive, 9 minutes]:
goroutine 280 [semacquire, 9 minutes]:
goroutine 327 [select, 9 minutes]:
TJ: did some research into this. Detecting and preventing out of memory errors is evidently a hard problem! But the next release of golang (1.10) has something promising: they are working on "* APIs for memory and CPU resource control". This will hopefully allow me to detect available memory before attempting to allocate a big slice.
So likely any fix to this won't land before golang 1.10 which is due early 2018.
In the meantime, if you are using the "-z" flag: be aware that if your compressed file contains really big files, you can hit these out of memory errors. Temporary solution is to unzip before scanning with siegfried.
A possible alternate approach is to back-up stream contents to a temp file on disk. That way I won't need to reserve such a large chunk of memory. It is a little less tidy and may mean a significant slowdown in some scenarios but it will at least avoid things blowing up like this.
... but it took 41 mins :(
Behaviour now is: if sf is reading from a stream (which it does for contents of compressed files and when something is piped to
Picking the right ARBITRARY_LIMIT is a challenge: it really depends on how much RAM different users have to spend. Also consider that you can have streams within streams within streams (e.g. a zip file that contains another zip file that itself contains a zip file) so might need multiples of the ARBITRARY_LIMIT. With the promised Golang 1.10 features for assessing available memory - may be able to make this smarter in future.
Currently ARBITRARY_LIMIT is set to ~65MB. I'm open to suggestions for changes to this setting. It could also be made configurable with a flag (e.g. -zlimit) if anyone would use that. E.g. if you have a lot of warc.gz files that are 1GB in size (a common size I think for web harvesting), you'd probably want a 1GB ARBITRARY_LIMIT so you could unload these into RAM.
I think an adjustable limit would be a good idea due to the wide variety in specs for user machines. Perhaps a short description in help page to assist users guesstimate their optimal ARBITRARY_LIMIT. Regarding the default limit size, it would be interesting to see how much faster it would be to process the same "consignment AV.zip'' test file if the ARBITRARY_LIMIT is set to 10 times the size (~650MB).