Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Issue with reading large compressed file (.zip and .gz) #116
My use case is trying to avoid to load some data into Apache Spark and instead use the fantastic vroom package (thanks!) to read a large compressed file with occurrence data, in order to then be able to check if there are any duplicates in one column called "occurrenceID".
The dataset I am using is available for download here: http://www.gbif.se/ipt/archive.do?r=artdata
It contains an occurrence.txt file which is approx 70 GB when unpacked.
I my attempt to read the data I use compressed files - one is the original file from the earlier download link meaning it is a .zip-file (DarwinCore Archive file) with that occurrence.txt file inside and the other one is a .gz of that occurrence.txt file.
The .zip file is 5.7 GB and the .gz file is 5.2 GB on disk.
To test this approach I start R with tidyverse and install the vroom package:
# /data is mounted and holds the .zip and .gz files docker run --rm -it -v archive-docker_data_archive:/data:ro rocker/tidyverse:3.6.0 bash
I then issue these commands after starting R:
install.packages("vroom") library(vroom) # attempt 1 df <- vroom("/data/artportalen/artdata.zip", col_select = "occurrenceID")
This gives this message:
So to work around the .zip limit I instead attempt to use a .gz variant of the same occurrence.txt data.
# attempt #2 df <- vroom("/data/artportalen/artdata-occ.gz", col_select = "occurrenceID") # .... progress is reported indexing over 100 GB for around 10m, then ...
This causes a segfault:
Both attempts fail.
The attempt with the .zip file complains about the column name being faulty, but I think it is not misspelled and perhaps a consequence of an interrupted read, the error message leads me to this SO post which I don't know how to work around:
The attempt with the .gz file appears to read more than a 100 GB before it crashes, which is strange since the file is 5.2 GB compressed which should be less than 70 GB when unpacked.
Therefore reporting this as a possible bug.
Did you try without compressing the file? e.g. using the 70Gb file directly?
Also other things to do is read all the columns as characters, since you are only concerned with one column anyway
df <- vroom("/data/artportalen/artdata-occ.gz", col_types = list(.default = "c"), col_select = "occurrenceID")
Often these types of problems are best solved with command line tools. e.g. you can count the number of unique values of
time unzip -p dwca-artdata-v92.144.zip occurrence.txt| cut -f 11 | awk '!a[$0]++' | wc -l 1210632 unzip -p dwca-artdata-v92.144.zip occurrence.txt 234.51s user 8.95s system 99% cpu 4:04.48 total
On my laptop this only took ~ 4 minutes to run and used limited memory and disk space.
That said this file seems to have uncovered a number of different issues in vroom, I am looking into it.
Thanks so much for those suggestions and also for
And if a more lightweight approach with
For comparison, I tried to check for any(duplicated(occurrenceID)) in a vector of 70M occurrenceIDs of random 41 characters long identifiers with in-memory R uses 6-700MB of memory and approximately and takes a few seconds and the radix sort that the sort() function does is also fast for in-memory data at about 2 minutes. The R code I used for that test here: https://github.com/gbif/ipt/files/3182719/check_occurrenceid.pdf
I tried the other variants you suggested with
The docker process for the latest call is at 0% cpu and uses 12 % of available memory on the box.
vroom still has to uncompress the file, it just does it in a temp directory, so it does not save any space than just un-compressing it yourself.
vroom stores the indexes of all the delimiters in the file in memory, so it is still going to use a lot of memory for a file this big even if you only select one column.
One thing you could do if you are only interested in a handful of columns is use
df <- vroom::vroom(pipe("cut -f 11 occurrence.txt")) length(unique(df$collectionID)) #>  1210631
That works without a great deal of memory and disk space as it only has to index the single column.
You could also stream the compressed file into cut like I showed above with this approach, which would avoid uncompressing the full file to disk, e.g.
df <- vroom::vroom(pipe("unzip -p dwca-artdata-v92.144.zip occurrence.txt | cut -f 11))
Also this file uncovered an overflow issue with exceptionally big files, now fixed by 190cf8d.
There also seems to be a possible overflow issue with the progress bar for these extremely large files, which I am looking into now, in the meantime you may want to run with
Thanks for suggestions and advice, for me the pipe sound like a good workflow...for this file I should be able to use the 0-based indexes within the meta.xml contained inside the darwin core archive zip for enumerating the columns for the
Regarding the quote character within a field, for this occurrence.txt file I think the occurrenceRemarks field is pretty nasty and may have quotes and other characters that should be escaped, but I haven't validated that. Somehow the darwin core archive parser at gbif.org can deal with it although that particular column probably needs more cleaning.