Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with reading large compressed file (.zip and .gz) #116

Closed
mskyttner opened this issue May 23, 2019 · 10 comments

Comments

@mskyttner
Copy link

commented May 23, 2019

My use case is trying to avoid to load some data into Apache Spark and instead use the fantastic vroom package (thanks!) to read a large compressed file with occurrence data, in order to then be able to check if there are any duplicates in one column called "occurrenceID".

The dataset I am using is available for download here: http://www.gbif.se/ipt/archive.do?r=artdata

It contains an occurrence.txt file which is approx 70 GB when unpacked.

I my attempt to read the data I use compressed files - one is the original file from the earlier download link meaning it is a .zip-file (DarwinCore Archive file) with that occurrence.txt file inside and the other one is a .gz of that occurrence.txt file.

The .zip file is 5.7 GB and the .gz file is 5.2 GB on disk.

To test this approach I start R with tidyverse and install the vroom package:

# /data is mounted and holds the .zip and .gz files
docker run --rm -it -v archive-docker_data_archive:/data:ro rocker/tidyverse:3.6.0 bash

I then issue these commands after starting R:

install.packages("vroom")

library(vroom)

# attempt 1
df <- vroom("/data/artportalen/artdata.zip", col_select = "occurrenceID")

This gives this message:

Multiple files in zip: reading 'occurrence.txt'
Error in vroom_(file, delim = delim, col_names = col_names, col_types = col_types,  : 
  Evaluation error: Unknown column `occurrenceID` 
Call `rlang::last_error()` to see a backtrace.
In addition: Warning message:
In (function (con, what, n = 1L, size = NA_integer_, signed = TRUE,  :
  possible truncation of >= 4GB file

So to work around the .zip limit I instead attempt to use a .gz variant of the same occurrence.txt data.

# attempt #2
df <- vroom("/data/artportalen/artdata-occ.gz", col_select = "occurrenceID")

# .... progress is reported indexing over 100 GB for around 10m, then ...

This causes a segfault:

 *** caught segfault ***
address 0x7fbd515ae292, cause 'memory not mapped'

Traceback:
 1: vroom_(file, delim = delim, col_names = col_names, col_types = col_types,     id = id, skip = skip, col_select = col_select, na = na, quote = quote,     trim_ws = trim_ws, escape_double = escape_double, escape_backslash = escape_backslash,     comment = comment, locale = locale, guess_max = guess_max,     n_max = n_max, altrep_opts = vroom_altrep_opts(altrep_opts),     num_threads = num_threads, progress = progress)
 2: vroom("/data/artportalen/artdata-occ.gz", col_select = "occurrenceID")

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 

Both attempts fail.

The attempt with the .zip file complains about the column name being faulty, but I think it is not misspelled and perhaps a consequence of an interrupted read, the error message leads me to this SO post which I don't know how to work around:

https://stackoverflow.com/questions/42740206/r-possible-truncation-of-4gb-file

The attempt with the .gz file appears to read more than a 100 GB before it crashes, which is strange since the file is 5.2 GB compressed which should be less than 70 GB when unpacked.

Therefore reporting this as a possible bug.

@jimhester

This comment has been minimized.

Copy link
Member

commented May 24, 2019

Did you try without compressing the file? e.g. using the 70Gb file directly?

Also other things to do is read all the columns as characters, since you are only concerned with one column anyway

df <- vroom("/data/artportalen/artdata-occ.gz", col_types = list(.default = "c"), col_select = "occurrenceID")
@jimhester

This comment has been minimized.

Copy link
Member

commented May 24, 2019

The discrepancy in size (70 vs 100) is likely because vroom is reporting the size in gigabytes, 1000 byte units, while the file system is reporting it in gibibytes, 1024 byte units.

@jimhester

This comment has been minimized.

Copy link
Member

commented May 24, 2019

Often these types of problems are best solved with command line tools. e.g. you can count the number of unique values of collectorId with

time unzip -p dwca-artdata-v92.144.zip occurrence.txt| cut -f 11 | awk '!a[$0]++' | wc -l
1210632
unzip -p dwca-artdata-v92.144.zip occurrence.txt  234.51s user 8.95s system 99% cpu 4:04.48 total

The awk '!a[$0]++ looks strange. It is shorthand to use a hash to print lines that haven't been seen before, it is a faster version of sort | uniq.

On my laptop this only took ~ 4 minutes to run and used limited memory and disk space.

That said this file seems to have uncovered a number of different issues in vroom, I am looking into it.

@mskyttner

This comment has been minimized.

Copy link
Author

commented May 24, 2019

Thanks so much for those suggestions and also for vroom and the xml2 package too.

Using sparklyr I can do some of the validation checks I need for this dataset for validating presence and uniqueness of identifiers with decent times but loading the file and converting to parquet to do it and then persist changes back to disk are steps that perhaps take a little more time compared to what patience levels allow for. Some more details here: gbif/ipt#1450

And if a more lightweight approach with vroom would be possible I'd prefer that. Especially if I could work with compressed files to save some space, even though I guess it will be a lot slower to read the file.

For comparison, I tried to check for any(duplicated(occurrenceID)) in a vector of 70M occurrenceIDs of random 41 characters long identifiers with in-memory R uses 6-700MB of memory and approximately and takes a few seconds and the radix sort that the sort() function does is also fast for in-memory data at about 2 minutes. The R code I used for that test here: https://github.com/gbif/ipt/files/3182719/check_occurrenceid.pdf

I tried the other variants you suggested with vroom but got stuck again, perhaps the cloud server ran out of memory or something else happened, the process is still going and hasn't crashed but doesn't seem to do anything:

vroom("/data/artportalen/artdata-occ.gz", col_types = list(.default = "c"), col_select = "occurrenceID")
indexed 8.03GB in 49s, 162.47MB/s^C   
                                        
df <- vroom("/data/artportalen/occurrence.txt", col_types = list(.default = "c"), col_select = "occurrenceID")
indexing occurrence.txt [-----------------------------------] 7.38TB/s, eta:  0s

The docker process for the latest call is at 0% cpu and uses 12 % of available memory on the box.

@jimhester

This comment has been minimized.

Copy link
Member

commented May 24, 2019

vroom still has to uncompress the file, it just does it in a temp directory, so it does not save any space than just un-compressing it yourself.

vroom stores the indexes of all the delimiters in the file in memory, so it is still going to use a lot of memory for a file this big even if you only select one column.

One thing you could do if you are only interested in a handful of columns is use cut and pipe them into vroom. e.g. something like

df <- vroom::vroom(pipe("cut -f 11 occurrence.txt"))
length(unique(df$collectionID))
#> [1] 1210631

That works without a great deal of memory and disk space as it only has to index the single column.

You could also stream the compressed file into cut like I showed above with this approach, which would avoid uncompressing the full file to disk, e.g.

df <- vroom::vroom(pipe("unzip -p dwca-artdata-v92.144.zip occurrence.txt | cut -f 11))
@jimhester

This comment has been minimized.

Copy link
Member

commented May 24, 2019

Also this file uncovered an overflow issue with exceptionally big files, now fixed by 190cf8d.

There also seems to be a possible overflow issue with the progress bar for these extremely large files, which I am looking into now, in the meantime you may want to run with progress = FALSE. Edit: fixed by e359b27

@jimhester

This comment has been minimized.

Copy link
Member

commented May 24, 2019

Oh also heuristic used to guess the delimiter doesn't seem to be guessed, but specifying it explicitly works, e.g. delim = "\t".

And there is a quote within a field, so turning off quotes is needed as well quote = ''.

jimhester added a commit that referenced this issue May 24, 2019

@jimhester

This comment has been minimized.

Copy link
Member

commented May 24, 2019

The zip size limit limitation is a limitation in base R, so there isn't much to be done there really.

@mskyttner

This comment has been minimized.

Copy link
Author

commented May 24, 2019

Thanks for suggestions and advice, for me the pipe sound like a good workflow...for this file I should be able to use the 0-based indexes within the meta.xml contained inside the darwin core archive zip for enumerating the columns for the cut -f command and then work though individual columns of interest one at a time, such as the occurrenceId column. That column-by-column approach should offer a wrkflw that both fits in R memory and doesn't generate huge temp files on the disk.

Regarding the quote character within a field, for this occurrence.txt file I think the occurrenceRemarks field is pretty nasty and may have quotes and other characters that should be escaped, but I haven't validated that. Somehow the darwin core archive parser at gbif.org can deal with it although that particular column probably needs more cleaning.

@jimhester

This comment has been minimized.

Copy link
Member

commented May 24, 2019

Generally tab separated file formats just ignore quotes entirely, it is just on by default in vroom as it is optimized for csv, so needs to be turned off manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.