Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO error reading multiple Feather files in R #177

Closed
lmullen opened this issue Jun 3, 2016 · 8 comments
Closed

IO error reading multiple Feather files in R #177

lmullen opened this issue Jun 3, 2016 · 8 comments

Comments

@lmullen
Copy link

@lmullen lmullen commented Jun 3, 2016

I have 18,500 Feather files (all with the same columns and column types) which I want to read in. So I do this:

library(feather)
library(purrr)
library(dplyr)

paths <- Sys.glob("/media/lmullen/data/chronicling-america/out/*.feather")

read_df <- failwith(NA, function(x) {
  message(x)
  read_feather(x)
})

raw_l <- paths %>% map(read_df)
names(raw_l) <- paths

When I run that code, some number of the Feather files (it varies from about 150 to 1500) fail to load with this error.

/media/lmullen/data/chronicling-america/out/sn85042907-1919.feather
Error : IO error: Unable to open file

There is nothing actually wrong with those Feather files though. I can read_feather(path_to_problem_file) and get back a data frame as expected. I can also get the paths which failed to load, then run map(paths_that_didnt_load, read_feather) and load all of them fine.

My only suspicion is that Feather is too fast---that it reads the files so quickly that the disk can't get to the next file in time. FWIW, the files that don't load in the batch tend to come in sequence. The files are stored on a RAID 10 array, so it's not as fast as an SSD, but it's fast.

When I put a Sys.sleep(1) call in between loading each file, that cuts down on the number of errors.

I can't think of a good way to provide a reproducible example, but happy to do so if you can give instructions.

> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_0.2.2   purrr_0.2.1   feather_0.0.1 dplyr_0.4.3  

loaded via a namespace (and not attached):
 [1] pryr_0.1.2       lazyeval_0.1.10  magrittr_1.5     R6_2.1.2         assertthat_0.1   parallel_3.3.0   DBI_0.4-1       
 [8] tools_3.3.0      tibble_1.0       Rcpp_0.12.5      stringi_1.1.1    codetools_0.2-14 stringr_1.0.0   
@hadley
Copy link
Collaborator

@hadley hadley commented Jun 3, 2016

Can you try calling gc() every (say) thousand runs?

maybe_gc <- local({
  i <- 1
  function(every = 1000) {
    if (i %% every == 0) {
      message("GC")
      gc()
    }
    i <<- i + 1
  }
})

@lmullen
Copy link
Author

@lmullen lmullen commented Jun 3, 2016

I added a call to maybe_gc() in my function read_df() from above, and it loaded all the Feather files without an error.

@hadley
Copy link
Collaborator

@hadley hadley commented Jun 3, 2016

@krlmlr any thoughts on how we could solve this problem? This is the downside of the lazy approach - if you are reading too many files, I'm assuming the OS runs out of file locks.

@krlmlr
Copy link
Contributor

@krlmlr krlmlr commented Jun 3, 2016

I guess we need close.feather() and call it from read_feather(). That's the one downside of the feather class -- PR coming soon.

@krlmlr
Copy link
Contributor

@krlmlr krlmlr commented Jun 7, 2016

@lmullen: This should be fixed now, could you please confirm?

@lmullen
Copy link
Author

@lmullen lmullen commented Jun 7, 2016

After reinstalling Feather from master, I re-ran my original code above. The first time I got one feather file that failed to load. Every other time I've run it all the files loaded correctly. I can't reproduce that first error, so I think it is fixed.

Thanks!

@jayelm
Copy link

@jayelm jayelm commented Jun 3, 2018

Is it possible this same problem could happen in Python, with pandas' read_feather? I'm encountering a similar error in similar circumstances (loading tens of thousands of feather files).

@wesm
Copy link
Owner

@wesm wesm commented Jun 3, 2018

Could you report the issue either on pandas's issue tracker or the JIRA for Apache Arrow? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants