Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading Feather files from a network file is much slower than RDS #342

Closed
maxmoro opened this issue Jun 30, 2018 · 5 comments
Closed

Reading Feather files from a network file is much slower than RDS #342

maxmoro opened this issue Jun 30, 2018 · 5 comments

Comments

@maxmoro
Copy link

@maxmoro maxmoro commented Jun 30, 2018

Reading very large Feather file over the network is much slower than reading RDS files. Is this a bug or am I doing something wrong?
In this reproducible example, I wrote 524 thousand rows in local folder and in a network folder. While in the local folder feather is very quick, on a network folder, read_feather took 139s while readRDS took 1.11secs !!

Can you help me?

library(feather)

data=mtcars;for (i in 1:14) {data=rbind(data,data)}
nrow(data) #524 K rows
#> [1] 524288

##### LOCAL ####
r='c:/temp/dataTest.rds'
f='c:/temp/dataTest.feather'

## Saving RDS on Local
system.time(saveRDS(data,r))
#>    user  system elapsed 
#>       1       0       1
## Reading RDS on Local
system.time(readRDS(r))
#>    user  system elapsed 
#>    0.44    0.02    0.45

## Saving Feather on Local
system.time(feather::write_feather(data,f))
#>    user  system elapsed 
#>    0.02    0.01    0.03
## Reading Feather on Local
system.time(feather::read_feather(f))
#>    user  system elapsed 
#>    0.00    0.03    0.05

file.remove(r,f)
#> [1] TRUE TRUE

##### NETWORK ####
r='//server/folder/dataTest.rds'
f='//server/folder/dataTest.feather'
## Saving RDS on Network
system.time(saveRDS(data,r))
#>    user  system elapsed 
#>    1.08    0.05    1.33
## Reading RDS on Network
system.time(readRDS(r))
#>    user  system elapsed 
#>    0.42    0.00    1.11

##  Saving Feather on Network
system.time(feather::write_feather(data,f))
#>    user  system elapsed 
#>    0.01    0.05    0.60
## Reading Feather on Network
system.time(feather::read_feather(f))
#>    user  system elapsed 
#>    0.02    0.20  139.51

file.remove(r,f)
#> [1] TRUE TRUE
Created on 2018-06-29 by the reprex package (v0.2.0).
@wesm
Copy link
Owner

@wesm wesm commented Jun 30, 2018

That's a very odd quirk. We are memory mapping the files by default and evidently that performs very poorly on your network

@maxmoro
Copy link
Author

@maxmoro maxmoro commented Jun 30, 2018

I noticed the slowness is exponential. With 256k rows is still acceptable, but as the rows increaae it became very slow. 2M rows may take hours

@wesm
Copy link
Owner

@wesm wesm commented Jul 5, 2018

Unfortunately, the memory mapping flag is not exposed right now in the R API. I don't have a timeline to fix, but a PR would be welcome. In the meantime, I suggest you copy files locally before reading them on this particular network; I'm sorry for the inconvenience

@hadley
Copy link
Collaborator

@hadley hadley commented Jan 7, 2019

Also .rds is compressed by default, .feather is not. I suspect that + a slow file system is the root cause.

@wesm
Copy link
Owner

@wesm wesm commented Apr 10, 2020

Feather V2 (coming in arrow 0.17.0) has lz4 and zstd compression support and so should be a lot faster to read over a network

@wesm wesm closed this as completed Apr 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants