Feather, CSV, or Rdata
You have a large tabular dataset created on a server, and you need to transfer it to your local machine for analysis. Do you use feather, CSV, or Rdata? I think I'd summarize some quick and dirty n=1 benchmarks (below) and each program's features the following way:
If you need to analyze it across Python and R, use feather. It's much faster and smaller than CSV. Rdata won't work with Python.
If you have list columns, use Rdata.
Rdata takes a longer to write, but it compresses down well. Use it if your network connection is slow.
Feather is smaller than CSV, and reads and writes a lot faster. If your pipeline is tabular data, feather is a good approach.
Feather and Rdata store the column types. This is a big benefit (and the original reason I tried feather), as I have a dataset that's 19 columns wide, and
read_csv()incorrectly guessed the column types. Inputting 19 columns manually would be quite a pain. Feather solves this issue.
The data file in R is 7,384,600 x 19, ~ 0.9Gb. These are all n=1 benchmarks, because I don't have time to fuss with careful benchmarks. Unlikely the ranking would change, and the speed factors are reasonable estimates.
$ ls -lagh data.* -rw-rw-r--. 1 vinceb 1.2G Apr 13 21:32 data.csv -rw-r--r--. 1 vinceb 931M Apr 13 21:32 data.feather -rw-rw-r--. 1 vinceb 98M Apr 13 21:39 data.Rdata
CSV is ~12 larger than Rdata, and feather is ~9.5 larger Rdata. Feather is 77% the size of uncompressed CSV.
So, Rdata transfers fast across networks
> system.time(d <- read_csv('data.csv')) user system elapsed 36.833 5.432 52.843 > system.time(d <- read_feather('data.feather')) user system elapsed 0.732 1.025 3.362 > system.time(d <- load('data.Rdata')) user system elapsed 2.451 0.183 2.643
Using elapsed time, CSV takes ~20 times longer than Rdata (yikes!), and feather takes ~1.3 times longer. Another win for Rdata.
system.time(save(a, file='data.Rdata')) user system elapsed 14.687 0.074 14.761 > system.time(write_csv(a, 'data.csv')) user system elapsed 23.542 1.029 24.570 > system.time(write_feather(a, 'data.feather')) user system elapsed 0.596 1.663 5.154
Feather is the fastest here: Rdata is about ~2.9 times longer, and CSV 7.8 times longer.