Skip to content

Feather, CSV, or Rdata

Vince Buffalo edited this page Apr 14, 2018 · 5 revisions

You have a large tabular dataset created on a server, and you need to transfer it to your local machine for analysis. Do you use feather, CSV, or Rdata? I think I'd summarize some quick and dirty n=1 benchmarks (below) and each program's features the following way:

  • If you need to analyze it across Python and R, use feather. It's much faster and smaller than CSV. Rdata won't work with Python.

  • If you have list columns, use Rdata.

  • Rdata takes a longer to write, but it compresses down well. Use it if your network connection is slow.

  • Feather is smaller than CSV, and reads and writes a lot faster. If your pipeline is tabular data, feather is a good approach.

  • Feather and Rdata store the column types. This is a big benefit (and the original reason I tried feather), as I have a dataset that's 19 columns wide, and readr's read_csv() incorrectly guessed the column types. Inputting 19 columns manually would be quite a pain. Feather solves this issue.

Benchmarks

The data file in R is 7,384,600 x 19, ~ 0.9Gb. These are all n=1 benchmarks, because I don't have time to fuss with careful benchmarks. Unlikely the ranking would change, and the speed factors are reasonable estimates.

Data Size

$ ls -lagh data.*
-rw-rw-r--. 1 vinceb 1.2G Apr 13 21:32 data.csv
-rw-r--r--. 1 vinceb 931M Apr 13 21:32 data.feather
-rw-rw-r--. 1 vinceb  98M Apr 13 21:39 data.Rdata

CSV is ~12 larger than Rdata, and feather is ~9.5 larger Rdata. Feather is 77% the size of uncompressed CSV.

So, Rdata transfers fast across networks

Reading Data

> system.time(d <- read_csv('data.csv'))
user  system elapsed
36.833   5.432  52.843
> system.time(d <- read_feather('data.feather'))
user  system elapsed
0.732   1.025   3.362
> system.time(d <- load('data.Rdata'))
user  system elapsed
2.451   0.183   2.643

Using elapsed time, CSV takes ~20 times longer than Rdata (yikes!), and feather takes ~1.3 times longer. Another win for Rdata.

Writing Data

system.time(save(a, file='data.Rdata'))
   user  system elapsed
 14.687   0.074  14.761
> system.time(write_csv(a, 'data.csv'))
   user  system elapsed
 23.542   1.029  24.570
> system.time(write_feather(a, 'data.feather'))
   user  system elapsed
  0.596   1.663   5.154

Feather is the fastest here: Rdata is about ~2.9 times longer, and CSV 7.8 times longer.

You can’t perform that action at this time.