CSV is much slower than MRI #985

aubergene opened this Issue Jun 8, 2011 · 24 comments

** edited to require fastercsv in 1.8 as I forgot the standard CSV library in 1.8 is very slow

  require "faster_csv"
rescue LoadError
  require "csv"

CSV.foreach('thirty-megs.csv') do |row|

ruby 1.8.7 (2011-02-18 patchlevel 334) [i686-darwin10.7.0]

real 0m20.300s
user 0m20.118s
sys 0m0.117s

ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-darwin10.7.0]

real 0m12.281s
user 0m11.952s
sys 0m0.159s

rubinius 2.0.0dev (1.8.7 23af0f7 yyyy-mm-dd JI) [x86_64-apple-darwin10.7.0]

real 1m39.303s
user 1m39.536s
sys 0m0.713s


Do you have a link to this "thirty-megs.csv" file?


Yes, sorry, it is 1 millions lines like this


I think this may mostly be a factor of the 1.9 library for csv being so much faster than the 1.8 version, which is the version that ships in Rubinius currently.

I ran your benchmark with JRuby, since we have both modes in place today. The 1.9 mode csv.rb is quite a bit faster than 1.8's version:

~/projects/jruby ➔ time jruby --1.9 -X+C csv_bench.rb 
real    0m20.552s
user    0m27.023s
sys 0m1.227s

~/projects/jruby ➔ time ruby1.9 csv_bench.rb 
real    0m16.079s
user    0m14.569s
sys 0m0.193s

~/projects/jruby ➔ time jruby -X+C csv_bench.rb 
real    0m53.274s
user    0m53.407s
sys 0m1.242s

So the comparison of Rubinius in 1.8 mode to Ruby 1.9 is probably not valid here. I attempted to try Rubinius with the 1.9 csv.rb, but it uses 1.9 syntax and I don't have a 1.9-enabled Rubinius build handy.


Oh, yes I forgot it's 1.8 I re-ran using fastercsv-1.5.4 which I understand is very similar to CSV in 1.9, it's quicker than before, but still slow.

real    0m19.323s
user    0m19.154s
sys     0m0.114s

real    1m37.845s
user    1m38.044s
sys     0m0.685s

Ok...I'll shut up then. At least the comparison is a bit more valid now. Thanks for re-running!


I made a patch improving performance when indexing String object. More info in pull request #1027.

31MB file times:

  • ruby 1.9.2-p180:
    ruby test.rb 8,96s user 0,06s system 99% cpu 9,028 total

  • ruby 1.8.7-p334:
    ruby test.rb 197,54s user 0,52s system 99% cpu 3:18,26 total

  • rubinius 1.2.4dev (1.8.7 5cb7a7d yyyy-mm-dd JI):
    ruby test.rb 168,26s user 0,60s system 100% cpu 2:48,36 total

  • rubinius-patched:
    ruby test.rb 116,09s user 0,40s system 100% cpu 1:55,85 total

Rubinius member

83341ab and e1b6562 improve this benchmark's perf by about 50% on my machine, please confirm. Thanks.


Forgive my ignorance, I looked around for how to apply this patch to Rubinius but wasn't sure what I should do or how to do it? I'm happy to follow instructions.

Also I updated the test so it will include FasterCsv for 1.8 and CSV in 1.9 as CSV is very slow in 1.8 anyway.

Rubinius member

If you're using rvm:

rvm install rbx-head

If not get the latest head with:

git clone https://github.com/rubinius/rubinius.git
cd rubinius

to build it, then ./bin/rbx path_to_benchmark.rb to run it.

# rubinius 1.2.4dev (1.8.7 7ae451a1 yyyy-mm-dd JI) [x86_64-apple-darwin10.7.0]
real    1m39.720s
user    1m39.645s
sys 0m0.506s
Rubinius member

aubergene, you're using a build from June 8 7ae451a . Please follow the documentation to build the latest master and try that. You should see 50% speedup (about 40s). still slower than mri but we're getting there.


Thanks hosiawak, I ran it twice and got these results, I got results about 30% faster which is still good.

rubinius 1.2.4dev (1.8.7 83b015818124294a769d9ffb39736d9c52fb00b9 yyyy-mm-dd JI) [x86_64-apple-darwin10.7.0]
real    1m10.265s
user    1m11.112s
sys 0m0.617s

@aubergene Is this still an issue?


@frodsan just ran the tests again, different machine, but seems like it's still similar to the previous runs, RBX takes approximate twice the time.

ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-darwin11.3.0]
real    0m13.149s
user    0m12.780s
sys 0m0.087s
rubinius 2.0.0rc1 (1.8.7 75596a44 2012-11-02 JI) [x86_64-apple-darwin11.4.2]
real    0m37.987s
user    0m38.263s
sys 0m0.269s

Looks like this is still an issue.

rubinius 2.0.0.rc1 (1.9.3 96db2d8e yyyy-mm-dd JI) [x86_64-apple-darwin12.3.0]
33,09s user 0,17s system 106% cpu 31,336 total

rubinius 2.0.0.rc1 (2.0.0 96db2d8e yyyy-mm-dd JI) [x86_64-apple-darwin12.3.0]
32,87s user 0,16s system 105% cpu 31,283 total

ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-darwin12.2.0]
10,42s user 0,04s system 99% cpu 10,462 total

ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin12.2.0]
10,09s user 0,04s system 99% cpu 10,147 total

I did some simple profiling: https://gist.github.com/MSNexploder/5398947
Around 40% of the time is spent in Rubinius::Splitter.split

Rubinius member

01593ae and 04ca45e should have improved this by making Rubinius::Splitter.split considerably faster in the specific cases that the CSV library is using.

Running a test similar to @aubergene's shows a distinct improvement:

rubinius 2.0.0.rc1 (1.9.3 04ca45e0 yyyy-mm-dd JI) [x86_64-apple-darwin12.3.0]
13.40s user 0.13s system 109% cpu 12.353 total

ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-darwin12.2.0]
8.69s user 0.12s system 99% cpu 8.819 total

@MSNexploder could you try running your benchmarks against master and report back?


Way better 👏

rubinius 2.0.0.rc1 (1.9.3 04ca45e0 yyyy-mm-dd JI) [x86_64-apple-darwin12.3.0]
17,11s user 0,11s system 108% cpu 15,818 total

rubinius 2.0.0.rc1 (2.0.0 04ca45e0 yyyy-mm-dd JI) [x86_64-apple-darwin12.3.0]
17,27s user 0,11s system 108% cpu 15,978 total

ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-darwin12.2.0]
10,25s user 0,04s system 99% cpu 10,293 total

ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin12.2.0]
10,21s user 0,04s system 99% cpu 10,265 total

I also updated my profiling gist: https://gist.github.com/MSNexploder/5398947
Looks like IO performance is the next bottleneck - Similar to #2166


Wow, what an improvement from two years ago. Great work!

Rubinius member

Performance update:

Consider the following script:

require 'csv'
require 'benchmark'

timings = []
amount  = 5

amount.times do
  timing = Benchmark.measure { CSV.foreach('/tmp/rubinius_985.csv') { |row| } }
  timings << timing.real

min = timings.min.round(3)
max = timings.max.round(3)
avg = (timings.inject(:+) / timings.length).round(3)

puts "Min: #{min} sec"
puts "Max: #{max} sec"
puts "Avg: #{avg} sec"

And the following 15MB input file: http://downloads.yorickpeterse.com/files/rubinius_985.csv (MD5 is 05fc37e8026a09cea0519377dcfdcf58 for the paranoid).

Running the above script using time ruby /tmp/test.rb yields the following results on MRI 2.0 p247:

Min: 3.12 sec
Max: 3.144 sec
Avg: 3.13 sec

real    0m15.718s
user    0m15.680s
sys     0m0.043s

And on Rbx 2.1:

Min: 5.734 sec
Max: 6.808 sec
Avg: 6.08 sec

real    0m33.564s
user    0m33.123s
sys     0m0.307s

So it's no longer 8 times as slow, only a "mere" 2 times.

Rubinius member

@YorickPeterse your CSV is 404'ing so I uploaded a new one here.

MRI 2.1.2p95

$> time ruby csv.rb sample.csv
Min: 2.55 sec
Max: 2.585 sec
Avg: 2.57 sec
ruby csv.rb sample.csv  12.84s user 0.05s system 99% cpu 12.903 total

rubinius 2.2.6.n135 (2.1.0 06d6f25c 2014-05-15 JI) [x86_64-linux-gnu]

$> time ./bin/rbx csv.rb sample.csv
Min: 3.791 sec
Max: 4.758 sec
Avg: 4.007 sec
./bin/rbx csv.rb sample.csv  22.36s user 0.24s system 110% cpu 20.539 total

1.6x slower nowadays.

What's our goal for this issue? Match or beat MRI?


@jc00ke I've always appreciated JRuby's performance stance: if it's slower than MRI in any way (other than startup time), it's a bug.

There was a time when this didn't make sense for Rubinius, because just about everything was far slower (as clearly seen above). Perhaps it's time for Rubinius to adopt this policy?

Rubinius member

@tmornini That would work for me. A policy of sorts would be helpful.


I'm seeing csv in MRI as 6x's faster than Rubinius.

This is my test script:

# cat read_csv.rb

require 'csv'

t_start = Time.now.to_f

count = 0
# 10,000 row file, ~4.5MB
# available at http://codefork.com/schoenberg_small.csv
CSV.foreach("schoenberg_small.csv") do |row|
  count += 1

elapsed = Time.now.to_f - t_start
rate = count / elapsed

puts "Finished in #{elapsed}s, at rate of #{rate} rows/sec"

Here's the output:

# ruby -v
ruby 2.1.5p273 (2014-11-13) [x86_64-linux-gnu]

# ruby read_csv.rb
Finished in 1.3033735752105713s, at rate of 7572.656211328679 rows/sec

# rbx -v
rubinius 2.4.1 (2.1.0 b332f133 2014-12-04 3.5.0 JI) [x86_64-linux-gnu]

# rbx read_csv.rb
Finished in 7.770043134689331s, at rate of 1270.2632184801419 rows/sec

Here's the profiler output: https://gist.github.com/codeforkjeff/c9fef79b5b091a911018

String#count is taking 36% of the total time, which seems high?

Rubinius member
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment