Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

CSV is much slower than MRI #985

Open
aubergene opened this Issue · 24 comments
@aubergene

** edited to require fastercsv in 1.8 as I forgot the standard CSV library in 1.8 is very slow

begin
  require "faster_csv"
  FasterCSV.build_csv_interface
rescue LoadError
  require "csv"
end

CSV.foreach('thirty-megs.csv') do |row|
end

ruby 1.8.7 (2011-02-18 patchlevel 334) [i686-darwin10.7.0]

real 0m20.300s
user 0m20.118s
sys 0m0.117s

ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-darwin10.7.0]

real 0m12.281s
user 0m11.952s
sys 0m0.159s

rubinius 2.0.0dev (1.8.7 23af0f7 yyyy-mm-dd JI) [x86_64-apple-darwin10.7.0]

real 1m39.303s
user 1m39.536s
sys 0m0.713s

@headius

Do you have a link to this "thirty-megs.csv" file?

@aubergene

Yes, sorry, it is 1 millions lines like this

Source,Target,Type,Weight
193690,208136,Undirected,0.2081897050
193690,206471,Undirected,0.2890272079
193690,209291,Undirected,0.2465690519
@headius

I think this may mostly be a factor of the 1.9 library for csv being so much faster than the 1.8 version, which is the version that ships in Rubinius currently.

I ran your benchmark with JRuby, since we have both modes in place today. The 1.9 mode csv.rb is quite a bit faster than 1.8's version:

~/projects/jruby ➔ time jruby --1.9 -X+C csv_bench.rb 
real    0m20.552s
user    0m27.023s
sys 0m1.227s

~/projects/jruby ➔ time ruby1.9 csv_bench.rb 
real    0m16.079s
user    0m14.569s
sys 0m0.193s

~/projects/jruby ➔ time jruby -X+C csv_bench.rb 
real    0m53.274s
user    0m53.407s
sys 0m1.242s

So the comparison of Rubinius in 1.8 mode to Ruby 1.9 is probably not valid here. I attempted to try Rubinius with the 1.9 csv.rb, but it uses 1.9 syntax and I don't have a 1.9-enabled Rubinius build handy.

@aubergene

Oh, yes I forgot it's 1.8 I re-ran using fastercsv-1.5.4 which I understand is very similar to CSV in 1.9, it's quicker than before, but still slow.

ruby-1.8.7-p334
real    0m19.323s
user    0m19.154s
sys     0m0.114s

rbx-2.0.0pre
real    1m37.845s
user    1m38.044s
sys     0m0.685s
@headius

Ok...I'll shut up then. At least the comparison is a bit more valid now. Thanks for re-running!

@pgrzegorczyk

I made a patch improving performance when indexing String object. More info in pull request #1027.

31MB file times:

  • ruby 1.9.2-p180:
    ruby test.rb 8,96s user 0,06s system 99% cpu 9,028 total

  • ruby 1.8.7-p334:
    ruby test.rb 197,54s user 0,52s system 99% cpu 3:18,26 total

  • rubinius 1.2.4dev (1.8.7 5cb7a7d yyyy-mm-dd JI):
    ruby test.rb 168,26s user 0,60s system 100% cpu 2:48,36 total

  • rubinius-patched:
    ruby test.rb 116,09s user 0,40s system 100% cpu 1:55,85 total

@hosiawak
Collaborator

83341ab and e1b6562 improve this benchmark's perf by about 50% on my machine, please confirm. Thanks.

@aubergene

Forgive my ignorance, I looked around for how to apply this patch to Rubinius but wasn't sure what I should do or how to do it? I'm happy to follow instructions.

Also I updated the test so it will include FasterCsv for 1.8 and CSV in 1.9 as CSV is very slow in 1.8 anyway.

@hosiawak
Collaborator

If you're using rvm:

rvm install rbx-head

If not get the latest head with:

git clone https://github.com/rubinius/rubinius.git
cd rubinius
rake

to build it, then ./bin/rbx path_to_benchmark.rb to run it.

@aubergene
# rubinius 1.2.4dev (1.8.7 7ae451a1 yyyy-mm-dd JI) [x86_64-apple-darwin10.7.0]
real    1m39.720s
user    1m39.645s
sys 0m0.506s
@hosiawak
Collaborator

aubergene, you're using a build from June 8 7ae451a . Please follow the documentation to build the latest master and try that. You should see 50% speedup (about 40s). still slower than mri but we're getting there.

@aubergene

Thanks hosiawak, I ran it twice and got these results, I got results about 30% faster which is still good.

rubinius 1.2.4dev (1.8.7 83b015818124294a769d9ffb39736d9c52fb00b9 yyyy-mm-dd JI) [x86_64-apple-darwin10.7.0]
real    1m10.265s
user    1m11.112s
sys 0m0.617s
@frodsan

@aubergene Is this still an issue?

@aubergene

@frodsan just ran the tests again, different machine, but seems like it's still similar to the previous runs, RBX takes approximate twice the time.

ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-darwin11.3.0]
real    0m13.149s
user    0m12.780s
sys 0m0.087s
rubinius 2.0.0rc1 (1.8.7 75596a44 2012-11-02 JI) [x86_64-apple-darwin11.4.2]
real    0m37.987s
user    0m38.263s
sys 0m0.269s
@MSNexploder

Looks like this is still an issue.

rubinius 2.0.0.rc1 (1.9.3 96db2d8e yyyy-mm-dd JI) [x86_64-apple-darwin12.3.0]
33,09s user 0,17s system 106% cpu 31,336 total

rubinius 2.0.0.rc1 (2.0.0 96db2d8e yyyy-mm-dd JI) [x86_64-apple-darwin12.3.0]
32,87s user 0,16s system 105% cpu 31,283 total

ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-darwin12.2.0]
10,42s user 0,04s system 99% cpu 10,462 total

ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin12.2.0]
10,09s user 0,04s system 99% cpu 10,147 total

I did some simple profiling: https://gist.github.com/MSNexploder/5398947
Around 40% of the time is spent in Rubinius::Splitter.split

@leocassarani
Collaborator

01593ae and 04ca45e should have improved this by making Rubinius::Splitter.split considerably faster in the specific cases that the CSV library is using.

Running a test similar to @aubergene's shows a distinct improvement:

rubinius 2.0.0.rc1 (1.9.3 04ca45e0 yyyy-mm-dd JI) [x86_64-apple-darwin12.3.0]
13.40s user 0.13s system 109% cpu 12.353 total

ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-darwin12.2.0]
8.69s user 0.12s system 99% cpu 8.819 total

@MSNexploder could you try running your benchmarks against master and report back?

@MSNexploder

Way better :clap:

rubinius 2.0.0.rc1 (1.9.3 04ca45e0 yyyy-mm-dd JI) [x86_64-apple-darwin12.3.0]
17,11s user 0,11s system 108% cpu 15,818 total

rubinius 2.0.0.rc1 (2.0.0 04ca45e0 yyyy-mm-dd JI) [x86_64-apple-darwin12.3.0]
17,27s user 0,11s system 108% cpu 15,978 total

ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-darwin12.2.0]
10,25s user 0,04s system 99% cpu 10,293 total

ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin12.2.0]
10,21s user 0,04s system 99% cpu 10,265 total

I also updated my profiling gist: https://gist.github.com/MSNexploder/5398947
Looks like IO performance is the next bottleneck - Similar to #2166

@ghost

Wow, what an improvement from two years ago. Great work!

@YorickPeterse

Performance update:

Consider the following script:

require 'csv'
require 'benchmark'

timings = []
amount  = 5

amount.times do
  timing = Benchmark.measure { CSV.foreach('/tmp/rubinius_985.csv') { |row| } }
  timings << timing.real
end

min = timings.min.round(3)
max = timings.max.round(3)
avg = (timings.inject(:+) / timings.length).round(3)

puts "Min: #{min} sec"
puts "Max: #{max} sec"
puts "Avg: #{avg} sec"

And the following 15MB input file: http://downloads.yorickpeterse.com/files/rubinius_985.csv (MD5 is 05fc37e8026a09cea0519377dcfdcf58 for the paranoid).

Running the above script using time ruby /tmp/test.rb yields the following results on MRI 2.0 p247:

Min: 3.12 sec
Max: 3.144 sec
Avg: 3.13 sec

real    0m15.718s
user    0m15.680s
sys     0m0.043s

And on Rbx 2.1:

Min: 5.734 sec
Max: 6.808 sec
Avg: 6.08 sec

real    0m33.564s
user    0m33.123s
sys     0m0.307s

So it's no longer 8 times as slow, only a "mere" 2 times.

@jc00ke
Owner

@YorickPeterse your CSV is 404'ing so I uploaded a new one here.

MRI 2.1.2p95

$> time ruby csv.rb sample.csv
Min: 2.55 sec
Max: 2.585 sec
Avg: 2.57 sec
ruby csv.rb sample.csv  12.84s user 0.05s system 99% cpu 12.903 total

rubinius 2.2.6.n135 (2.1.0 06d6f25c 2014-05-15 JI) [x86_64-linux-gnu]

$> time ./bin/rbx csv.rb sample.csv
Min: 3.791 sec
Max: 4.758 sec
Avg: 4.007 sec
./bin/rbx csv.rb sample.csv  22.36s user 0.24s system 110% cpu 20.539 total

1.6x slower nowadays.

What's our goal for this issue? Match or beat MRI?

@tmornini
Collaborator

@jc00ke I've always appreciated JRuby's performance stance: if it's slower than MRI in any way (other than startup time), it's a bug.

There was a time when this didn't make sense for Rubinius, because just about everything was far slower (as clearly seen above). Perhaps it's time for Rubinius to adopt this policy?

@jc00ke
Owner

@tmornini That would work for me. A policy of sorts would be helpful.

@codeforkjeff

I'm seeing csv in MRI as 6x's faster than Rubinius.

This is my test script:

# cat read_csv.rb

require 'csv'

t_start = Time.now.to_f

count = 0
# 10,000 row file, ~4.5MB
# available at http://codefork.com/schoenberg_small.csv
CSV.foreach("schoenberg_small.csv") do |row|
  count += 1
end

elapsed = Time.now.to_f - t_start
rate = count / elapsed

puts "Finished in #{elapsed}s, at rate of #{rate} rows/sec"

Here's the output:

# ruby -v
ruby 2.1.5p273 (2014-11-13) [x86_64-linux-gnu]

# ruby read_csv.rb
Finished in 1.3033735752105713s, at rate of 7572.656211328679 rows/sec

# rbx -v
rubinius 2.4.1 (2.1.0 b332f133 2014-12-04 3.5.0 JI) [x86_64-linux-gnu]

# rbx read_csv.rb
Finished in 7.770043134689331s, at rate of 1270.2632184801419 rows/sec

Here's the profiler output: https://gist.github.com/codeforkjeff/c9fef79b5b091a911018

String#count is taking 36% of the total time, which seems high?

@chuckremes
Collaborator
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.