#readlines significantly slower than MRI #1481

Open
jmstacey opened this Issue Jan 3, 2012 · 12 comments

Comments

Projects
None yet
5 participants
Contributor

jmstacey commented Jan 3, 2012

Calling #readline in Rubinius is significantly slower than MRI. For example, MRI #readlines requires just a tenth of a second or so, but Rubinius #readlines requires over 5 seconds on a file with 500,000 words separated by line break.

Steps to reproduce

  • Prepare a sample source file to read. For example, run the following with Ruby18 or 19. You can't use Rubinius because of a different issue with map which I'll file another bug for. The following will create a source file with 500,000 words consisting of 8 random characters.
SOURCE = 'source.txt'

# File.open(SOURCE, 'w') do |f|
#   500000.times do
#     f.puts (0...8).map{65.+(rand(25)).chr}.join # http://stackoverflow.com/questions/88311/how-best-to-generate-a-random-string-in-ruby
#   end
# end
  • Run an example of the troubled code:
SOURCE = 'source.txt'
file = File.new(SOURCE)
words = file.readlines

p words.size

Timed Benchmarks

Ruby 1.9.2p312

Jon-Staceys-MacBook-Pro:Desktop jon$ time ruby readline_speed.rb 
500000

real    0m0.183s
user    0m0.143s
sys 0m0.034s

Rubinius 2.0.testing @ 2750944

Jon-Staceys-MacBook-Pro:Desktop jon$ time rubinius_bin/bin/rbx ./readline_speed.rb 
500000

real    0m5.965s
user    0m6.245s
sys 0m0.188s

Profiles

Ruby 1.9.2p312 with ruby-prof

Jon-Staceys-MacBook-Pro:Desktop jon$ ruby-prof ./readline_speed.rb 
500000
Thread ID: 70308421914320
Total: 0.141768

 %self     total     self     wait    child    calls  name
 99.69      0.14     0.14     0.00     0.00        1  IO#readlines
  0.19      0.14     0.00     0.00     0.14        1  Kernel#load
  0.05      0.00     0.00     0.00     0.00        1  Kernel#p
  0.03      0.14     0.00     0.00     0.14        2  Global#[No method]
  0.01      0.00     0.00     0.00     0.00        1  File#initialize
  0.01      0.00     0.00     0.00     0.00        1  Kernel#inspect
  0.01      0.00     0.00     0.00     0.00        4  Hash#default
  0.01      0.00     0.00     0.00     0.00        2  IO#set_encoding
  0.00      0.00     0.00     0.00     0.00        1  <Class::IO>#new
  0.00      0.00     0.00     0.00     0.00        1  Fixnum#to_s
  0.00      0.00     0.00     0.00     0.00        1  <Class::IO>#allocate

Rubinius 2.0.testing @ 2750944

Jon-Staceys-MacBook-Pro:Desktop jon$ time rubinius_bin/bin/rbx ./readline_speed.rb 
500000
Thread 1: total running time: 19.015297034s

  %   cumulative   self                self     total
 time   seconds   seconds      calls  ms/call  ms/call  name
------------------------------------------------------------
  16.76    6.59      3.19    1015628     0.00     0.01  Rubinius::Channel#as_lock
  13.06   13.57      2.48     500001     0.00     0.03  IO#read_to_separator
   8.75   17.91      1.66     500001     0.00     0.04  IO#gets
   8.54    1.99      1.62    1000002     0.00     0.00  Rubinius::GlobalVariables#[]=
   6.03    1.15      1.15        115     9.97     9.97  GC.collect_young
   4.89    1.19      0.93     507814     0.00     0.00  IO::InternalBuffer::fill_from<63> {}
   4.79    1.13      0.91     507814     0.00     0.00  IO::InternalBuffer::shift<152> {}
   4.43    5.17      0.84     507814     0.00     0.01  IO::InternalBuffer#fill_from
   3.34   19.01      0.64          1   635.60 19014.62  IO#readlines
   3.26    0.72      0.62     500001     0.00     0.00  IO#ensure_open_and_readable
   2.91    0.60      0.55    1015628     0.00     0.00  Rubinius::Channel#receive
   2.79    0.58      0.53    1015628     0.00     0.00  Rubinius::Channel#<<
   2.76    3.59      0.52     507814     0.00     0.01  IO::InternalBuffer#shift
   2.66    0.51      0.51          7    72.29    72.29  GC.collect_mature
   2.26    0.89      0.43     500005     0.00     0.00  Kernel#StringValue
   2.05    0.48      0.39     507814     0.00     0.00  IO::InternalBuffer#find
   1.96    0.46      0.37     500006     0.00     0.00  Rubinius::Type.coerce_to
   1.65    0.44      0.31     500000     0.00     0.00  Array#<<
   0.74    0.16      0.14     507814     0.00     0.00  String.from_chararray
   0.69    0.18      0.13    1000002     0.00     0.00  Rubinius::LookupTable#[]=
   0.68    0.18      0.13    1000002     0.00     0.00  String#empty?
   0.61    0.16      0.12    1015628     0.00     0.00  IO::InternalBuffer#size
   0.61    0.16      0.12    1007816     0.00     0.00  IO::InternalBuffer#exhausted?
   0.55    0.15      0.10    1000006     0.00     0.00  Rubinius::LookupTable#key?
   0.53    0.12      0.10     500000     0.00     0.00  Array#set_index
   0.43    0.13      0.08    1000007     0.00     0.00  Rubinius::LookupTable#[]
   0.37    0.09      0.07     507815     0.00     0.00  IO::InternalBuffer#empty_to
   0.33    0.09      0.06     507814     0.00     0.00  Rubinius::CharArray#locate
   0.31    0.08      0.06     500006     0.00     0.00  Rubinius::Type.object_kind_of?
   0.31    0.08      0.06     507815     0.00     0.00  IO::InternalBuffer#empty?
   0.22    0.04      0.04       8791     0.00     0.00  IO::InternalBuffer#fill
   0.21    0.06      0.04     500004     0.00     0.00  Fixnum#&
   0.19    0.06      0.04     500011     0.00     0.00  IO#ensure_open
   0.18    0.06      0.03     500000     0.00     0.00  Kernel#taint
   0.11    0.03      0.02       7813     0.00     0.00  String#<<
   0.02    0.00      0.00       8791     0.00     0.00  IO::InternalBuffer#reset!
   0.01    0.00      0.00       7813     0.00     0.00  String#append
   0.01    0.00      0.00       7813     0.00     0.00  String#modify!
   0.00   19.01      0.00          1     0.38 19015.00  Object::__script__<16> {}

39 methods called a total of 20,181,855 times


real    0m20.618s
user    0m20.862s
sys 0m0.301s

I tried the 19 syntax, but it was even slower:

Jon-Staceys-MacBook-Pro:Desktop jon$ rubinius_bin/bin/rbx -X19 ./readline_speed.rb 
500000
Thread 1: total running time: 28.908094775000002s

  %   cumulative   self                self     total
 time   seconds   seconds      calls  ms/call  ms/call  name
------------------------------------------------------------
  11.80    6.95      3.41    1015628     0.00     0.01  Rubinius::Channel#as_lock
  11.64   19.59      3.37     500001     0.01     0.04  IO::EachReader#read_to_separator
   9.01   27.05      2.60     500001     0.01     0.05  IO#each
   5.90    2.08      1.71    1000001     0.00     0.00  Rubinius::GlobalVariables#[]=
   5.82    1.68      1.68        208     8.09     8.09  GC.collect_young
   5.39    3.08      1.56     500002     0.00     0.01  Rubinius::CharArray#dup
   4.08    5.00      1.18     507815     0.00     0.01  String#<<
   3.55   20.72      1.03     500001     0.00     0.04  IO::EachReader#each
   3.50    1.30      1.01     507814     0.00     0.00  IO::InternalBuffer::fill_from<63> {}
   3.37    1.21      0.97     507814     0.00     0.00  IO::InternalBuffer::shift<152> {}
   2.50    0.72      0.72          9    80.43    80.43  GC.collect_mature
   2.40   27.74      0.69     500001     0.00     0.06  IO#gets
   2.36    0.79      0.68     500001     0.00     0.00  IO#ensure_open_and_readable
   2.32   28.91      0.67          1   670.57 28907.34  IO#readlines
   2.16    0.78      0.62     500002     0.00     0.00  Rubinius::CharArray.new
   2.14    4.90      0.62     507814     0.00     0.01  IO::InternalBuffer#fill_from
   2.03    0.69      0.59     500002     0.00     0.00  Kernel#initialize_copy
   1.83    0.58      0.53    1015628     0.00     0.00  Rubinius::Channel#receive
   1.82    3.19      0.53     507814     0.00     0.01  IO::InternalBuffer#shift
   1.81    3.60      0.52     507815     0.00     0.01  String#modify!
   1.71    1.43      0.49     500000     0.00     0.00  IO::gets<391> {}
   1.69    0.54      0.49    1015628     0.00     0.00  Rubinius::Channel#<<
   1.31    0.46      0.38     507813     0.00     0.00  IO::InternalBuffer#find
   1.17    0.46      0.34     500000     0.00     0.00  Array#<<
   0.95    0.30      0.27     500001     0.00     0.00  IO::EachReader#initialize
   0.66    2.54      0.19     500001     0.00     0.01  Class#allocate
   0.54    0.18      0.16     507814     0.00     0.00  String.from_chararray
   0.48    0.19      0.14    1015628     0.00     0.00  IO::InternalBuffer#size
   0.44    0.15      0.13     500000     0.00     0.00  IO#increment_lineno
   0.42    0.17      0.12    1007815     0.00     0.00  IO::InternalBuffer#exhausted?
   0.42    0.17      0.12    1000001     0.00     0.00  Rubinius::LookupTable#[]=
   0.42    0.14      0.12     507815     0.00     0.00  String#append
   0.39    0.16      0.11    1000005     0.00     0.00  Rubinius::LookupTable#key?
   0.34    0.12      0.10     500000     0.00     0.00  Array#set_index
   0.31    0.14      0.09    1000006     0.00     0.00  Rubinius::LookupTable#[]
   0.29    0.11      0.08     507815     0.00     0.00  IO::InternalBuffer#empty_to
   0.28    0.13      0.08    1000004     0.00     0.00  Kernel#__class__
   0.27    0.12      0.08    1000002     0.00     0.00  Proc.__from_block__
   0.27    0.10      0.08     500002     0.00     0.00  Rubinius::CharArray.allocate_sized
   0.25    0.10      0.07     500002     0.00     0.00  String#empty?
   0.22    0.09      0.06     500001     0.00     0.00  Module#===
   0.21    0.08      0.06     507813     0.00     0.00  Rubinius::CharArray#locate
   0.21    0.08      0.06     507814     0.00     0.00  IO::InternalBuffer#empty?
   0.20    0.08      0.06     507822     0.00     0.00  Kernel#kind_of?
   0.17    0.07      0.05     500002     0.00     0.00  BasicObject#initialize

9 methods omitted

54 methods called a total of 29,697,594 times

Code used for Rubinius test:

require 'profiler'
SOURCE = 'source.txt'

profiler = Rubinius::Profiler::Instrumenter.new

profiler.profile do
  file = File.new(SOURCE)
  words = file.readlines

  p words.size
end

Conclusion

It appears that IO#read_to_separator and/or IO#gets need some performance tuning in 18, and 19 needs some work in other areas.

Contributor

jgaskins commented May 17, 2012

I've been putting my eyes on this. My profiling report looks a little different than yours due to the changes in the months that have passed since this ticket was filed. :-)

➜  readlines  ~/Code/rubinius/bin/rbx -Xprofile readlines.rb
500000
Thread 1: total running time: 15.588074327000001s

  %   cumulative   self                self     total
 time   seconds   seconds      calls  ms/call  ms/call  name
------------------------------------------------------------
  13.39    3.91      2.09    1015639     0.00     0.00  Rubinius.synchronize
  10.99    5.25      1.71    1000003     0.00     0.01  Rubinius::GlobalVariables#[]=
   9.43    7.88      1.47     500001     0.00     0.02  IO#read_to_separator
   7.37    1.28      1.15    1000007     0.00     0.00  Array#[]
   6.79   14.62      1.06     500001     0.00     0.03  IO#gets
   5.57    0.87      0.87        164     5.30     5.30  GC.collect_young
   4.02    1.05      0.63     500003     0.00     0.00  Rubinius::Type.coerce_to_symbol
   3.97    0.78      0.62     507815     0.00     0.00  IO::InternalBuffer::fill_from<64> {}
   3.75    0.73      0.58     507815     0.00     0.00  IO::InternalBuffer::shift<153> {}
   3.52    2.07      0.55     500002     0.00     0.00  Kernel::Kernel<112> {}
   3.03    3.19      0.47     507815     0.00     0.01  IO::InternalBuffer#fill_from
   2.72    1.51      0.42     500002     0.00     0.00  Thread#[]=
   2.50    0.44      0.39     500002     0.00     0.00  IO#ensure_open_and_readable
   2.45   15.32      0.38          1   381.53 15320.82  IO#readlines
   2.33    2.01      0.36     507815     0.00     0.00  IO::InternalBuffer#shift
   2.05    0.32      0.32          7    45.68    45.68  GC.collect_mature
   1.59    0.29      0.25     507814     0.00     0.00  IO::InternalBuffer#find
   1.46    0.29      0.23     500044     0.00     0.00  Array#<<
   1.38    0.26      0.22         21    10.27    12.53  Array#each
   1.28    0.23      0.20     500003     0.00     0.00  Rubinius::Type.coerce_to
   1.16    0.20      0.18    1015639     0.00     0.00  Rubinius.lock
   0.82    0.15      0.13    1015640     0.00     0.00  Rubinius.unlock
   0.73    0.12      0.11     500017     0.00     0.00  Kernel#respond_to?
   0.68    0.12      0.11     507816     0.00     0.00  String.from_bytearray
   0.62    0.13      0.10    1500025     0.00     0.00  Rubinius::LookupTable#[]
   0.61    0.12      0.10    1000015     0.00     0.00  Rubinius::Type.object_kind_of?
   0.53    0.09      0.08     500039     0.00     0.00  Kernel#StringValue
   0.48    0.09      0.07    1007818     0.00     0.00  IO::InternalBuffer#exhausted?
   0.47    0.09      0.07    1000009     0.00     0.00  String#empty?
   0.42    0.09      0.07    1015632     0.00     0.00  IO::InternalBuffer#size
   0.39    0.09      0.06    1500092     0.00     0.00  Rubinius::Tuple#at
   0.34    0.06      0.05     507819     0.00     0.00  IO::InternalBuffer#empty_to
   0.33    0.07      0.05    1000012     0.00     0.00  Rubinius::LookupTable#key?
   0.32    0.06      0.05     500044     0.00     0.00  Array#set_index
   0.28    0.05      0.04     500001     0.00     0.00  Rubinius::LookupTable#[]=
   0.25    0.05      0.04     500002     0.00     0.00  Thread#locals_store
   0.22    0.04      0.03     507818     0.00     0.00  IO::InternalBuffer#empty?
   0.22    0.04      0.03     507814     0.00     0.00  Rubinius::ByteArray#locate
   0.22    0.05      0.03    1000013     0.00     0.00  Fixnum#>=
   0.21    0.05      0.03    1000039     0.00     0.00  Fixnum.===
   0.20    0.03      0.03       8792     0.00     0.00  IO::InternalBuffer#fill
   0.19    0.04      0.03     500000     0.00     0.00  Enumerable::count<125> {}
   0.15    0.03      0.02     500000     0.00     0.00  Kernel#taint
   0.15    0.03      0.02     500043     0.00     0.00  IO#ensure_open
   0.13    0.03      0.02     500041     0.00     0.00  Fixnum#&

22 methods omitted

67 methods called a total of 28,682,468 times

Just setting $_ each time gets is called is taking up almost 11% of its runtime. I tried implementing it with io.read.split, which bypasses setting the global variable and reduces the method's runtime by a factor of 5 (IO#read completes in ~360ms on my machine), but IO#readlines is supposed to include the separators and String#split doesn't even offer a way to do that.

I'm not sure if there are rules about adding public interface to core-Ruby methods and classes so I don't want to go adding another arg to String#split or another method to String itself. Also, readlines is implemented exactly as it is in MRI, so I think using read.split in any form might be treating the symptoms rather than the cause.

Optimizing the setting of global variables by tinkering with LookupTable would help, but it would be even better if we could get rid of the call entirely since $_ is being restored to its original value after eof anyway.

Contributor

headius commented Aug 11, 2012

Retesting this on recent rbx...

I put the reading logic in a loop and benchmarked only that for 5 iterations, to see if rbx improved. It does, but not a great deal:

system ~/projects/jruby $ ../rubinius/bin/rbx file_reader.rb 
  5.651650   0.081317   5.732967 (  5.333050)
  5.018842   0.039015   5.057857 (  5.001714)
  4.525175   0.025034   4.550209 (  4.551207)
  4.410717   0.023844   4.434561 (  4.435064)
  4.370136   0.022201   4.392337 (  4.365882)

system ~/projects/jruby $ ruby-1.9.3 file_reader.rb 
  0.070000   0.010000   0.080000 (  0.086364)
  0.080000   0.010000   0.090000 (  0.104612)
  0.060000   0.000000   0.060000 (  0.073291)
  0.070000   0.010000   0.080000 (  0.087227)
  0.060000   0.000000   0.060000 (  0.066388)

system ~/projects/jruby $ jruby file_reader.rb 
  1.080000   0.050000   1.130000 (  0.487000)
  0.380000   0.030000   0.410000 (  0.225000)
  0.710000   0.020000   0.730000 (  0.315000)
  1.720000   0.130000   1.850000 (  0.682000)
  0.170000   0.000000   0.170000 (  0.164000)

@jmstacey jmstacey added a commit to jmstacey/rubinius that referenced this issue Oct 15, 2012

@jmstacey jmstacey Don't set $_ in IO#readlines to slightly improve performance. This is…
… a partial treatment for issue #1481.
b9786b4

@jmstacey jmstacey added a commit to jmstacey/rubinius that referenced this issue Oct 15, 2012

@jmstacey jmstacey Don't set $. global variable in IO::EachReader methods. Partial treat…
…ment for #1481.
0980d63
Contributor

jmstacey commented Oct 15, 2012

jgaskins, is that what you had in mind? This slightly improves performance on my test by roughly 1 second in 1.9 mode and by about 1.5 seconds in 1.8 mode. I didn't see any surrounding code that would be affected by not having $_ contain the last line, but I'm not completely familiar with the codebase. At least, running mspec on /spec/ruby/stringio/ didn't break anything...

Global assignment may be part of the problem. We seem to be setting globals in IO::EachReader#read_to_separator [and the other IO::EachReader methods]. I'm not exactly sure why we're setting globals all over the place, can someone please explain what I'm missing?

Even if the two commits above are good to go and should be submitted as a pull request, we still have a significant performance deficit compared to the other Ruby VMs.

With the two commits above applied, I can wring out an extra 2 second speedup [from around 5.5 to 3.5 seconds on my system] by removing the Rubinius.synchronize blocks from InternalBufferr#fill_from and InternalBufferr#shift. It looks like evanphx put these in earlier this year a9116a0 to solve a GC problem.
I didn't think that creating a memory barrier should be such an expensive operation... I'll leave that for those more wise than I.

Here's my latest Profiler graph: https://gist.github.com/3890814

Edit: Ignore those first two referenced commits....

Owner

YorickPeterse commented Oct 28, 2013

Using the above source file and the following benchmark:

require 'benchmark'

SOURCE  = 'source.txt'
timings = []
amount  = 5

amount.times do
  timing = Benchmark.measure { File.new(SOURCE).readlines }
  timings << timing.real
end

min = timings.min.round(3)
max = timings.max.round(3)
avg = (timings.inject(:+) / timings.length).round(3)

puts "Min: #{min} sec"
puts "Max: #{max} sec"
puts "Avg: #{avg} sec"

I get the following timings on MRI 2.0 p247 (command is time ruby test.rb):

Min: 0.08 sec
Max: 0.111 sec
Avg: 0.088 sec

real    0m0.504s
user    0m0.473s
sys     0m0.030s

And on Rbx 2.1:

Min: 2.43 sec
Max: 3.362 sec
Avg: 2.772 sec

real    0m14.227s
user    0m15.433s
sys     0m0.190s
Owner

YorickPeterse commented Mar 19, 2014

Currently investigating this issue on the latest Rubinius master (non debug build as suggested by @evanphx). I'm using the benchmark https://github.com/rubinius/rubinius-benchmark/blob/master/core/file/bench_each_byte.rb which I run as following:

./bin/benchmark core/file/bench_each_byte.rb

Rbx version:

rubinius 2.2.6.n78 (2.1.0 1244f4c5 2014-03-19 JI) [x86_64-linux-gnu]

This gives the following results:

=== rbx ===
File#each_byte with a small file
                          130.2 (±3.1%) i/s -        660 in   5.074767s (cycle=12)
File#each_byte with a medium file
                           14.6 (±0.0%) i/s -         73 in   5.018761s (cycle=1)
File#each_byte with a large file
                            1.5 (±0.0%) i/s -          8 in   5.408362s (cycle=1)

When running it with the profiler (./bin/benchmark -p) the profiling results are as following:

Thread 1: total running time: 43.147538699s

  %   cumulative   self                self     total
 time   seconds   seconds      calls  ms/call  ms/call  name
------------------------------------------------------------
  15.76   19.25      6.81    7507988     0.00     0.00  Rubinius.synchronize
  11.64    8.12      5.03    3753953     0.00     0.00  IO::InternalBuffer::fill_from<74> {}
  10.72   23.12      4.63    3753952     0.00     0.01  IO#eof?
  10.41   12.88      4.50    3753897     0.00     0.00  IO::InternalBuffer#getbyte
   9.36   43.07      4.05         55    73.55   783.18  IO#each_byte
   6.71    3.51      2.90    3753953     0.00     0.00  IO#ensure_open_and_readable
   6.07    3.12      2.62    3753897     0.00     0.00  IO::InternalBuffer::getbyte<201> {}
   6.07   15.69      2.62    3753897     0.00     0.00  IO#getbyte
   5.69   14.77      2.46    3753953     0.00     0.00  IO::InternalBuffer#fill_from
   4.94    2.68      2.13    7507853     0.00     0.00  IO::InternalBuffer#size
   2.49    1.36      1.08    3754065     0.00     0.00  IO::InternalBuffer#empty?
   1.39    0.77      0.60    7508032     0.00     0.00  Rubinius.lock
   1.37    0.77      0.59    7508035     0.00     0.00  Rubinius.unlock
   1.31    0.82      0.57   11268462     0.00     0.00  Fixnum#==
   1.10    0.57      0.48    3754010     0.00     0.00  IO::InternalBuffer#empty_to
   0.87    0.55      0.38    7508771     0.00     0.00  Fixnum#-
   0.83    0.53      0.36    7507958     0.00     0.00  IO#ensure_open
   0.67    0.38      0.29    3754009     0.00     0.00  IO::InternalBuffer#exhausted?
   0.57    0.33      0.24    3753908     0.00     0.00  Rubinius::ByteArray#[]
   0.43    0.25      0.19    2400734     0.00     0.00  Object::__script__<26> {}
   0.41    0.26      0.18    3754467     0.00     0.00  Fixnum#&
   0.39    0.29      0.17    3764457     0.00     0.00  Fixnum#+
   0.24    0.10      0.10       2526     0.04     0.04  GC.collect_young
   0.13    0.07      0.05     720966     0.00     0.00  Object::__script__<21> {}
   0.11    0.06      0.05     632197     0.00     0.00  Object::__script__<16> {}
   0.03    0.01      0.01        204     0.06     0.06  Rubinius.add_method
   0.03   43.11      0.01        481     0.02    89.63  Array#each
   0.02    0.01      0.01         15     0.58     0.58  Rubinius::CodeLoader#load_compiled_file
   0.02    0.01      0.01         14     0.47     0.70  Rubinius::CodeLoader#resolve_require_path
   0.01    0.01      0.01        325     0.02     0.02  Array#include?
   0.01    0.01      0.01        132     0.05     0.08  Rubinius.add_defn_method
   0.01    0.02      0.01         14     0.45     1.53  Rubinius::CodeLoader#require
   0.01    0.01      0.01          1     5.03    11.21  Object::__script__<215> {}
   0.01    0.01      0.00         25     0.17     0.24  Rubinius::CodeLoader.feature_provided?
   0.01    0.01      0.00         15     0.25     0.79  Rubinius::CodeLoader#load_file
   0.01    0.01      0.00         14     0.26     0.72  Rubinius::CodeLoader::require<102> {}
   0.01   43.09      0.00          3     1.20 14364.24  Benchmark::ips<153> {}
   0.01    0.00      0.00        178     0.02     0.02  Rubinius.privatized_method?
   0.01    0.01      0.00         57     0.06     0.09  Marshal::State#serialize
   0.01    0.00      0.00        164     0.02     0.02  Rational.reduce
   0.01    0.04      0.00         14     0.18     2.94  Rubinius::CodeLoader.require
   0.01    0.00      0.00        534     0.00     0.00  IO::InternalBuffer#fill
   0.00    0.00      0.00         97     0.02     0.02  String.allocate
   0.00    0.00      0.00          1     1.95     1.95  Object::__script__<302> {}
   0.00    0.00      0.00          2     0.95     1.73  Rubinius::Sprinter::Builder#parse

77 methods omitted

122 methods called a total of 108,889,627 times

Oprofile Results

I also decided to run the benchmark using oprofile. In this case I ran the benchmark as following:

operf rbx ./core/file/bench_each_byte.rb

Reporting the results was done using:

opreport -l -m all -g -t1

The results of this are the following:

Using /home/yorickpeterse/Private/Projects/ruby/rubinius-benchmark/oprofile_data/samples/ for samples directory.
warning: /no-vmlinux could not be found.
warning: [vdso] (tgid:10155 range:0x7fffad76f000-0x7fffad770fff) could not be found.
CPU: Intel Sandy Bridge microarchitecture, speed 3.1e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        linenr info                 app name                 symbol name
37221     7.6441  root.cpp:52                 rbx                      rubinius::Root::set(rubinius::Object*, rubinius::Roots*)
32936     6.7641  oop.cpp:337                 rbx                      rubinius::ObjectHeader::lock(rubinius::State*, rubinius::GCTokenImpl&, rubinius::CallFrame*, unsigned long, bool)
24821     5.0975  oop.cpp:571                 rbx                      rubinius::ObjectHeader::unlock(rubinius::State*, rubinius::GCTokenImpl&, rubinius::CallFrame*)
21359     4.3865  vm.cpp:181                  rbx                      rubinius::VM::new_object_typed_dirty(rubinius::Class*, unsigned long, rubinius::object_type)
18868     3.8749  stack_variables.cpp:12      rbx                      rubinius::StackVariables::create_heap_alias(rubinius::State*, rubinius::CallFrame*, bool)
11281     2.3168  jit_util.cpp:925            rbx                      rbx_return_to_here
10248     2.1046  block_environment.cpp:429   rbx                      rubinius::BlockEnvironment::under_call_frame(rubinius::State*, rubinius::GCTokenImpl&, rubinius::CompiledCode*, rubinius::MachineCode*, rubinius::CallFrame*)
9991      2.0519  call_frame.cpp:189          rbx                      rubinius::CallFrame::scope_still_valid(rubinius::VariableScope*)
9855      2.0239  vm_thread_state.cpp:19      rbx                      rubinius::VMThreadState::state_as_object(rubinius::State*)
9513      1.9537  block_environment.hpp:50    rbx                      void rubinius::BlockEnvironment::top_scope<rubinius::State*>(rubinius::State*, rubinius::VariableScope*)
9504      1.9518  block_environment.cpp:67    rbx                      rubinius::BlockEnvironment::invoke(rubinius::State*, rubinius::CallFrame*, rubinius::BlockEnvironment*, rubinius::Arguments&, rubinius::BlockInvocation&)
9455      1.9418  variable_scope.hpp:54       rbx                      void rubinius::VariableScope::self<rubinius::State*>(rubinius::State*, rubinius::Object*)
8979      1.8440  jit_util.cpp:684            rbx                      rbx_yield_stack
8955      1.8391  tuple.hpp:40                rbx                      rubinius::Tuple::put(rubinius::State*, long, rubinius::Object*)
8803      1.8079  vm_thread_state.cpp:70      rbx                      rubinius::VMThreadState::raise_return(rubinius::Object*, rubinius::VariableScope*)
8799      1.8071  jit_primitives.cpp:2850     rbx                      jit_stub_io_ensure_open
8262      1.6968  block_environment.hpp:51    rbx                      void rubinius::BlockEnvironment::compiled_code<rubinius::State*>(rubinius::State*, rubinius::CompiledCode*)
7733      1.5881  jit_util.cpp:838            rbx                      rbx_push_local_depth
7250      1.4889  jit_util.cpp:248            rbx                      rbx_create_block
7077      1.4534  system.cpp:1306             rbx                      rubinius::System::vm_object_lock(rubinius::State*, rubinius::GCTokenImpl&, rubinius::Object*, rubinius::CallFrame*)
6927      1.4226  vm.cpp:235                  rbx                      rubinius::VM::new_young_tuple_dirty(unsigned long)
6907      1.4185  (no location information)   no-vmlinux               /no-vmlinux
6847      1.4062  variable_scope.cpp:195      rbx                      rubinius::VariableScope::flush_to_heap(rubinius::State*)
6770      1.3904  jit_primitives.cpp:8175     rbx                      jit_stub_vm_object_unlock
6585      1.3524  block_environment.hpp:49    rbx                      void rubinius::BlockEnvironment::scope<rubinius::State*>(rubinius::State*, rubinius::VariableScope*)
6484      1.3316  jit_util.cpp:1160           rbx                      rbx_raise_return
6337      1.3014  variable_scope.cpp:153      rbx                      rubinius::VariableScope::get_local(rubinius::State*, int)
6198      1.2729  jit_primitives.cpp:8036     rbx                      jit_stub_vm_object_lock
5842      1.1998  block_environment.cpp:373   rbx                      rubinius::BlockEnvironment::call(rubinius::State*, rubinius::CallFrame*, rubinius::Arguments&, int)
5615      1.1532  vm_thread_state.cpp:31      rbx                      rubinius::VMThreadState::set_state(rubinius::State*, rubinius::ThreadState*)
5524      1.1345  block_environment.hpp:52    rbx                      void rubinius::BlockEnvironment::constant_scope<rubinius::State*>(rubinius::State*, rubinius::ConstantScope*)
5310      1.0905  instructions.cpp:82         rbx                      rubinius::MachineCode::interpreter(rubinius::State*, rubinius::MachineCode*, rubinius::InterpreterCallFrame*)
5270      1.0823  variable_scope.hpp:56       rbx                      void rubinius::VariableScope::last_match<rubinius::State*>(rubinius::State*, rubinius::Object*)
5233      1.0747  thread_state.hpp:26         rbx                      void rubinius::ThreadState::destination_scope<rubinius::State*>(rubinius::State*, rubinius::VariableScope*)
4984      1.0236  global_cache.cpp:177        rbx                      rubinius::GlobalCache::prune_young()

Random Changes

If I comment out https://github.com/rubinius/rubinius/blob/master/kernel/common/io.rb#L74 (and the corresponding end keyword) and re-run the benchmark the results are as following:

=== rbx ===
File#each_byte with a small file
                          415.5 (±5.3%) i/s -       2072 in   5.003953s (cycle=37)
File#each_byte with a medium file
                           47.9 (±0.0%) i/s -        240 in   5.013134s (cycle=4)
File#each_byte with a large file
                            4.8 (±0.0%) i/s -         25 in   5.196865s (cycle=1)

It's worth stressing that this was a change done out of curiosity. I'm fairly confident the synchronization process is there for a reason. Without this race conditions could pop up if the object is modified while being used (if I'm not mistaken). The massive increase of performance however is very interesting.

YorickPeterse self-assigned this Mar 19, 2014

Owner

YorickPeterse commented Mar 20, 2014

After a chat with various others it was suggested to remove the locking and instead implement a CAS based system to ensure thread-safe IO operations. @evanphx made a pseudo-code example for this which can be found at https://gist.github.com/evanphx/9674625.

The motivation for this is that IO instances should be thread-safe (or at least as much as possible), this is also why the current locking mechanisms are put in place. Removing those would yield a pretty big performance benefit (about 10 times in the File#each_byte benchmark). The cost however would be a total lack of thread-safety.

Similar to other core classes (e.g. Array) the statement could be made that IO is not thread-safe. However, from what I gather the general consensus seems to be that IO should be thread-safe.

@chuckremes You'll probably want to watch this thread as I'll be using it as a brain/numbers dump :)

Owner

YorickPeterse commented Mar 20, 2014

As an example, this is output of Rbx without any IO locking going on (compared to MRI 2.1 and JRuby):

$ ./bin/benchmark -t `which rbx` -t /home/yorickpeterse/.rubies/ruby-2.1.1/bin/ruby -t /home/yorickpeterse/.rubies/jruby-1.7.10/bin/jruby ./core/file/bench_each_byte.rb                     
=== /home/yorickpeterse/.rubies/rbx-git/bin/rbx ===
File#each_byte with a small file
                          657.1 (±10.5%) i/s -       3276 in   5.068450s (cycle=63)
File#each_byte with a medium file
                           76.6 (±0.0%) i/s -        385 in   5.026636s (cycle=7)
File#each_byte with a large file
                            7.5 (±0.0%) i/s -         38 in   5.059536s (cycle=1)
=== /home/yorickpeterse/.rubies/ruby-2.1.1/bin/ruby ===
File#each_byte with a small file
                         1540.6 (±1.7%) i/s -       7803 in   5.066599s (cycle=153)
File#each_byte with a medium file
                          171.8 (±1.7%) i/s -        867 in   5.046385s (cycle=17)
File#each_byte with a large file
                           17.2 (±0.0%) i/s -         87 in   5.049136s (cycle=1)
=== /home/yorickpeterse/.rubies/jruby-1.7.10/bin/jruby ===
File#each_byte with a small file
                          836.2 (±2.9%) i/s -       4216 in   5.046000s (cycle=62)
File#each_byte with a medium file
                           96.9 (±2.1%) i/s -        486 in   5.019000s (cycle=9)
File#each_byte with a large file
                            9.6 (±0.0%) i/s -         49 in   5.086000s (cycle=1)

Comparing ./core/file/bench_each_byte.rb:File#each_byte with a small file:
   /home/yorickpeterse/.rubies/ruby-2.1.1/bin/ruby:       1540 i/s
/home/yorickpeterse/.rubies/jruby-1.7.10/bin/jruby:        836 i/s - 1.84x slower
       /home/yorickpeterse/.rubies/rbx-git/bin/rbx:        657 i/s - 2.34x slower
Comparing ./core/file/bench_each_byte.rb:File#each_byte with a medium file:
   /home/yorickpeterse/.rubies/ruby-2.1.1/bin/ruby:        171 i/s
/home/yorickpeterse/.rubies/jruby-1.7.10/bin/jruby:         96 i/s - 1.77x slower
       /home/yorickpeterse/.rubies/rbx-git/bin/rbx:         76 i/s - 2.24x slower
Comparing ./core/file/bench_each_byte.rb:File#each_byte with a large file:
   /home/yorickpeterse/.rubies/ruby-2.1.1/bin/ruby:         17 i/s
/home/yorickpeterse/.rubies/jruby-1.7.10/bin/jruby:          9 i/s - 1.79x slower
       /home/yorickpeterse/.rubies/rbx-git/bin/rbx:          7 i/s - 2.30x slower

The patch for this is as following:

diff --git a/kernel/common/io.rb b/kernel/common/io.rb
index bce5086..717792f 100644
--- a/kernel/common/io.rb
+++ b/kernel/common/io.rb
@@ -70,7 +70,7 @@ class IO
     #
     # Returns the number of bytes in the buffer.
     def fill_from(io, skip = nil)
-      Rubinius.synchronize(self) do
+      #Rubinius.synchronize(self) do
         empty_to io
         discard skip if skip

@@ -88,7 +88,7 @@ class IO
         end

         return size
-      end
+      #end
     end

     def empty_to(io)
@@ -147,19 +147,19 @@ class IO
     end

     def unseek!(io)
-      Rubinius.synchronize(self) do
+      #Rubinius.synchronize(self) do
         # Unseek the still buffered amount
         return unless write_synced?
         io.prim_seek @start - @used, IO::SEEK_CUR unless empty?
         reset!
-      end
+      #end
     end

     ##
     # Returns +count+ bytes from the +start+ of the buffer as a new String.
     # If +count+ is +nil+, returns all available bytes in the buffer.
     def shift(count=nil)
-      Rubinius.synchronize(self) do
+      #Rubinius.synchronize(self) do
         total = size
         total = count if count and count < total

@@ -167,7 +167,7 @@ class IO
         @start += total

         str
-      end
+      #end
     end

     PEEK_AHEAD_LIMIT = 16
@@ -197,18 +197,18 @@ class IO
     def getbyte(io)
       return if size == 0 and fill_from(io) == 0

-      Rubinius.synchronize(self) do
+      #Rubinius.synchronize(self) do
         byte = @storage[@start]
         @start += 1
         byte
-      end
+      #end
     end

     # TODO: fix this when IO buffering is re-written.
     def getchar(io)
       return if size == 0 and fill_from(io) == 0

-      Rubinius.synchronize(self) do
+      #Rubinius.synchronize(self) do
         char = ""
         while size > 0
           char.force_encoding Encoding::ASCII_8BIT
@@ -220,7 +220,7 @@ class IO
             return IO.read_encode io, char
           end
         end
-      end
+      #end
     end

     ##
Member

chuckremes commented Mar 21, 2014

I’m going to try to spike something out on my fork tomorrow & this weekend. I’ll post to this thread when I have something to share.

On Mar 20, 2014, at 5:45 PM, Yorick Peterse notifications@github.com wrote:

After a chat with various others it was suggested to remove the locking and instead implement a CAS based system to ensure thread-safe IO operations. @evanphx made a pseudo-code example for this which can be found at https://gist.github.com/evanphx/9674625.

The motivation for this is that IO instances should be thread-safe (or at least as much as possible), this is also why the current locking mechanisms are put in place. Removing those would yield a pretty big performance benefit (about 10 times in the File#each_byte benchmark). The cost however would be a total lack of thread-safety.

Similar to other core classes (e.g. Array) the statement could be made that IO is not thread-safe. However, from what I gather the general consensus seems to be that IO should be thread-safe.

@chuckremes You'll probably want to watch this thread as I'll be using it as a brain/numbers dump :)


Reply to this email directly or view it on GitHub.

Contributor

headius commented Mar 21, 2014

FYI, we were forced to make IO thread-safe when it became apparent that it's very common to use a given IO object across many threads, and because many of the underlying IO operations make guarantees regarding threading and atomicity of operations, which we need to emulate in most cases. I don't think Rubinius can choose to make IO thread-unsafe without the very long tail of convincing everyone to stop using IO objects across threads.

Of course, at the very minimum you have many threads using stdio.

I will also point out that JRuby synchronizes IO operations and guarantees IOs will be thread-safe as well, but that synchronization has never shown up as the bottleneck for us on IO stuff.

Contributor

headius commented Mar 21, 2014

I should also mention that the Hotspot JVM has a whole series of optimizations designed to reduce or eliminate the cost of locking. For example, locks that are only ever acquired by one thread at a time may get elided or replaced with a less expensive mechanism. In other words, I can't say that we wouldn't have a synchronization bottleneck if we were on a VM that isn't as agressive.

Member

chuckremes commented Mar 21, 2014

@headius Thanks for the info. I’ll probably take a gander at the JRuby source to see how y’all are doing this. I’m already knee-deep in the half C++ / half Ruby monstrosity that is Rubinius IO, so I may as well add Java to the mix. :)

On Mar 21, 2014, at 5:07 AM, Charles Oliver Nutter notifications@github.com wrote:

I should also mention that the Hotspot JVM has a whole series of optimizations designed to reduce or eliminate the cost of locking. For example, locks that are only ever acquired by one thread at a time may get elided or replaced with a less expensive mechanism. In other words, I can't say that we wouldn't have a synchronization bottleneck if we were on a VM that isn't as agressive.


Reply to this email directly or view it on GitHub.

Member

chuckremes commented Mar 27, 2014

I've gotten far enough with a "rewrite" using FFI that I can draw some conclusions here. One, there appears to be a penalty for calling out to C++ that doesn't always exist for FFI-wrapped functions. I know FFI has a large amount of C++ behind it... I have no explanation why it doesn't suffer similarly.

Anyway, I opened another issue #2994 to track this. I included the results of my profiling both the original implementation and my FFI implementation. The FFI implementation should properly reproduce the IO#each_byte and IO#readlines methods. I've only implemented enough to make sure those two work. All of the supporting code was straight-up copy/pasted from rubinius' IO classes.

This has been a real learning experience for me. Fun stuff. What is clear is that there is plenty of room to improve Rubinius IO performance.

YorickPeterse removed their assignment May 5, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment