Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Profiling-based improvement to import#values_sql_for_columns_and_attributes - 80% speedup for large imports #45

Closed
wants to merge 1 commit into from

5 participants

Sai Peter Hicks Ben Woosley Zach Dennis James Le Cuirot
Sai

https://plus.google.com/u/0/103112149634414554669/posts/EGghnZ1icVY for details.

Three issues corrected:

  1. Calling ActiveRecord::Base#connection within the loop. This may well be an ActiveRecord WTF, but connection actually isn't memoized; every time you call it, it goes back to the connection pool and fetches things over again. Yikes. So I memoized it local to the function.

  2. Calling sequence_name needlessly, for non-id columns. This is a result of doing and in the wrong order. Statements are evaluated left-to-right; this means that you should put the stuff that's cheapest to fail on the left, and the expensive stuff (or the bits whose real return values you want preserved) on the right. So I just swapped the order.
    There's a deeper WTF here, in that sequence_name should be memoized within ActiveRecord, but something was breaking that. Not sure what, and I haven't fixed that, so it'll still be slow if you are using a primary key column. Serves you right for trying to insert explicit primary key values instead of leaving that up to the database at insertion time. :-P

  3. Double-escaping. connection.quote already does appropriate type conversion and escaping; calling connection.type_cast first provides no better security, but slows things down significantly by making things be converted twice. So I just got rid of that.

These three changes resulted in an 80% performance improvement for the import call.

Peter Hicks

Excellent commit - using this, an import that previously took around 2 hours now takes about 40 minutes!

Sai

@poggs Damn dood, wtf are you importing? I'm doing 10k records in ~3s. I'm having difficulty imagining what could take that long that would even fit in memory.

Is it the import itself that's taking that long, or some other stuff?

Peter Hicks

Railway timetable data. It's this function here - https://github.com/poggs/tsdbexplorer/blob/master/lib/tsdbexplorer/cif.rb#L63 - which reads ~500Mb of data, chops it up, pushes it in to a hash called 'pending', and then every 1000 records, imports it.

I haven't really looked at speeding that up, or where it's hideously inefficient :)

Ben Woosley

FYI I added this change to my fork, published to rubygems as Empact-activerecord-import.

Sai

@Empact Would be nice if I got contributor credit. ;-)

Zach Dennis
Owner

@saizai, thank you for your pull request. I believe the first two issues have been resolved as a part of #71. I will look into the third issue.

James Le Cuirot

@zdennis The third issue was also dealt with by #71 so you can close this.

Zach Dennis
Owner

Thanks @chewi

Zach Dennis zdennis closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Feb 10, 2012
  1. Sai

    Profiling-based improvements to import#values_sql_for_columns_and_att…

    saizai authored
    …ributes - 80% speedup for large imports
This page is out of date. Refresh to see the latest.
Showing with 7 additions and 3 deletions.
  1. +7 −3 lib/activerecord-import/import.rb
10 lib/activerecord-import/import.rb
View
@@ -294,13 +294,17 @@ def import_without_validations_or_callbacks( column_names, array_of_attributes,
# Returns SQL the VALUES for an INSERT statement given the passed in +columns+
# and +array_of_attributes+.
def values_sql_for_columns_and_attributes(columns, array_of_attributes) # :nodoc:
+ # connection gets called a *lot* in this high intensity loop.
+ # Reuse the same one w/in the loop, otherwise it would keep being re-retreived (= lots of time for large imports)
+ connection_memo = connection
array_of_attributes.map do |arr|
my_values = arr.each_with_index.map do |val,j|
column = columns[j]
- if !sequence_name.blank? && column.name == primary_key && val.nil?
- connection.next_value_for_sequence(sequence_name)
+ # be sure to query sequence_name *last*, only if cheaper tests fail, because it's costly
+ if val.nil? && column.name == primary_key && !sequence_name.blank?
+ connection_memo.next_value_for_sequence(sequence_name)
else
- connection.quote(column.type_cast(val), column)
+ connection_memo.quote(val, column) # no need for column.type_cast(val) - quote already does type casting
end
end
"(#{my_values.join(',')})"
Something went wrong with that request. Please try again.