Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

update docs, refactor out ColumnDefinition (for pg)... bump minor

  • Loading branch information...
commit be6174eb384b4db8282c7e744cacb1498bd7dd99 1 parent cb84279
@seamusabshere authored
View
22 CHANGELOG
@@ -1,3 +1,25 @@
+0.5.0 / 2012-09-21
+
+* Breaking changes (well, not really)
+
+ * "document" (as in the second argument to #row) has been renamed to "setter"!
+
+* Bug fixes
+
+ * If you say upsert({:name => 'Jerry', :color => 'red'}), make sure that it only affects rows really meeting those conditions
+ * Always sort selector and setter keys - i.e., column names - before doing anything with them
+ * Support PostgreSQL 9.1+
+ * Support MRI 1.8
+
+* Enhancements
+
+ * Slightly faster benchmarks for SQlite3 and MySQL
+ * Slightly slower on PostgreSQL (probably because the merge function requires more arguments), but more accurate
+ * Slightly clearer code structure
+ * Use bind parameters instead of quoting for PostgreSQL and SQLite3.
+ * Provide Upsert.clear_database_functions(connection) (currently only for PostgreSQL)
+ * Don't subclass String for Upset::Binary... hopefully save some strcpy()s?
+
0.4.0 / 2012-09-04
* Bug fixes
View
50 README.md
@@ -26,7 +26,7 @@ Rows are buffered in memory until it's efficient to send them to the database.
upsert.row({:name => 'Pierre'}, :breed => 'tabby')
end
-Tested to be much about 85% faster on PostgreSQL and 50% faster on MySQL than comparable methods (see the tests).
+Tested to be much about 60% faster on PostgreSQL and 60–90% faster on MySQL and SQLite3 than comparable methods (see the tests, which fail if they are not faster).
## Gotchas
@@ -34,11 +34,9 @@ Tested to be much about 85% faster on PostgreSQL and 50% faster on MySQL than co
Make sure you're upserting against either primary key columns or columns with UNIQUE indexes or both.
-### Columns are set based on the first row you pass
+### For MySQL, columns are set based on the first row you pass
-What if you pass the same columns in a different order?
-
-Currently, the first row you pass in determines the columns that will be used. That's useful for mass importing of many rows with the same columns, but is surprising if you're trying to use a single `Upsert` object to add arbitrary data. For example, this won't work:
+Currently, on MySQL, the first row you pass in determines the columns that will be used for all future upserts using the same Upsert object. That's useful for mass importing of many rows with the same columns, but is surprising if you're trying to use a single `Upsert` object to add arbitrary data. For example, this won't work:
Upsert.batch(Pet.connection, Pet.table_name) do |upsert|
upsert.row({:name => 'Jerry'}, :breed => 'beagle')
@@ -54,8 +52,11 @@ You would need to use a new `Upsert` object. On the other hand, this is totally
Pull requests for any of these would be greatly appreciated:
+1. More correctness tests! What is the dictionary definition of "upsert," anyway?
+1. Sanity check my three benchmarks (four if you include activerecord-import on MySQL). Do they accurately represent optimized alternatives?
1. Provide `require 'upsert/debug'` that will make sure you are selecting on columns that have unique indexes
-1. Make `Upsert` instances accept arbitrary columns, which is what people probably expect. (this should work on PG already)
+1. Make `Upsert` instances accept arbitrary columns, which is what people probably expect. (this should work on PostgreSQL and SQLite3 already)
+1. JRuby support
## Real-world usage
@@ -74,16 +75,16 @@ Originally written to speed up the [`data_miner`](https://github.com/seamusabshe
Using the [mysql2](https://rubygems.org/gems/mysql2) driver.
- Upsert.new Mysql2::Connection.new([...]), :pets
+ upsert = Upsert.new(Mysql2::Connection.new(:username => 'root', :password => 'password', :database => 'upsert_test'), :pets)
#### Speed
-From the tests:
+From the tests (updated 9/21/12):
- Upsert was 77% faster than find + new/set/save
- Upsert was 58% faster than create + rescue/find/update
- Upsert was 80% faster than find_or_create + update_attributes
- Upsert was 39% faster than faking upserts with activerecord-import
+ Upsert was 88% faster than find + new/set/save
+ Upsert was 90% faster than create + rescue/find/update
+ Upsert was 90% faster than find_or_create + update_attributes
+ Upsert was 60% faster than faking upserts with activerecord-import
#### SQL MERGE trick
@@ -93,21 +94,25 @@ From the tests:
INSERT INTO table (a,b,c) VALUES (1,2,3), (4,5,6)
ON DUPLICATE KEY UPDATE a=VALUES(a),b=VALUES(b),c=VALUES(c);
+If `a` only appeared in the selector, then we avoid updating it in case of a duplicate key:
+
+ ON DUPLICATE KEY UPDATE a=a,b=VALUES(b),c=VALUES(c);
+
Since this is an upsert helper library, not a general-use ON DUPLICATE KEY UPDATE wrapper, you **can't** do things like `c=c+1`.
### PostgreSQL
Using the [pg](https://rubygems.org/gems/pg) driver.
- Upsert.new PG.connect([...]), :pets
+ upsert = Upsert.new(PG.connect(:dbname => 'upsert_test'), :pets)
#### Speed
-From the tests:
+From the tests (updated 9/21/12):
- Upsert was 73% faster than find + new/set/save
- Upsert was 84% faster than find_or_create + update_attributes
- Upsert was 87% faster than create + rescue/find/update
+ Upsert was 65% faster than find + new/set/save
+ Upsert was 79% faster than find_or_create + update_attributes
+ Upsert was 76% faster than create + rescue/find/update
# (can't compare to activerecord-import because you can't fake it on pg)
#### SQL MERGE trick
@@ -139,6 +144,8 @@ From the tests:
SELECT merge_db(1, 'david');
SELECT merge_db(1, 'dennis');
+I slightly modified it so that it only retries once - don't want infinite loops.
+
The decision was made **not** to use the following because it's not straight from the manual:
# http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql
@@ -167,11 +174,16 @@ This was also rejected because there's something we can use in the manual:
Using the [sqlite3](https://rubygems.org/gems/sqlite3) driver.
- Upsert.new SQLite3::Database.open([...]), :pets
+ upsert = Upsert.new(SQLite3::Database.open(':memory:'), :pets)
#### Speed
-FIXME tests are segfaulting. Pull request would be lovely.
+From the tests (updated 9/21/12):
+
+ Upsert was 77% faster than find + new/set/save
+ Upsert was 80% faster than find_or_create + update_attributes
+ Upsert was 85% faster than create + rescue/find/update
+ # (can't compare to activerecord-import because you can't fake it on sqlite3)
#### SQL MERGE trick
View
21 lib/upsert.rb
@@ -35,6 +35,16 @@ def logger
end
end
+ # @param [Mysql2::Client,Sqlite3::Database,PG::Connection,#raw_connection] connection A supported database connection.
+ #
+ # Clear any database functions that may have been created.
+ #
+ # Currently only applies to PostgreSQL.
+ def clear_database_functions(connection)
+ dummy = new(connection, :dummy)
+ dummy.buffer.clear_database_functions
+ end
+
# @param [String] v A string containing binary data that should be inserted/escaped as such.
#
# @return [Upsert::Binary]
@@ -42,9 +52,14 @@ def binary(v)
Binary.new v
end
- # @yield [Upsert] An +Upsert+ object in batch mode. You can call #row on it multiple times and it will try to optimize on speed.
+ # Guarantee that the most efficient way of buffering rows is used.
#
- # @note Buffered in memory until it's efficient to send to the server a packet.
+ # Currently mostly helps for MySQL, but you should use it whenever possible in case future buffering-based optimizations become possible.
+ #
+ # @param [Mysql2::Client,Sqlite3::Database,PG::Connection,#raw_connection] connection A supported database connection.
+ # @param [String,Symbol] table_name The name of the table into which you will be upserting.
+ #
+ # @yield [Upsert] An +Upsert+ object in batch mode. You can call #row on it multiple times and it will try to optimize on speed.
#
# @return [nil]
#
@@ -108,6 +123,8 @@ def initialize(connection, table_name)
# Upsert a row given a selector and a setter.
#
+ # The selector values are used as setters if it's a new row. So if your selector is `name=Jerry` and your setter is `age=4`, and there is no Jerry yet, then a new row will be created with name Jerry and age 4.
+ #
# @see http://api.mongodb.org/ruby/1.6.4/Mongo/Collection.html#update-instance_method Loosely based on the upsert functionality of the mongo-ruby-driver #update method
#
# @param [Hash] selector Key-value pairs that will be used to find or create a row.
View
15 lib/upsert/buffer/mysql2_client.rb
@@ -4,9 +4,9 @@ class Buffer
class Mysql2_Client < Buffer
def ready
return if rows.empty?
- c = parent.connection
+ connection = parent.connection
if not async?
- c.execute sql
+ connection.execute sql
rows.clear
return
end
@@ -14,7 +14,7 @@ def ready
new_row = rows.pop
d = new_row.values_sql_bytesize + 3 # ),(
if @cumulative_sql_bytesize + d > max_sql_bytesize
- c.execute sql
+ connection.execute sql
rows.clear
@cumulative_sql_bytesize = static_sql_bytesize + d
else
@@ -35,7 +35,8 @@ def original_setter
def insert_part
@insert_part ||= begin
connection = parent.connection
- %{INSERT INTO #{parent.quoted_table_name} (#{setter.map { |k| connection.quote_ident(k) }.join(',')}) VALUES }
+ column_names = setter.map { |k| connection.quote_ident(k) }
+ %{INSERT INTO #{parent.quoted_table_name} (#{column_names.join(',')}) VALUES }
end
end
@@ -43,12 +44,12 @@ def update_part
@update_part ||= begin
connection = parent.connection
updaters = setter.map do |k|
- qk = connection.quote_ident(k)
+ quoted_name = connection.quote_ident(k)
if original_setter.include?(k)
- "#{qk}=VALUES(#{qk})"
+ "#{quoted_name}=VALUES(#{quoted_name})"
else
# NOOP
- "#{qk}=#{qk}"
+ "#{quoted_name}=#{quoted_name}"
end
end.join(',')
%{ ON DUPLICATE KEY UPDATE #{updaters}}
View
37 lib/upsert/buffer/pg_connection.rb
@@ -1,3 +1,4 @@
+require 'upsert/buffer/pg_connection/column_definition'
require 'upsert/buffer/pg_connection/merge_function'
class Upsert
@@ -7,43 +8,11 @@ class PG_Connection < Buffer
def ready
return if rows.empty?
row = rows.shift
- MergeFunction.execute(self, row)
+ MergeFunction.execute self, row
end
def clear_database_functions
- connection = parent.connection
- # http://stackoverflow.com/questions/7622908/postgresql-drop-function-without-knowing-the-number-type-of-parameters
- connection.execute <<-EOS
-CREATE OR REPLACE FUNCTION pg_temp.upsert_delfunc(text)
- RETURNS void AS
-$BODY$
-DECLARE
- _sql text;
-BEGIN
-
-FOR _sql IN
- SELECT 'DROP FUNCTION ' || quote_ident(n.nspname)
- || '.' || quote_ident(p.proname)
- || '(' || pg_catalog.pg_get_function_identity_arguments(p.oid) || ');'
- FROM pg_catalog.pg_proc p
- LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace
- WHERE p.proname = $1
- AND pg_catalog.pg_function_is_visible(p.oid) -- you may or may not want this
-LOOP
- EXECUTE _sql;
-END LOOP;
-
-END;
-$BODY$
- LANGUAGE plpgsql;
-EOS
- res = connection.execute(%{SELECT proname FROM pg_proc WHERE proname LIKE 'upsert_%'})
- res.each do |row|
- k = row['proname']
- next if k == 'upsert_delfunc'
- Upsert.logger.info %{[upsert] Dropping function #{k.inspect}}
- connection.execute %{SELECT pg_temp.upsert_delfunc('#{k}')}
- end
+ MergeFunction.clear self
end
end
end
View
59 lib/upsert/buffer/pg_connection/column_definition.rb
@@ -0,0 +1,59 @@
+class Upsert
+ class Buffer
+ class PG_Connection < Buffer
+ # @private
+ class ColumnDefinition
+ class << self
+ # activerecord-3.2.5/lib/active_record/connection_adapters/postgresql_adapter.rb#column_definitions
+ def all(buffer, table_name)
+ connection = buffer.parent.connection
+ res = connection.execute <<-EOS
+SELECT a.attname AS name, format_type(a.atttypid, a.atttypmod) AS sql_type, d.adsrc AS default
+FROM pg_attribute a LEFT JOIN pg_attrdef d
+ ON a.attrelid = d.adrelid AND a.attnum = d.adnum
+WHERE a.attrelid = '#{connection.quote_ident(table_name)}'::regclass
+AND a.attnum > 0 AND NOT a.attisdropped
+EOS
+ res.map do |row|
+ new connection, row['name'], row['sql_type'], row['default']
+ end.sort_by do |cd|
+ cd.name
+ end
+ end
+ end
+
+ attr_reader :name
+ attr_reader :sql_type
+ attr_reader :default
+ attr_reader :quoted_name
+ attr_reader :quoted_selector_name
+ attr_reader :quoted_setter_name
+
+ def initialize(connection, name, sql_type, default)
+ @name = name
+ @sql_type = sql_type
+ @default = default
+ @quoted_name = connection.quote_ident name
+ @quoted_selector_name = connection.quote_ident "#{name}_selector"
+ @quoted_setter_name = connection.quote_ident "#{name}_setter"
+ end
+
+ def to_selector_arg
+ "#{quoted_selector_name} #{sql_type}"
+ end
+
+ def to_setter_arg
+ "#{quoted_setter_name} #{sql_type}"
+ end
+
+ def to_setter
+ "#{quoted_name} = #{quoted_setter_name}"
+ end
+
+ def to_selector
+ "#{quoted_name} = #{quoted_selector_name}"
+ end
+ end
+ end
+ end
+end
View
111 lib/upsert/buffer/pg_connection/merge_function.rb
@@ -1,9 +1,9 @@
require 'digest/md5'
class Upsert
- # @private
class Buffer
class PG_Connection < Buffer
+ # @private
class MergeFunction
class << self
def execute(buffer, row)
@@ -12,7 +12,6 @@ def execute(buffer, row)
end
def unique_name(table_name, selector, setter)
- # $stderr.puts "AAA #{table_name}/#{selector}/#{setter}"
parts = [
'upsert',
table_name,
@@ -21,15 +20,53 @@ def unique_name(table_name, selector, setter)
'SET',
setter.join('_A_')
].join('_')
+ # maybe i should md5 instead
crc32 = Zlib.crc32(parts).to_s
[ parts.first(MAX_NAME_LENGTH-11), crc32 ].join
end
def lookup(buffer, row)
@lookup ||= {}
- s = row.selector.keys
- c = row.setter.keys
- @lookup[unique_name(buffer.parent.table_name, s, c)] ||= new(buffer, s, c)
+ selector = row.selector.keys
+ setter = row.setter.keys
+ key = [buffer.parent.table_name, selector, setter]
+ @lookup[key] ||= new(buffer, selector, setter)
+ end
+
+ def clear(buffer)
+ connection = buffer.parent.connection
+ # http://stackoverflow.com/questions/7622908/postgresql-drop-function-without-knowing-the-number-type-of-parameters
+ connection.execute <<-EOS
+CREATE OR REPLACE FUNCTION pg_temp.upsert_delfunc(text)
+ RETURNS void AS
+$BODY$
+DECLARE
+ _sql text;
+BEGIN
+
+FOR _sql IN
+ SELECT 'DROP FUNCTION ' || quote_ident(n.nspname)
+ || '.' || quote_ident(p.proname)
+ || '(' || pg_catalog.pg_get_function_identity_arguments(p.oid) || ');'
+ FROM pg_catalog.pg_proc p
+ LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace
+ WHERE p.proname = $1
+ AND pg_catalog.pg_function_is_visible(p.oid) -- you may or may not want this
+LOOP
+ EXECUTE _sql;
+END LOOP;
+
+END;
+$BODY$
+ LANGUAGE plpgsql;
+EOS
+ res = connection.execute(%{SELECT proname FROM pg_proc WHERE proname LIKE 'upsert_%'})
+ res.each do |row|
+ k = row['proname']
+ next if k == 'upsert_delfunc'
+ Upsert.logger.info %{[upsert] Dropping function #{k.inspect}}
+ connection.execute %{SELECT pg_temp.upsert_delfunc('#{k}')}
+ end
end
end
@@ -47,7 +84,7 @@ def initialize(buffer, selector, setter)
end
def name
- @name ||= MergeFunction.unique_name buffer.parent.table_name, selector, setter
+ @name ||= MergeFunction.unique_name table_name, selector, setter
end
def execute(row)
@@ -87,63 +124,19 @@ def connection
buffer.parent.connection
end
- def quoted_table_name
- buffer.parent.quoted_table_name
+ def table_name
+ buffer.parent.table_name
end
- # [upsert] SELECT upsert_pets_SEL_name_SET_birthday_A_good_A_home_addr2097355686($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)
-
- class ColumnDefinition
- attr_reader :name, :sql_type, :default, :quoted_name, :quoted_selector_name, :quoted_setter_name
- def initialize(connection, name, sql_type, default)
- @name = name
- @sql_type = sql_type
- @default = default
- @quoted_name = connection.quote_ident name
- @quoted_selector_name = connection.quote_ident "#{name}_selector"
- @quoted_setter_name = connection.quote_ident "#{name}_setter"
- end
-
- def to_selector_arg
- [ quoted_selector_name, sql_type ].join ' '
- # [ quoted_selector_name, sql_type, 'DEFAULT', (default || 'NULL') ].join ' '
- end
-
- def to_setter_arg
- [ quoted_setter_name, sql_type ].join ' '
- # [ quoted_setter_name, sql_type, 'DEFAULT', (default || 'NULL') ].join ' '
- end
-
- def to_setter
- "#{quoted_name} = #{quoted_setter_name}"
- end
-
- def to_selector
- "#{quoted_name} = #{quoted_selector_name}"
- end
- end
-
- # activerecord-3.2.5/lib/active_record/connection_adapters/postgresql_adapter.rb#column_definitions
- def get_column_definitions
- res = connection.execute <<-EOS
-SELECT a.attname AS name, format_type(a.atttypid, a.atttypmod) AS sql_type, d.adsrc AS default
-FROM pg_attribute a LEFT JOIN pg_attrdef d
- ON a.attrelid = d.adrelid AND a.attnum = d.adnum
-WHERE a.attrelid = '#{quoted_table_name}'::regclass
- AND a.attnum > 0 AND NOT a.attisdropped
-EOS
- res.map do |row|
- ColumnDefinition.new connection, row['name'], row['sql_type'], row['default']
- end.sort_by do |cd|
- cd.name
- end
+ def quoted_table_name
+ buffer.parent.quoted_table_name
end
# the "canonical example" from http://www.postgresql.org/docs/9.1/static/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE
# differentiate between selector and setter
def create!
- Upsert.logger.info "[upsert] Creating or replacing database function #{name.inspect} on table #{buffer.parent.table_name.inspect} for selector #{selector.map(&:inspect).join(', ')} and setter #{setter.map(&:inspect).join(', ')}"
- column_definitions = get_column_definitions
+ Upsert.logger.info "[upsert] Creating or replacing database function #{name.inspect} on table #{table_name.inspect} for selector #{selector.map(&:inspect).join(', ')} and setter #{setter.map(&:inspect).join(', ')}"
+ column_definitions = ColumnDefinition.all buffer, table_name
selector_column_definitions = column_definitions.select { |cd| selector.include?(cd.name) }
setter_column_definitions = column_definitions.select { |cd| setter.include?(cd.name) }
connection.execute <<-EOS
@@ -166,8 +159,9 @@ def create!
INSERT INTO #{quoted_table_name}(#{setter_column_definitions.map(&:quoted_name).join(', ')}) VALUES (#{setter_column_definitions.map(&:quoted_setter_name).join(', ')});
RETURN;
EXCEPTION WHEN unique_violation THEN
- IF (first_try) THEN
- first_try = 0;
+ -- seamusabshere 9/20/12 only retry once
+ IF (first_try = 1) THEN
+ first_try := 0;
ELSE
RETURN;
END IF;
@@ -179,7 +173,6 @@ def create!
LANGUAGE plpgsql;
EOS
end
-
end
end
end
View
5 lib/upsert/cell/mysql2_client.rb
@@ -2,7 +2,10 @@ class Upsert
class Cell
# @private
class Mysql2_Client < Cell
- attr_reader :name, :value, :quoted_value
+ attr_reader :name
+ attr_reader :value
+ attr_reader :quoted_value
+
def initialize(connection, name, value)
@name = name
@value = value
View
23 lib/upsert/cell/pg_connection.rb
@@ -2,22 +2,25 @@ class Upsert
class Cell
# @private
class PG_Connection < Cell
- attr_reader :name, :value, :quoted_name
+ attr_reader :name
+ attr_reader :value
+ attr_reader :quoted_name
+
def initialize(connection, name, value)
@name = name
@value = value
@quoted_name = connection.quote_ident name
end
+
def bind_value
- @bind_value ||= begin
- case value
- when Upsert::Binary
- { :value => value.value, :format => 1 }
- when Time, DateTime
- [value.strftime(ISO8601_DATETIME), sprintf(USEC_SPRINTF, value.usec)].join('.')
- else
- value
- end
+ return @bind_value if defined?(@bind_value)
+ @bind_value = case value
+ when Upsert::Binary
+ { :value => value.value, :format => 1 }
+ when Time, DateTime
+ [value.strftime(ISO8601_DATETIME), sprintf(USEC_SPRINTF, value.usec)].join('.')
+ else
+ value
end
end
end
View
6 lib/upsert/cell/sqlite3_database.rb
@@ -2,12 +2,16 @@ class Upsert
class Cell
# @private
class SQLite3_Database < Cell
- attr_reader :name, :value, :quoted_name
+ attr_reader :name
+ attr_reader :value
+ attr_reader :quoted_name
+
def initialize(connection, name, value)
@name = name
@value = value
@quoted_name = connection.quote_ident name
end
+
def bind_value
return @bind_value if defined?(@bind_value)
@bind_value = case value
View
8 lib/upsert/row.rb
@@ -19,17 +19,18 @@ class Row
attr_reader :selector
attr_reader :setter
+
def initialize(parent, raw_selector, raw_setter)
- conn = parent.connection
+ connection = parent.connection
cell_class = parent.cell_class
@selector = raw_selector.inject({}) do |memo, (k, v)|
- memo[k.to_s] = cell_class.new(conn, k, v)
+ memo[k.to_s] = cell_class.new(connection, k, v)
memo
end
@setter = raw_setter.inject({}) do |memo, (k, v)|
- memo[k.to_s] = cell_class.new(conn, k, v)
+ memo[k.to_s] = cell_class.new(connection, k, v)
memo
end
@@ -37,6 +38,7 @@ def initialize(parent, raw_selector, raw_setter)
setter[missing] = selector[missing]
end
+ # there is probably a more clever way to incrementally sort these hashes
@selector = sort_hash selector
@setter = sort_hash setter
end
View
2  lib/upsert/version.rb
@@ -1,3 +1,3 @@
class Upsert
- VERSION = "0.4.0"
+ VERSION = "0.5.0"
end
View
4 spec/database_functions_spec.rb
@@ -10,13 +10,13 @@
Upsert.logger = Logger.new io, Logger::INFO
# clear
- Upsert.new(PGconn.new(:dbname => 'upsert_test'), :pets).buffer.clear_database_functions
+ Upsert.clear_database_functions(PGconn.new(:dbname => 'upsert_test'))
# create
Upsert.new(PGconn.new(:dbname => 'upsert_test'), :pets).row :name => 'hello'
# clear
- Upsert.new(PGconn.new(:dbname => 'upsert_test'), :pets).buffer.clear_database_functions
+ Upsert.clear_database_functions(PGconn.new(:dbname => 'upsert_test'))
# create (#2)
Upsert.new(PGconn.new(:dbname => 'upsert_test'), :pets).row :name => 'hello'
Please sign in to comment.
Something went wrong with that request. Please try again.