Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk Insert: Reuse indexes for unique_by #35546

Merged
merged 2 commits into from Mar 20, 2019

Conversation

Projects
None yet
5 participants
@kaspth
Copy link
Member

commented Mar 9, 2019

I found :unique_by with :columns and :where inside it tough to
grasp. The documentation only mentioned indexes and partial indexes.
So why duplicate a model's indexes in an insert_all/upsert_all call
when we can just look it up?

This has the added benefit of raising if no index is found, such that
people can't insert thousands of records without relying on an index
of some form.

cc @boblail

@kaspth kaspth added this to the 6.0.0 milestone Mar 9, 2019

@kaspth kaspth self-assigned this Mar 9, 2019

@rails-bot rails-bot bot added the activerecord label Mar 9, 2019

@kaspth

This comment has been minimized.

Copy link
Member Author

commented Mar 9, 2019

Another benefit is that if an index is changed, users won't have to hunt down insert_all/upsert_all calls in their app 😊

def primary_keys
Array.wrap(model.primary_key)
def find_index_for(unique_by_index)
if index = connection.indexes(model.table_name).index_by(&:name)[unique_by_index.to_s]

This comment has been minimized.

Copy link
@kaspth

kaspth Mar 9, 2019

Author Member

@kamipo is this okay to do? Or will it execute a query for the schema?

This comment has been minimized.

Copy link
@kaspth

kaspth Mar 9, 2019

Author Member

Another thing we could do is matching on the columns ala: find { |index| index.columns.sort == unique_by_index } to get unique_by_index: :isbn and unique_by_index: %i( author_id name ).

This comment has been minimized.

Copy link
@kamipo

This comment has been minimized.

Copy link
@kaspth

kaspth Mar 9, 2019

Author Member

Great, I gotta try that!

sql << " WHERE #{where}" if where
index = insert_all.unique_by_index

sql = +"(#{quote_columns(index&.columns || insert_all.primary_keys).join(',')})"

This comment has been minimized.

Copy link
@vishaltelangre

vishaltelangre Mar 9, 2019

Contributor

I think we need to set primary keys as conflict target only if update_duplicates? is true.

This comment has been minimized.

Copy link
@boblail

boblail Mar 9, 2019

Contributor

Good catch, @vishaltelangre. Yeah, that's true. It's OK (desirable, even) to leave the conflict target clause off of ON CONFLICT DO NOTHING. Implicitly using primary keys here would cause duplicates to be skipped (by default) only if they were duplicate by primary key.

This comment has been minimized.

Copy link
@kaspth

kaspth Mar 9, 2019

Author Member

Yes, I had a version of this that looked like the below but I cut update_duplicates? here as connection's build_insert_sql already guarded with update_duplicates?. However, I see now that would still fuck it up for skip_duplicates?, right. I think this would be clearer if it was split into conflict_target_for_skip and conflict_target_for_update. I'll experiment, thanks.

def conflict_target
  if index = insert_all.unique_by_index
    sql = +"(#{quote_columns(index.columns).join(',')})"
    sql << " WHERE #{index.where}" if index.where
    sql
  elsif update_duplicates?
    "(#{quote_columns(insert_all.primary_keys).join(',')})"
  end
end
@boblail
Copy link
Contributor

left a comment

👍 Cool! I think supporting identifying the conflict target by name is a solid improvement!

(Also, I didn't know about connection.indexes(model.table_name) — TIL!)

I've got two questions about the API:

unique_by_index

is kind of a wordy parameter name. Do we need to add _index? On one hand, it helps to signal to users what value is expected. On the other hand, Rails-generated index names start with "index_" so the phrases unique_by_index: "index_books_on_isbn" repeats the word "index", and the phrase unique_by: "index_books_on_isbn" reads just fine.

Support both?

Does it make sense to support both ways of specifying an index? We could reuse remove_index's API here to increase predictability:

Examples:

Book.insert_all books, unique_by: { name: "index_books_on_isbn" }
Book.insert_all books, unique_by: { column: :isbn }
Book.insert_all books, unique_by: { column: %i[author title] }

If wanted to support both, but departed a little from remove_index's API, we could forego the nested hash by treating a string/symbol value as an index name and an array value as an array of columns:

Examples:

Book.insert_all books, unique_by: "index_books_on_isbn"
Book.insert_all books, unique_by: %i[:isbn]
Book.insert_all books, unique_by: %i[author title]

(Note: in this scenario, we're dropping support for :where — and removing one way of specifying a partial index — but we've added the ability to specify a partial index by name. The array-of-columns syntax just gives you a less-wordy way of describing the most commonplace indexes.)

Lastly, should we loop @dhh in on the conversation? He weighed in pretty heavily on the API of the original PR.

sql << " WHERE #{where}" if where
index = insert_all.unique_by_index

sql = +"(#{quote_columns(index&.columns || insert_all.primary_keys).join(',')})"

This comment has been minimized.

Copy link
@boblail

boblail Mar 9, 2019

Contributor

Good catch, @vishaltelangre. Yeah, that's true. It's OK (desirable, even) to leave the conflict target clause off of ON CONFLICT DO NOTHING. Implicitly using primary keys here would cause duplicates to be skipped (by default) only if they were duplicate by primary key.

sql << " WHERE #{where}" if where
index = insert_all.unique_by_index

sql = +"(#{quote_columns(index&.columns || insert_all.primary_keys).join(',')})"

This comment has been minimized.

Copy link
@boblail

boblail Mar 9, 2019

Contributor

We can also provide the name of an index as conflict target! so:

Example:

INSERT ... ON CONFLICT index_books_on_isbn DO UPDATE ...

If we only support identifying indexes by name, I'm pretty sure this method boils down to

def conflict_target
  insert_all.unique_by_index
end

This comment has been minimized.

Copy link
@kaspth

kaspth Mar 9, 2019

Author Member

Screenshot 2019-03-09 at 20 48 51

This comment has been minimized.

Copy link
@kaspth

kaspth Mar 9, 2019

Author Member

Hm, I cannot seem to get that to work. And I can't find documentation for either SQLIte3, Postgres or MySQL that say they support it. How does it work?

This comment has been minimized.

Copy link
@boblail

boblail Mar 10, 2019

Contributor

🤔 I might've read https://www.postgresql.org/docs/9.5/sql-insert.html#SQL-ON-CONFLICT wrong. But does it work like this? (with ON CONSTRAINT)

INSERT ... ON CONFLICT ON CONSTRAINT index_books_on_isbn DO UPDATE ...

Then again, maybe that's not the best plan:

Tip: It is often preferable to use unique index inference rather than naming a constraint directly using ON CONFLICT ON CONSTRAINT constraint_name. Inference will continue to work correctly when the underlying index is replaced by another more or less equivalent index in an overlapping way, for example when using CREATE UNIQUE INDEX ... CONCURRENTLY before dropping the index being replaced.

This comment has been minimized.

Copy link
@kaspth

kaspth Mar 10, 2019

Author Member

I tried that but it threw an "Unknown constraint error". Looks like we'll keep what we've got.

Ah nice tidbit! So that's why they prefer to infer the index instead of naming it.

@boblail

This comment has been minimized.

Copy link
Contributor

commented Mar 9, 2019

BTW, thanks for looping me in again, @kaspth! 😄

kaspth added a commit that referenced this pull request Mar 9, 2019

Schema Cache: cache table indexes
Useful to not query for indexes when an application uses schema cache.

Ref #35546
@kaspth

This comment has been minimized.

Copy link
Member Author

commented Mar 9, 2019

True, the parameter name is too wordy. I've been thinking we just go with unique_index: :isbn with the column shorthand. That alludes to the fact that insert_all defaults to check duplicates on the primary key and you're merely changing that default.

Though it is tough to say which expect should be stressed more. Is it the duplicate check or is it the aspect that the check is done with an index? I do like the latter because it hints of the performance aspect e.g. if you're inserting 100.000 rows into an already large table, you'd want to check the index performance first.

kaspth added a commit that referenced this pull request Mar 10, 2019

Schema Cache: cache table indexes
Useful to not query for indexes when an application uses schema cache.

Ref #35546

kaspth added a commit that referenced this pull request Mar 12, 2019

Schema Cache: cache table indexes
Useful to not query for indexes when an application uses schema cache.

Ref #35546

kaspth added a commit that referenced this pull request Mar 12, 2019

Schema Cache: cache table indexes
Useful to not query for indexes when an application uses schema cache.

Ref #35546

kaspth added a commit that referenced this pull request Mar 12, 2019

Schema Cache: cache table indexes
Useful to not query for indexes when an application uses schema cache.

Ref #35546

@kaspth kaspth force-pushed the bulk-inserts-with-index branch from 400751d to d3a678a Mar 12, 2019

@kaspth

This comment has been minimized.

Copy link
Member Author

commented Mar 12, 2019

I'm pretty happy with this. I'm still open to other names and I'm going to do a final documentation pass tomorrow.

I've also killed the sort I had on the columns when finding indexes. That order means something so I don't think an [ :author_id, :name ] index should we found with [ :name, :author_id ].

Show resolved Hide resolved activerecord/lib/active_record/persistence.rb Outdated
Show resolved Hide resolved activerecord/test/cases/insert_all_test.rb Outdated

kaspth added some commits Mar 9, 2019

Bulk Insert: Reuse indexes for unique_by
I found `:unique_by` with `:columns` and `:where` inside it tough to
grasp. The documentation only mentioned indexes and partial indexes.
So why duplicate a model's indexes in an insert_all/upsert_all call
when we can just look it up?

This has the added benefit of raising if no index is found, such that
people can't insert thousands of records without relying on an index
of some form.

@kaspth kaspth force-pushed the bulk-inserts-with-index branch from d3a678a to 60c29e1 Mar 20, 2019

@kaspth

This comment has been minimized.

Copy link
Member Author

commented Mar 20, 2019

Went back to our first love: unique_by! Also did a documentation pass while I was digging through the options documentation.

@boblail I was honestly very surprised that insert_all/upsert_all matches every unique index by default. I thought it was just the primary key index that was used! And that unique_by would then simply shift to using a different index. Also makes it seem somewhat strange that passing a unique_by can make insert_all/upsert_all throw an ActiveRecord::RecordNotUnique error. You'd think that would be for insert_all!. So we should probably also look into the raising aspects of these two variants.

@kaspth kaspth merged commit 6b51307 into master Mar 20, 2019

1 check passed

codeclimate All good!
Details

@kaspth kaspth deleted the bulk-inserts-with-index branch Mar 20, 2019

@simi

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2019

👍 for keeping unique_by

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.