Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf: Improve performance of where when using an array of values #39009

Closed

Conversation

eileencodes
Copy link
Member

@eileencodes eileencodes commented Apr 21, 2020

A coworker at GitHub found a few months back that if we used
santitize_sql over where when we knew the values going into where
it was a lot faster than where.

This PR adds a new Arel node type called HomogenousIn that will be
used when Rails knows the values are all homogenous and can therefore
pick a faster codepath. This new codepath skips some of the required
processing by where to make wheres with homogenous arrays faster
without requiring the application author to know when to use which query
type.

Using our benchmark code:

ids = (1..1000).each.map do |n|
  Post.create!.id
end

Benchmark.ips do |x|
  x.report("where with ids") do
    Post.where(id: ids).to_a
  end

  x.report("where with sanitize") do
    Post.where(ActiveRecord::Base.sanitize_sql(["id IN (?)", ids])).to_a
  end

  x.compare!
end

Before this PR comparing where with a list of IDs to sanitize sql:

Warming up --------------------------------------
      where with ids    11.000  i/100ms
 where with sanitize    17.000  i/100ms

Calculating -------------------------------------
      where with ids    115.733  (± 4.3%) i/s -    583.000  in   5.045828s
 where with sanitize    174.231  (± 4.0%) i/s -    884.000  in   5.081495s

Comparison:
 where with sanitize:      174.2 i/s
      where with ids:      115.7 i/s - 1.51x  slower

After this PR comparing where with a list of IDs to sanitize sql:

Warming up --------------------------------------
      where with ids    16.000  i/100ms
 where with sanitize    19.000  i/100ms

Calculating -------------------------------------
      where with ids    158.293  (± 6.3%) i/s -    800.000  in   5.072208s
 where with sanitize    169.141  (± 3.5%) i/s -    855.000  in   5.060878s

Comparison:
 where with sanitize:      169.1 i/s
      where with ids:      158.3 i/s - same-ish: difference falls within error

Notes:

  1. sanitize2 is a horrific name, obviously. We aren't sure what to
    name it though. Thoughts?
  2. HomogenousIn may not be the best name for our new node.

Co-authored-by: Aaron Patterson aaron.patterson@gmail.com

cc/ @tenderlove

A coworker at GitHub found a few months back that if we used
`santitize_sql` over `where` when we knew the values going into `where`
it was a lot faster than `where`.

This PR adds a new Arel node type called `HomogenousIn` that will be
used when Rails knows the values are all homogenous and can therefore
pick a faster codepath. This new codepath skips some of the required
processing by `where` to make `wheres` with homogenous arrays faster
without requiring the application author to know when to use which query
type.

Using our benchmark code:

```ruby
ids = (1..1000).each.map do |n|
  Post.create!.id
end

Benchmark.ips do |x|
  x.report("where with ids") do
    Post.where(id: ids).to_a
  end

  x.report("where with sanitize") do
    Post.where(ActiveRecord::Base.sanitize_sql(["id IN (?)", ids])).to_a
  end

  x.compare!
end
```

Before this PR comparing where with a list of IDs to santitize sql:

```
Warming up --------------------------------------
      where with ids    11.000  i/100ms
 where with sanitize    17.000  i/100ms

Calculating -------------------------------------
      where with ids    115.733  (± 4.3%) i/s -    583.000  in   5.045828s
 where with sanitize    174.231  (± 4.0%) i/s -    884.000  in   5.081495s

Comparison:
 where with sanitize:      174.2 i/s
      where with ids:      115.7 i/s - 1.51x  slower
```

After this PR comparing where with a list of IDs to santitize sql:

```
Warming up --------------------------------------
      where with ids    16.000  i/100ms
 where with sanitize    19.000  i/100ms

Calculating -------------------------------------
      where with ids    158.293  (± 6.3%) i/s -    800.000  in   5.072208s
 where with sanitize    169.141  (± 3.5%) i/s -    855.000  in   5.060878s

Comparison:
 where with sanitize:      169.1 i/s
      where with ids:      158.3 i/s - same-ish: difference falls within error
```

Notes:

1) `santitize2` is a horrific name, obviously. We aren't sure what to
name it though. Thoughts?
2) `HomogenousIn` may not be the best name for our new node.

Co-authored-by: Aaron Patterson <aaron.patterson@gmail.com>
@tenderlove
Copy link
Member

I'm not sure that HomogenousIn is the best name because the collection that it wraps is not homogenous in the traditional sense. Not all types in the collection are the same, rather they are all safe to pass to connection quote, and they are all terminal nodes (IOW, the HomoegenousIn node cannot point to a subselect, where the regular In node can)

def cast(value)
value = \
value = if value <=> 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add a comment here about why we're doing this. This is a huge "hack" to check to see whether or not value is numeric. Non-numbers will return nil for <=>, and it avoid us need to do an is_a? check. We found using the spaceship operator was faster than checking types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why not use value.is_a?(Numeric). <=> is faster alternative of value.is_a?(Numeric)?

@tenderlove
Copy link
Member

tenderlove commented Apr 21, 2020

The issue with the current serialize method is that it does both value validation and serialization (but only sometimes). We would like to split that functionality in to two methods for this patch.

Here is an example of what I mean. The active model integer type's serialize method (which can be found here) looks like this:

      def serialize(value)
        return if value.is_a?(::String) && non_numeric_string?(value)
        ensure_in_range(super)
      end

ensure_in_range will validate that the value is valid. If it isn't, it raises an exception. If no exception is raised, the call to super will actually serialize it and return the serialized form.

Another behavior that Active Record has is that we will omit values that are outside of range supported by the database. Here is an example:

class Post < ActiveRecord::Base
end

params = {}
params[:id] = 9999999999999999999999999999999

$-d = true
Post.where(id: [params[:id], 1234]).to_a

The SQL executed by the above code is:

SELECT `posts`.* FROM `posts` WHERE `posts`.`id` IN (1234)

Active Record strips out the very large number because the database can't handle it.

An unfortunate consequence of "raising + serializing" combined as one unit, and stripping out-of-range values is that we have to raise and catch exceptions in order to support this. If you run the above example, you'll actually see that an ActiveModel::RangeError is raised in the course of running the program on master (this branch does not):

Exception `ActiveModel::RangeError' at /Users/aaron/git/rails/activemodel/lib/active_model/type/integer.rb:40 - 9999999999999999999999999999999 is out of range for ActiveModel::Type::Integer with limit 8 bytes
D, [2020-04-21T14:38:37.905285 #78972] DEBUG -- :   Post Load (0.5ms)  SELECT `posts`.* FROM `posts` WHERE `posts`.`id` IN (1234)

It would be nice if we could check that a value is valid without having to serialize it, only to get an exception.

Unfortunately the serialize method in Active Model is public, so we don't want to change the behavior. Is there a different name we could use?

Second, this "raise if not valid" behavior is not consistent among all types. For example, the Enum type will not raise an exception on invalid values.

I think serializable? is a good method that should be in Active Model. But I don't know what to do about serialize vs serialize2.

values.map! do |v|
predicate_builder.build_bind_attribute(attribute.name, v)
if nils.empty? && ranges.empty?
predicate_builder.connection.in_clause_length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
predicate_builder.connection.in_clause_length

I think we can remove this.

@tenderlove
Copy link
Member

I'm not sure that HomogenousIn is the best name because the collection that it wraps is not homogenous in the traditional sense. Not all types in the collection are the same, rather they are all safe to pass to connection quote, and they are all terminal nodes (IOW, the HomoegenousIn node cannot point to a subselect, where the regular In node can)

I was thinking about this a little more and I want to take back my statement. We know the list is homogenous (or must be homogenous) in the "traditional sense". All the values must have the same type. This line proves it. They all must be the same type so we can use the same type caster.

We should probably push this logic down in to the HomogenousIn node.

@@ -8,14 +8,21 @@ def serialize(value)
cast(value)
end

def serialize2(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why serialize2 exists? All definitions of it in this PR seems to have the same code as serialize.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used here https://github.com/rails/rails/pull/39009/files#diff-3bcf02e80bc52706c3df424d2a8c4db2R31 and it's a terrible name but we're not sure what to call it. The other serialize methods do both serialize and validation, this one just does serialization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if all the implementations of serialize2 on this PR is always the same as the implementation as serialize I don't see the different and serialize2 is also doing serialization and validation, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the same. Check the serialize method on Integer here. It's doing both serialization and validation. This one is only doing serialization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the role of serialize2 to be a fast, unchecked serialize (i.e. "garbage in, garbage out")? If so, would unchecked_serialize be an appropriate name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that was the context I was missing. In this diff all serialize2 methods have the same implementation of the serialize in the same files, but I was missing that the subclasses were overriding the method.

If we want to serialize2 to be part of the public API I'd go with @jonathanhefner's suggestion, but with the long term goal of renaming it to serialize and renaming the current serialize to something as verify_and_serialize.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rafaelfranca I like @jonathanhefner's suggestion too. @eileencodes wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me!

@@ -11,6 +11,10 @@ def initialize(precision: nil, limit: nil, scale: nil)
@limit = limit
end

def serializable?(_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be part of the public API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should. I would like it if active model types can tell us whether or not a value can be serialized for that particular type. We should add some docs, but I didn't want to go that far unless everyone was OK with adding this to the public API

kamipo added a commit to kamipo/rails that referenced this pull request Apr 29, 2020
This is a smaller alternative of performance improvement, without
refactoring type casting mechanism rails#39009.

This is relatively a smaller change (but about 40% faster than before),
so I think this could be easier reviewed without discuss about
refactoring type casting mechanism.

This just makes `attribute.in(values)` less allocation from an array of
casted nodes to one casted array node.

```ruby
ids = (1..1000).each.map do |n|
  Post.create!.id
end

Benchmark.ips do |x|
  x.report("where with ids") do
    Post.where(id: ids).to_a
  end

  x.report("where with sanitize") do
    Post.where(ActiveRecord::Base.sanitize_sql(["id IN (?)", ids])).to_a
  end

  x.compare!
end
```

Before:

```
Warming up --------------------------------------
      where with ids     7.000  i/100ms
 where with sanitize    13.000  i/100ms

Calculating -------------------------------------
      where with ids     70.661  (± 5.7%) i/s -    357.000  in   5.072771s
 where with sanitize    130.993  (± 7.6%) i/s -    663.000  in   5.096085s

Comparison:
 where with sanitize:      131.0 i/s
      where with ids:       70.7 i/s - 1.85x  slower
```

After:

```
Warming up --------------------------------------
      where with ids    10.000  i/100ms
 where with sanitize    13.000  i/100ms

Calculating -------------------------------------
      where with ids     98.174  (± 7.1%) i/s -    490.000  in   5.012851s
 where with sanitize    132.289  (± 8.3%) i/s -    663.000  in   5.052728s

Comparison:
 where with sanitize:      132.3 i/s
      where with ids:       98.2 i/s - 1.35x  slower
```
@eileencodes
Copy link
Member Author

We fixed the method names, made some improvements, and manually merged this in 70ddb8a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants