Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent schema_cache leaking connections between threads #38577

Closed

Conversation

davidtaylorhq
Copy link

At Discourse, we have been seeing some unexpected behaviour in our Sidekiq worker processes, which run jobs in a multi-threaded environment. We were seeing multiple threads accessing the same database connections concurrently.

The root cause appears to be 49b6b21, which changed schema_cache to be per-pool rather than per-connection. However, the schema_cache still maintains an internal reference to a single connection. This reference was replaced each time the schema_cache was requested, but the change would affect all threads. Under some circumstances, this can result in multiple threads trying to use the same connection.

This commit switches back to a single SchemaCache instance per connection, but also introduces a new SchemaCacheData class. A single instance of SchemaCacheData is shared between all SchemaCache instances for a pool.

cc @SamSaffron

49b6b21 changed schema_cache to be per-connection-pool rather than per-connection. However, the schema_cache still maintains an internal reference to a single connection. This reference was replaced each time the schema_cache was requested, but the change would affect all threads. Under some circumstances, this can result in multiple threads trying to use the same connection.

This commit switches back to a single SchemaCache instance per connection, but also introduces a new SchemaCacheData class. A single instance of SchemaCacheData is shared between all SchemaCache instances for a pool.
@eileencodes eileencodes self-assigned this Feb 26, 2020
@eileencodes eileencodes added this to the 6.1.0 milestone Feb 26, 2020
Copy link
Member

@eileencodes eileencodes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @davidtaylorhq thanks so much for the PR.

Is there a way to do this without making big changes to the SchemaCache class? My concern is that we can't backport it because with the changes it's more of a "feature + bugfix" than just a bug fix. The other possibility is we can merge this as-is (after I do some more testing with our app) and then backport a portion of it to fix the leak you're experiencing.

This change also needs a CHANGELOG.

end

def set_schema_cache(cache)
self.schema_cache = cache
# The pool's cache should not have a connection attached
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this comment?

@davidtaylorhq
Copy link
Author

Thanks for reviewing @eileencodes - I removed the unnecessary comment and added to the CHANGELOG (I pushed the commits to the branch, but they haven't appeared here yet... probably due to the earlier GitHub issues)

Is there a way to do this without making big changes to the SchemaCache class? My concern is that we can't backport it because with the changes it's more of a "feature + bugfix" than just a bug fix.

I've tried to make the changes as minimal as possible. There shouldn't be any new feature introduced here - sharing the cache data across the whole pool was implemented back in 49b6b21. This change just fixes that feature to make it thread-safe.

The diff is reasonably large, but the changes are mostly converting instance variables into data. references, so that the data is separated from the connection logic. The existing public methods on SchemaCache remain the same.

@eileencodes
Copy link
Member

The diff is reasonably large, but the changes are mostly converting instance variables into data references

Yes but this class is public and you've change the interface and how to interact with the schema cache, so anyone who is expecting to access the methods.

@davidtaylorhq
Copy link
Author

davidtaylorhq commented Feb 27, 2020

you've change the interface and how to interact with the schema cache

I've tried to avoid doing anything which would be backwards-incompatible. All existing interfaces remain the same, so the only change is the addition of SchemaCache#data, SchemaCache#data= and a new optional parameter to the initializer. If these three new interfaces are unused, behaviour should be backwards-compatible.

I would love to find a solution which has absolutely zero interface change, but haven't been able to come up with one.

@SamSaffron
Copy link
Contributor

@eileencodes / @davidtaylorhq / @rafaelfranca what we could do is make this PR 2 commits, first commit improves the existing state so it is not as bad, that change is ultra safe to merge.

Basically this monkey patch:

https://github.com/discourse/discourse/blob/master/lib/freedom_patches/schema_cache_concurrency.rb

Something like

      def connection=(connection)
        @connection = connection
        Thread.current["schema_cached_connection"] = connection
      end

      def connection
        Thread.current["schema_cached_connection"] || @connection
      end

This is clearly not ideal, but will work for 99.99% of the cases, you need to do some serious thread shuffling to break it.

Then in 6.1 we can introduce the more comprehensive patch, how does that sound?

Copy link
Member

@jeremy jeremy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could pass the pool to the SchemaCache instead of a connection? We'd continue to accept a (deprecated) connection for backward compatibility. When SchemaCache needs to use the connection, it'd request one from the pool: … = @pool.with_connection { |c| c.primary_key(table_name) }

@davidtaylorhq
Copy link
Author

Hi @jeremy

Perhaps we could pass the pool to the SchemaCache instead of a connection?

This is how I approached the problem to start with, and came up with this solution. Conceptually I prefer it, but after discussing with @SamSaffron we had some concerns:

  • It changes the public interface more than this PR

  • When the cache is cold, it means two connections are required per thread (one 'normal' connection, and one checked out specifically for SchemaCache). If the pool is low on connections, this could cause a thread to block and slow down the application.

That said, happy to develop this approach some more if those two problems can be mitigated

@rails-bot
Copy link

rails-bot bot commented May 28, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Thank you for your contributions.

@rails-bot rails-bot bot added the stale label May 28, 2020
@eileencodes eileencodes added pinned and removed stale labels May 28, 2020
@rafaelfranca rafaelfranca removed this from the 6.1.0 milestone Nov 24, 2020
Base automatically changed from master to main January 14, 2021 17:01
@byroot
Copy link
Member

byroot commented Jul 28, 2021

I've been meaning to do a similar refactor for a while. IMO it should even go further as if you have say read replicas, the read replica pool and the write pool should ideally share their schema cache (or at least most of the data).

Yes but this class is public and you've change the interface and how to interact with the schema cache, so anyone who is expecting to access the methods.

@eileencodes could you point out the changes you think are not acceptable? Maybe I'm missing something but I don't really see the interface change here. Or maybe the PR was updated after your review?

@eileencodes
Copy link
Member

@eileencodes could you point out the changes you think are not acceptable? Maybe I'm missing something but I don't really see the interface change here. Or maybe the PR was updated after your review?

The SchemaCache class isn't marked as nodoc which means most of it is public. By moving the access to the data into a new class it breaks the existing public API. For example the attr_reader for :version or accessing data_sources from the cache is now not available without going through data.. While this class isn't documented it's not private either so not only is this a breaking change, it's not backportable because of those breaking changes.

@byroot
Copy link
Member

byroot commented Jul 29, 2021

I see, then maybe the solution is to have all these changes as a new class, and leave the original one untouched.

davidtaylorhq added a commit to discourse/discourse that referenced this pull request Feb 1, 2022
This adapter ensures that MiniSql locks the ActiveRecord mutex before using the raw PG connection. This ensures that multiple threads will not attempt to use the same connection simultaneously.

This commit also removes the schema_cache_concurrency freedom-patch, which is no longer required now that cross-thread connection use is controlled by the mutex.

For the original root cause of this issue, see rails/rails#38577
davidtaylorhq added a commit to discourse/discourse that referenced this pull request Feb 2, 2022
This adapter ensures that MiniSql locks the ActiveRecord mutex before using the raw PG connection. This ensures that multiple threads will not attempt to use the same connection simultaneously.

This commit also removes the schema_cache_concurrency freedom-patch, which is no longer required now that cross-thread connection use is controlled by the mutex.

For the original root cause of this issue, see rails/rails#38577
davidtaylorhq added a commit to discourse/discourse that referenced this pull request Feb 3, 2022
This adapter ensures that MiniSql locks the ActiveRecord mutex before using the raw PG connection. This ensures that multiple threads will not attempt to use the same connection simultaneously.

This commit also removes the schema_cache_concurrency freedom-patch, which is no longer required now that cross-thread connection use is controlled by the mutex.

For the original root cause of this issue, see rails/rails#38577
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants