New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

makara connection pool management is not thread-safe #151

Open
aks opened this Issue Mar 14, 2017 · 17 comments

Comments

Projects
None yet
10 participants
@aks

aks commented Mar 14, 2017

The connection pool management is not thread-safe. So, when running makara in thread-intensive code, such as sidekiq, subtle errors creep in.

The fix is not entirely straightforward: AR 4 and 5 has reorganized the connection pool management to be thread-safe, with lots of mutex usage. Makara has none of that, and it's connection pool management is not thread-safe.

In fact, makara uses an array of connections for each pool, which is traversed each time a connection decision is being made. Connections are added up to the maximum connection, but there are no protections against simultaneous access and changes across threads.

In addition, in the master and replica (slave) connection pools, there are connections with different states in each array: blacklisted or not.

We could protect the connection pool array with a semaphore, but traversing an array behind a semaphore is a Bad Idea because it blocks all other threads from traversing that same array. If the semaphore supported read-locking in addition to write-locking then multi-thread traversals would work in parallel, except for adding or removing connections.

However, IMHO, it would be better to have multiple arrays of connection pools, with each kind of connection being queued into a separate pool (array) of connections. So, blacklisted connections would be maintained in a blacklisted connection pool, and the remaining set of available connections would be in the available connection pool. Then, each pool could be managed as a thread-safe queue object.

Makara should be updated to make best use of the latest AR connection pool code, becoming thread-safe in the process.

@bleonard

This comment has been minimized.

Show comment
Hide comment
@bleonard

bleonard Mar 14, 2017

Member

Thanks for taking the time to write this up @aks
It reads like a good summary of the issue at hand.

I don't suppose you want to give it a shot? 😄

Member

bleonard commented Mar 14, 2017

Thanks for taking the time to write this up @aks
It reads like a good summary of the issue at hand.

I don't suppose you want to give it a shot? 😄

@aks

This comment has been minimized.

Show comment
Hide comment
@aks

aks Mar 15, 2017

Well, maybe.

I've already forked the makara repo, and have added a hijack on the connection method.

However, that change was insufficient. There are still AR timeouts occurring in our sidekiq processes, which are thread-full, and the stacktraces on those timeouts do not include makara code anywhere.

Whether or not the timeouts and lack-of-thread-awareness in makara are related is still TBD.

So, there appear to be two issues:

  1. makara has incomplete coverage of current AR methods (it doesn't hijack all the necessary methods)
  2. makara does not manage its connection correctly in a thread-full environment

I've reviewed the ActiveRecord 4.2.6 and 5 code connection adapters and connection handling, to get an idea of how hard it would be to update/rewrite makara to become an alternative connection handler for AR.

In the current AR 4.2.6+ or 5 code, each model has a connection handler, which is supposed to determine the appropriate connection using a connection spec. Once a connection is selected, it is cached.

In order to support dynamic query balancing, each model would have to use a dynamic connection handler, which would choose a connection on each query, instead of using a static, cached connection. This appears to require a change to AR itself.

Would you happen to be aware anyone seriously looking into these AR issues? Are you or anyone on your team doing so?

The alternative to using an AR "insert" like makara is a proxy service, like pg_bouncer or pg_pool which could also be reconfigured to perform dynamic load-balancing.

Since we are already using pg_bouncer, it's conceptually simpler and comparatively easier (in theory) to consider updating it to perform load-balancing dynamically, based on the SQL statement and its context.

As you might guess, right now I'm looking for the Path of Least Effort, much as the cat in Heinlein's book was always looking for the "Door Into Summer". I'm really hopeful that you might have a good answer to help us find that door.

aks commented Mar 15, 2017

Well, maybe.

I've already forked the makara repo, and have added a hijack on the connection method.

However, that change was insufficient. There are still AR timeouts occurring in our sidekiq processes, which are thread-full, and the stacktraces on those timeouts do not include makara code anywhere.

Whether or not the timeouts and lack-of-thread-awareness in makara are related is still TBD.

So, there appear to be two issues:

  1. makara has incomplete coverage of current AR methods (it doesn't hijack all the necessary methods)
  2. makara does not manage its connection correctly in a thread-full environment

I've reviewed the ActiveRecord 4.2.6 and 5 code connection adapters and connection handling, to get an idea of how hard it would be to update/rewrite makara to become an alternative connection handler for AR.

In the current AR 4.2.6+ or 5 code, each model has a connection handler, which is supposed to determine the appropriate connection using a connection spec. Once a connection is selected, it is cached.

In order to support dynamic query balancing, each model would have to use a dynamic connection handler, which would choose a connection on each query, instead of using a static, cached connection. This appears to require a change to AR itself.

Would you happen to be aware anyone seriously looking into these AR issues? Are you or anyone on your team doing so?

The alternative to using an AR "insert" like makara is a proxy service, like pg_bouncer or pg_pool which could also be reconfigured to perform dynamic load-balancing.

Since we are already using pg_bouncer, it's conceptually simpler and comparatively easier (in theory) to consider updating it to perform load-balancing dynamically, based on the SQL statement and its context.

As you might guess, right now I'm looking for the Path of Least Effort, much as the cat in Heinlein's book was always looking for the "Door Into Summer". I'm really hopeful that you might have a good answer to help us find that door.

@bf4

This comment has been minimized.

Show comment
Hide comment
@bf4

bf4 Apr 14, 2017

Contributor

I kind of like how switch_point does it https://github.com/eagletmt/switch_point/blob/master/lib/switch_point/proxy.rb#L22-L33 by making an active record subclass just for getting access to a connection_pool, similar to https://github.com/customink/secondbase/blob/master/lib/second_base/base.rb

or https://github.com/instructure/shackles/blob/master/lib/shackles/connection_handler.rb kind of extends the connection handler

Contributor

bf4 commented Apr 14, 2017

I kind of like how switch_point does it https://github.com/eagletmt/switch_point/blob/master/lib/switch_point/proxy.rb#L22-L33 by making an active record subclass just for getting access to a connection_pool, similar to https://github.com/customink/secondbase/blob/master/lib/second_base/base.rb

or https://github.com/instructure/shackles/blob/master/lib/shackles/connection_handler.rb kind of extends the connection handler

@aks

This comment has been minimized.

Show comment
Hide comment
@aks

aks Apr 14, 2017

The fresh_connection gem (https://github.com/tsukasaoishi/fresh_connection) has an elegant way to insert itself into AR, without having to "hijack" lots of methods.

aks commented Apr 14, 2017

The fresh_connection gem (https://github.com/tsukasaoishi/fresh_connection) has an elegant way to insert itself into AR, without having to "hijack" lots of methods.

@jwg2s

This comment has been minimized.

Show comment
Hide comment
@jwg2s

jwg2s Jun 8, 2017

@aks how have you guys mitigated this? We're evaluating makara with https://github.com/ankane/distribute_reads + sidekiq, and in our staging environment seeing that no Postgres data is being collected. My hunch is that this is related to makara's lack of thread safety.

This seems like a deal breaker for anyone using makara, right?

EDIT: I think this is just a reporting issue, what used to show up as Postgres is now showing up as ActiveRecord in NewRelic. Still interested to hear your thoughts on whether makara is really safe to use with sidekiq in a production setting.

jwg2s commented Jun 8, 2017

@aks how have you guys mitigated this? We're evaluating makara with https://github.com/ankane/distribute_reads + sidekiq, and in our staging environment seeing that no Postgres data is being collected. My hunch is that this is related to makara's lack of thread safety.

This seems like a deal breaker for anyone using makara, right?

EDIT: I think this is just a reporting issue, what used to show up as Postgres is now showing up as ActiveRecord in NewRelic. Still interested to hear your thoughts on whether makara is really safe to use with sidekiq in a production setting.

@swordfish444

This comment has been minimized.

Show comment
Hide comment
@swordfish444

swordfish444 Sep 11, 2017

@aks This prevented us from adopting Makara. Is there any movement on this? It would be great to see this fixed!

swordfish444 commented Sep 11, 2017

@aks This prevented us from adopting Makara. Is there any movement on this? It would be great to see this fixed!

@bleonard

This comment has been minimized.

Show comment
Hide comment
@bleonard

bleonard Sep 12, 2017

Member

@jeremy with your knowledge of the AR internals, any thoughts on this?
Have you done anything to fix this issue in your environment?

Member

bleonard commented Sep 12, 2017

@jeremy with your knowledge of the AR internals, any thoughts on this?
Have you done anything to fix this issue in your environment?

@rajagopals

This comment has been minimized.

Show comment
Hide comment
@rajagopals

rajagopals Aug 2, 2018

@aks @jwg2s @bleonard We are evaluating using makara for read-write splitting and came across this thread. Do you have updates/suggestions/mitigation options for this issue?

I don't see other well-maintained gems for real-write splitting in production, anyone has other suggestions that worked well in production environment?

FYI We use AWS Aurora MySQL as our backing store and our application is hosted on Heroku.

rajagopals commented Aug 2, 2018

@aks @jwg2s @bleonard We are evaluating using makara for read-write splitting and came across this thread. Do you have updates/suggestions/mitigation options for this issue?

I don't see other well-maintained gems for real-write splitting in production, anyone has other suggestions that worked well in production environment?

FYI We use AWS Aurora MySQL as our backing store and our application is hosted on Heroku.

@aks

This comment has been minimized.

Show comment
Hide comment
@aks

aks Aug 2, 2018

aks commented Aug 2, 2018

@dzunk

This comment has been minimized.

Show comment
Hide comment
@dzunk

dzunk Aug 2, 2018

@rajagopals recently went through the same thing, needed to migrate off of pgpool/pgbouncer and the thread-safety issue ultimately prevented us from adopting Makara.

We wound up using active_record_slave on top of an Aurora Postgres cluster. No issues yet in a multi-threaded production environment, handling ~40k queries per minute. YMMV

dzunk commented Aug 2, 2018

@rajagopals recently went through the same thing, needed to migrate off of pgpool/pgbouncer and the thread-safety issue ultimately prevented us from adopting Makara.

We wound up using active_record_slave on top of an Aurora Postgres cluster. No issues yet in a multi-threaded production environment, handling ~40k queries per minute. YMMV

@ankane

This comment has been minimized.

Show comment
Hide comment
@ankane

ankane Sep 20, 2018

Contributor

Hey @aks, can you explain more about how the issue manifests itself, and how to reproduce it?

Contributor

ankane commented Sep 20, 2018

Hey @aks, can you explain more about how the issue manifests itself, and how to reproduce it?

@patrykk21

This comment has been minimized.

Show comment
Hide comment
@patrykk21

patrykk21 Oct 10, 2018

Any updates here?

patrykk21 commented Oct 10, 2018

Any updates here?

@jeremy

This comment has been minimized.

Show comment
Hide comment
@jeremy

jeremy Oct 10, 2018

Contributor

@patrykk21 Are you investigating it? Please do share your findings!

Contributor

jeremy commented Oct 10, 2018

@patrykk21 Are you investigating it? Please do share your findings!

@patrykk21

This comment has been minimized.

Show comment
Hide comment
@patrykk21

patrykk21 Oct 10, 2018

@jeremy I tried replicating the issue with both gems on our project locally without success. I'm generating a new clean Rails 5.1 app to have more control over the environment and trying to break it.

Do you have any insight or any specific tests that would break it?

patrykk21 commented Oct 10, 2018

@jeremy I tried replicating the issue with both gems on our project locally without success. I'm generating a new clean Rails 5.1 app to have more control over the environment and trying to break it.

Do you have any insight or any specific tests that would break it?

@jeremy

This comment has been minimized.

Show comment
Hide comment
@jeremy

jeremy Oct 12, 2018

Contributor

@patrykk21 How'd you try to replicate it?

Contributor

jeremy commented Oct 12, 2018

@patrykk21 How'd you try to replicate it?

@patrykk21

This comment has been minimized.

Show comment
Hide comment
@patrykk21

patrykk21 Oct 12, 2018

@jeremy Well, using our app with multiple simultaneous requests through sidekiq and some manual tests through threads, expecting to have some timeout errors but it actually never occurred.

patrykk21 commented Oct 12, 2018

@jeremy Well, using our app with multiple simultaneous requests through sidekiq and some manual tests through threads, expecting to have some timeout errors but it actually never occurred.

@jeremy

This comment has been minimized.

Show comment
Hide comment
@jeremy

jeremy Oct 12, 2018

Contributor

@patrykk21 Hmm, that would be a matter of getting lucky to see an issue. We'd need to examine the code to see how to get concurrent threads into a state that causes specific problems. Considering you're using Sidekiq and there are other libraries that do suggest they're thread-safe (possibly by leaning more heavily on Active Record's connection handlers), I'd go with an alternative for now.

Contributor

jeremy commented Oct 12, 2018

@patrykk21 Hmm, that would be a matter of getting lucky to see an issue. We'd need to examine the code to see how to get concurrent threads into a state that causes specific problems. Considering you're using Sidekiq and there are other libraries that do suggest they're thread-safe (possibly by leaning more heavily on Active Record's connection handlers), I'd go with an alternative for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment