New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kill long running queries before creating/dropping triggers #1

Merged
merged 14 commits into from Mar 14, 2018

Conversation

2 participants
@jackfan108

jackfan108 commented Mar 7, 2018

Notes

  • kills and logs long running queries on the origin table before creating/dropping triggers
  • stops and raises an error when creating/dropping triggers take more than 3 seconds (USE LOCK_WAIT_TIMEOUT)
  • add env var to control the query-killing mechanism (need to pro-actively add in env var)
  • [fill me in] unnecessary query killing when trigger finishes real fast (addressed in Potential Future Direction)

to-do

  • add specs
  • manual testing
  • confirm with @goodgravy @timminkov that we are targeting the correct tables
  • James will be thinking about the race condition
  • hold off threading approach until James reaches conclusion
  • implement logic to address race condition
  • move sleep to beginning of the block
  • do setting/unsetting SESSION LOCK_WAIT_TIMEOUT once
  • do unsetting in a ensure block
  • use minitest not rspec
  • set rails_teespring to point at this version of lhm teepsring/lhm
  • SET ENV VAR ON PROD
  • test migration on campaigns to ensure that this PR is working properly (https://github.com/teespring/rails-teespring/pull/17172)

Potential Future Direction

  • make query killing less aggressive (e.g. only killing 60+ sec queries)
  • have (a set amount of) retries for trigger creation (at some point we might get lucky?)
  • use Threadswait to clean up threads
  • in the case that triggers get created successfully, the new thread would still get executed, meaning we will be unnecessarily killing "long running" queries (unnecessarily being the queries that will be killed shouldn't be killed since they aren't blocking triggers from getting created.

@jackfan108 jackfan108 force-pushed the migration-mageddon-kill-long-queries-before-doing-trigger-stuffs branch from 0b0b96c to cf6724b Mar 8, 2018

@jackfan108 jackfan108 force-pushed the migration-mageddon-kill-long-queries-before-doing-trigger-stuffs branch from cf6724b to f2e2d62 Mar 8, 2018

@connection.execute(tagged(stmt))
end
rescue Timeout::Error
error("statement: \"#{stmt}\" took longer than #{sec} seconds to run... ABORT!")

This comment has been minimized.

@goodgravy

goodgravy Mar 8, 2018

Member

Will this kill the process in the DB?

This comment has been minimized.

@jackfan108

jackfan108 Mar 9, 2018

@goodgravy the new implementation kills the query and unblocks subsequent queries. I have a manual integration test that I can show you.

def kill_long_running_queries_on_origin_table!
return unless ENV['LHM_KILL_LONG_RUNNING_QUERIES'] == 'true'
3.times do
long_running_queries(@orogin.name).each do |id, query, duration|

This comment has been minimized.

@goodgravy

goodgravy Mar 8, 2018

Member

Typo: orogin

result = @connection.execute <<-SQL.strip_heredoc
SELECT ID, INFO, TIME FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE command <> 'Sleep'
AND INFO LIKE '%FROM `#{table_name}`%'

This comment has been minimized.

@goodgravy

goodgravy Mar 8, 2018

Member

What about queries like SELECT * FROM users INNER JOIN campaigns ON ...?

3.times do
long_running_queries(@orogin.name).each do |id, query, duration|
Lhm.logger.info "Action on table #{table_name} detected; killing #{duration}-second query: #{query}."
@connection.execute("KILL #{id}")

This comment has been minimized.

@goodgravy

goodgravy Mar 8, 2018

Member

Just a side-note that only the owning user can kill processes. 99% of the time that's fine because the queries will be run by the same user as the LHM migration, but how do we handle long queries run by e.g. the read-only user?

Logging the result of the KILL might be enough here. We can always manually kill things in the rare instances the scripted KILL doesn't work.

This comment has been minimized.

@jackfan108

jackfan108 Mar 9, 2018

I do not have a good answer for this. If like vangogh-2015-10 has a long running query but lhm is using user heroku-2015-10, what do we do?

  1. I don't know if that is a real concern. i.e. do we actually have long running queries from non heroku-2015-10 ?
  2. I don't think master user can even kill other users' queries. I tried live on snapshotDB to kill other queries and got an error.
entangle.each do |stmt|
@connection.execute(tagged(stmt))
execute_with_timeout(stmt, MAX_RUNNING_SECONDS)

This comment has been minimized.

@goodgravy

goodgravy Mar 8, 2018

Member

There is a race condition here, where a long-running query could start in between killing the ones that were already there, and starting the trigger creation / deletion.

My first thought for how to fix that is: start the trigger creation in a thread, then kill long-running queries on origin table after a few seconds.

end

def execute_with_timeout(stmt, sec)
Timeout.timeout(sec) do

This comment has been minimized.

@goodgravy

goodgravy Mar 8, 2018

Member

I am strongly opposed to using Timeout here! We could be causing all kinds of carnage inside of LHM, the MySQL driver, the DB…

See http://www.mikeperham.com/2015/05/08/timeout-rubys-most-dangerous-api/

Can we use a session-local MySQL timeout variable instead? That will ensure resources are cleaned up.

# the goal of this thread is to wait until long running queries that started between
# #kill_long_running_queries_on_origin_table! and trigger creation
# to pass the threshold time and then kill them
sleep(LONG_QUERY_TIME_THRESHOLD + 2)

This comment has been minimized.

@jackfan108

jackfan108 Mar 13, 2018

2 --> TIME_TO_WAIT_FOR_QUERY_TO_START (some clarifying name)


module Lhm
class Entangler
include Command
include SqlHelper

TABLES_WITH_LONG_QUERIES = %w(designs campaigns campaign_roots tags orders).freeze
LONG_QUERY_TIME_THRESHOLD = 5
SESSION_WAIT_LOCK_TIMEOUT = LONG_QUERY_TIME_THRESHOLD * 4

This comment has been minimized.

@jackfan108

jackfan108 Mar 13, 2018

LONG_QUERY_TIME_THRESHOLD = 10
INITIALIZATION_DELAY = 2
TRIGGER_MAXIMUM_DURATION = 2
SESSION_WAIT_LOCK_TIMEOUT = LONG_QUERY_TIME_THRESHOLD + INITIALIZATION_DELAY + TRIGGER_MAXIMUM_DURATION

@jackfan108 jackfan108 force-pushed the migration-mageddon-kill-long-queries-before-doing-trigger-stuffs branch from 8be8a25 to 18893fd Mar 14, 2018


puts "cleaning up rogue long queries..."
ActiveRecord::Base.connection.reconnect!
ids = ActiveRecord::Base.connection.execute("select id from information_schema.processlist where info like '\%sleep(1000)%' and time > 1").to_a.flatten

This comment has been minimized.

@jackfan108

jackfan108 Mar 14, 2018

Put select sleep(1000) from origin` in a variable and target the query specifically.

@jackfan108 jackfan108 merged commit e302b5b into master Mar 14, 2018

@jackfan108 jackfan108 deleted the migration-mageddon-kill-long-queries-before-doing-trigger-stuffs branch Mar 14, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment