New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #23017 - Rex resilience #298

Merged
merged 24 commits into from May 2, 2018

Conversation

Projects
None yet
4 participants
@adamruzicka
Contributor

adamruzicka commented Nov 15, 2017

@adamruzicka

This comment has been minimized.

Contributor

adamruzicka commented Dec 13, 2017

Test failure seems unrelated

@iNecas

This comment has been minimized.

Member

iNecas commented Jan 16, 2018

[test]

end
def delegated_actions
action.sub_plans('state' => %w[running])

This comment has been minimized.

@iNecas

iNecas Feb 7, 2018

Member

This can be potentially expensive and memory-hungry method, loading all the actions data for a job with 10000s of hosts. I'm thinking of an alternative approach, where we would keep the track of association between local action and remote one in a separate table, where we would track the relation between local step and remote task. We could also use this relation to group the requests to the proxy when creating the remote actions.

In short, instead of calling to proxy on every host, we would just mark that the remote action is ready to be triggered at the remote host, and a periodic procedure (that could be done as part of the parent task) would collect the info from all actions that would be ready to be triggered remotely and trigger it in bulk.

I think we could do this in two phases, leaving the bulk triggering as a separate PR, but we could do the work in the PR ready for moving this direction.

module Middleware
class WatchDelegatedProxySubTasks < ::Dynflow::Middleware
class CheckOnProxyActions; end
POLL_INTERVAL = 30

This comment has been minimized.

@iNecas

iNecas Feb 7, 2018

Member

This might be quite aggressive interval, especially with large scale. Once in 5 minutes could be quite good, perhaps exposing this through settings, just in case.

@theforeman-bot

This comment has been minimized.

Member

theforeman-bot commented Feb 8, 2018

@adamruzicka, the Redmine ticket used is for a different project than the one associated with this GitHub repository. Please either:

If changing the ticket number used, remember to update the PR title and the commit message (using git commit --amend).


This message was auto-generated by Foreman's prprocessor

@@ -102,6 +111,7 @@ def on_resume
# @override to put custom logic on event handling
def on_data(data)
output[:proxy_output] = data
remote_task.destroy!

This comment has been minimized.

@iNecas

iNecas Feb 9, 2018

Member

remote_task.destroy! is repeating at many places. Perhaps we should have one method to perform the post-run logic, something like def finish. I can imagine calling wipe_secrets! could also be there.

@@ -111,6 +121,7 @@ def wipe_secrets!
end
def on_proxy_action_missing
remote_task.destroy!

This comment has been minimized.

@iNecas

iNecas Feb 9, 2018

Member

There is quite high coupling between the error! method on the remote_task.destroy!. We could probably benefit from defining something like def fail_action(message) where, we would call finish_action before raising the error, and we could use that from all the places this pattern repeats.

class AddRemoteTasks < ActiveRecord::Migration[5.0]
def change
create_table :foreman_tasks_remote_tasks do |t|
t.string :remote_task_id

This comment has been minimized.

@iNecas

iNecas Feb 9, 2018

Member

Really a nitpick: I would suggest this ordering of columns: [:execution_plan_id, :step_id, :state, :proxy_url, :remote_task_id], just taking as which values will be known from the beginning vs. which will be potentially added later.

@@ -2,13 +2,13 @@ module Actions
module Middleware
class WatchDelegatedProxySubTasks < ::Dynflow::Middleware
class CheckOnProxyActions; end
POLL_INTERVAL = 30
POLL_INTERVAL = 10

This comment has been minimized.

@iNecas

iNecas Feb 9, 2018

Member

Was it intentional to shorten the interval? I would actually expect to prolong this a bit, perhaps even something like once per 5 minutes could be fine.

@iNecas

Some requests inline. Haven't tested it yet, but looks like the step right direction.

end
def check_triggered
tasks = remote_tasks.triggered.group_by(&:proxy_url)

This comment has been minimized.

@iNecas

iNecas Feb 9, 2018

Member

I wonder how this works with 10000s of tasks? Shouldn't we use some iteration though batches to spread the tasks over some "smaller' amounts? Maybe it's fine, but from the previous experience, it's good to keep thinking about this option.

@adamruzicka adamruzicka changed the title from Fixes #21668 - Rex resilience to [WIP] Fixes #21668 - Rex resilience Feb 9, 2018

@adamruzicka adamruzicka changed the title from [WIP] Fixes #23017 - Rex resilience to Fixes #23017 - Rex resilience Apr 23, 2018

@iNecas

This comment has been minimized.

Member

iNecas commented Apr 30, 2018

t.string :proxy_url, :null => false
t.string :remote_task_id
t.index :execution_plan_id

This comment has been minimized.

@tbrisker

tbrisker May 2, 2018

Member

This index isn't needed, all DBs know how to use partial indices so the second index will also allow using just :execution_plan_id.

@adamruzicka

This comment has been minimized.

Contributor

adamruzicka commented May 2, 2018

I'd say the test failures are unrelated, I've seen those happen randomly on various db-ruby version combinations

@iNecas

This comment has been minimized.

Member

iNecas commented May 2, 2018

Thanks @adamruzicka and @tbrisker

@iNecas iNecas merged commit f8f27c1 into theforeman:master May 2, 2018

1 of 2 checks passed

default Job result: FAILURE
Details
prprocessor Commit message style is correct
Details

@adamruzicka adamruzicka deleted the adamruzicka:rex-resilience branch May 2, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment