Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #21162 - Handles any error thrown while connecting via ssh #4880

Closed
wants to merge 5 commits into from
Closed

Fixes #21162 - Handles any error thrown while connecting via ssh #4880

wants to merge 5 commits into from

Conversation

juliovalcarcel
Copy link

Removes unnecessary rescue/debug messages and handles all exceptions the same until the timeout occurs. This ensures that we attempt to connect until the complete timeout occurs.

@theforeman-bot
Copy link
Member

Do not merge! This patch has not been tested yet.

Can an existing organization member please verify this patch?

2 similar comments
@theforeman-bot
Copy link
Member

Do not merge! This patch has not been tested yet.

Can an existing organization member please verify this patch?

@theforeman-bot
Copy link
Member

Do not merge! This patch has not been tested yet.

Can an existing organization member please verify this patch?

end
end


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra blank line detected.

rescue => e
logger.debug "Error occured while connecting \"#{e.inspect}\", retrying"
sleep(2)
retry

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace detected.

begin
Timeout.timeout(8) do
ssh.run('pwd')
begin

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant begin block detected.

@theforeman-bot
Copy link
Member

Issues: #21162

Copy link
Member

@dLobatog dLobatog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[test]
ok to test

Copy link
Member

@dLobatog dLobatog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliovalcarcel Thanks for the patch. I've read 21162 but I don't understand how this is fixing it, could you elaborate a bit on how that's happening?

end
end
rescue => e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it reached the timeout, it should only show rescue Timeout::Error.

retry
rescue Net::SSH::AuthenticationFailed
logger.debug "Auth failed for #{username} at #{address}, retrying"
rescue => e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will hide other exceptions which may be caused by Foreman itself, the list of exceptions we had here was there in purpose so that other exceptions don't retry but fail and the backtrace is shown.

@theforeman-bot
Copy link
Member

There were the following issues with the commit message:

  • length of the first commit message line for cae0447 exceeds 65 characters
  • commit message for cae0447 is not wrapped at 72nd column

If you don't have a ticket number, please create an issue in Redmine.

More guidelines are available in Coding Standards or on the Foreman wiki.


This message was auto-generated by Foreman's prprocessor

@juliovalcarcel
Copy link
Author

@dLobatog When creating a new host via the EC2 compute resource it was failing to SSH with a "SSH Error: IOError steam closed" which then caused provisioning to fail entirely when in reality the host it created was able to be SSH to from Foreman with no issue. After more digging I concluded that Foreman attempts to SSH too quickly before the instance is created and provisioned and the default centos user is added with foreman's public key. Looking at /var/log/secure in 22162 you can see that the first attempt to connect is Foreman then we see the centos user being added then (For a reason I am unsure of) the ssh daemon restarts. Doing more digging into net-ssh told me that the net-ssh has limitations that it can't handle when a the ssh server is restarted and will throw an IOError (Which was what was causing my issue).

This fixes my issue by not only handling the IOError inside the inner rescue => e but any other issues that could occur. As a user if the initiate_connection method fails for any reason I always want it to retry until the timeout. It signifies to me that for some reason foreman cannot ssh to the host in 5 mins which means I need to fix something. Even if it failed for any other issue I would always want it to retry inside of the timeout limit.

I just updated to make the inner log message an info log for easy debugging as the exception that is thrown clearly indicates the issue at play if it was one of the ones we were catching before and I just added a debug log which will print the stacktrace if debug is on, but without throwing an error and failing so in the 1% case it's not one of the set of exceptions identified we can still diagnose it. Also I changed the outer rescue to only handle Timeout::Error

Includes stacktrace of nested failures for debugging purposes
Copy link

@houndci-bot houndci-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some files could not be reviewed due to errors:

.rubocop.yml: Style/FileName has the wrong namespace - should be Naming
.rubocop.yml: Style/FileName has the wrong namespace - should be Naming
.rubocop.yml: Style/PredicateName has the wrong namespace - should be Naming
.rubocop.yml: Style/AccessorMethodName has the wrong namespace - should be Naming
.rubocop.yml: Style/MethodName has the wrong namespace - should be Naming
.rubocop.yml: Style/VariableNumber has the wrong namespace - should be Naming
Error: The `Style/OpMethod` cop has been renamed and moved to `Naming/BinaryOperatorParameterName`.
(obsolete configuration found in .rubocop.yml, please update it)

sleep(2)
retry
rescue Timeout::Error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to log also the Timeout::Error?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By doing rescue e and then logging the exception class in the message we will know if it was a timeout or not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging the message is fine, but I would discourage logging the backtrace for this. The problem I see is, we have a set of known exceptions, and we don't need a backtrace from those (including this). Than we have another set of exceptions, that we don't expect: I'm not sure we should retry from there, but more importnantly: we should log the exception with error level from the unknown ones. Therefore, I would be fine with listing the known ones in one rescue block, pretty much the same as you do now (and including the IOError + having the resuce => e there, with more verbose logging (in error level): I think I could live with retry there as well. Other thoughts anyone?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think retry is acceptable but we need to limit how many times we try retry. +1 to the rest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of retries is limited by the outer timeout block

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never ever eat backtraces, I do not agree with @iNecas here and I would like to see it. If you think that could be too veerbose, just make sure it is logged only once (the first try). In the end, it's debug level (turned off by default).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the log contains trace of last exception, I'm ok with that. Especially if that's expected exception from which we recover by retry. But I don't insist :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think debug trace for known exceotions is ok: what I'm advocating for though is logging unknown exceptions AND backtrace with error log level

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any exception is not rethrown, it must be logged. Previously we had particular exceptions and it was ok to swallow them, not now when we can miss a generic exception and not have stracktrace in logs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user, what I care about is that I can SSH into my box in this case I don't care what error occurs I just want it to retry for 5 minutes until it gets in. If it does time out its easy enough to enable debug logging and try it again. Theoretically (in this case for ssh) if the issue happens the first time and I turn on debug logs to get the stack trace I will get the same issue again. And in most cases the issue will not be on Foreman's side it will be something else with the server it is SSHing into.

@lzap
Copy link
Member

lzap commented Nov 13, 2017

Frankly after reading the issue I think the proper solution would be to add IO "closed stream" exception handling and not go with generic.

Copy link

@houndci-bot houndci-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some files could not be reviewed due to errors:

.rubocop.yml: Style/FileName has the wrong namespace - should be Naming
.rubocop.yml: Style/FileName has the wrong namespace - should be Naming
.rubocop.yml: Style/PredicateName has the wrong namespace - should be Naming
.rubocop.yml: Style/AccessorMethodName has the wrong namespace - should be Naming
.rubocop.yml: Style/MethodName has the wrong namespace - should be Naming
.rubocop.yml: Style/VariableNumber has the wrong namespace - should be Naming
Error: The `Style/OpMethod` cop has been renamed and moved to `Naming/BinaryOperatorParameterName`.
(obsolete configuration found in .rubocop.yml, please update it)

@juliovalcarcel
Copy link
Author

I reverted back all but the minimum changes I think should be added. It does a rescue on IOError and then for any unknown exceptions it will still catch them and try again if it hasn't timed out yet while still logging the error.

Copy link
Member

@lzap lzap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to have exceptions swallowed, except the last warning about giving up.

rescue => e
Foreman::Logging.exception("SSH error", e)
logger.info "An error occured while connecting before timeout occured \"#{e.inspect}\", retrying"
logger.debug "Full stacktrace of exception: \n #{e.backtrace.join("\n ")}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Foreman::Logging.exception("An error occured while connecting before timeout occured", e) instead, this is a helper which formats exceptions properly in logs.

@@ -99,10 +99,19 @@ def initiate_connection!
retry
rescue Timeout::Error
retry
rescue IOError
logger.debug "net-ssh threw an IOError, retrying"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not swallow exceptions, name the exception and properly log it.

Foreman::Logging.exception("IO error occurred", exception)

end
end
rescue Timeout::Error
Foreman::Logging.exception("Error connecting over SSH, reached max timeout of 360 seconds")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here it should be regular logger.warn call instead exception.

@lzap
Copy link
Member

lzap commented Jun 26, 2018

Ping? Interested in getting this into the codebase?

@theforeman-bot
Copy link
Member

Thank you for your contribution, @juliovalcarcel! This PR has been inactive for 6 months, closing for now.
Feel free to reopen when you return to it. This is an automated process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants