Fixes #21162 - Handles any error thrown while connecting via ssh #4880

juliovalcarcel · 2017-10-01T20:15:19Z

Removes unnecessary rescue/debug messages and handles all exceptions the same until the timeout occurs. This ensures that we attempt to connect until the complete timeout occurs.

theforeman-bot · 2017-10-01T20:15:21Z

Do not merge! This patch has not been tested yet.

Can an existing organization member please verify this patch?

theforeman-bot · 2017-10-01T20:15:21Z

Do not merge! This patch has not been tested yet.

Can an existing organization member please verify this patch?

theforeman-bot · 2017-10-01T20:15:21Z

Do not merge! This patch has not been tested yet.

Can an existing organization member please verify this patch?

houndci-bot · 2017-10-01T20:15:29Z

app/services/foreman/provision/ssh.rb

    end
  end

+


Extra blank line detected.

houndci-bot · 2017-10-01T20:15:29Z

app/services/foreman/provision/ssh.rb

+        rescue => e
+          logger.debug "Error occured while connecting \"#{e.inspect}\", retrying"
+          sleep(2)
+          retry 


Trailing whitespace detected.

houndci-bot · 2017-10-01T20:15:29Z

app/services/foreman/provision/ssh.rb

-      begin
-        Timeout.timeout(8) do
-          ssh.run('pwd')
+    begin


Redundant begin block detected.

theforeman-bot · 2017-10-01T20:15:38Z

Issues: #21162

dLobatog

[test]
ok to test

dLobatog

@juliovalcarcel Thanks for the patch. I've read 21162 but I don't understand how this is fixing it, could you elaborate a bit on how that's happening?

dLobatog · 2017-10-02T08:48:49Z

app/services/foreman/provision/ssh.rb

      end
    end
+  rescue => e


If it reached the timeout, it should only show rescue Timeout::Error.

dLobatog · 2017-10-02T08:48:53Z

app/services/foreman/provision/ssh.rb

-        retry
-      rescue Net::SSH::AuthenticationFailed
-        logger.debug "Auth failed for #{username} at #{address}, retrying"
+      rescue => e


This will hide other exceptions which may be caused by Foreman itself, the list of exceptions we had here was there in purpose so that other exceptions don't retry but fail and the backtrace is shown.

theforeman-bot · 2017-10-02T11:58:10Z

There were the following issues with the commit message:

length of the first commit message line for cae0447 exceeds 65 characters
commit message for cae0447 is not wrapped at 72nd column

If you don't have a ticket number, please create an issue in Redmine.

More guidelines are available in Coding Standards or on the Foreman wiki.

This message was auto-generated by Foreman's prprocessor

juliovalcarcel · 2017-10-02T12:01:17Z

@dLobatog When creating a new host via the EC2 compute resource it was failing to SSH with a "SSH Error: IOError steam closed" which then caused provisioning to fail entirely when in reality the host it created was able to be SSH to from Foreman with no issue. After more digging I concluded that Foreman attempts to SSH too quickly before the instance is created and provisioned and the default centos user is added with foreman's public key. Looking at /var/log/secure in 22162 you can see that the first attempt to connect is Foreman then we see the centos user being added then (For a reason I am unsure of) the ssh daemon restarts. Doing more digging into net-ssh told me that the net-ssh has limitations that it can't handle when a the ssh server is restarted and will throw an IOError (Which was what was causing my issue).

This fixes my issue by not only handling the IOError inside the inner rescue => e but any other issues that could occur. As a user if the initiate_connection method fails for any reason I always want it to retry until the timeout. It signifies to me that for some reason foreman cannot ssh to the host in 5 mins which means I need to fix something. Even if it failed for any other issue I would always want it to retry inside of the timeout limit.

I just updated to make the inner log message an info log for easy debugging as the exception that is thrown clearly indicates the issue at play if it was one of the ones we were catching before and I just added a debug log which will print the stacktrace if debug is on, but without throwing an error and failing so in the 1% case it's not one of the set of exceptions identified we can still diagnose it. Also I changed the outer rescue to only handle Timeout::Error

Includes stacktrace of nested failures for debugging purposes

houndci-bot

Some files could not be reviewed due to errors:

.rubocop.yml: Style/FileName has the wrong namespace - should be Naming

.rubocop.yml: Style/FileName has the wrong namespace - should be Naming
.rubocop.yml: Style/PredicateName has the wrong namespace - should be Naming
.rubocop.yml: Style/AccessorMethodName has the wrong namespace - should be Naming
.rubocop.yml: Style/MethodName has the wrong namespace - should be Naming
.rubocop.yml: Style/VariableNumber has the wrong namespace - should be Naming
Error: The `Style/OpMethod` cop has been renamed and moved to `Naming/BinaryOperatorParameterName`.
(obsolete configuration found in .rubocop.yml, please update it)

iNecas · 2017-11-06T12:36:47Z

app/services/foreman/provision/ssh.rb

        sleep(2)
        retry
-      rescue Timeout::Error


Do we want to log also the Timeout::Error?

By doing rescue e and then logging the exception class in the message we will know if it was a timeout or not

Logging the message is fine, but I would discourage logging the backtrace for this. The problem I see is, we have a set of known exceptions, and we don't need a backtrace from those (including this). Than we have another set of exceptions, that we don't expect: I'm not sure we should retry from there, but more importnantly: we should log the exception with error level from the unknown ones. Therefore, I would be fine with listing the known ones in one rescue block, pretty much the same as you do now (and including the IOError + having the resuce => e there, with more verbose logging (in error level): I think I could live with retry there as well. Other thoughts anyone?

I think retry is acceptable but we need to limit how many times we try retry. +1 to the rest.

Number of retries is limited by the outer timeout block

Never ever eat backtraces, I do not agree with @iNecas here and I would like to see it. If you think that could be too veerbose, just make sure it is logged only once (the first try). In the end, it's debug level (turned off by default).

If the log contains trace of last exception, I'm ok with that. Especially if that's expected exception from which we recover by retry. But I don't insist :-)

I think debug trace for known exceotions is ok: what I'm advocating for though is logging unknown exceptions AND backtrace with error log level

If any exception is not rethrown, it must be logged. Previously we had particular exceptions and it was ok to swallow them, not now when we can miss a generic exception and not have stracktrace in logs.

As a user, what I care about is that I can SSH into my box in this case I don't care what error occurs I just want it to retry for 5 minutes until it gets in. If it does time out its easy enough to enable debug logging and try it again. Theoretically (in this case for ssh) if the issue happens the first time and I turn on debug logs to get the stack trace I will get the same issue again. And in most cases the issue will not be on Foreman's side it will be something else with the server it is SSHing into.

lzap · 2017-11-13T13:37:09Z

Frankly after reading the issue I think the proper solution would be to add IO "closed stream" exception handling and not go with generic.

houndci-bot

Some files could not be reviewed due to errors:

.rubocop.yml: Style/FileName has the wrong namespace - should be Naming

.rubocop.yml: Style/FileName has the wrong namespace - should be Naming
.rubocop.yml: Style/PredicateName has the wrong namespace - should be Naming
.rubocop.yml: Style/AccessorMethodName has the wrong namespace - should be Naming
.rubocop.yml: Style/MethodName has the wrong namespace - should be Naming
.rubocop.yml: Style/VariableNumber has the wrong namespace - should be Naming
Error: The `Style/OpMethod` cop has been renamed and moved to `Naming/BinaryOperatorParameterName`.
(obsolete configuration found in .rubocop.yml, please update it)

juliovalcarcel · 2018-02-07T15:00:33Z

I reverted back all but the minimum changes I think should be added. It does a rescue on IOError and then for any unknown exceptions it will still catch them and try again if it hasn't timed out yet while still logging the error.

lzap

We don't want to have exceptions swallowed, except the last warning about giving up.

lzap · 2018-02-08T10:54:27Z

app/services/foreman/provision/ssh.rb

      rescue => e
-        Foreman::Logging.exception("SSH error", e)
+        logger.info "An error occured while connecting before timeout occured \"#{e.inspect}\", retrying"
+        logger.debug "Full stacktrace of exception: \n  #{e.backtrace.join("\n  ")}"


Please use Foreman::Logging.exception("An error occured while connecting before timeout occured", e) instead, this is a helper which formats exceptions properly in logs.

lzap · 2018-02-08T10:56:42Z

app/services/foreman/provision/ssh.rb

@@ -99,10 +99,19 @@ def initiate_connection!
        retry
      rescue Timeout::Error
        retry
+      rescue IOError
+        logger.debug "net-ssh threw an IOError, retrying"


Please do not swallow exceptions, name the exception and properly log it.

Foreman::Logging.exception("IO error occurred", exception)

lzap · 2018-02-08T11:09:38Z

app/services/foreman/provision/ssh.rb

      end
    end
+  rescue Timeout::Error
+    Foreman::Logging.exception("Error connecting over SSH, reached max timeout of 360 seconds")


And here it should be regular logger.warn call instead exception.

lzap · 2018-06-26T09:42:06Z

Ping? Interested in getting this into the codebase?

theforeman-bot · 2018-12-27T01:23:09Z

Thank you for your contribution, @juliovalcarcel! This PR has been inactive for 6 months, closing for now.
Feel free to reopen when you return to it. This is an automated process.

fixes #21162: Handles any error thrown while connecting via ssh

0e05067

houndci-bot reviewed Oct 1, 2017

View reviewed changes

theforeman-bot added Needs testing Not yet reviewed labels Oct 1, 2017

juliovalcarcel added 2 commits October 1, 2017 16:18

Fixes #21162 - Fixing houndci issues

6c8a4d7

Fixes #21162 - Fixing last houndci issue

f06646a

dLobatog reviewed Oct 2, 2017

View reviewed changes

theforeman-bot added the Waiting on contributor label Oct 2, 2017

Fixes #21162 - Updating per review

4ddbab5

Includes stacktrace of nested failures for debugging purposes

theforeman-bot added Needs testing and removed Waiting on contributor labels Oct 30, 2017

houndci-bot reviewed Oct 30, 2017

View reviewed changes

iNecas reviewed Nov 6, 2017

View reviewed changes

Fixes #21162 - Reverting back rescues per discussion

298cdd9

houndci-bot reviewed Feb 7, 2018

View reviewed changes

lzap requested changes Feb 8, 2018

View reviewed changes

theforeman-bot added Waiting on contributor and removed Not yet reviewed labels Feb 8, 2018

theforeman-bot added the Inactive label Dec 27, 2018

theforeman-bot closed this Dec 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #21162 - Handles any error thrown while connecting via ssh #4880

Fixes #21162 - Handles any error thrown while connecting via ssh #4880

juliovalcarcel commented Oct 1, 2017

theforeman-bot commented Oct 1, 2017

theforeman-bot commented Oct 1, 2017

theforeman-bot commented Oct 1, 2017

houndci-bot Oct 1, 2017

houndci-bot Oct 1, 2017

houndci-bot Oct 1, 2017

theforeman-bot commented Oct 1, 2017

dLobatog left a comment

dLobatog left a comment

dLobatog Oct 2, 2017

dLobatog Oct 2, 2017

theforeman-bot commented Oct 2, 2017

juliovalcarcel commented Oct 2, 2017

houndci-bot left a comment

iNecas Nov 6, 2017

juliovalcarcel Nov 6, 2017

iNecas Nov 6, 2017

ares Nov 7, 2017

iNecas Nov 7, 2017

lzap Nov 7, 2017

ares Nov 7, 2017

iNecas Nov 7, 2017

lzap Nov 8, 2017

juliovalcarcel Nov 8, 2017

lzap commented Nov 13, 2017

houndci-bot left a comment

juliovalcarcel commented Feb 7, 2018

lzap left a comment

lzap Feb 8, 2018

lzap Feb 8, 2018

lzap Feb 8, 2018

lzap commented Jun 26, 2018

theforeman-bot commented Dec 27, 2018

Fixes #21162 - Handles any error thrown while connecting via ssh #4880

Fixes #21162 - Handles any error thrown while connecting via ssh #4880

Conversation

juliovalcarcel commented Oct 1, 2017

theforeman-bot commented Oct 1, 2017

theforeman-bot commented Oct 1, 2017

theforeman-bot commented Oct 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theforeman-bot commented Oct 1, 2017

dLobatog left a comment

Choose a reason for hiding this comment

dLobatog left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theforeman-bot commented Oct 2, 2017

juliovalcarcel commented Oct 2, 2017

houndci-bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lzap commented Nov 13, 2017

houndci-bot left a comment

Choose a reason for hiding this comment

juliovalcarcel commented Feb 7, 2018

lzap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lzap commented Jun 26, 2018

theforeman-bot commented Dec 27, 2018