report auth/connection errors as a metric instead of status #43

miguelgrinberg · 2014-07-30T21:44:36Z

No description provided.

sigmavirus24 · 2014-07-30T21:55:57Z

swift_api_local_check.py

+            swift = client.Connection(preauthurl=endpoint,
+                                      preauthtoken=auth_token)
+            if swift is None:
+                status_msg = 'Unable to obtain valid swift client'


We're currently using (the equivalent of) elapsed = -1 here.

git-harry · 2014-07-31T08:02:23Z

swift_api_local_check.py

-                                                endpoint_type='publicURL')
-        auth_token = keystone.auth_ref['token']['id']
+            status_msg = 'Unable to obtain valid keystone client'
+            api_ok = False


I disagree with this concept. I think the swift_api_local_status metric should only reflect the status of the swift api. The change would mean the metric should be called swift_keystone_api_local_status.

The way I have viewed this is that a metric is something we are trying to measure with the script. The status line describes the success when trying to gather the defined metrics. Therefore an auth failure when auth isn't the metric is a script failure and so should be reflected in the status returned and not a something represented by a metric.

I don't really have a strong feeling either way. I could argue that the authentication dance is part of accessing an API, so when that breaks the API is not accessible, hence the "API is up" metric should be set to False like I do here. You can go to the other side and say that keystone is a dependency of this script, and if that is broken then this script cannot run.

Either way, if keystone is dead red lights will pop up everywhere and a support person will be alerted (I hope). We just need to decide on one of the two approaches and be consistent. As I said above, I don't have a strong preference, you guys know the rest of the system better so I'd like to know what you think.

So here's what I understand from the plugin docs (which I've had to read a few times before this sunk in totally):

The plugin should only return status OK ${msg} (or status success ${msg}) if it was able to successfully collect the metrics it aims to collect. To me this says that if we cannot collect the metrics, we should be printing status err ${msg} instead and returning a non-zero exit code.

What this means for this code is that we should probably stick to using status_err.

Sure, but the metric in this case is is the API accessible?. The status should almost always be OK, the only case it should be an error is when the script itself has a failure that prevents it from running (i.e. import error, etc).

The way I see it if the script is able to run it should return status OK. The metric should say if the API could be reached or not. TBD what we do for external issues, such as keystone or other services down, that's a gray area.

So we cannot check the actual status of the API without having an instance of Keystone's client. That to me is semantically the same as an import error.

but the metric in this case is is the API accessible?

Which we can't really answer meaningfully without authentication. We can ping something regardless, get a 401 and say "Yep looks up" when it could be broken.

Reporting that the check was successful, and saying that the API was not available seems like a bit of a lie in this case. We don't know whether the API is up and triggering an alarm about Swift (or any other service) also being down may be less helpful to support than is the intention behind these plugins. If instead we report an error because we're not able to get a keystone client, then support will be able to look into that specifically if it becomes a problem for very long.

I'm okay with that definition, that's the "gray area" that I mentioned above.

To summarize:

import errors result in error status

inability to access a dependent service (such as keystone) also result in error status

not able to obtain a client or not able to connect to the API endpoint returns status OK, and the problem is reported in the metrics as the API being down.

unexpected exceptions triggered during the run cause an error status, with the text of the exception in the message.

response times are reported as int32 in milliseconds. In the event of an error status, the response time metric is reported as -1.

Does this cover all the cases for local API tests?

mancdaz · 2014-08-01T10:08:17Z

@miguelgrinberg I'd agree with that summary, apart from one thing...

If we decide the API IS DOWN, and we send that metric as a 0, then we should not send any other metrics, including response time. Rather than sending it as -1, just don't send it.

So the pattern would be

is api up?
    send api_up = 1
    send all other metrics
is api down?
    send api_up = 0
some other error?
    status_err

Most of the clients we've been working with so far seem to encapsulate all the 'api is down' type errors into a single parent exception (something like ClientException or HTTPClientException). We've been raising that back to the caller script to be handled there (and setting the metric to 0), and catching everything else in the get_x_client method and setting status_err. Checkout the neutron/nova/glance stuff for example patterns.

swift_api_local now reports client errors as a metric instead of status

report auth/connection errors as a metric instead of status

6271a3a

sigmavirus24 reviewed Jul 30, 2014
View reviewed changes

use elapsed time = -1 for failures

ef90878

git-harry reviewed Jul 31, 2014
View reviewed changes

keystone errors now cause an error status

2d7f12e

miguelgrinberg added a commit that referenced this pull request Aug 1, 2014

Merge pull request #43 from miguelgrinberg/swift_local_api_changes

9dec786

swift_api_local now reports client errors as a metric instead of status

miguelgrinberg merged commit 9dec786 into rcbops:master Aug 1, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

report auth/connection errors as a metric instead of status #43

report auth/connection errors as a metric instead of status #43

miguelgrinberg commented Jul 30, 2014

sigmavirus24 Jul 30, 2014

git-harry Jul 31, 2014

miguelgrinberg Jul 31, 2014

sigmavirus24 Jul 31, 2014

miguelgrinberg Jul 31, 2014

sigmavirus24 Jul 31, 2014

miguelgrinberg Aug 1, 2014

mancdaz commented Aug 1, 2014

report auth/connection errors as a metric instead of status #43

report auth/connection errors as a metric instead of status #43

Conversation

miguelgrinberg commented Jul 30, 2014

sigmavirus24 Jul 30, 2014

Choose a reason for hiding this comment

git-harry Jul 31, 2014

Choose a reason for hiding this comment

miguelgrinberg Jul 31, 2014

Choose a reason for hiding this comment

sigmavirus24 Jul 31, 2014

Choose a reason for hiding this comment

miguelgrinberg Jul 31, 2014

Choose a reason for hiding this comment

sigmavirus24 Jul 31, 2014

Choose a reason for hiding this comment

miguelgrinberg Aug 1, 2014

Choose a reason for hiding this comment

mancdaz commented Aug 1, 2014