Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

report auth/connection errors as a metric instead of status #43

Merged
merged 3 commits into from
Aug 1, 2014
Merged

report auth/connection errors as a metric instead of status #43

merged 3 commits into from
Aug 1, 2014

Conversation

miguelgrinberg
Copy link
Contributor

No description provided.

swift = client.Connection(preauthurl=endpoint,
preauthtoken=auth_token)
if swift is None:
status_msg = 'Unable to obtain valid swift client'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're currently using (the equivalent of) elapsed = -1 here.

endpoint_type='publicURL')
auth_token = keystone.auth_ref['token']['id']
status_msg = 'Unable to obtain valid keystone client'
api_ok = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this concept. I think the swift_api_local_status metric should only reflect the status of the swift api. The change would mean the metric should be called swift_keystone_api_local_status.

The way I have viewed this is that a metric is something we are trying to measure with the script. The status line describes the success when trying to gather the defined metrics. Therefore an auth failure when auth isn't the metric is a script failure and so should be reflected in the status returned and not a something represented by a metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have a strong feeling either way. I could argue that the authentication dance is part of accessing an API, so when that breaks the API is not accessible, hence the "API is up" metric should be set to False like I do here. You can go to the other side and say that keystone is a dependency of this script, and if that is broken then this script cannot run.

Either way, if keystone is dead red lights will pop up everywhere and a support person will be alerted (I hope). We just need to decide on one of the two approaches and be consistent. As I said above, I don't have a strong preference, you guys know the rest of the system better so I'd like to know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here's what I understand from the plugin docs (which I've had to read a few times before this sunk in totally):

The plugin should only return status OK ${msg} (or status success ${msg}) if it was able to successfully collect the metrics it aims to collect. To me this says that if we cannot collect the metrics, we should be printing status err ${msg} instead and returning a non-zero exit code.

What this means for this code is that we should probably stick to using status_err.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but the metric in this case is is the API accessible?. The status should almost always be OK, the only case it should be an error is when the script itself has a failure that prevents it from running (i.e. import error, etc).

The way I see it if the script is able to run it should return status OK. The metric should say if the API could be reached or not. TBD what we do for external issues, such as keystone or other services down, that's a gray area.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we cannot check the actual status of the API without having an instance of Keystone's client. That to me is semantically the same as an import error.

but the metric in this case is is the API accessible?

Which we can't really answer meaningfully without authentication. We can ping something regardless, get a 401 and say "Yep looks up" when it could be broken.

Reporting that the check was successful, and saying that the API was not available seems like a bit of a lie in this case. We don't know whether the API is up and triggering an alarm about Swift (or any other service) also being down may be less helpful to support than is the intention behind these plugins. If instead we report an error because we're not able to get a keystone client, then support will be able to look into that specifically if it becomes a problem for very long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with that definition, that's the "gray area" that I mentioned above.

To summarize:

  • import errors result in error status
  • inability to access a dependent service (such as keystone) also result in error status
  • not able to obtain a client or not able to connect to the API endpoint returns status OK, and the problem is reported in the metrics as the API being down.
  • unexpected exceptions triggered during the run cause an error status, with the text of the exception in the message.
  • response times are reported as int32 in milliseconds. In the event of an error status, the response time metric is reported as -1.

Does this cover all the cases for local API tests?

@mancdaz
Copy link
Contributor

mancdaz commented Aug 1, 2014

@miguelgrinberg I'd agree with that summary, apart from one thing...

If we decide the API IS DOWN, and we send that metric as a 0, then we should not send any other metrics, including response time. Rather than sending it as -1, just don't send it.

So the pattern would be

is api up?
    send api_up = 1
    send all other metrics
is api down?
    send api_up = 0
some other error?
    status_err

Most of the clients we've been working with so far seem to encapsulate all the 'api is down' type errors into a single parent exception (something like ClientException or HTTPClientException). We've been raising that back to the caller script to be handled there (and setting the metric to 0), and catching everything else in the get_x_client method and setting status_err. Checkout the neutron/nova/glance stuff for example patterns.

miguelgrinberg added a commit that referenced this pull request Aug 1, 2014
swift_api_local now reports client errors as a metric instead of status
@miguelgrinberg miguelgrinberg merged commit 9dec786 into rcbops:master Aug 1, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants