Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowed ingestors to be disabled and improved resolving of listener address of those #72

Merged
merged 9 commits into from
Jan 20, 2020

Conversation

itzg
Copy link
Contributor

@itzg itzg commented Jan 16, 2020

Resolves

https://jira.rax.io/browse/SALUS-748 and prepares for new line protocol ingestor to be added for https://jira.rax.io/browse/SALUS-594

What

We found on some systems where logstash was running that the lumberjack binding would fail since the equivalent logstash plugin was already bound at the standard port. That made me realize that we should ensure a few variations of ingest config to be allowed:

  • disabling an ingestor entirely by setting a blank bind address
  • binding to "all interfaces" by specifying 0.0.0.0 as the ingestor bind address, yet converting to 127.0.0.1 when the agent runner is looking up what to use for agent configs
  • binding to any available port by allowing a 0 or absent port value

All of this also gets ready for me to add a line protocol ingest type to be used in conjunction with racker/salus-packages-agent#1

How

Most of the code changes are code cleanup to ensure the Ingestor Bind and Start method are consistently doing just what they were intended to be doing. With that the agent lookup of registered listener address could be consistently handled.

How to test

Existing unit tests

@GeorgeJahad
Copy link
Contributor

binding to "all interfaces" by specifying 0.0.0.0 as the ingestor bind address, yet converting to 127.0.0.1 when the agent runner is looking up what to use for agent configs

@itzg I'm confused by that comment. The code makes it look like 127.0.0.1 is always used no matter what host is specified. is that correct? if so, why allow a host at all? how is 0.0.0.0 special?

@itzg
Copy link
Contributor Author

itzg commented Jan 17, 2020

binding to "all interfaces" by specifying 0.0.0.0 as the ingestor bind address, yet converting to 127.0.0.1 when the agent runner is looking up what to use for agent configs

@itzg I'm confused by that comment. The code makes it look like 127.0.0.1 is always used no matter what host is specified. is that correct? if so, why allow a host at all? how is 0.0.0.0 special?

Yeah, this was an awkward scenario to explain since there's the ingestor and then the agent perspective. I'll try to further explain with a hypothetical scenario:

Let's say a user wants to send application metrics from another server (or logs in the case of filebeat/lumberjack) over to Envoy because they're already in the right format and the user wants to leverage our Envoy->Ambassador->Kafka infrastructure, but with custom metrics. (This is a use case that RBA will probably exercise for their apps.) In that case, the ingestor would need to be told to bind to 0.0.0.0.

At the same time, the Envoy is launching its managed instance of telegraf and it is always on the same host. Therefore 127.0.0.1 is the consistent address to tell the agent to contact the Envoy for ingest, because if the user configured 0.0.0.0 that address can't be given to telegraf as-is to dial in order to reach Envoy.

Unfortunately there's still an edge case that cancels the "consistent address" statement. If the user configures the ingestor to bind to a public IP address of the host, then 127.0.0.1 can't be dialed by telegraf for that ingestor's port. An extra configuration field per ingestor could be used to solve this; however, I'd prefer to keep it simple and automatic until that use cases arises.


func RegisterListenerAddress(name string, address string) {
// The given address is relative to the binding, so it might contain the special "all interfaces"
// address of 0.0.0.0. As such, we need to pick apart and rebuild an address suitable for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably add a comment here, something like:

This will allow a server without an envoy to send it's metrics to a
different servers envoy/ingestor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Will do.

GeorgeJahad
GeorgeJahad previously approved these changes Jan 17, 2020
Copy link
Contributor

@GeorgeJahad GeorgeJahad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added a suggested comment, but it looks good to me.

@GeorgeJahad
Copy link
Contributor

actually, i do have one other question about the shared ingestor scenario. since they are coming through the an ingestor whose envoy has a different resource id, is it a problem that we don't have the correct resource id associated with those metrics?

@itzg
Copy link
Contributor Author

itzg commented Jan 17, 2020

actually, i do have one other question about the shared ingestor scenario. since they are coming through the an ingestor whose envoy has a different resource id, is it a problem that we don't have the correct resource id associated with those metrics?

Good question. I was going to say the other system could pre-populate resourceId, but currently the Ambassador removes the resourceId tag here
https://github.com/racker/salus-telemetry-ambassador/blob/f452f04c7f48643528038dbb21e9e487b8045e51/src/main/java/com/rackspace/salus/telemetry/ambassador/services/MetricRouter.java#L88-L90
but would fail to re-resolve the resourceId and its labels here
https://github.com/racker/salus-telemetry-ambassador/blob/f452f04c7f48643528038dbb21e9e487b8045e51/src/main/java/com/rackspace/salus/telemetry/ambassador/services/MetricRouter.java#L105

A small tweak in the latter could fall back to retaining the originally provided resourceId rather than an empty labels map.

@jjbuchan
Copy link
Contributor

actually, i do have one other question about the shared ingestor scenario. since they are coming through the an ingestor whose envoy has a different resource id, is it a problem that we don't have the correct resource id associated with those metrics?

This is a similar concern I have (prior to reading the PR) and is why I was going to say I prefer requiring them to install and envoy (with a proxy if required) might be a better solution... but maybe a "small tweak" does solve that.

@itzg
Copy link
Contributor Author

itzg commented Jan 17, 2020

I spoke too soon. As I was experimenting with the code change I realized further down it promotes the resourceId label that was extracted from the metric's tags to the device field of the external metric:

https://github.com/racker/salus-telemetry-ambassador/blob/f978a1039f3a098b5e8a1ae3569df0473ce079a9/src/main/java/com/rackspace/salus/telemetry/ambassador/services/MetricRouter.java#L119

The warning log and metric are misleading, but the code all works as-is
https://github.com/racker/salus-telemetry-ambassador/blob/f978a1039f3a098b5e8a1ae3569df0473ce079a9/src/main/java/com/rackspace/salus/telemetry/ambassador/services/MetricRouter.java#L107-L111

Copy link
Contributor

@jjbuchan jjbuchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure about allowing binding to "all interfaces" by specifying 0.0.0.0 as the ingestor bind address, yet converting to 127.0.0.1 when the agent runner is looking up what to use for agent configs but the rest looks good.

If this is blocking you I'm ok for a merge to be discussed later, otherwise it might be worth us chatting about this more.

@@ -86,7 +82,7 @@ func (fbr *FilebeatRunner) PostInstall() error {
return nil
}

func (fbr *FilebeatRunner) EnsureRunningState(ctx context.Context, applyConfigs bool) {
func (fbr *FilebeatRunner) EnsureRunningState(ctx context.Context, _ bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the normal way to deprecate a field vs. updating everything that calls it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The telegraf runner actually makes use of that parameter

func (tr *TelegrafRunner) EnsureRunningState(ctx context.Context, applyConfigs bool) {

and yes, that's the standard way IntelliJ recommends fixing the warning
image

@@ -241,6 +232,6 @@ func (fbr *FilebeatRunner) exePath() string {
return filepath.Join(currentVerLink, binSubpath, "filebeat")
}

func (fbr *FilebeatRunner) ProcessTestMonitor(correlationId string, content string, timeout time.Duration) (*telemetry_edge.TestMonitorResults, error) {
func (fbr *FilebeatRunner) ProcessTestMonitor(string, string, time.Duration) (*telemetry_edge.TestMonitorResults, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a standardization thing since the fields weren't being used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I could have also done ProcessTestMonitor(_ string, _ string, _ time.Duration) but Go allows the special case of all unnamed parameters. I can change to underscores for consistency. I just wanted to fix warnings while I was in this file.

GeorgeJahad
GeorgeJahad previously approved these changes Jan 17, 2020
@itzg
Copy link
Contributor Author

itzg commented Jan 20, 2020

In f7146c9 I realized I could both clarify the binding-to-dial-address conversion and solve external IP address binding by only swapping out 0.0.0.0 addresses.

GeorgeJahad
GeorgeJahad previously approved these changes Jan 20, 2020
@jjbuchan
Copy link
Contributor

The warning log and metric are misleading

Should that be updated?

@jjbuchan
Copy link
Contributor

Looking at that block of code it seems good. What's misleading?

// The given address is relative to the binding, so if an ingestor has been configured to
// listen on all interfaces with 0.0.0.0, such as for accepting telemetry from
// non-envoy-running hosts, then the agent needs to be told the loopback address to dial.
listenerAddresses.Store(name, strings.Replace(address, "0.0.0.0", "127.0.0.1", 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be changed to a more specific regex using ^ & $? I'd imagine at worst right now all it's doing it converting an already bad address to an equally as bad one e.g. 10.0.0.0 -> 1127.0.0.1, so it maybe makes no different, but if it's easy to do a closer match I think I'd prefer that.

Happy to go with whatever you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching on ^0.0.0.0: could be good?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll use the net.SplitHostPort and only replace a host that exactly matches "0.0.0.0".

@itzg
Copy link
Contributor Author

itzg commented Jan 20, 2020

Looking at that block of code it seems good. What's misleading?

Nothing :) , now that I realized users could create resources via the API to supply the label processing there in the metrics routing.

@itzg itzg merged commit 52751c4 into master Jan 20, 2020
@itzg itzg deleted the itzg/allow-optional-ingestors branch January 20, 2020 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants