Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(kubernetes_logs source): Use resource_version of 0 to use cache #9974

Merged
merged 3 commits into from Nov 11, 2021

Conversation

spencergilbert
Copy link
Contributor

Signed-off-by: Spencer Gilbert spencer.gilbert@datadoghq.com

Closes #7943

I had to remove the const to set it like this, if there's a better alternative let me know. Setting this to 0 rather than None should allow us to use cached values from the apiserver rather than getting the freshest results from etcd directly.

Signed-off-by: Spencer Gilbert <spencer.gilbert@datadoghq.com>
@spencergilbert spencergilbert added the source: kubernetes_logs Anything `kubernetes_logs` source related label Nov 9, 2021
@spencergilbert spencergilbert self-assigned this Nov 9, 2021
@netlify
Copy link

netlify bot commented Nov 9, 2021

✔️ Deploy Preview for vector-project canceled.

🔨 Explore the source changes: 172ebeb

🔍 Inspect the deploy log: https://app.netlify.com/sites/vector-project/deploys/618c267ac81a7d0007aa2f77

@spencergilbert spencergilbert added ci-condition: k8s e2e all targets Run Kubernetes E2E test suite for all targets (instead of just the essential subset) ci-condition: k8s e2e tests enable Run Kubernetes E2E test suite for this PR labels Nov 9, 2021
Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. For the kubernetes_logs source we only consume logs for local pods so I imagine that the local node will have any relevant data available. Are you aware of any risks here?

@spencergilbert
Copy link
Contributor Author

Makes sense. For the kubernetes_logs source we only consume logs for local pods so I imagine that the local node will have any relevant data available. Are you aware of any risks here?

We actually don't communicate with the "local" node. We today, and after this, make calls to the control plane for pod/namespace information. This changes our requirements around the freshness of the resource and whether we get it from what the kube-apiserver has cached (part of the control plane) or ends up querying etcd directly (also sorta part of the control plane, but more in a storage manner).

Communicating with "node local" api's is what Datadog suggested after a short review, but would be a pretty full rewrite.

@jszwedko
Copy link
Member

jszwedko commented Nov 9, 2021

Ah, gotcha, thanks for clearing that up.

Is the risk then that we might get stale metadata for k8s resources like the pods or namespaces and use that to annotate the logs?

@spencergilbert
Copy link
Contributor Author

Ah, gotcha, thanks for clearing that up.

Is the risk then that we might get stale metadata for k8s resources like the pods or namespaces and use that to annotate the logs?

Yeah - it does introduce that possibility, but I'd consider the stability improvements are worth it. As is (especially with every vector instance making calls) we can DDoS etcd pretty easily :|

@jszwedko
Copy link
Member

jszwedko commented Nov 9, 2021

Yeah, I agree, I just wanted to make sure I understood the trade-offs.

@github-actions
Copy link

github-actions bot commented Nov 9, 2021

Soak Test Results

Baseline: 99e7549
Comparison: 3279d38
Total Vector CPUs: 4

What follows is a statistical summary of the soak captures between the SHAs given above. Units are bytes/second/CPU, except for 'skewness' and 'kurtosis'. Higher numbers in 'comparison' is generally better. Higher skewness or kurtosis numbers indicate a lack of consistency in behavior, making predictions of fitness in the field challenging.


datadog_agent_remap_blackhole

EXPERIMENT VALUE_min VALUE_p90 VALUE_p99 VALUE_max VALUE_skewness VALUE_kurtosis
baseline 10.56Mi 10.60Mi 10.61Mi 10.61Mi -0.35 -0.58
comparison 10.59Mi 10.61Mi 10.62Mi 10.62Mi 0.36 -0.10

datadog_agent_remap_datadog_logs

EXPERIMENT VALUE_min VALUE_p90 VALUE_p99 VALUE_max VALUE_skewness VALUE_kurtosis
baseline 19.86Mi 19.95Mi 19.97Mi 19.97Mi -0.41 -0.76
comparison 18.56Mi 18.60Mi 18.61Mi 18.62Mi -0.30 -0.66

syslog_humio_logs

EXPERIMENT VALUE_min VALUE_p90 VALUE_p99 VALUE_max VALUE_skewness VALUE_kurtosis
baseline 7.50Mi 7.65Mi 7.65Mi 7.65Mi -0.46 -1.29
comparison 7.07Mi 7.19Mi 7.20Mi 7.21Mi 0.14 -1.16

syslog_log2metric_humio_metrics

EXPERIMENT VALUE_min VALUE_p90 VALUE_p99 VALUE_max VALUE_skewness VALUE_kurtosis
baseline 5.05Mi 5.07Mi 5.08Mi 5.08Mi 0.07 -1.24
comparison 5.15Mi 5.18Mi 5.18Mi 5.18Mi 0.36 -0.89

syslog_log2metric_splunk_hec_metrics

EXPERIMENT VALUE_min VALUE_p90 VALUE_p99 VALUE_max VALUE_skewness VALUE_kurtosis
baseline 5.13Mi 5.26Mi 5.26Mi 5.26Mi -1.05 -0.49
comparison 5.35Mi 5.38Mi 5.38Mi 5.38Mi -0.03 -1.05

syslog_loki

EXPERIMENT VALUE_min VALUE_p90 VALUE_p99 VALUE_max VALUE_skewness VALUE_kurtosis
baseline 3.91Mi 4.11Mi 4.15Mi 4.15Mi 0.18 -1.13
comparison 3.86Mi 4.05Mi 4.07Mi 4.07Mi -0.33 -0.98

syslog_regex_logs2metric_ddmetrics

EXPERIMENT VALUE_min VALUE_p90 VALUE_p99 VALUE_max VALUE_skewness VALUE_kurtosis
baseline 3.88Mi 3.89Mi 3.90Mi 3.90Mi 0.28 -0.75
comparison 3.77Mi 3.79Mi 3.79Mi 3.79Mi 0.25 -1.07

syslog_splunk_hec_logs

EXPERIMENT VALUE_min VALUE_p90 VALUE_p99 VALUE_max VALUE_skewness VALUE_kurtosis
baseline 7.02Mi 7.03Mi 7.03Mi 7.03Mi -0.01 0.27
comparison 7.24Mi 7.26Mi 7.26Mi 7.26Mi -0.01 -0.85

Signed-off-by: Spencer Gilbert <spencer.gilbert@datadoghq.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@datadoghq.com>
@github-actions
Copy link

Soak Test Results

Baseline: a9c310c
Comparison: 172ebeb
Total Vector CPUs: 4

What follows is a statistical summary of the soak captures between the SHAs given above. Units are bytes/second/CPU, except for 'skewness' and 'kurtosis'. Higher numbers in 'comparison' is generally better. Higher skewness or kurtosis numbers indicate a lack of consistency in behavior, making predictions of fitness in the field challenging.


datadog_agent_remap_blackhole

EXPERIMENT VALUE_min VALUE_p90 VALUE_p99 VALUE_max VALUE_skewness VALUE_kurtosis
baseline 10.56Mi 10.60Mi 10.61Mi 10.61Mi -0.35 -0.58
comparison 10.04Mi 10.09Mi 10.09Mi 10.10Mi -0.17 -1.16

datadog_agent_remap_datadog_logs

EXPERIMENT VALUE_min VALUE_p90 VALUE_p99 VALUE_max VALUE_skewness VALUE_kurtosis
baseline 19.94Mi 19.98Mi 19.99Mi 19.99Mi 0.06 -0.74
comparison 20.12Mi 20.15Mi 20.17Mi 20.17Mi 0.28 0.19

splunk_hec_route_s3

EXPERIMENT VALUE_min VALUE_p90 VALUE_p99 VALUE_max VALUE_skewness VALUE_kurtosis
baseline 5.71Mi 6.04Mi 6.09Mi 6.11Mi -0.27 -0.74
comparison 5.47Mi 5.77Mi 5.83Mi 5.83Mi -0.48 -0.58

splunk_transforms_splunk3

EXPERIMENT VALUE_min VALUE_p90 VALUE_p99 VALUE_max VALUE_skewness VALUE_kurtosis
baseline 2.28Mi 2.72Mi 2.79Mi 2.80Mi -0.04 -1.13
comparison 2.61Mi 2.81Mi 2.83Mi 2.86Mi -0.32 -0.50

syslog_humio_logs

EXPERIMENT VALUE_min VALUE_p90 VALUE_p99 VALUE_max VALUE_skewness VALUE_kurtosis
baseline 7.50Mi 7.65Mi 7.65Mi 7.65Mi -0.46 -1.29
comparison 7.07Mi 7.19Mi 7.20Mi 7.21Mi 0.14 -1.16

syslog_log2metric_humio_metrics

EXPERIMENT VALUE_min VALUE_p90 VALUE_p99 VALUE_max VALUE_skewness VALUE_kurtosis
baseline 5.02Mi 5.04Mi 5.05Mi 5.05Mi -0.47 -0.31
comparison 5.00Mi 5.03Mi 5.04Mi 5.04Mi -0.55 -0.74

syslog_log2metric_splunk_hec_metrics

EXPERIMENT VALUE_min VALUE_p90 VALUE_p99 VALUE_max VALUE_skewness VALUE_kurtosis
baseline 5.19Mi 5.19Mi 5.19Mi 5.19Mi -0.39 -1.25
comparison 5.34Mi 5.38Mi 5.38Mi 5.38Mi -0.44 -0.54

syslog_loki

EXPERIMENT VALUE_min VALUE_p90 VALUE_p99 VALUE_max VALUE_skewness VALUE_kurtosis
baseline 4.13Mi 4.37Mi 4.45Mi 4.46Mi 0.29 -0.12
comparison 3.86Mi 4.05Mi 4.07Mi 4.07Mi -0.33 -0.98

syslog_regex_logs2metric_ddmetrics

EXPERIMENT VALUE_min VALUE_p90 VALUE_p99 VALUE_max VALUE_skewness VALUE_kurtosis
baseline 3.71Mi 3.72Mi 3.72Mi 3.72Mi -0.04 -1.58
comparison 3.77Mi 3.79Mi 3.79Mi 3.79Mi 0.25 -1.07

syslog_splunk_hec_logs

EXPERIMENT VALUE_min VALUE_p90 VALUE_p99 VALUE_max VALUE_skewness VALUE_kurtosis
baseline 7.02Mi 7.03Mi 7.03Mi 7.03Mi -0.01 0.27
comparison 7.24Mi 7.26Mi 7.26Mi 7.26Mi -0.01 -0.85

@spencergilbert spencergilbert merged commit 6cfb28d into master Nov 11, 2021
@spencergilbert spencergilbert deleted the spencer/use-apiserver-cache branch November 11, 2021 15:24
jdrouet pushed a commit that referenced this pull request Nov 18, 2021
…9974)

* feat(kubernetes_logs source): Use resource_version of 0 to use cache

Signed-off-by: Spencer Gilbert <spencer.gilbert@datadoghq.com>

* Update reflector unit tests, use to_owned over to_string

Signed-off-by: Spencer Gilbert <spencer.gilbert@datadoghq.com>

* make fmt

Signed-off-by: Spencer Gilbert <spencer.gilbert@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-condition: k8s e2e all targets Run Kubernetes E2E test suite for all targets (instead of just the essential subset) ci-condition: k8s e2e tests enable Run Kubernetes E2E test suite for this PR source: kubernetes_logs Anything `kubernetes_logs` source related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vector making inefficient api GETs in large K8s clusters
2 participants