Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced crash logs events for: report_crash_loop, pod_oom_kill_enricher, image_pull_backoff_reporter, KubePodCrashLooping #1282

Merged
merged 26 commits into from
Feb 27, 2024

Conversation

ganeshrvel
Copy link
Contributor

  • report_crash_loop
  • pod_oom_kill_enricher
  • image_pull_backoff_reporter
  • KubePodCrashLooping

- report_crash_loop
- pod_oom_kill_enricher
- image_pull_backoff_reporter
- KubePodCrashLooping
@ganeshrvel ganeshrvel changed the title Enhanced crash logs events for: Enhanced crash logs events for: report_crash_loop, pod_oom_kill_enricher, image_pull_backoff_reporter, KubePodCrashLooping Feb 9, 2024
Copy link
Contributor

@arikalon1 arikalon1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work
left some comments

blocks = [TableBlock(
[[k, v] for (k, v) in pending_rows],
["label", "value"],
table_name="",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant
"" is the default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return Enrichment(
enrichment_type=EnrichmentType.crash_info,
blocks=blocks,
title="Pod Unscheduled Information")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change to:
Unscheduled Pod Information

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

blocks: List[BaseBlock] = []
investigator = PendingInvestigator(pod)
all_reasons = investigator.investigate()
def get_pending_pod_enrichment(pod: Pod) -> Optional[Enrichment]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this Optional? We never return None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -163,6 +183,10 @@ class ImagePullBackoffInvestigator:
"err_template": 'Error response from daemon: manifest for .*? not found: manifest unknown: Failed to fetch ".*?"',
"reason": ImagePullBackoffReason.ImageDoesntExist | ImagePullBackoffReason.TagNotFound,
},
{
"err_template": r'Error response from daemon: Get ".*?": net\/http: request canceled while waiting for connection \(Client\.Timeout exceeded while awaiting headers\)',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it less sensitive?
Like anything that includes "Timeout exceeded"
What's the downside of that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

investigator = PendingInvestigator(pod)
all_reasons = investigator.investigate()
def get_pending_pod_enrichment(pod: Pod) -> Optional[Enrichment]:
pending_rows: List[List[str]] = []
message = get_unscheduled_message(pod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this cover all the pending scenarios ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did cover the required fields in the new enrichment sheet. But i have reverted back the MarkdownBlock which had the reason for the pending state

event.add_enrichment(blocks)
if pod_issues_enrichments:
issue_message, issues_enrichments = pod_issues_enrichments
blocks: List[BaseBlock] = [MarkdownBlock(f"{message_string}. {issue_message}")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be added to description, and not as Markdown

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if len(pods_with_issue) < 1:
logging.debug(f"`pods_with_issue` for found for issue: {issue}")
return

expected_pods = get_expected_replicas(event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to change
We have here the event, the type of the issue, the wait message and wait reason (which are currently not in use)

We need a new function, get_explanation
Information about number of available pods, and number of unavailable should be taken from the resource status

For Deployment, Statefulset, DaemonSet, message should be:
X pods are available, Y pods are not ready due to $ISSUE TYPE.
($WAIT_REASON: $WAIT_MESSAGE)

For others (Pod, Jobs ...)
Pod is in not ready due to $ISSUE_TYPE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

)

enrichments = get_crash_report_enrichments(pod)
if enrichments:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neven None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"is_empty": True,
"remarks": f"Logs unavailable for container: {container_status.name}"
}
logging.error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be changed to info or removed.
Also, no point printing the container log, we know it's empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced it with logging.info

)
metadata = None
if not container_log:
metadata = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for the UI? To present an empty log?
How does this look in Slack and the other Sinks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this for the platform UI, to display an empty log.

Slack and msteams sinks screenshots:

Screenshot 2024-02-21 at 5 35 05 PM

Screenshot 2024-02-21 at 5 49 15 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we don't see the empty file on Slack?

On any event, is there a way to send the empty log only to the UI sink?
Maybe we can create a new type of block, EmptyFileBlock that is handled only in the UI sink? (and sends it as an empty file)

Or other solutions that will have a similar result?

Copy link
Contributor

@arikalon1 arikalon1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice works

Added few more comments

@@ -245,8 +245,12 @@ def deployment_events_enricher(event: DeploymentEvent, params: ExtendedEventEnri
event.add_enrichment([events_table_block], {SlackAnnotations.ATTACHMENT: True},
enrichment_type=EnrichmentType.k8s_events, title="Deployment Events")
else:
pods = list_pods_using_selector(dep.metadata.namespace, dep.spec.selector, "status.phase=Running")
event.add_enrichment([MarkdownBlock(f"*Replicas: Desired ({dep.spec.replicas}) --> Running ({len(pods)})*")])
if dep.kind in ["Deployment", "Statefulset", "DaemonSet"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a DeploymentEvent , kind is always a deployment
In addition, I think this logic is not correct for DaemonSet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

else:
issue_text = issue.name

if resource.kind in ["Deployment", "Statefulset", "DaemonSet"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a small fix here:
For DaemonSet the available and desired status fields are different.
I think you want:
desired = desiredNumberScheduled
available = numberAvailable
When you run kubectl get DaemonSet -o yaml some-ds you see in status:

status:
  currentNumberScheduled: 4
  desiredNumberScheduled: 4
  numberAvailable: 4
  numberMisscheduled: 0
  numberReady: 4
  observedGeneration: 1
  updatedNumberScheduled: 4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


event.extend_description(message_text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should always add it to description, and not only if we found an enrichment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

)
metadata = None
if not container_log:
metadata = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we don't see the empty file on Slack?

On any event, is there a way to send the empty log only to the UI sink?
Maybe we can create a new type of block, EmptyFileBlock that is handled only in the UI sink? (and sends it as an empty file)

Or other solutions that will have a similar result?

title="Pod Logs"
)
metadata = None
if not log_data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here regarding the empty logs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. Added new EmptyFileBlock

@@ -130,6 +130,35 @@ def truncate_content(self, max_file_size_bytes: int) -> bytes:
return "\n".join(truncated_lines).encode("utf-8")


class EmptyFileBlock(FileBlock):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not extend a FileBlock
The point of not making it a FileBlock, was to not send it to all existing sinks.

But if this extends a FileBlock, it will be handled as a FileBlock by all sinks, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@arikalon1 arikalon1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work

@arikalon1 arikalon1 merged commit ca11c28 into master Feb 27, 2024
16 checks passed
@arikalon1 arikalon1 deleted the feature/enhance-crashlog-events branch February 27, 2024 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants