Skip to content

Conversation

@Quinn-With-Two-Ns
Copy link
Contributor

@Quinn-With-Two-Ns Quinn-With-Two-Ns commented Nov 28, 2025

If the workflow task failure is due to gRPC message too large, set GrpcMessageTooLarge as failure_reason on temporal_workflow_task_execution_failed.

closes #1065


Note

Adds a specific failure_reason label GrpcMessageTooLarge to workflow task failure metrics when WFT completion hits an oversized gRPC message and records the metric.

  • Metrics/Telemetry:
    • Add FailureReason::GrpcMessageTooLarge and support emitting failure_reason via metrics::failure_reason(...).
  • Worker/Workflow:
    • Store metrics: MetricsContext in Workflows and, on complete_workflow_task error flagged MESSAGE_TOO_LARGE_KEY, fail the WFT with GrpcMessageTooLarge cause and record wf_task_failed with failure_reason="GrpcMessageTooLarge".
  • Tests:
    • Update oversize_grpc_message integration test to enable Prometheus and assert the temporal_workflow_task_execution_failed metric includes failure_reason="GrpcMessageTooLarge".

Written by Cursor Bugbot for commit cdf0c95. This will update automatically on new commits. Configure here.

@Quinn-With-Two-Ns Quinn-With-Two-Ns requested a review from a team as a code owner November 28, 2025 02:32
FailureReason::Timeout => "timeout".to_owned(),
FailureReason::NexusOperation(op) => format!("operation_{op}"),
FailureReason::NexusHandlerError(op) => format!("handler_error_{op}"),
FailureReason::GrpcMessageTooLarge => "GrpcMessageTooLarge".to_owned(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this can't be snake case because it was already like this in the other SDKs? Do we already have released versions with it like that? Otherwise it'd be good to make them all snake case since that's more standard

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah and the other two ones for workflows are NonDeterminismError and WorkflowError and those not snake case


#[tokio::test]
async fn oversize_grpc_message() {
use crate::common::{ANY_PORT, NAMESPACE, prom_metrics};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These first two imports appear unused? I also generally try to avoid these scoped imports

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I removed them

@Quinn-With-Two-Ns Quinn-With-Two-Ns merged commit 87afc29 into temporalio:master Dec 2, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Set GrpcMessageTooLarge as failure_reason for workflow task failed metric

2 participants