Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context deadline exceeded when using the context passed to the activity #1424

Open
mrkaspa opened this issue Mar 19, 2024 · 9 comments
Open

Comments

@mrkaspa
Copy link

mrkaspa commented Mar 19, 2024

Expected Behavior

I should no receive the error context deadline exceeded when doing DB operations with the context passed in the Activity parameter

Actual Behavior

I have code like this in my app

func CleanupActivity(ctx context.Context) error {
	// this is a TemporalLogger (pkg/logger/temporal.go)
	log := activity.GetLogger(ctx)

	return cleanup.Cleanup(ctx, param.JobID)
}
func Cleanup(ctx context.Context, jobID string) error {
	db := database.GetDB()

	// Fetch the job from the database.
	j := &job.Job{ID: jobID}
	err := db.NewSelect().Model(j).WherePK().Scan(ctx)
	if err != nil {
		return wferrors.NewTaskError(errors.WithStack(err), wferrors.ErrCodeDatabase, "failed to fetch job")
	}

and inside the cleanup.Cleanup function I do database operations with the Bun library that uses the context that is passed so when it tries to make a query I got the error:

DatabaseError, failed to fetch job, context deadline exceeded

so the database query is failing due to context deadline exceeded

Steps to Reproduce the Problem

  1. Use the context passed in the Activity as argument for database access
  2. Deploy on production

Specifications

  • Version: go.temporal.io/sdk v1.25.1
  • Platform: Linux
@Quinn-With-Two-Ns
Copy link
Contributor

The context passed into an activity has documentation around when users should expect it to be cancelled

# Context Cancellation

I would suspect in your case the activity is timing out before the database operation is complete.

@mrkaspa
Copy link
Author

mrkaspa commented Mar 20, 2024

Yes, I thought that, but the problem is that my deadline is of 1 hour per activity and when it starts failing for example failed the job 1 the subsquents jobs keep failing for the same reason, how can fail a new job for this reason if I have a deadline of 1 hour per activity.

btw, this is how I have the workflow settings

activityoptions := workflow.ActivityOptions{
		// Set Activity Timeout duration
		// ScheduleToCloseTimeout: 5 * time.Second,
		StartToCloseTimeout: 60 * time.Minute,
		// ScheduleToStartTimeout: 10 * time.Second,
	}
	ctx = workflow.WithActivityOptions(ctx, activityoptions)
	ctx = workflow.WithRetryPolicy(ctx, temporal.RetryPolicy{
		MaximumAttempts: 10,
	})

@Quinn-With-Two-Ns
Copy link
Contributor

The timeouts can be shorter depending on what other activity and workflow option, it is also possible the error is coming from some internal deadline set in your database library and not the activity context. You can check the deadline of the context using ctx.Deadline() to see when the context would expire.

https://pkg.go.dev/context#Context

@mrkaspa
Copy link
Author

mrkaspa commented Mar 20, 2024

The problem is that this does not happen everytime, in our production experience we have deployed the solution and everything works fine for some executions, and after some days one workflow starts to fail and the next ones will always fail for the same reason

@Quinn-With-Two-Ns
Copy link
Contributor

On one of these occurrences can you share the actual activity schedule event ?

@mrkaspa
Copy link
Author

mrkaspa commented Apr 9, 2024

@Quinn-With-Two-Ns where can I see that?

@mrkaspa
Copy link
Author

mrkaspa commented Apr 9, 2024

rn this is failing again

error
activity error (type: PreprocessingActivity, scheduledEventID: 5, startedEventID: 6, identity: ): activity StartToClose timeout (type: StartToClose): activity StartToClose timeout (type: StartToClose)

Error
last connection error: connection error: desc = "error reading server preface: read tcp 172.17.0.12:39422->52.26.119.98:7233: use of closed network connection"

PanicError
runtime error: index out of range [4096] with length 4096

we are seeing this error, I wonder if somehow the connection with the temporal servers is lost

@Quinn-With-Two-Ns
Copy link
Contributor

The last errors looks like an issue with you application and not the SDK, but just the error message is not enough for me to provide any insight and I cannot tell what is wrong with your application.

To help debug any further what I would need is a stand alone reproduction of the issue showing the SDK canceling the context outside of the documented cases where users should expect it to be cancelled

# Context Cancellation

@mrkaspa
Copy link
Author

mrkaspa commented Apr 9, 2024

we are using nomad to deploy our containers and when restart them the issue is solved, this issue uses to happen everyweek, all the workflows start to throw timeouts and I think the reason is they lost connection with the temporal servers, so the temporal server can not execute the activities and time out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants