Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gemini may report success even when a write fails #265

Closed
kbr- opened this issue Mar 18, 2021 · 1 comment
Closed

Gemini may report success even when a write fails #265

kbr- opened this issue Mar 18, 2021 · 1 comment

Comments

@kbr-
Copy link

kbr- commented Mar 18, 2021

When attempting a write, Gemini performs a retry loop:

func (cs *cqlStore) mutate(ctx context.Context, builder qb.Builder, ts time.Time, values ...interface{}) (err error) {
	var i int
	for i = 0; i < cs.maxRetriesMutate; i++ {
		err = cs.doMutate(ctx, builder, ts, values...)
		if err == nil {
			break
		}
		select {
		case <-ctx.Done():
			return ctx.Err()
		case <-time.After(cs.maxRetriesMutateSleep):
		}
	}
	if err != nil {
		if w := cs.logger.Check(zap.InfoLevel, "failed to apply mutation"); w != nil {
			w.Write(zap.Int("attempts", i), zap.Error(err))
		}
		return
	}

	cs.ops.WithLabelValues(cs.system, opType(builder)).Inc()
	return
}

We can see that if it fails maxRetriesMutate times, it reports an error. This will cause Gemini to return non-zero error code as well.

However there is one case where Gemini can exit without any errors even if a write fails. This is happens when we enter

		case <-ctx.Done():
			return ctx.Err()

which happens e.g. when we send an interrupt to Gemini. In this case we exit the loop in the middle of it and there is no guarantee that this particular mutate was successful (all previous retries failed, and the retry in which we exit might have failed as well).

Perhaps a mechanism should be introduced where Gemini finishes all in-progress mutate loops when it is interrupted before it exits so we can know if any write has failed (including the last batch of writes). If the user really wants Gemini to stop, the mechanism could wait for a second interrupt, in which case it would forcibly abort all loops like it does now.

@dkropachev
Copy link
Collaborator

This particular code servers mutation operation, not whole cycle.

Additionally, when ctx.Done() returns anything ctx.Err() will return error, most likely context.Canceled, altenrativelly it could be context.DeadlineExceeded, but it is never empty, as result operation that is in the progress are getting aborted and error reported that is getting submitted into result pipeline.

But at the same time, issue reported regarding commanding Gemini to stop is real, we should work it out.
I will close this issue in the favor of #295

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants