Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail fast when systemic error occurs in poll #8749

Merged
merged 3 commits into from
Apr 26, 2021

Conversation

panjf2000
Copy link
Contributor

@panjf2000 panjf2000 commented Apr 7, 2021

Most of the ae.c backends didn't explicitly handle errors, and instead ignored all errors and did an implicit retry.
This is desired for EAGAIN and EINTER, but in case of other systematic errors, we prefer to fail and log the error we got rather than get into a busy loop.

ae_evport.c handled EINTR:

redis/src/ae_evport.c

Lines 288 to 296 in fb66e2e

if (port_getn(state->portfd, event, MAX_EVENT_BATCHSZ, &nevents,
tsp) == -1 && (errno != ETIME || nevents == 0)) {
if (errno == ETIME || errno == EINTR)
return 0;
/* Any other error indicates a bug. */
perror("aeApiPoll: port_get");
abort();
}

Therefore we should also do that in other polls.

@panjf2000
Copy link
Contributor Author

@oranagra @yossigo
Got some time to take a look?

@ShooterIT
Copy link
Collaborator

AFAIK, we don't deal with EINTR in more than one place, I remember we ever discussed that

@panjf2000
Copy link
Contributor Author

panjf2000 commented Apr 12, 2021

AFAIK, we don't deal with EINTR in more than one place, I remember we ever discussed that

I don't quite get this point, would you share more details about it?

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong (and it's also wrong in existing code in ae_evport.c).

it doesn't really handle EINTER.
EINTER is actually already implicitly handled by the caller (aeProcessEvents), it'll do a before/after sleep cycle and go back to sleep without processing any file events.

what this does, is terminate the process on any other error (so the title is misleading).
i suppose it may be a good idea to do that, but not in this way.
the perror is likely to go nowhere (if redis is daemonized).
i suppose that a better way to deal with that is let the caller check for the negative return value and errno, and handle it.
but actually the caller (aeProcessEvents) doesn't have access to serverLog these days either.
considering the implication of the missing error check (if there are any?), i'm not sure this is worth fixing.

@panjf2000 panjf2000 changed the title Handle EINTR error on poll Fail fast when systemic errors occurs in poll Apr 21, 2021
@panjf2000 panjf2000 changed the title Fail fast when systemic errors occurs in poll Fail fast when systemic errors occur in poll Apr 21, 2021
@panjf2000 panjf2000 changed the title Fail fast when systemic errors occur in poll Fail fast when systemic error occurs in poll Apr 21, 2021
@panjf2000
Copy link
Contributor Author

You are right about the title, changed it.

As for this fix, I think Redis ought to fail fast when systemic errors (which are irreparable) occur, notifying users about that, otherwise, it will just fail over and over again without being dealt with properly

@oranagra
Copy link
Member

you are right.. if such error ever happen, the user will see a busy loop, and no log line or anything that can explain what happens.
there may be two cases though:

  1. a temporary one time error, on which we may wanna log but keep running
  2. a permanent resulting in a busy loop, on which we wanna terminate.

i'm not sure there are actually any such errors in reality, and terminating a the server on a temporary error would lead to a regression, which will take time to fix and re-release.
i suppose we can take the risk and put such a fix in unstable to be released with redis 7.0.

but anyway, the current solution of using perror is not enough. redis will terminate and the majority of users won't get to see the error message.

@panjf2000
Copy link
Contributor Author

Maybe set up a number of error retry? we will only terminate the redis server when it exceeds that limitation.

@panjf2000
Copy link
Contributor Author

but anyway, the current solution of using perror is not enough. redis will terminate and the majority of users won't get to see the error message.

then we use serverLog instead?

@oranagra
Copy link
Member

i'm not sure we can easily distinguish between a transient error and a repeated one, without adding too much code complexity which honestly i don't think is worth it.
i guess we can just take the risk if we do that enough time before releasing it.

regarding the error log, yes, we need to propagate the error upwards and print it there to serverLog.

@panjf2000
Copy link
Contributor Author

So we just include "server.h" in ae.c?

@oranagra
Copy link
Member

oranagra commented Apr 21, 2021

i don't think we wanna include server.h in ae.c.
two other options may be:

  1. propagate the error upwards and handle it in the caller (i.e. the caller of aeMain and others, where serverLog is available).
  2. include redisassert.h and call panic() (printing errno)

@yossigo WDYT?

@yossigo
Copy link
Member

yossigo commented Apr 21, 2021

@oranagra I think ae is conceptually a low level library (which is also the reason why we're reluctant to have server.h included), and as such it should return errors rather than panic.

@oranagra
Copy link
Member

@yossigo note that in ziplist.c and few other places we now include redisassert.h and use assert and panic which map to serverAssert and serverPanic

@panjf2000
Copy link
Contributor Author

So actually we've already broken some boundaries in Redis?

@oranagra
Copy link
Member

yes. but including redisassert.h is not like including server.h.. it just exposes panic and assert.
i think both solutions are ok in this case. waiting to hear back from Yossi, in case my last comment changes something.

@panjf2000
Copy link
Contributor Author

Hi @yossigo, any thoughts about the last comment from @oranagra ?

@panjf2000
Copy link
Contributor Author

The redisassert.h can't be included in ae, it will case link errors:
image

So I guess we should go with the solution that returns errno?

@ShooterIT
Copy link
Collaborator

i don't know why we put energy on that, is there a bad case?

@yossigo
Copy link
Member

yossigo commented Apr 25, 2021

I agree both options are ok.

@oranagra
Copy link
Member

@panjf2000 the link errors are probably easily solvable.
part of it is because ae.c is used by redis-cli and redis-benchmark, not just redis-server. but we can solve that by adding a _serverAssert and serverLog symbols in these too (which will map to just printf and assert).
not sure about the other part (debug.c errors), maybe bad order of inclusions.

the current code may also be ok, maybe a bit more lines, and less hermetic, since some callers may ignore the error, and/or need to know exactly which errors skip, like what aeMain does...

@panjf2000
Copy link
Contributor Author

OK, I will try to fix those link errors first.

@panjf2000
Copy link
Contributor Author

BTW, where exactly should I put those two function symbols in redis-cli and redis-benchmark? I don't want to put it in the wrong place that you're not gonna like it.

@oranagra

@oranagra
Copy link
Member

if it were me, i would have probably tried to put them at the very bottom of each C file (below main) with some comment that says they're here because ae.c uses redisassert.h.

if that's more than one or two line per function, and this code gets complex, maybe we wanna have it shared between them, so maybe cli_common.c, or create a redisassert.c

@panjf2000 panjf2000 force-pushed the handle-poll-eintr branch 2 times, most recently from 0e65802 to 061e437 Compare April 25, 2021 11:09
@panjf2000 panjf2000 requested a review from oranagra April 25, 2021 11:09
src/ae_epoll.c Outdated Show resolved Hide resolved
@panjf2000 panjf2000 requested a review from oranagra April 26, 2021 09:36
@oranagra oranagra merged commit a8b6596 into redis:unstable Apr 26, 2021
@panjf2000 panjf2000 deleted the handle-poll-eintr branch April 26, 2021 13:23
madolson pushed a commit to madolson/redis that referenced this pull request May 12, 2021
Most of the ae.c backends didn't explicitly handle errors, and instead
ignored all errors and did an implicit retry.
This is desired for EAGAIN and EINTER, but in case of other systematic
errors, we prefer to fail and log the error we got rather than get into a busy loop.
JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Sep 8, 2021
Most of the ae.c backends didn't explicitly handle errors, and instead
ignored all errors and did an implicit retry.
This is desired for EAGAIN and EINTER, but in case of other systematic
errors, we prefer to fail and log the error we got rather than get into a busy loop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants