Fail fast when systemic error occurs in poll #8749

panjf2000 · 2021-04-07T04:29:59Z

Most of the ae.c backends didn't explicitly handle errors, and instead ignored all errors and did an implicit retry.
This is desired for EAGAIN and EINTER, but in case of other systematic errors, we prefer to fail and log the error we got rather than get into a busy loop.

ae_evport.c handled EINTR:

redis/src/ae_evport.c

Lines 288 to 296 in fb66e2e

    
           if (port_getn(state->portfd, event, MAX_EVENT_BATCHSZ, &nevents, 
        
               tsp) == -1 && (errno != ETIME || nevents == 0)) { 
        
               if (errno == ETIME || errno == EINTR) 
        
                   return 0; 
        
               /* Any other error indicates a bug. */ 
        
               perror("aeApiPoll: port_get"); 
        
               abort(); 
        
           }

Therefore we should also do that in other polls.

panjf2000 · 2021-04-08T07:54:09Z

@oranagra @yossigo
Got some time to take a look?

ShooterIT · 2021-04-11T14:41:22Z

AFAIK, we don't deal with EINTR in more than one place, I remember we ever discussed that

panjf2000 · 2021-04-12T14:40:22Z

AFAIK, we don't deal with EINTR in more than one place, I remember we ever discussed that

I don't quite get this point, would you share more details about it?

oranagra

I think this is wrong (and it's also wrong in existing code in ae_evport.c).

it doesn't really handle EINTER.
EINTER is actually already implicitly handled by the caller (aeProcessEvents), it'll do a before/after sleep cycle and go back to sleep without processing any file events.

what this does, is terminate the process on any other error (so the title is misleading).
i suppose it may be a good idea to do that, but not in this way.
the perror is likely to go nowhere (if redis is daemonized).
i suppose that a better way to deal with that is let the caller check for the negative return value and errno, and handle it.
but actually the caller (aeProcessEvents) doesn't have access to serverLog these days either.
considering the implication of the missing error check (if there are any?), i'm not sure this is worth fixing.

panjf2000 · 2021-04-21T06:33:32Z

You are right about the title, changed it.

As for this fix, I think Redis ought to fail fast when systemic errors (which are irreparable) occur, notifying users about that, otherwise, it will just fail over and over again without being dealt with properly

oranagra · 2021-04-21T06:40:51Z

you are right.. if such error ever happen, the user will see a busy loop, and no log line or anything that can explain what happens.
there may be two cases though:

a temporary one time error, on which we may wanna log but keep running
a permanent resulting in a busy loop, on which we wanna terminate.

i'm not sure there are actually any such errors in reality, and terminating a the server on a temporary error would lead to a regression, which will take time to fix and re-release.
i suppose we can take the risk and put such a fix in unstable to be released with redis 7.0.

but anyway, the current solution of using perror is not enough. redis will terminate and the majority of users won't get to see the error message.

panjf2000 · 2021-04-21T06:49:52Z

Maybe set up a number of error retry? we will only terminate the redis server when it exceeds that limitation.

panjf2000 · 2021-04-21T06:51:33Z

but anyway, the current solution of using perror is not enough. redis will terminate and the majority of users won't get to see the error message.

then we use serverLog instead?

oranagra · 2021-04-21T07:00:56Z

i'm not sure we can easily distinguish between a transient error and a repeated one, without adding too much code complexity which honestly i don't think is worth it.
i guess we can just take the risk if we do that enough time before releasing it.

regarding the error log, yes, we need to propagate the error upwards and print it there to serverLog.

panjf2000 · 2021-04-21T08:26:05Z

So we just include "server.h" in ae.c?

oranagra · 2021-04-21T08:46:49Z

i don't think we wanna include server.h in ae.c.
two other options may be:

propagate the error upwards and handle it in the caller (i.e. the caller of aeMain and others, where serverLog is available).
include redisassert.h and call panic() (printing errno)

@yossigo WDYT?

yossigo · 2021-04-21T10:41:36Z

@oranagra I think ae is conceptually a low level library (which is also the reason why we're reluctant to have server.h included), and as such it should return errors rather than panic.

oranagra · 2021-04-21T10:44:47Z

@yossigo note that in ziplist.c and few other places we now include redisassert.h and use assert and panic which map to serverAssert and serverPanic

panjf2000 · 2021-04-21T10:53:28Z

So actually we've already broken some boundaries in Redis?

oranagra · 2021-04-21T12:24:07Z

yes. but including redisassert.h is not like including server.h.. it just exposes panic and assert.
i think both solutions are ok in this case. waiting to hear back from Yossi, in case my last comment changes something.

panjf2000 · 2021-04-22T09:10:15Z

Hi @yossigo, any thoughts about the last comment from @oranagra ?

panjf2000 · 2021-04-23T09:41:08Z

The redisassert.h can't be included in ae, it will case link errors:

So I guess we should go with the solution that returns errno?

ShooterIT · 2021-04-23T11:57:51Z

i don't know why we put energy on that, is there a bad case?

yossigo · 2021-04-25T06:54:26Z

I agree both options are ok.

oranagra · 2021-04-25T09:30:15Z

@panjf2000 the link errors are probably easily solvable.
part of it is because ae.c is used by redis-cli and redis-benchmark, not just redis-server. but we can solve that by adding a _serverAssert and serverLog symbols in these too (which will map to just printf and assert).
not sure about the other part (debug.c errors), maybe bad order of inclusions.

the current code may also be ok, maybe a bit more lines, and less hermetic, since some callers may ignore the error, and/or need to know exactly which errors skip, like what aeMain does...

panjf2000 · 2021-04-25T09:38:34Z

OK, I will try to fix those link errors first.

panjf2000 · 2021-04-25T10:00:47Z

BTW, where exactly should I put those two function symbols in redis-cli and redis-benchmark? I don't want to put it in the wrong place that you're not gonna like it.

@oranagra

oranagra · 2021-04-25T10:14:14Z

if it were me, i would have probably tried to put them at the very bottom of each C file (below main) with some comment that says they're here because ae.c uses redisassert.h.

if that's more than one or two line per function, and this code gets complex, maybe we wanna have it shared between them, so maybe cli_common.c, or create a redisassert.c

src/ae_epoll.c

Most of the ae.c backends didn't explicitly handle errors, and instead ignored all errors and did an implicit retry. This is desired for EAGAIN and EINTER, but in case of other systematic errors, we prefer to fail and log the error we got rather than get into a busy loop.

Handle EINTR error on poll

d285b07

oranagra reviewed Apr 21, 2021

View reviewed changes

panjf2000 changed the title ~~Handle EINTR error on poll~~ Fail fast when systemic errors occurs in poll Apr 21, 2021

panjf2000 changed the title ~~Fail fast when systemic errors occurs in poll~~ Fail fast when systemic errors occur in poll Apr 21, 2021

panjf2000 changed the title ~~Fail fast when systemic errors occur in poll~~ Fail fast when systemic error occurs in poll Apr 21, 2021

panjf2000 force-pushed the handle-poll-eintr branch 2 times, most recently from 0e65802 to 061e437 Compare April 25, 2021 11:09

panjf2000 requested a review from oranagra April 25, 2021 11:09

panjf2000 force-pushed the handle-poll-eintr branch from 061e437 to e7be49e Compare April 25, 2021 11:18

Panic when system errors occur in poll

d0944fa

panjf2000 force-pushed the handle-poll-eintr branch from e7be49e to d0944fa Compare April 26, 2021 01:46

oranagra reviewed Apr 26, 2021

View reviewed changes

src/ae_epoll.c Outdated Show resolved Hide resolved

Print error strings

3841fa4

panjf2000 requested a review from oranagra April 26, 2021 09:36

oranagra approved these changes Apr 26, 2021

View reviewed changes

oranagra merged commit a8b6596 into redis:unstable Apr 26, 2021

panjf2000 deleted the handle-poll-eintr branch April 26, 2021 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail fast when systemic error occurs in poll #8749

Fail fast when systemic error occurs in poll #8749

panjf2000 commented Apr 7, 2021 •

edited

Loading

panjf2000 commented Apr 8, 2021

ShooterIT commented Apr 11, 2021

panjf2000 commented Apr 12, 2021 •

edited

Loading

oranagra left a comment

panjf2000 commented Apr 21, 2021

oranagra commented Apr 21, 2021

panjf2000 commented Apr 21, 2021

panjf2000 commented Apr 21, 2021

oranagra commented Apr 21, 2021

panjf2000 commented Apr 21, 2021

oranagra commented Apr 21, 2021 •

edited

Loading

yossigo commented Apr 21, 2021

oranagra commented Apr 21, 2021

panjf2000 commented Apr 21, 2021

oranagra commented Apr 21, 2021

panjf2000 commented Apr 22, 2021

panjf2000 commented Apr 23, 2021

ShooterIT commented Apr 23, 2021

yossigo commented Apr 25, 2021

oranagra commented Apr 25, 2021

panjf2000 commented Apr 25, 2021

panjf2000 commented Apr 25, 2021

oranagra commented Apr 25, 2021

	if (port_getn(state->portfd, event, MAX_EVENT_BATCHSZ, &nevents,
	tsp) == -1 && (errno != ETIME \|\| nevents == 0)) {
	if (errno == ETIME \|\| errno == EINTR)
	return 0;

	/* Any other error indicates a bug. */
	perror("aeApiPoll: port_get");
	abort();
	}

Fail fast when systemic error occurs in poll #8749

Fail fast when systemic error occurs in poll #8749

Conversation

panjf2000 commented Apr 7, 2021 • edited Loading

panjf2000 commented Apr 8, 2021

ShooterIT commented Apr 11, 2021

panjf2000 commented Apr 12, 2021 • edited Loading

oranagra left a comment

Choose a reason for hiding this comment

panjf2000 commented Apr 21, 2021

oranagra commented Apr 21, 2021

panjf2000 commented Apr 21, 2021

panjf2000 commented Apr 21, 2021

oranagra commented Apr 21, 2021

panjf2000 commented Apr 21, 2021

oranagra commented Apr 21, 2021 • edited Loading

yossigo commented Apr 21, 2021

oranagra commented Apr 21, 2021

panjf2000 commented Apr 21, 2021

oranagra commented Apr 21, 2021

panjf2000 commented Apr 22, 2021

panjf2000 commented Apr 23, 2021

ShooterIT commented Apr 23, 2021

yossigo commented Apr 25, 2021

oranagra commented Apr 25, 2021

panjf2000 commented Apr 25, 2021

panjf2000 commented Apr 25, 2021

oranagra commented Apr 25, 2021

panjf2000 commented Apr 7, 2021 •

edited

Loading

panjf2000 commented Apr 12, 2021 •

edited

Loading

oranagra commented Apr 21, 2021 •

edited

Loading