New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
redis may lose latency records generated by expirecycle #4453
Comments
Excellent analysis. thanks! What about this slightly modified fix? diff --git a/src/expire.c b/src/expire.c
index a02fe566..b001ea5e 100644
--- a/src/expire.c
+++ b/src/expire.c
@@ -140,7 +140,7 @@ void activeExpireCycle(int type) {
if (type == ACTIVE_EXPIRE_CYCLE_FAST)
timelimit = ACTIVE_EXPIRE_CYCLE_FAST_DURATION; /* in microseconds. */
- for (j = 0; j < dbs_per_call; j++) {
+ for (j = 0; j < dbs_per_call && timelimit_exit == 0; j++) {
int expired;
redisDb *db = server.db+(current_db % server.dbnum);
@@ -155,6 +155,7 @@ void activeExpireCycle(int type) {
unsigned long num, slots;
long long now, ttl_sum;
int ttl_samples;
+ iteration++;
/* If there is nothing to expire try next DB ASAP. */
if ((num = dictSize(db->expires)) == 0) {
@@ -207,18 +208,18 @@ void activeExpireCycle(int type) {
/* We can't block forever here even if there are many keys to
* expire. So after a given amount of milliseconds return to the
* caller waiting for the other active expire cycle. */
- iteration++;
if ((iteration & 0xf) == 0) { /* check once every 16 iterations. */
long long elapsed = ustime()-start;
-
- latencyAddSampleIfNeeded("expire-cycle",elapsed/1000);
if (elapsed > timelimit) timelimit_exit = 1;
+ break;
}
- if (timelimit_exit) return;
/* We don't repeat the cycle if there are less than 25% of keys
* found expired in the current DB. */
} while (expired > ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP/4);
}
+
+ long long elapsed = ustime()-start;
+ latencyAddSampleIfNeeded("expire-cycle",elapsed/1000);
} It has the following advantages:
Anyway it can happen that we are unlucky and we expire successive very big keys one after the other till we reach 16 iterations and we finally realize we are doing too much work, but this is kinda of a tradeoff... Makes sense? |
P.S. @soloestoy please if you agree with the above, could you update your PR? I would like to merge your code to retain attribution of the discovery & fix. |
@antirez I agree with the above and have updated my PR, it's a great honor, thank you! ^_^ P.S. we should put |
My honor and pleasure :-) Yep indeed, break is inside the conditional, sorry for the mistake. |
Btw I tested a (fixed) version and in general writing a regression test for this, and apparently it works as advertised. However not sure if to commit this because being time-based it's the kind of test that will randomly fail forever. We have a few more of those and it's terrible :-) |
Merged @soloestoy thank you. I remember that when we met in Hangzhou you also showed me different fixes to the object frequency. Are those already PRed? Thanks. |
Here I have a instance which have dozens of key-value pairs only in
db0
, and some of them are very big, just like a list contains millions of elements.And the big keys have been set expire time, so when they were deleted by background
expire-cycle
the client got too much latency. I think we can see latency records aboutexpire-cycle
, but it didn't.I checked the function
activeExpireCycle
and found some bugs, here they are:From the function we can see that, we call
latencyAddSampleIfNeeded
every 16 iterations, but unfortunately there are twobreak
s beforeiteration++
which can makeiteration
never reach16
, that also maketimelimit_exit
never be1
. So, we should move theiteration++
to the location beforebreak
s.Check elapsed every 16 iterations is a good idea to save cpu time, but if
server.dbnum
is smaller than16
, that can also makeiteration
never reach16
. So, we should add a condition(iteration % server.dbnum) == 0
to protect it.I think it is not necessary to call
latencyAddSampleIfNeeded
every 16 iterations, call it only when we reach thetimelimit
is enough, and also after theexpire-cycle
if we doesn't reachtimelimit
.The text was updated successfully, but these errors were encountered: