Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition issues between the main thread and module threads #12817

Merged
merged 41 commits into from Jan 19, 2024
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
ecdf99c
Fix race condition issues between the main thread and module threads
sundb Nov 28, 2023
716ef44
Let RM_Reply*() and RM_UnblockClient() inside GIL
sundb Dec 4, 2023
6e4de39
Add `from_module` paramter to `moduleBlockedClientTimeOut()` to make …
sundb Dec 5, 2023
c09ff05
cleanup
sundb Dec 5, 2023
152e7f2
Make sure don't touch anything before acquire GIL
sundb Dec 6, 2023
e6a08af
cleanup
sundb Dec 6, 2023
49937df
Upwaord the free of bg
sundb Dec 8, 2023
761f5dd
Merge branch 'unstable' into thread-safe-freeclient
sundb Dec 8, 2023
15db3fc
Fix adding fake client to server.clients_pending_write
sundb Dec 12, 2023
14c0aa9
Add c->conn null assertion for updateClientMemoryUsage() and updateCl…
sundb Dec 12, 2023
1bad868
improve comment
sundb Dec 12, 2023
4b48f42
lock the ae poll block and cache modules size
sundb Dec 16, 2023
dfb5665
Added __thread specifier for ProcessingEventsWhileBlocked
sundb Dec 18, 2023
f9c27e9
Optimize el mutex variable names, and add comments
sundb Dec 19, 2023
0f1e787
Add comment for modules_count
sundb Dec 19, 2023
d5ded67
Remove unnecessary line and update comment
sundb Dec 19, 2023
07f8de1
update comment
sundb Dec 19, 2023
0741699
change lock_ae to lock_el
sundb Dec 19, 2023
1ed674f
update comment
sundb Dec 19, 2023
eeda5e6
Add comment about freeing argvs in module thread
sundb Dec 20, 2023
eaa21da
Merge branch 'unstable' into thread-safe-freeclient
sundb Jan 7, 2024
e26eeb9
Revert some codes after #12905
sundb Jan 7, 2024
c660f6e
Remove unnecessary change
sundb Jan 7, 2024
d272368
Call RM_BlockedClientMeasureTimeStart() and RM_BlockedClientMeasureTi…
sundb Jan 9, 2024
70b78ae
Add comment about thread safe for RM_FreeString, RM_RetainString and …
sundb Jan 9, 2024
8bc235f
cleanup
sundb Jan 9, 2024
496d0db
Merge branch 'unstable' into thread-safe-freeclient
sundb Jan 9, 2024
a47cba0
Revert "Call RM_BlockedClientMeasureTimeStart() and RM_BlockedClientM…
sundb Jan 10, 2024
7be865f
Terminate module block thread first in timeout callback
sundb Jan 10, 2024
54524ca
Revert "Terminate module block thread first in timeout callback"
sundb Jan 10, 2024
17efc14
Avoid calling RM_BlockedClientMeasureTimeEnd in the main thread
sundb Jan 11, 2024
54cdf01
Indent
sundb Jan 11, 2024
588eff4
Revert "Indent"
sundb Jan 12, 2024
cd8b6a3
Revert "Avoid calling RM_BlockedClientMeasureTimeEnd in the main thread"
sundb Jan 12, 2024
0bc0796
Using local lock instead of GIL to protect RM_BlockedClientMeasureTim…
sundb Jan 12, 2024
a06d7dd
Add comments
sundb Jan 12, 2024
d9c18af
fix crash
sundb Jan 12, 2024
42de713
Add new method blockClientPrivdataInit
sundb Jan 14, 2024
393af9e
Add api comment for RM_BlockedClientMeasureTimeStart() and RM_Blocked…
sundb Jan 14, 2024
70d5348
Update src/module.c
sundb Jan 19, 2024
1359fad
Update comments about retained string
sundb Jan 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/blocked.c
Expand Up @@ -239,7 +239,7 @@ void replyToBlockedClientTimedOut(client *c) {
addReplyLongLong(c,server.fsynced_reploff >= c->bstate.reploffset);
addReplyLongLong(c,replicationCountAOFAcksByOffset(c->bstate.reploffset));
} else if (c->bstate.btype == BLOCKED_MODULE) {
moduleBlockedClientTimedOut(c);
moduleBlockedClientTimedOut(c, 0);
} else {
serverPanic("Unknown btype in replyToBlockedClientTimedOut().");
}
Expand Down
57 changes: 43 additions & 14 deletions src/module.c
Expand Up @@ -304,8 +304,11 @@ static size_t moduleTempClientMinCount = 0; /* Min client count in pool since

/* We need a mutex that is unlocked / relocked in beforeSleep() in order to
* allow thread safe contexts to execute commands at a safe moment. */
static pthread_mutex_t moduleGIL = PTHREAD_MUTEX_INITIALIZER;

typedef struct {
pthread_mutex_t mutex;
redisAtomic pthread_t owner; /* The thread that currently holds the lock */
} ModuleGIL;
ModuleGIL moduleGIL = { PTHREAD_MUTEX_INITIALIZER, (pthread_t)0 };

/* Function pointer type for keyspace event notification subscriptions from modules. */
typedef int (*RedisModuleNotificationFunc) (RedisModuleCtx *ctx, int type, const char *event, RedisModuleString *key);
Expand Down Expand Up @@ -8202,7 +8205,7 @@ int RM_UnblockClient(RedisModuleBlockedClient *bc, void *privdata) {
* argument, but better to be safe than sorry. */
if (bc->timeout_callback == NULL) return REDISMODULE_ERR;
if (bc->unblocked) return REDISMODULE_OK;
if (bc->client) moduleBlockedClientTimedOut(bc->client);
if (bc->client) moduleBlockedClientTimedOut(bc->client, 1);
}
moduleUnblockClientByHandle(bc,privdata);
return REDISMODULE_OK;
Expand Down Expand Up @@ -8301,8 +8304,10 @@ void moduleHandleBlockedClients(void) {
* This needs to be out of the reply callback above given that a
* module might not define any callback and still do blocking ops.
*/
if (c && !clientHasModuleAuthInProgress(c) && !bc->blocked_on_keys) {
updateStatsOnUnblock(c, bc->background_duration, reply_us, server.stat_total_error_replies != prev_error_replies);
if (c && !clientHasModuleAuthInProgress(c)) {
int had_errors = c->deferred_reply_errors ? !!listLength(c->deferred_reply_errors) :
(server.stat_total_error_replies != prev_error_replies);
updateStatsOnUnblock(c, bc->background_duration, reply_us, had_errors);
Comment on lines +8344 to +8347
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following #12817 (comment)
we call updateStatsOnUnblock() here when from RM_UnblockClient().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean that before this PR it was called twice?
maybe we should add a comment in moduleBlockedClientTimedOut, explaining the if statement by referring to this call.

Copy link
Collaborator Author

@sundb sundb Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, moduleHandleBlockedClients() ignored updating the block status when the client was blocked on keys.

origin code:

if (c && !clientHasModuleAuthInProgress(c) && !bc->blocked_on_keys) {
    updateStatsOnUnblock();
}

}

if (c != NULL) {
Expand Down Expand Up @@ -8355,8 +8360,15 @@ int moduleBlockedClientMayTimeout(client *c) {
/* Called when our client timed out. After this function unblockClient()
* is called, and it will invalidate the blocked client. So this function
* does not need to do any cleanup. Eventually the module will call the
* API to unblock the client and the memory will be released. */
void moduleBlockedClientTimedOut(client *c) {
* API to unblock the client and the memory will be released.
*
* If this function is called from a module, we handle the timeout callback
* and the update of the unblock status in a thread-safe manner to avoid race
* conditions with the main thread.
* If this function is called from the main thread, we must handle the unblocking
* of the client synchronously. This ensures that we can reply to the client before
* resetClient() is called. */
void moduleBlockedClientTimedOut(client *c, int from_module) {
RedisModuleBlockedClient *bc = c->bstate.module_blocked_handle;

/* Protect against re-processing: don't serve clients that are already
Expand All @@ -8365,14 +8377,22 @@ void moduleBlockedClientTimedOut(client *c) {
if (bc->unblocked) return;

RedisModuleCtx ctx;
moduleCreateContext(&ctx, bc->module, REDISMODULE_CTX_BLOCKED_TIMEOUT);
int flags = REDISMODULE_CTX_BLOCKED_TIMEOUT;
if (from_module) flags |= REDISMODULE_CTX_THREAD_SAFE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm trying to figure out this change.
in the top comment i see this change (an addition of REDISMODULE_CTX_THREAD_SAFE) is listed together with the fact we don't call updateStatsOnUnblock (which is a different change in this function).

  1. can you please help me understand why it was needed.
  2. what are the other side effects of this change

Copy link
Collaborator Author

@sundb sundb Dec 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because both of them are fixed to ensure that moduleBlockedClientTimedOut() is thread-safe.

The reason is:

When error reply is called in timeout_callback(), ctx is not thread-safe, eventually lead to race conditions in afterErrorReply.

If the ctx is not to be REDISMODULE_CTX_THREAD_SAFE, afterErrorReply() will be triggered if replying an error.
If REDISMODULE_CTX_THREAD_SAFE is used, all replies will be deferred.

redis/src/networking.c

Lines 501 to 508 in 27a8e3b

if (c->flags & CLIENT_MODULE) {
if (!c->deferred_reply_errors) {
c->deferred_reply_errors = listCreate();
listSetFreeMethod(c->deferred_reply_errors, (void (*)(void*))sdsfree);
}
listAddNodeTail(c->deferred_reply_errors, sdsnewlen(s, len));
return;
}

moduleCreateContext(&ctx, bc->module, flags);
ctx.client = bc->client;
ctx.blocked_client = bc;
ctx.blocked_privdata = bc->privdata;
long long prev_error_replies = server.stat_total_error_replies;

long long prev_error_replies;
if (!from_module)
prev_error_replies = server.stat_total_error_replies;

bc->timeout_callback(&ctx,(void**)c->argv,c->argc);
moduleFreeContext(&ctx);
updateStatsOnUnblock(c, bc->background_duration, 0, server.stat_total_error_replies != prev_error_replies);

if (!from_module)
updateStatsOnUnblock(c, bc->background_duration, 0, server.stat_total_error_replies != prev_error_replies);

/* For timeout events, we do not want to call the disconnect callback,
* because the blocked client will be automatically disconnected in
Expand Down Expand Up @@ -8551,17 +8571,26 @@ void RM_ThreadSafeContextUnlock(RedisModuleCtx *ctx) {
}

void moduleAcquireGIL(void) {
pthread_mutex_lock(&moduleGIL);
pthread_mutex_lock(&moduleGIL.mutex);
atomicSet(moduleGIL.owner, pthread_self());
}

int moduleTryAcquireGIL(void) {
return pthread_mutex_trylock(&moduleGIL);
int ret = pthread_mutex_trylock(&moduleGIL.mutex);
if (ret == 0) atomicSet(moduleGIL.owner, pthread_self());
return ret;
}

void moduleReleaseGIL(void) {
pthread_mutex_unlock(&moduleGIL);
atomicSet(moduleGIL.owner, (pthread_t)0);
oranagra marked this conversation as resolved.
Show resolved Hide resolved
pthread_mutex_unlock(&moduleGIL.mutex);
}

int moduleOwnsGIL(void) {
pthread_t owner_thread;
atomicGet(moduleGIL.owner, owner_thread);
return pthread_equal(pthread_self(), owner_thread);
}

/* --------------------------------------------------------------------------
* ## Module Keyspace Notifications API
Expand Down Expand Up @@ -11928,7 +11957,7 @@ void moduleInitModulesSystem(void) {

/* Our thread-safe contexts GIL must start with already locked:
* it is just unlocked when it's safe. */
pthread_mutex_lock(&moduleGIL);
moduleAcquireGIL();
}

void modulesCron(void) {
Expand Down
17 changes: 10 additions & 7 deletions src/networking.c
Expand Up @@ -415,7 +415,8 @@ void _addReplyToBufferOrList(client *c, const char *s, size_t len) {
* after the command's reply (specifically important during multi-exec). the exception is
* the SUBSCRIBE command family, which (currently) have a push message instead of a proper reply.
* The check for executing_client also avoids affecting push messages that are part of eviction. */
if (c == server.current_client && (c->flags & CLIENT_PUSHING) &&
// TODO: should forbid the ReplyWith* module family api from being called outside the lock?
if ((c->flags & CLIENT_PUSHING) && c == server.current_client &&
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start version: 7.2.0
introduced: #12326
reason: RM_Reply* is not thread-safe

WARNING: ThreadSanitizer: data race (pid=1969717)
  Write of size 8 at 0x558d01162be0 by main thread (mutexes: write M86, write M82, write M84, write M64):
    #0 processCommandAndResetClient /data/redis_fork/src/networking.c:2466 (redis-server+0xe1db4)
    #1 processInputBuffer /data/redis_fork/src/networking.c:2575 (redis-server+0xe1db4)
    #2 readQueryFromClient /data/redis_fork/src/networking.c:2715 (redis-server+0xe28da)
    #3 callHandler /data/redis_fork/src/connhelpers.h:79 (redis-server+0x2943f2)
    #4 connSocketEventHandler /data/redis_fork/src/socket.c:298 (redis-server+0x2943f2)
    #5 aeProcessEvents /data/redis_fork/src/ae.c:436 (redis-server+0x97245)
    #6 aeMain /data/redis_fork/src/ae.c:496 (redis-server+0x97245)
    #7 main /data/redis_fork/src/server.c:7212 (redis-server+0x846d5)

  Previous read of size 8 at 0x558d01162be0 by thread T14:
    #0 _addReplyToBufferOrList /data/redis_fork/src/networking.c:418 (redis-server+0xdc0d6)
    #1 addReplyProto /data/redis_fork/src/networking.c:474 (redis-server+0xdd0a7)
    #2 RM_ReplyWithCallReply /data/redis_fork/src/module.c:3424 (redis-server+0x212732)
    #3 bg_call_worker /data/redis_fork/tests/modules/blockedclient.c:145 (blockedclient.so+0x9c0d)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like #12817 (comment), this is not about the use of RM_AddReply, it's about using the argv strings (changing their refcount).
p.s. how did you conclude it's related to the above mentioned PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they're not related.
this is because main thread many modify server.current_client when module thread read it.
however, c->flags & CLIENT_PUSHING is always false for module threads, so this is harmless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sundb this one at the top comment (number 2), says:

Harm Level: Low

but if that's just an access to a variable and then ignoring what we read from it, isn't that "Harm Level: None"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it should be None.

server.executing_client && !cmdHasPushAsReply(server.executing_client->cmd))
{
_addReplyProtoToList(c,server.pending_push_messages,s,len);
Expand Down Expand Up @@ -1434,7 +1435,7 @@ void unlinkClient(client *c) {
listNode *ln;

/* If this is marked as current client unset it. */
if (server.current_client == c) server.current_client = NULL;
if (c->conn && server.current_client == c) server.current_client = NULL;

/* Certain operations must be done only if the client has an active connection.
* If the client was already unlinked or if it's a "fake client" the
Expand Down Expand Up @@ -1478,7 +1479,7 @@ void unlinkClient(client *c) {
}

/* Remove from the list of pending reads if needed. */
serverAssert(io_threads_op == IO_THREADS_OP_IDLE);
serverAssert(!c->conn || io_threads_op == IO_THREADS_OP_IDLE);
if (c->pending_read_list_node != NULL) {
listDelNode(server.clients_pending_read,c->pending_read_list_node);
c->pending_read_list_node = NULL;
Expand Down Expand Up @@ -1631,6 +1632,12 @@ void freeClient(client *c) {
reqresReset(c, 1);
#endif

/* Remove the contribution that this client gave to our
* incrementally computed memory usage. */
if (c->conn)
server.stat_clients_type_memory[c->last_memory_type] -=
c->last_memory_usage;
Comment on lines +1653 to +1657
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start version: 7.0.0
introduced: #8687
reason: touch server.stat_clients_type_memory without GIL

WARNING: ThreadSanitizer: data race (pid=90167)
  Write of size 8 at 0x0001006f41e0 by main thread (mutexes: write M0, write M1, write M2, write M3):
    #0 updateClientMemoryUsage server.c:956 (redis-server:arm64+0x10001f760)
    #1 clientsCron server.c:1116 (redis-server:arm64+0x1000201fc)
    #2 serverCron server.c:1451 (redis-server:arm64+0x100021b10)
    #3 processTimeEvents ae.c:331 (redis-server:arm64+0x1000100b8)
    #4 aeProcessEvents ae.c:466 (redis-server:arm64+0x10000f614)
    #5 aeMain ae.c:496 (redis-server:arm64+0x1000103bc)
    #6 main server.c:7212 (redis-server:arm64+0x10003e76c)

  Previous write of size 8 at 0x0001006f41e0 by thread T7:
    #0 freeClient networking.c:1684 (redis-server:arm64+0x10005f30c)
    #1 moduleFreeContext module.c:834 (redis-server:arm64+0x1001821ac)
    #2 RM_FreeThreadSafeContext module.c:8494 (redis-server:arm64+0x10019dcf8)
    #3 worker <null>:17631300 (blockedclient.so:arm64+0x828)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is resolved by modifying the freeClient code, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and make updateClientMemoryUsage() and clientEvictionAllowed() no longer keep track of non-conn user memory and whether eviction.
However, this has the side effect that server.stat_clients_type_memory[CLIENT_TYPE_NORMAL] will be lower than it was before this fix was made.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't / shouldn't evict them anyway.
if we tracked them, it was wrong to do that.
we can list this as a fix (not about thread race) in the top comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never been able to remember how I reproduced it.
I remember using RM_Call to reproduce it, but I forget which command, not the following client no-evict command.

Config:

maxmemory-clients 1g

Command:

RedisModule_Call(ctx,"client","cc","no-evict","off");

Patch:

int clientEvictionAllowed(client *c) {
    serverAssert(c->conn);
    if (server.maxmemory_clients == 0 || c->flags & CLIENT_NO_EVICT) {
        return 0;
    }
    int type = getClientType(c);
    return (type == CLIENT_TYPE_NORMAL || type == CLIENT_TYPE_PUBSUB);
}

serverAssert(c->conn); will be triggered.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please als have a look at #12817 (comment) and top comment(7).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think using RM_Call to call the CLIENT command is valid.
specifically the ones manipulating the current client, like enabling tracking, and so on.
the user may be wanting to operate on the calling client, not the fake client, but that's not currently supported, and i think we should just disallow or disregard this case.


/* Unlink the client: this will close the socket, remove the I/O
* handlers, and remove references of the client from different
* places where active clients may be referenced. */
Expand Down Expand Up @@ -1679,10 +1686,6 @@ void freeClient(client *c) {
* we lost the connection with the master. */
if (c->flags & CLIENT_MASTER) replicationHandleMasterDisconnection();

/* Remove the contribution that this client gave to our
* incrementally computed memory usage. */
server.stat_clients_type_memory[c->last_memory_type] -=
c->last_memory_usage;
/* Remove client from memory usage buckets */
if (c->mem_usage_bucket) {
c->mem_usage_bucket->mem_usage_sum -= c->last_memory_usage;
Expand Down
28 changes: 15 additions & 13 deletions src/server.c
Expand Up @@ -949,6 +949,7 @@ static inline clientMemUsageBucket *getMemUsageBucket(size_t mem) {
* usage bucket.
*/
void updateClientMemoryUsage(client *c) {
if (!c->conn) return;
size_t mem = getClientMemoryUsage(c, NULL);
int type = getClientType(c);
/* Now that we have the memory used by the client, remove the old
Expand All @@ -961,7 +962,7 @@ void updateClientMemoryUsage(client *c) {
}

int clientEvictionAllowed(client *c) {
if (server.maxmemory_clients == 0 || c->flags & CLIENT_NO_EVICT) {
if (server.maxmemory_clients == 0 || c->flags & CLIENT_NO_EVICT || !c->conn) {
return 0;
}
int type = getClientType(c);
Expand Down Expand Up @@ -1864,19 +1865,20 @@ void afterSleep(struct aeEventLoop *eventLoop) {
/********************* WARNING ********************
* Do NOT add anything above moduleAcquireGIL !!! *
***************************** ********************/
/* Acquire the modules GIL so that their threads won't touch anything. */
if (moduleCount() && !moduleOwnsGIL()) {
oranagra marked this conversation as resolved.
Show resolved Hide resolved
mstime_t latency;
latencyStartMonitor(latency);

moduleAcquireGIL();
moduleFireServerEvent(REDISMODULE_EVENT_EVENTLOOP,
REDISMODULE_SUBEVENT_EVENTLOOP_AFTER_SLEEP,
NULL);
latencyEndMonitor(latency);
latencyAddSampleIfNeeded("module-acquire-GIL",latency);
}

if (!ProcessingEventsWhileBlocked) {
/* Acquire the modules GIL so that their threads won't touch anything. */
if (moduleCount()) {
mstime_t latency;
latencyStartMonitor(latency);

moduleAcquireGIL();
moduleFireServerEvent(REDISMODULE_EVENT_EVENTLOOP,
REDISMODULE_SUBEVENT_EVENTLOOP_AFTER_SLEEP,
NULL);
latencyEndMonitor(latency);
latencyAddSampleIfNeeded("module-acquire-GIL",latency);
}
/* Set the eventloop start time. */
server.el_start = getMonotonicUs();
/* Set the eventloop command count at start. */
Expand Down
3 changes: 2 additions & 1 deletion src/server.h
Expand Up @@ -2517,12 +2517,13 @@ void moduleFreeContext(struct RedisModuleCtx *ctx);
void moduleCallCommandUnblockedHandler(client *c);
void unblockClientFromModule(client *c);
void moduleHandleBlockedClients(void);
void moduleBlockedClientTimedOut(client *c);
void moduleBlockedClientTimedOut(client *c, int from_module);
void modulePipeReadable(aeEventLoop *el, int fd, void *privdata, int mask);
size_t moduleCount(void);
void moduleAcquireGIL(void);
int moduleTryAcquireGIL(void);
void moduleReleaseGIL(void);
int moduleOwnsGIL(void);
void moduleNotifyKeyspaceEvent(int type, const char *event, robj *key, int dbid);
void firePostExecutionUnitJobs(void);
void moduleCallCommandFilters(client *c);
Expand Down
5 changes: 3 additions & 2 deletions tests/modules/blockedclient.c
Expand Up @@ -135,8 +135,6 @@ void *bg_call_worker(void *arg) {
RedisModuleCallReply *rep = RedisModule_Call(ctx, cmd, format, bg->argv + cmd_pos + 1, bg->argc - cmd_pos - 1);
RedisModule_FreeString(NULL, format_redis_str);

// Release GIL
RedisModule_ThreadSafeContextUnlock(ctx);

// Reply to client
if (!rep) {
Expand All @@ -155,6 +153,9 @@ void *bg_call_worker(void *arg) {
RedisModule_Free(bg->argv);
RedisModule_Free(bg);

Copy link
Collaborator Author

@sundb sundb Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start version: 6.2.0
introduced: #8141
reason: release client argv outside of GIL

WARNING: ThreadSanitizer: data race (pid=91603)
  Read of size 4 at 0x0001027f5534 by main thread (mutexes: write M0, write M1, write M2, write M3):
    #0 decrRefCount object.c:393 (redis-server:arm64+0x1000741f0)
    #1 freeClientArgv networking.c:1391 (redis-server:arm64+0x10005f504)
    #2 resetClient networking.c:2061 (redis-server:arm64+0x100061bfc)
    #3 unblockClient blocked.c:212 (redis-server:arm64+0x10016aa80)
    #4 moduleHandleBlockedClients module.c:8316 (redis-server:arm64+0x10019d4e4)
    #5 blockedBeforeSleep blocked.c:758 (redis-server:arm64+0x10016c5c4)
    #6 beforeSleep server.c:1745 (redis-server:arm64+0x100023a8c)
    #7 aeProcessEvents ae.c:379 (redis-server:arm64+0x10000ee58)
    #8 aeMain ae.c:496 (redis-server:arm64+0x100010148)
    #9 main server.c:7213 (redis-server:arm64+0x10003e538)

  Previous write of size 4 at 0x0001027f5534 by thread T7:
    #0 decrRefCount object.c:407 (redis-server:arm64+0x100074358)
    #1 RM_FreeString module.c:2696 (redis-server:arm64+0x100188f20)
    #2 bg_call_worker <null>:49743940 (usercall.so:arm64+0x30ec)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it because of auto-memory?
the stack trace here seems unrelated to the release of the argv array.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I update the stack trace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what do we conclude here?
that if a module retains the argv strings, it must release them before unblocking the client, or alternatively with the GIL locked?
@MeirShpilraien PTAL

Copy link
Member

@oranagra oranagra Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually the order doesn't matter, if they both decr at the same time it can mess up the refcount.
i suppose we must clone the stings before branching?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should mention it on RM_FreeString, RM_RetainString, and RM_HoldString. We should mentioned that those function are not thread safe and should only be called when the GIL is held. We should decide if we want to document the exception about the RM_FreeString in case you know you are the only owner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. for some reason i thought that RM_ReplyWithString will also be an issue, but now i see it doesn't touch the refcount. (at least not in the current implementation)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sundb this one is linked to number 4 at the top, which says:

Harm Level: None
Trigger assertion

first of all, an assertion isn't "None" it terminates the process and can cause data loss. maybe change it to "Low"?
secondly this can also mess up the refcount, or cause double free and other issues that can lead to memory corruption.
am i missing anything?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my intention was that it was only caused by the module's use of API, not by the internal implement, so I chose None.
let's change it to LOW.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh, ok, maybe in addition to the "Harm Level", we can add some explicit "Trigger", to specify that it depends on some rare case, and people can easily rule it out and know they're safe.
or we can leave it as is, the scenario is there, it's just a little bit hard to understand if you're not keen on the details.

// Release GIL
RedisModule_ThreadSafeContextUnlock(ctx);

// Free the Redis module context
RedisModule_FreeThreadSafeContext(ctx);

Expand Down
5 changes: 3 additions & 2 deletions tests/modules/usercall.c
Expand Up @@ -136,8 +136,6 @@ void *bg_call_worker(void *arg) {
RedisModuleCallReply *rep = RedisModule_Call(ctx, cmd, format, bg->argv + 3, bg->argc - 3);
RedisModule_FreeString(NULL, format_redis_str);

// Release GIL
RedisModule_ThreadSafeContextUnlock(ctx);

// Reply to client
if (!rep) {
Expand All @@ -156,6 +154,9 @@ void *bg_call_worker(void *arg) {
RedisModule_Free(bg->argv);
RedisModule_Free(bg);

// Release GIL
RedisModule_ThreadSafeContextUnlock(ctx);

// Free the Redis module context
RedisModule_FreeThreadSafeContext(ctx);

Expand Down