Skip to content

2.27.0.0-b518

@Sumukh-Phalgaonkar Sumukh-Phalgaonkar tagged this 02 Sep 15:16
Summary:
The map `session_virtual_wal_` maintains mapping from session ID  to a pointer to the Virtual WAL instance. Each tserver maintains its own map, and an entry is added to the map whenever a new Virtual WAL is initialised. Now, there are two deleters from this map:

  # `DestroyVirtualWALForCDC()` : This comes into action when a walsender exits cleanly.
  # `DestroyVirtualWALBatchForCDC()` : This comes into action when a walsender crashes. This is called by `CleanupSessions()`, which is a bg thread to remove expired sessions. Here we first identify which of the expired sessions have a corresponding VWAL and then we destroy the Virtual WAL and delete the entry from `session_virtual_wal_`.

In the second deleter's code path, we acquire a read lock on `session_virtual_wal_` when we are filtering the expired sessions to find out the ones with a corresponding Virtual WAL. We then relinquish this read lock and acquire a write lock at the time of performing the destroy Virtual WAL and deletion from the map. There exists a period in between the two steps where we release the lock and the first deletion path mentioned above ( i.e `DestroyVirtualWALForCDC()`) may come in and delete the entry from `session_virtual_wal_`. Now when deleter 2 tries to access this entry for destroying Virtual WAL we get a segmentation fault.

To fix this issue, we are getting rid of the filtering logic. Instead `DestroyVirtualWALBatchForCDC()` will be directly called from CleanupSessions() and in the critical session protected by the lock, we will perform necessary deletion  if the expired session exists in `session_virtual_wal_`.
Jira: DB-18021

Test Plan:
QA tests
Adding a UT to repro this test is not possible right now, since we do not have test infra to run walsender in c++ tests. Whereas the java tests do not have the syncpoint infra to deterministically reproduce the race. Ticket for adding the UT: https://yugabyte.atlassian.net/browse/DB-18034

Reviewers: skumar, asrinivasan

Reviewed By: asrinivasan

Subscribers: yql, ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D46229
Assets 2
Loading