Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Second Channel For RDB #11678

Open
naglera opened this issue Jan 2, 2023 · 5 comments · May be fixed by #12109
Open

Second Channel For RDB #11678

naglera opened this issue Jan 2, 2023 · 5 comments · May be fixed by #12109

Comments

@naglera
Copy link

naglera commented Jan 2, 2023

The issue that is being addressed

During bgsave, the primary child process, sends an rdb snapshot to the replica. During this time any write command that the primary has to process is kept in the COB in order to be sent to the replica once bgsave is done.
If the save is taking for too long or write commands are coming in high frequency, the COB may reach its limits, causing full sync to fail.
By implementing a more efficient way to stream RDB data from primary to replica, the feature intends to not only reduce COB overrun, but also simplify the bgsave process.

Description

The primary will simultaneously send RDB and command stream during the full sync. The replica on the other hand will store the command stream in a local buffer to be streamed locally once new snapshot is loaded.
By doing so we will gain the following:

  1. Reduce memory load. Replica will be responsible to the online replication buffer, this will reduce the memory load on the primary during the save, which is already heavy due to COW memory usage.
  2. Faster sync. The primary main process is a bottle neck. The feature will take from the primary main process the responsibility of being the bridge between the background process and the replica, thus even when primary main process handles heavy commands, the full sync will continue.
  3. Reduce stale data. Currently in cases that the COB at the primary holds a lot of important changes, when replica finishes loading the DB sent from primary it starts answering clients’ reads. This leaves the clients with stale data which may take long to propagate (Transferring a COB of a few giga will take minutes) .
    With this feature the commands streaming will be much faster at the replica side, making the time between the replica start responding to client to the time when the replica is completely up to data considerably shorter.
    image

Alternatives we've considered

One connection for both data types

During fsync primary will use the connection to its replica to write both rdb and command stream. This will be done by adding a header as a prefix to each message indicating the message type.

$RDB
RDB data...
$COB
Command Stream data...
...

Pros:

  • Simplicity. Single connection, easy to read and debug.
  • Chit chat protocol line is easily developed on top of. The replica command’s header processing will be done in generic way, adding more message type in the future will be much easier. Discussed in [WIP] replication: handling PINGs as out-of-band data #8440.

Cons:

  • Reduce parallelism. Comparing to the first option, all writes and key iteration must go through main process.
  • Message header overhead. Each message on the primary replica connection will have few more bytes to pass the message type.

Additional information

Preliminary results

We checked the memory influence of the feature, with 3gb size DB, and high write on primary we where able to move 80% of the used memory by the COB to the replica side.

Graph explanation:
imageimage

  1. In regular bgsave cob starts to grow once the back ground process is created, comparing to the POC where the cob grows only during the load from disk. In practice we will be able to reduce the size much more by frequent yielding during the snapshot load.
  2. Why steps? I used client_recent_max_output_buffer to measure the cob size, it shows the recent max so it only updates once every ~1 second.
  3. Maximum flat: this is also a result of the recent_max , also there is a continues write burst to the primary during this time, so although replica started reading from the primary COB, there are clients filling it.

This is the impact on the replica buffer size. Same memory usage distributed between the primary and the replica (mostly the replica).
image
Design

  • Flow

image

  • High level design

Once we know full sync is needed:
Setup:

  1. Replica creates a second channel.
  2. From second channel replica sends new command
    new-command <master_host> <master_port> --rdb <path_to_replica_disk> +send-end-offset
    asking for snapshot with new--rdb sub command (similar to rdb command but diskless on primary side), so primary child process will send the end offset of the new RDB before the RDB itself.

Full Sync:

  1. On primary side: Bgprocess created. (Bgprocess uses second channel).
    1. The end offset of the yet uncreated RDB should be sent.
    2. Bgprocess starts sending RDB through second channel.
  2. On the replica side:
    1. Get the end-offset, and send back through main channel:
      PSYNC <master_repl_id> <rdb_end_offset>+1
    2. Meanwhile receiving from second channel the RDB, saving it to disk or memory.
  3. Primary answer with PSYNC and from now on sends the online replication data to the new “connected“ replica.
  4. Replica reads the replication data and store it in local COB:
void readQueryFromClient(connection *conn) {
    // ... Read and parse command
    if((c->flags & CLIENT_MASTER) && server.repl_state == REPL_STATE_TRANSFER){
        SaveCommandsToLocalCob(c->query_buff, offset);
        server.amz.pending_master_stream_size += nread;
        return;
    }
    // ... Process command
}
  1. When child process sent all data it gets rid of the second connection and terminates.

Recovery

  1. Replica finished loading the RDB.
  2. Replica streams the local COB to memory.
  3. Replica back online.
  4. Replica Streams the regular COB from primary to memory.
@soloestoy
Copy link
Collaborator

soloestoy commented Jan 10, 2023

Thanks for your report, it's a good idea and I have thought about doing this too to solve the replication stream problem on fullsync stage. But I didn't finish it since there are two questions I can't figure it out.

  1. indeed we can reduce the replication stream memory load in master, but it doesn't eliminate the memory, we just transfer the memory from master to replica, so overall the memory is still there.
  2. now we use the global replication buffer to store the replication stream, it means no matter how many replicas, the master can just hold one replication buffer, if we transfer the replication stream to replicas(we allow multi replicas do fullsync), it means we have many copies on replicas, the whole memory in all master and replicas is increasing.

@oranagra
Copy link
Member

few quick comments:

  • in the past, RDB writes had to go though the main process anyway in TLS mode, maybe we can avoid that in this case if the child terminates that connection.
  • i'd like to solve the PING problemS! with this change too, which is why i preferred multiplexing in the past.
  • with multiplexing, the overhead for the data stream can be low (small overhead per chunk, not per message / command)
  • is the benchmark being shown above done with diskless or disk-based?
  • you mention the sync is faster, i guess that's mainly from the perspective of the master (either way, currently we don't read more of the command stream from the socket while the replica id parsing the rdb file, right?)

@ranshid
Copy link
Collaborator

ranshid commented Jan 10, 2023

Just to add some small point to the discussion:
One benefit I see in having a second channel is that when the main engine process it also busy processing commands the RDB streaming is not effected. Traffic on the primary does not have to be mutative and reflect on the replica, and I do find some value in it. But I agree multiplexing does have other advantages as said before.

@naglera
Copy link
Author

naglera commented Jan 10, 2023

@oranagra to you questions:

  • As the second channel can be closed by the child process before it terminate, we will not have the TLS issues we had in the past (different process will not have to share connections).
  • Agreed with both points (PING & low overhead), second channel will not take us closer to solving the PING problem, comparing to the multiplexing option. There's a possibility of opening a dedicated chit chat channel, but I think that would be overkill.
  • The benchmark above done with disk-based full sync.
  • The sync is not faster, but the time takes to load the incremental data assembled during the sync, is shrinked due to locality of the buffer.
    Replica does read command stream from socket during rdb load. It happens in rdbLoadProgressCallback. Currently we mostly use it to answer clients with (error) LOADING Redis is loading the dataset in memory

@naglera
Copy link
Author

naglera commented Jan 10, 2023

To @soloestoy question:

  1. It is true that we only transfer load from master to replica. I still think it is helpful from two reasons-
    a. During the full sync the master also suffer from COW memory load, thus decreasing the memory load on the master side is more impactful as it always use more memory then the replica.
    b. If one instance might go into swap, I mostly prefer it will be the replica.
  2. In my experience, full syncing multiple replicas simultaneously is rarely used. I do agree that there is a memory waste in this case. We will be able to make an exception whenever more then one replica is trying to sync from the primary.

@naglera naglera linked a pull request Apr 27, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants