-
Notifications
You must be signed in to change notification settings - Fork 23.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support RDMA as tranport layer protocol #9161
Conversation
In the production environment, RDMA gets popular and common for a networking card. So support RDMA as transport layer protocol to improve performance and cost reduction. Note that this feature is ONLY implemented/tested on Linux. Several steps of the full job: 1, support RDMA for the connection from client & server. This is the most import part, and luckly it's implemented in this patch. Add a new config "rdma-port" for the server side to listen on a RDMA port. Both redis-cli and redis-benchmark work fine with a new argument '--rdma'. "REPLICAOF" command launches a RDMA client if "rdma-replication" is enabled, and it works fine. 2, To support RDMA cluster mode. 3, To implement async read/write for client side. Because RDMA does NOT support POLLOUT event, it's a little difficult to implement the async IO mechanism for hiredis. The test result is quite exciting: CPU: Intel(R) Xeon(R) Platinum 8260. NIC: Mellanox ConnectX-5. Config of redis: appendonly no, port 6379, rdma-port 6379, appendonly no, server_cpulist 12, bgsave_cpulist 16. For RDMA: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \ --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma For TCP: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \ --threads 8 -d 512 -t ping,set,get,lrange_100 ====== PING_INLINE ====== TCP: QPS: 159017 AVG LAT: 0.183 RDMA: QPS: 523944 AVG LAT: 0.054 ====== PING_MBULK ====== TCP: QPS: 162256 AVG LAT: 0.179 RDMA: QPS: 509839 AVG LAT: 0.056 ====== SET ====== TCP: QPS: 154700 AVG LAT: 0.187 RDMA: QPS: 492368 AVG LAT: 0.058 ====== GET ====== TCP: QPS: 159022 AVG LAT: 0.182 RDMA: QPS: 525099 AVG LAT: 0.054 ====== LPUSH (needed to benchmark LRANGE) ====== TCP: QPS: 142537 AVG LAT: 0.207 RDMA: QPS: 395038 AVG LAT: 0.073 ====== LRANGE_100 (first 100 elements) ====== TCP: QPS: 36171 AVG LAT: 0.657 RDMA: QPS: 55266 AVG LAT: 0.412 Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
cool |
@oranagra @yossigo @soloestoy |
@pizhenwei thank you for the significant contribution. |
@pizhenwei Thanks for this contribution! After a discussion with @oranagra and the rest of the core team, we have reached a few conclusions. First, RDMA is not (yet?) a commodity technology so in practical terms we can't test / use / experience it in any way. For example, none of the major public clouds offer it today. We also know very little about it, but we can assess that the surface area for officially supporting it is pretty big and involves many additional aspects, such as:
Because of this, we don't think it's possible at this point in time to accept this contribution and have it an integral part of the Redis core. What we can and want to do is use this opportunity to move forward with ideas we have already discussed in the past around first-party modules and modularizing Redis. One idea we discussed in the past was being able to have the TLS connection capability implemented as a standalone optional module, so users can use Redis with or without it, or even load alternative connection modules with different TLS implementations. If we pursue this, RDMA support could be (mostly) an external module which can be developed and maintained separately. |
+1 for this. I think defining protocol over RDMA is especially challenging since there are many design choices for RDMA (ibverbs) to implement the same function. For example, to transmit a bulk of data, one can use SEND/RECV primitive similar to UDP; or he can use RDMA WRITE primitive to directly write the data into the peer's memory, which bypasses the peer's CPU; or he can also let peer read from his buffer using RDMA READ primitive. These design choices may yield different performance characteristics. In addition, the performance of RDMA is sensitive to the platform it runs on (NIC model, CPU model, etc.). The design choice that performs well on one platform may yield bad performance on another one, so finding an ideal design is even more challenging with so many platforms available. |
Close this PR, and still keep pizhenwei:feature-rdma branch for performance test purpose and so on. Issues & suggestions are welcomed! |
In the production environment, RDMA gets popular and common for a networking card. So support RDMA as transport layer protocol to improve performance and cost reduction. Note that this feature is ONLY implemented/tested on Linux. Actually, this is the v2 implementation. The v1 uses low level IB verbs API directly, the code and discuss ses PR: redis#9161 Instead of low level API, the v2 use rsocket which is implemented by rdma-core to simplify the work in Redis. The test result is quite exciting: CPU: Intel(R) Xeon(R) Platinum 8260. NIC: Mellanox ConnectX-5. Config of redis: appendonly no, port 6379, rdma-port 6379, appendonly no, server_cpulist 12, bgsave_cpulist 16. For RDMA: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \ --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma For TCP: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \ --threads 8 -d 512 -t ping,set,get,lrange_100 ====== PING_INLINE ====== TCP: QPS: 159017 AVG LAT: 0.183 v1 RDMA: QPS: 523944 AVG LAT: 0.054 v2 RDMA: QPS: 492683 AVG LAT: 0.052 ====== PING_MBULK ====== TCP: QPS: 162256 AVG LAT: 0.179 v1 RDMA: QPS: 509839 AVG LAT: 0.056 v2 RDMA: QPS: 532226 AVG LAT: 0.048 ====== SET ====== TCP: QPS: 154700 AVG LAT: 0.187 v1 RDMA: QPS: 492368 AVG LAT: 0.058 v2 RDMA: QPS: 295534 AVG LAT: 0.095 ====== GET ====== TCP: QPS: 159022 AVG LAT: 0.182 v1 RDMA: QPS: 525099 AVG LAT: 0.054 v1 RDMA: QPS: 411488 AVG LAT: 0.065 ====== LPUSH (needed to benchmark LRANGE) ====== TCP: QPS: 142537 AVG LAT: 0.207 v1 RDMA: QPS: 395038 AVG LAT: 0.073 v2 RDMA: QPS: 353232 AVG LAT: 0.079 ====== LRANGE_100 (first 100 elements) ====== TCP: QPS: 36171 AVG LAT: 0.657 v1 RDMA: QPS: 55266 AVG LAT: 0.412 v2 RDMA: QPS: 52228 AVG LAT: 0.468 Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
At least this sentence is not true for years already. So the more accurate sentence is: "Almost all HPC, AI and hyper-scale clouds use RDMA as a base for their network fabric". Thanks |
Fully abstract connection types, and a connection driver could register into redis-server dynamiclly. Theoretically TCP & Unix socket can be also implemented by this mechanism, but in the current scenario only support TLS to run as a shared library. This also should be used for RDMA, some discuss see: redis#9161 Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Fully abstract connection types, and a connection driver could register into redis-server dynamiclly. Theoretically TCP & Unix socket can be also implemented by this mechanism, but in the current scenario only support TLS to run as a shared library. This also should be used for RDMA, some discuss see: redis#9161 Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Hi, @yossigo : |
Hi @jue-jiang, Redis over TCP is not formally specified but there's the de-facto implementation that makes several assumptions, like:
I assumed (and may was wrong) that some of those assumptions are not necessarily true for RDMA and that some work is required to analyze and define that. |
@jue-jiang
For the Redis scenario, when client side runs 'get key', the client side does NOT know the length of response data, and client side could read stream data and parse 'Redis protocol'. But RDMA could only send/receive fixed size of data. |
In the production environment, RDMA gets popular and common for a
networking card. So support RDMA as transport layer protocol to
improve performance and cost reduction.
Note that this feature is ONLY implemented/tested on Linux.
Several steps of the full job:
1, support RDMA for the connection from client & server. This is the
most import part, and luckly it's implemented in this patch.
Add a new config "rdma-port" for the server side to listen on
a RDMA port. Both redis-cli and redis-benchmark work fine with a
new argument '--rdma'. "REPLICAOF" command launches a RDMA client
if "rdma-replication" is enabled, and it works fine.
2, To support RDMA cluster mode.
3, To implement async read/write for client side. Because RDMA does
NOT support POLLOUT event, it's a little difficult to implement
the async IO mechanism for hiredis.
The test result is quite exciting:
CPU: Intel(R) Xeon(R) Platinum 8260.
NIC: Mellanox ConnectX-5.
Config of redis: appendonly no, port 6379, rdma-port 6379, appendonly no,
server_cpulist 12, bgsave_cpulist 16.
For RDMA: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000
--threads 8 -d 512 -t ping,set,get,lrange_100 --rdma
For TCP: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000
--threads 8 -d 512 -t ping,set,get,lrange_100
====== PING_INLINE ======
TCP: QPS: 159017 AVG LAT: 0.183
RDMA: QPS: 523944 AVG LAT: 0.054
====== PING_MBULK ======
TCP: QPS: 162256 AVG LAT: 0.179
RDMA: QPS: 509839 AVG LAT: 0.056
====== SET ======
TCP: QPS: 154700 AVG LAT: 0.187
RDMA: QPS: 492368 AVG LAT: 0.058
====== GET ======
TCP: QPS: 159022 AVG LAT: 0.182
RDMA: QPS: 525099 AVG LAT: 0.054
====== LPUSH (needed to benchmark LRANGE) ======
TCP: QPS: 142537 AVG LAT: 0.207
RDMA: QPS: 395038 AVG LAT: 0.073
====== LRANGE_100 (first 100 elements) ======
TCP: QPS: 36171 AVG LAT: 0.657
RDMA: QPS: 55266 AVG LAT: 0.412
Signed-off-by: zhenwei pi pizhenwei@bytedance.com