Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support RDMA as tranport layer protocol #9161

Closed
wants to merge 1 commit into from

Conversation

pizhenwei
Copy link
Contributor

In the production environment, RDMA gets popular and common for a
networking card. So support RDMA as transport layer protocol to
improve performance and cost reduction.

Note that this feature is ONLY implemented/tested on Linux.

Several steps of the full job:
1, support RDMA for the connection from client & server. This is the
most import part, and luckly it's implemented in this patch.
Add a new config "rdma-port" for the server side to listen on
a RDMA port. Both redis-cli and redis-benchmark work fine with a
new argument '--rdma'. "REPLICAOF" command launches a RDMA client
if "rdma-replication" is enabled, and it works fine.

2, To support RDMA cluster mode.

3, To implement async read/write for client side. Because RDMA does
NOT support POLLOUT event, it's a little difficult to implement
the async IO mechanism for hiredis.

The test result is quite exciting:
CPU: Intel(R) Xeon(R) Platinum 8260.
NIC: Mellanox ConnectX-5.
Config of redis: appendonly no, port 6379, rdma-port 6379, appendonly no,
server_cpulist 12, bgsave_cpulist 16.
For RDMA: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000
--threads 8 -d 512 -t ping,set,get,lrange_100 --rdma
For TCP: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000
--threads 8 -d 512 -t ping,set,get,lrange_100

====== PING_INLINE ======
TCP: QPS: 159017 AVG LAT: 0.183
RDMA: QPS: 523944 AVG LAT: 0.054

====== PING_MBULK ======
TCP: QPS: 162256 AVG LAT: 0.179
RDMA: QPS: 509839 AVG LAT: 0.056

====== SET ======
TCP: QPS: 154700 AVG LAT: 0.187
RDMA: QPS: 492368 AVG LAT: 0.058

====== GET ======
TCP: QPS: 159022 AVG LAT: 0.182
RDMA: QPS: 525099 AVG LAT: 0.054

====== LPUSH (needed to benchmark LRANGE) ======
TCP: QPS: 142537 AVG LAT: 0.207
RDMA: QPS: 395038 AVG LAT: 0.073

====== LRANGE_100 (first 100 elements) ======
TCP: QPS: 36171 AVG LAT: 0.657
RDMA: QPS: 55266 AVG LAT: 0.412

Signed-off-by: zhenwei pi pizhenwei@bytedance.com

In the production environment, RDMA gets popular and common for a
networking card. So support RDMA as transport layer protocol to
improve performance and cost reduction.

Note that this feature is ONLY implemented/tested on Linux.

Several steps of the full job:
1, support RDMA for the connection from client & server. This is the
   most import part, and luckly it's implemented in this patch.
   Add a new config "rdma-port" for the server side to listen on
   a RDMA port. Both redis-cli and redis-benchmark work fine with a
   new argument '--rdma'. "REPLICAOF" command launches a RDMA client
   if "rdma-replication" is enabled, and it works fine.

2, To support RDMA cluster mode.

3, To implement async read/write for client side. Because RDMA does
   NOT support POLLOUT event, it's a little difficult to implement
   the async IO mechanism for hiredis.

The test result is quite exciting:
CPU: Intel(R) Xeon(R) Platinum 8260.
NIC: Mellanox ConnectX-5.
Config of redis: appendonly no, port 6379, rdma-port 6379, appendonly no,
                 server_cpulist 12, bgsave_cpulist 16.
For RDMA: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma
For TCP: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100

====== PING_INLINE ======
 TCP: QPS: 159017   AVG LAT: 0.183
RDMA: QPS: 523944   AVG LAT: 0.054

====== PING_MBULK ======
 TCP: QPS: 162256   AVG LAT: 0.179
RDMA: QPS: 509839   AVG LAT: 0.056

====== SET ======
 TCP: QPS: 154700   AVG LAT: 0.187
RDMA: QPS: 492368   AVG LAT: 0.058

====== GET ======
 TCP: QPS: 159022   AVG LAT: 0.182
RDMA: QPS: 525099   AVG LAT: 0.054

====== LPUSH (needed to benchmark LRANGE) ======
 TCP: QPS: 142537   AVG LAT: 0.207
RDMA: QPS: 395038   AVG LAT: 0.073

====== LRANGE_100 (first 100 elements) ======
 TCP: QPS:  36171   AVG LAT: 0.657
RDMA: QPS:  55266   AVG LAT: 0.412

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
@kukey
Copy link
Contributor

kukey commented Jun 29, 2021

cool

@pizhenwei
Copy link
Contributor Author

@oranagra @yossigo @soloestoy
Hi, sorry about that maybe I should create an issue to discuss this feature before pushing a new PR.
About 1 month ago, when I debug iSCSI/iSER, I noted that the performance of storage virtualization got a lot of improvement during using RDMA. But I'm not sure if redis could get the same achievement(because the IO size is always aligned to 4K, the KV size is indefinite length). So I tried to implement this feature to test this idea, and luckily the test result seems almost to triple of TCP.
What's the next step should I take?

@oranagra
Copy link
Member

@pizhenwei thank you for the significant contribution.
we are already looking into this PR, please hold on until we publish our feedback.
this is certainly very interesting.
p.s. it's nice to see the TLS / connection abstraction project paying off again 8-)

@yossigo
Copy link
Member

yossigo commented Jul 8, 2021

@pizhenwei Thanks for this contribution! After a discussion with @oranagra and the rest of the core team, we have reached a few conclusions.

First, RDMA is not (yet?) a commodity technology so in practical terms we can't test / use / experience it in any way. For example, none of the major public clouds offer it today.

We also know very little about it, but we can assess that the surface area for officially supporting it is pretty big and involves many additional aspects, such as:

  • Formally defining the Redis over RDMA protocol (including security, RESP2/RESP3 considerations, consider how replication is done, cluster bus, etc.).
  • Consider how to make that accessible to different clients (can they directly use a lower level RDMA library? create a new standard low-level lib that will serve as the C binding?).
  • Complete the implementation for replication, cluster bus, Sentinel, etc.

Because of this, we don't think it's possible at this point in time to accept this contribution and have it an integral part of the Redis core.

What we can and want to do is use this opportunity to move forward with ideas we have already discussed in the past around first-party modules and modularizing Redis. One idea we discussed in the past was being able to have the TLS connection capability implemented as a standalone optional module, so users can use Redis with or without it, or even load alternative connection modules with different TLS implementations.

If we pursue this, RDMA support could be (mostly) an external module which can be developed and maintained separately.

@FujiZ
Copy link

FujiZ commented Jul 8, 2021

  • Formally defining the Redis over RDMA protocol (including security, RESP2/RESP3 considerations, consider how replication is done, cluster bus, etc.).

+1 for this. I think defining protocol over RDMA is especially challenging since there are many design choices for RDMA (ibverbs) to implement the same function. For example, to transmit a bulk of data, one can use SEND/RECV primitive similar to UDP; or he can use RDMA WRITE primitive to directly write the data into the peer's memory, which bypasses the peer's CPU; or he can also let peer read from his buffer using RDMA READ primitive. These design choices may yield different performance characteristics.

In addition, the performance of RDMA is sensitive to the platform it runs on (NIC model, CPU model, etc.). The design choice that performs well on one platform may yield bad performance on another one, so finding an ideal design is even more challenging with so many platforms available.

@pizhenwei
Copy link
Contributor Author

Close this PR, and still keep pizhenwei:feature-rdma branch for performance test purpose and so on.

Issues & suggestions are welcomed!

@pizhenwei pizhenwei closed this Jul 23, 2021
pizhenwei added a commit to pizhenwei/redis that referenced this pull request Jul 23, 2021
In the production environment, RDMA gets popular and common for a
networking card. So support RDMA as transport layer protocol to
improve performance and cost reduction.

Note that this feature is ONLY implemented/tested on Linux.

Actually, this is the v2 implementation. The v1 uses low level IB
verbs API directly, the code and discuss ses PR:
    redis#9161

Instead of low level API, the v2 use rsocket which is implemented
by rdma-core to simplify the work in Redis.

The test result is quite exciting:
CPU: Intel(R) Xeon(R) Platinum 8260.
NIC: Mellanox ConnectX-5.
Config of redis: appendonly no, port 6379, rdma-port 6379, appendonly no,
                 server_cpulist 12, bgsave_cpulist 16.
For RDMA: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma
For TCP: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100

====== PING_INLINE ======
    TCP: QPS: 159017   AVG LAT: 0.183
v1 RDMA: QPS: 523944   AVG LAT: 0.054
v2 RDMA: QPS: 492683   AVG LAT: 0.052

====== PING_MBULK ======
    TCP: QPS: 162256   AVG LAT: 0.179
v1 RDMA: QPS: 509839   AVG LAT: 0.056
v2 RDMA: QPS: 532226   AVG LAT: 0.048

====== SET ======
    TCP: QPS: 154700   AVG LAT: 0.187
v1 RDMA: QPS: 492368   AVG LAT: 0.058
v2 RDMA: QPS: 295534   AVG LAT: 0.095

====== GET ======
    TCP: QPS: 159022   AVG LAT: 0.182
v1 RDMA: QPS: 525099   AVG LAT: 0.054
v1 RDMA: QPS: 411488   AVG LAT: 0.065

====== LPUSH (needed to benchmark LRANGE) ======
    TCP: QPS: 142537   AVG LAT: 0.207
v1 RDMA: QPS: 395038   AVG LAT: 0.073
v2 RDMA: QPS: 353232   AVG LAT: 0.079

====== LRANGE_100 (first 100 elements) ======
    TCP: QPS:  36171   AVG LAT: 0.657
v1 RDMA: QPS:  55266   AVG LAT: 0.412
v2 RDMA: QPS:  52228   AVG LAT: 0.468

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
@rleon
Copy link

rleon commented Jul 25, 2021

First, RDMA is not (yet?) a commodity technology so in practical terms we can't test / use / experience it in any way. For example, none of the major public clouds offer it today.

At least this sentence is not true for years already.
Azure has it from 2015: https://azure.microsoft.com/en-us/blog/azure-linux-rdma-hpc-available/
https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-hpc - RDMA-capable instances
Amazon: https://aws.amazon.com/hpc/efa/
Alicloud: https://www.alibabacloud.com/blog/using-rdma-on-container-service-for-kubernetes_594462

So the more accurate sentence is: "Almost all HPC, AI and hyper-scale clouds use RDMA as a base for their network fabric".

Thanks

pizhenwei added a commit to pizhenwei/redis that referenced this pull request Aug 5, 2021
Fully abstract connection types, and a connection driver could register into
redis-server dynamiclly. Theoretically TCP & Unix socket can be also implemented
by this mechanism, but in the current scenario only support TLS to run as a
shared library. This also should be used for RDMA, some discuss see:
    redis#9161

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
pizhenwei added a commit to pizhenwei/redis that referenced this pull request Aug 5, 2021
Fully abstract connection types, and a connection driver could register into
redis-server dynamiclly. Theoretically TCP & Unix socket can be also implemented
by this mechanism, but in the current scenario only support TLS to run as a
shared library. This also should be used for RDMA, some discuss see:
    redis#9161

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
@jue-jiang
Copy link

Hi, @yossigo :
I have gone through Redis Protocol specification. TCP is just used as it is. There is no Redis over TCP protocol. If there is one, please correct me.
From RDMA side, application developer should just use RDMA the same way they use TCP and obtain better performance. So I think defining Redis over RDMA protocol is not the first priority. And I agree a clean way to switch between TCP and RDMA should be done first.

@yossigo
Copy link
Member

yossigo commented Sep 2, 2021

Hi @jue-jiang,

Redis over TCP is not formally specified but there's the de-facto implementation that makes several assumptions, like:

  • Working on top of a full duplex reliable stream
  • How an endpoint address (equivalent to IP+port) is represneted, e.g. when a replica announces its visible address to a master, or in cluster bus messages

I assumed (and may was wrong) that some of those assumptions are not necessarily true for RDMA and that some work is required to analyze and define that.

@pizhenwei
Copy link
Contributor Author

pizhenwei commented Sep 3, 2021

Hi @jue-jiang,

Redis over TCP is not formally specified but there's the de-facto implementation that makes several assumptions, like:

  • Working on top of a full duplex reliable stream
  • How an endpoint address (equivalent to IP+port) is represneted, e.g. when a replica announces its visible address to a master, or in cluster bus messages

I assumed (and may was wrong) that some of those assumptions are not necessarily true for RDMA and that some work is required to analyze and define that.

@jue-jiang
Redis supports a connection abstract layer(Ref struct ConnectionType), and it's designed as stream semantics. Let's look at the basic difference between TCP and RDMA:

  • TCP: stream semantics, you can write/read uncertain length of data, the TCP protocol could send/receive correctly.
  • RDMA: message semantics. Before write/read data with remote side, we must allocate memory, and set memory as RDMA memory region with a fixed size.

For the Redis scenario, when client side runs 'get key', the client side does NOT know the length of response data, and client side could read stream data and parse 'Redis protocol'. But RDMA could only send/receive fixed size of data.
To support Redis over RDMA, the main job is 'emulate stream semantics by RDMA' to compact Redis connection abstract layer.
So defining Redis over RDMA protocol is definitely the first priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants