Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance observations when running NF's on local and/or remote NUMA node in a server #285

Open
Balaram6712 opened this issue Apr 29, 2021 · 11 comments

Comments

@Balaram6712
Copy link

Respected sir,

Please find the following observations and couple of queries on how we can improve the throughput in such cases.

Experimental setup:

We run OpenNetVM on a server which is a two-node (NUMA) server with 24 cores (12 cores per each node), 96GB memory, and one dual-port 10G NICs. We use DPDK Pktgen to generate traffic, which runs on another server which has the same configuration as the previous one. Both servers run Ubuntu 18.04. Two NICs are located on Node-0 (we call it as local node). Node-1 is called remote node. Two servers are connected in back-to-back manner using dual-port 10G NICs.

On top this setup we run multiple following experiments

  1. Running single NF (basic_monitor NF)
    1. Sending 9 Gbps traffic from one NIC
      1. When NF runs on a local node, we have observed full throughput (9 Gbps).
      2. When NF runs on a remote node, we have observed 4.5 Gbps throughput.
        1. In order to improve throughput, we have done the following steps and observed 7 Gbps throughput
          • Increasing buffer size at NF buffer size as well here where is changed to #define NF_QUEUE_RINGSIZE 65536
          • Increasing the buffer size for a port of RX, TX thread RX buffer, TX buffer changed to
            #define RTE_MP_RX_DESC_DEFAULT 8192, #define RTE_MP_TX_DESC_DEFAULT 8192
          • Also the packet read size on NF side here changed to #define PACKET_READ_SIZE ((uint16_t)1024)
    2. Sending 18 Gbps traffic from two ports (9 Gbps from each port)
      1. In order to send traffic from two ports to NF we made following changes in the manager code by directly assigning the action and service id to the packet by making the following changes in this function here
        Screenshot from 2021-04-24 14-45-02
      2. We run NF basic_monitor on a local node (node 0) and we have observed throughput of around 12.5 Gbps. Whereas when we run the same NF on a remote node we have observed 4.5 Gbps throughput.
      3. The following attempts were made to improve the throughput:
        • Using multiple RX threads, TX threads by increasing the count of RX threads in the code here by changing to #define ONVM_NUM_RX_THREADS 2, also increasing the number of cores allocated while running the manager to 5(2 RX, 2TX, 1 Manager).
        • Also increasing the buffer sizes as mentioned above.
        • After these changes, we have observed throughput 15 Gbps and 6.4 Gbps when a NF runs in local node and remote node respectively.

Query 1: could you suggest any other approaches to get full throughput i.e 18 Gbps (for traffic rate 18 Gbps) when a NF runs in a local node?

  1. Running two NFs ( running basic_monitor and simple_forward)
    1. Running two NFs in local node
      1. In order to steer the traffic from each port to a specific NF, we have done the following changes:
        • The changes made to the manager code in this function here by disabling the flow table look up and assigning the action and destination based on the service id based on the port from which the packet is arriving.
          Screenshot from 2021-04-24 18-58-26
        • We have sent 9 Gbps traffic from each port (total 18 Gbps) and observed 7 Gbps throughput for each NF (total 14 Gbps)
    2. Running basic monitor on local node and simple_forward on remote node
      1. We have sent 9 Gbps traffic from each port (total 18 Gbps) and observed 4.5 Gbps throughput for each NF

Query 2: When we run two NFs in a local node, we have observed 7 Gbps throughput for each NF, whereas when we run one NF in local and another NF in remote node, we have observed 4.5 Gbps throughput for each. Here, it is surprising that throughput of NF which is running in local node is also getting 4.5 Gbps instead of 7 Gbps which we observed when both NFs were running in local node. Could you please suggest any changes to improve throughput?

Thanks

@Balaram6712
Copy link
Author

To add to this,
I further checked the RX thread working in the main.c file and found out that the packets received at the manager, only Thread-1 is collection them from both the ports(0,1) and then storing them in the rx_mgr buffer corresponding to Thread-1. Thread-0 shows rx_count as zero while collecting the packet from ports(0,1) everytime, this might be the problem for NF giving same throughput when basic_monitor in Local Node and simple forward in Remote node having the same throughput.

Can you please give any suggestions on how to solve this problem and also how I can send the packets to the corresponding rx_mgr buffer of a thread based on the port->id(0 or 1) from which the packets are being received so that port-0 traffic is sent to rx-buffer-0 and port-1 traffic is sent to rx-buffer-1 and then the corresponding action on packet is performed.

Thanks

@JackKuo-tw
Copy link
Contributor

Hi @Balaram6712 ,

I'm an ONVM user as well, the following comments could not be fully correct.

  1. Remote node has poor performance

True, remote node doesn't have bound with standalone NIC, NUMA distance matters.

This is because DPDK pass-through the kernel and packets arrive LLC directly, in your case is Node-0's LLC.

So when the Node-1 wants to process packets, it must retrieve the packets from Node-0, which hurts performance.

The aforementioned guess can be verify via PCM, could you please check L3 cache hit rate, memory bandwith change between 2 local & 1-local-1-remote. (I don't have any NUMA server 😭)

  1. Observed 4.5 Gbps throughput for each NF in local & remote

I think the problem is resource contention, please inspect PCM metrics.

BTW, do you make sure that NFs are not overload? (0 rx_drop in all your cases)

  1. How do you measure the rx_count ? print it or other modification?

I try to alter ONVM_NUM_RX_THREADS to 2, and investigate the PCM's IPC, cache metrics, 1 of my rx_thread seems not working too... (I have no idea)

But I don't think it's the problem of both 4.5 Gbps for each NF.

  1. How to separate rx_thread for each port?

In my opinion, the easiest way is to add another rx_thread_main function, called rx_thread_main_2.

in rx_thread_main_2, specify rte_eth_rx_burst to process port 1 only, don't forget to modify rx_thread_main_1 to process port 0 only.


Have a nice weekend & experiment! 💪

@JackKuo-tw
Copy link
Contributor

JackKuo-tw commented May 7, 2021

BTW, I found rx_thread_main has race condition in rx_count if ONVM_NUM_RX_THREADS is more than 1.

@twood02 is it a bug?

@twood02
Copy link
Member

twood02 commented May 7, 2021

@Balaram6712 - Thanks for your questions and sorry we are a little slow to respond right now.

Query 1: could you suggest any other approaches to get full throughput i.e 18 Gbps (for traffic rate 18 Gbps) when a NF runs in a local node?

The changes you have made are good. Make sure that you have FLOW_LOOKUP disabled since that can slow down onvm_pkt_process_rx_batch by a lot. Adjusting the buffer sizes can affect things, but it sounds like you have already tried that.

Query 2: When we run two NFs in a local node, we have observed 7 Gbps throughput for each NF, whereas when we run one NF in local and another NF in remote node, we have observed 4.5 Gbps throughput for each. Here, it is surprising that throughput of NF which is running in local node is also getting 4.5 Gbps instead of 7 Gbps which we observed when both NFs were running in local node. Could you please suggest any changes to improve throughput?

In general we have not looked in detail at NUMA issues for ONVM. Ideally you will want to keep packets fully processed on one socket. To do that you will need 2 RX threads (one per socket) and 2 TX threads (one per socket). I don't recall if our manager core assignment logic is smart enough to try to distribute those across sockets. My guess is that right now, the TX thread is becoming the bottleneck as it is getting packets from both the local and remote nodes to send out. You may to start a second TX thread and check that it is assigned to handle the NFs from the second NUMA socket.

@JackKuo-tw - yes you are correct that rx_count has a race condition since different RX Threads could be reading from the same port (but different queues on that port). One solution to this is we could make the RX threads be assigned all queues on one port instead of one queue on all ports. We thought that our current approach would lead to better load balancing across RX threads, but you are correct that it can lead to incorrect stats data.

@Balaram6712
Copy link
Author

@JackKuo-tw Thanks for the reply,

I will get back to you on the PCM metrics after i measure them.

Yes for the remote node the performance drops, what I am trying to solve here is that when one NF is moved to remote node it is affecting the performance on the local NF as well.

BTW, do you make sure that NFs are not overload? (0 rx_drop in all your cases)

Yes I observed rx_drop = 0 in all the cases

How do you measure the rx_count ? print it or other modification?

I printed rx_count value by adding code here

I observed that only Thread-1 receiving the packet from port-0,1 and storing in the corresponding buffer of Thread-1 and Thread-0 shows value zero every time for rx_count.

I try to alter ONVM_NUM_RX_THREADS to 2, and investigate the PCM's IPC, cache metrics, 1 of my rx_thread seems not working too... (I have no idea)

Even i observed this by running pqos in the system only 1 RX thread seems to be running, I guess that this can be referred to Thread-0 not receiving packets.

How to separate rx_thread for each port?
In my opinion, the easiest way is to add another rx_thread_main function, called rx_thread_main_2.
in rx_thread_main_2, specify rte_eth_rx_burst to process port 1 only, don't forget to modify rx_thread_main_1 to process port 0 only.

I added the new function rx_thread_main_2 and limited rte_eth_rx_burst only to port 1 and for rx_thread_main corresponding rte_eth_rx_burst limited it only to port 0. But what i observed was that only 2nd NF simple forward is receiving packets and 1st NF basic_monitor is not receiving packets.

@twood02 Thanks for the reply,

My guess is that right now, the TX thread is becoming the bottleneck as it is getting packets from both the local and remote nodes to send out. You may to start a second TX thread and check that it is assigned to handle the NFs from the second NUMA socket.

I added an extra TX thread as well and assigned each TX thread corresponding to each NF and observed an improvement in the throughput

  1. In local Node for traffic of 9Gbps from each port(Total 18Gbps) we have observed 7.75Gbps for each NF (Total 15.5Gbps).
  2. For the same traffic rate as mentioned above, when simple forward NF is moved from local to remote node we observed traffic of 6.5Gbps for each NF (Total 13Gbps).

But the problem still here exists affecting the performance of the local NF when other NF is moved to remote Node, i guess the problem still is near multiple RX threads where only Thread-1 is receiving the packets from both the ports, can you suggest how we can solve this problem so that both the threads will be able to receive the packets from both the ports and I can steer the packets accordingly.

One solution to this is we could make the RX threads be assigned all queues on one port instead of one queue on all ports.

Can you suggest how we can do this so that a RX thread can collect packets from the port which it is assigned to.

Thanks

@twood02
Copy link
Member

twood02 commented May 8, 2021

I can respond to other parts later, but quickly:

Can you suggest how we can do this so that a RX thread can collect packets from the port which it is assigned to.

This should be an easy change in the rx_thread for loop. Currently it iterates through all ports and then bursts from a queue based on the RX thread ID number. Instead you could have a for loop for all queues (possibly you only have 1 queue, especially if you are using virtual NICs), and then change:

rte_eth_rx_burst(ports->id[i], rx_mgr->id, pkts, PACKET_READ_SIZE);

to

rte_eth_rx_burst(rx_mgr->id, i, pkts, PACKET_READ_SIZE);

@Balaram6712
Copy link
Author

possibly you only have 1 queue, especially if you are using virtual NICs
@twood02 is it possible to have 2 queues so that they can be mapped to corresponding 2 ports when running OpenNetVM?

After the changes u suggested on assigning RX thread to a port i observe full throughput in local node and remote node max(6.5 Gbps each), but there is problem when running one NF in local and other in remote node, the local NF throughput is getting effected. For example when I am sending 9Gbps to both NF which are placed in local and remote node repectively, the local NF throughput is 7Gbps and remote NF is 6.5Gbps which is not expected as local NF is expected to produce full throughput. I presume that it is due to 1 queue in use. I tried to debug by adding some print statements of port_id and packets count, observed that count of packets rx_count = zero in all the cases for port->id = 0. If there are 2 queues to receive corresponding packets the issue can be solved. Can you suggest on how I can solve this issue by adding another queue?

I know it's been a long time looking into it, can you please help me out here would be very helpful
Thank you!

@twood02
Copy link
Member

twood02 commented Aug 23, 2021

I want to be sure I fully understand your setup now. Is this all accurate?

  1. 2 RX threads, one per port.
  2. Each RX thread is on a different NUMA node
  3. Incoming traffic rate is 9Gbps per port (18 total)
  4. This gives 7Gbps on the local NF and 9gbps on the remote NF

What exactly do you mean by local/remote here? You should be having each NF get packets from its "local" RX thread (i.e., the one on the same NUMA node)

@Balaram6712
Copy link
Author

@twood02

2 RX threads, one per port.

Yes

Each RX thread is on a different NUMA node

The 2 ports are connected to local NUMA node.

Incoming traffic rate is 9Gbps per port (18 total)

Yes

This gives 7Gbps on the local NF and 9gbps on the remote NF

local NF gives 7Gbps and remote NF gives 6.5Gbps

Our setup is such that we have 2 NUMA nodes, OpenNetVM manager is run on local node connected to 2 ports. The NF is either placed in local NUMA node or remote NUMA node.

We run OpenNetVM on a server which is a two-node (NUMA) server with 24 cores (12 cores per each node), 96GB memory, and one dual-port 10G NICs. We use DPDK Pktgen to generate traffic, which runs on another server which has the same configuration as the previous one. Both servers run Ubuntu 18.04. Two NICs are located on Node-0 (we call it as local node). Node-1 is called remote node. Two servers are connected in back-to-back manner using dual-port 10G NICs.

Have a look at the detailed setup explanation as stated before.

Thank you!!

@twood02
Copy link
Member

twood02 commented Aug 23, 2021

@Balaram6712 Thanks for the clarification.

When you say:

OpenNetVM manager is run on local node connected to 2 ports

I assume you mean that the main ONVM manager and first RX thread are started on the local node, but that there is also a manager RX thread running on the remote node.

Are your NFs sending the packets back out or dropping them after doing some processing? The bottleneck could be the TX thread(s) if you are having the NFs send out the port.

@Balaram6712
Copy link
Author

@twood02

I assume you mean that the main ONVM manager and first RX thread are started on the local node, but that there is also a manager RX thread running on the remote node.

No ONVM Manager and 2RX threads are run in local node. The 2 RX threads collect the packets correspondingly from the 2 ports, the RX thread which corresponds to the NF in remote node collects packets and sends to NF. The overhead of packet from local to remote which causes to produce less throughput when NF when placed in remote node but this should not effect local NF throughput.

Are your NFs sending the packets back out or dropping them after doing some processing?

I am sending the packets through out port after processing them by NF.

The bottleneck could be the TX thread(s) if you are having the NFs send out the port.

There are 2TX threads corresponding to each NF which handle the packets and send them out through port. I don't think this will be a bottleneck as we are assigning corresponding threads to handle packets from NF, but the NF in local node results in lesser throughput. If the bottleneck is near TX threads for sending packet out the port, but in a scenario when both NF are placed in local node I am able to observe full throughput from both NF.

I don't think problem is due to TX thread and sending packets out the port. I guess it is due to RX thread and queue which is causing this problem to produce lesser throughput for NF which is in local node in this scenario.

Please let me know if I am able to convey my problem correctly.

Thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants