Record some baseline benchmarks for XDP.
And describe the hardware Toke and Jesper have.
- Parameters:
-
- Different NICs
- Touching reading data before drop vs not
- Single RX-queue performance
- Multi RX-queue performance scaling
Q: Will a packet size test make sense?
Since we are already saturating the PCI bus I don’t think this is needed.
Q: Should we compare against ‘iptables -t raw -j DROP’ ?
Yes, this makes sense; and against DPDK.
TODO: Desc in paper how XDP\_TX actually acheives bulking, by delaying the tail/doorbell (until driver exit it’s NAPI call).
Idea: We could measure the overhead XDP introduce, by comparing against iptables-raw drop?
Yes, this makes sense: touch packet in XDP, pass to iptables-raw.
The redirect needs a separate benchmark document.
file://bench05_xdp_redirect.org
- DUT (Device Under Test):
-
- CPU: E5-1650 v4 @ 3.60GHz
Jesper have more types of NICs.
- DUT:
- model name : Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
(Kernel: 4.17.0-rc6-bpf-next-rm-ndo-flush+ #24 SMP PREEMPT)
trex-console command:
start -f stl/udp_for_benchmarks.py -t packet_len=64 --port 0 -m 100%
-Per port stats table ports | 0 | 1 ----------------------------------------------------------------------------------------- opackets | 111670205505 | 0 obytes | 7146893156880 | 0 ipackets | 184 | 116 ibytes | 38924 | 7564 ierrors | 0 | 0 oerrors | 0 | 0 Tx Bw | 50.64 Gbps | 0.00 bps -Global stats enabled Cpu Utilization : 99.4 % 17.0 Gb/core Platform_factor : 1.0 Total-Tx : 50.64 Gbps Total-Rx : 0.00 bps Total-PPS : 98.92 Mpps Total-CPS : 0.00 cps Expected-PPS : 0.00 pps Expected-CPS : 0.00 cps Expected-BPS : 0.00 bps Active-flows : 0 Clients : 0 Socket-util : 0.0000 % Open-flows : 0 Servers : 0 Socket : 0 Socket/Clients : -nan Total_queue_full : 658688774 drop-rate : 50.64 Gbps current time : 1922.0 sec test duration : 0.0 sec
Max XDP_DROP dropping speed single RX queue.
[jbrouer@broadwell kernel-bpf-samples] $ sudo ./xdp_rxq_info --dev mlx5p1 --action XDP_DROP --sec 3 Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP XDP stats CPU pps issue-pps XDP-RX CPU 3 25,928,270 0 XDP-RX CPU total 25,928,270 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 3:3 25,928,274 0 rx_queue_index 3:sum 25,928,274
(Kernel: 4.17.0-rc6-bpf-next-rm-ndo-flush+ #24 SMP PREEMPT)
trex-console command:
start -f stl/udp_1pkt_range_clients.py -t packet_len=64 --port 0 -m 100%
Trex performance 88.20 Mpps
-Per port stats table ports | 0 | 1 ----------------------------------------------------------------------------------------- opackets | 130357779874 | 0 obytes | 8342897911936 | 0 ipackets | 234 | 143 ibytes | 49960 | 9292 ierrors | 0 | 0 oerrors | 0 | 0 Tx Bw | 45.16 Gbps | 0.00 bps -Global stats enabled Cpu Utilization : 100.0 % 15.1 Gb/core Platform_factor : 1.0 Total-Tx : 45.16 Gbps Total-Rx : 1.35 Kbps Total-PPS : 88.20 Mpps Total-CPS : 0.00 cps Expected-PPS : 0.00 pps Expected-CPS : 0.00 cps Expected-BPS : 0.00 bps Active-flows : 0 Clients : 0 Socket-util : 0.0000 % Open-flows : 0 Servers : 0 Socket : 0 Socket/Clients : -nan Total_queue_full : 1091860676 drop-rate : 45.16 Gbps current time : 2248.9 sec test duration : 0.0 sec
XDP_DROP results: total 75,297,461 pps, and approx 12Mpps per RX queue.
[jbrouer@broadwell kernel-bpf-samples]$ sudo ./xdp_rxq_info --dev mlx5p1 --action XDP_DROP --sec 3 Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP XDP stats CPU pps issue-pps XDP-RX CPU 0 12,617,796 0 XDP-RX CPU 1 13,106,530 0 XDP-RX CPU 2 12,499,630 0 XDP-RX CPU 3 12,276,195 0 XDP-RX CPU 4 12,528,915 0 XDP-RX CPU 5 12,268,394 0 XDP-RX CPU total 75,297,461 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 12,617,796 0 rx_queue_index 0:sum 12,617,796 rx_queue_index 1:1 13,106,511 0 rx_queue_index 1:sum 13,106,511 rx_queue_index 2:2 12,499,589 0 rx_queue_index 2:sum 12,499,589 rx_queue_index 3:3 12,276,230 0 rx_queue_index 3:sum 12,276,230 rx_queue_index 4:4 12,528,917 0 rx_queue_index 4:sum 12,528,917 rx_queue_index 5:5 12,268,394 0 rx_queue_index 5:sum 12,268,394
Issue is that the CPU have approx 40% idle cycles.
Show adapter(s) (ixgbe1 ixgbe2 mlx5p1 i40e1 i40e2) statistics (ONLY that changed!) Ethtool(mlx5p1 ) stat: 0 ( 0) <= outbound_pci_stalled_wr /sec Ethtool(mlx5p1 ) stat: 12413351 ( 12,413,351) <= rx0_cache_reuse /sec Ethtool(mlx5p1 ) stat: 12413330 ( 12,413,330) <= rx0_xdp_drop /sec Ethtool(mlx5p1 ) stat: 12936295 ( 12,936,295) <= rx1_cache_reuse /sec Ethtool(mlx5p1 ) stat: 12936306 ( 12,936,306) <= rx1_xdp_drop /sec Ethtool(mlx5p1 ) stat: 12331680 ( 12,331,680) <= rx2_cache_reuse /sec Ethtool(mlx5p1 ) stat: 12331690 ( 12,331,690) <= rx2_xdp_drop /sec Ethtool(mlx5p1 ) stat: 12089538 ( 12,089,538) <= rx3_cache_reuse /sec Ethtool(mlx5p1 ) stat: 12089538 ( 12,089,538) <= rx3_xdp_drop /sec Ethtool(mlx5p1 ) stat: 12359246 ( 12,359,246) <= rx4_cache_reuse /sec Ethtool(mlx5p1 ) stat: 12359234 ( 12,359,234) <= rx4_xdp_drop /sec Ethtool(mlx5p1 ) stat: 12065542 ( 12,065,542) <= rx5_cache_reuse /sec Ethtool(mlx5p1 ) stat: 12065542 ( 12,065,542) <= rx5_xdp_drop /sec Ethtool(mlx5p1 ) stat: 88048303 ( 88,048,303) <= rx_64_bytes_phy /sec Ethtool(mlx5p1 ) stat: 5635163922 ( 5,635,163,922) <= rx_bytes_phy /sec Ethtool(mlx5p1 ) stat: 74194562 ( 74,194,562) <= rx_cache_reuse /sec Ethtool(mlx5p1 ) stat: 13760980 ( 13,760,980) <= rx_discards_phy /sec Ethtool(mlx5p1 ) stat: 93624 ( 93,624) <= rx_out_of_buffer /sec Ethtool(mlx5p1 ) stat: 88049446 ( 88,049,446) <= rx_packets_phy /sec Ethtool(mlx5p1 ) stat: 5635015858 ( 5,635,015,858) <= rx_prio0_bytes /sec Ethtool(mlx5p1 ) stat: 74287687 ( 74,287,687) <= rx_prio0_packets /sec Ethtool(mlx5p1 ) stat: 4457316248 ( 4,457,316,248) <= rx_vport_unicast_bytes /sec Ethtool(mlx5p1 ) stat: 74288600 ( 74,288,600) <= rx_vport_unicast_packets /sec Ethtool(mlx5p1 ) stat: 74194573 ( 74,194,573) <= rx_xdp_drop /sec 10:26:26 PM CPU %usr %nice %sys %iowait %irq %soft %idle 10:26:28 PM all 0.00 0.00 0.08 0.00 1.26 59.61 39.04 10:26:28 PM 0 0.00 0.00 0.00 0.00 1.52 59.60 38.89 10:26:28 PM 1 0.00 0.00 0.00 0.00 1.50 61.50 37.00 10:26:28 PM 2 0.00 0.00 0.00 0.00 1.02 59.90 39.09 10:26:28 PM 3 0.00 0.00 0.00 0.00 1.51 58.29 40.20 10:26:28 PM 4 0.00 0.00 0.00 0.00 1.02 59.90 39.09 10:26:28 PM 5 0.00 0.00 0.00 0.00 1.02 59.18 39.80 10:26:26 PM CPU intr/s 10:26:28 PM all 224677.50 10:26:28 PM 0 246317.00 10:26:28 PM 1 254161.00 10:26:28 PM 2 244789.50 10:26:28 PM 3 241976.00 10:26:28 PM 4 244768.50 10:26:28 PM 5 240606.00
Looking at the NAPI bulking, it is clear that sometimes the NAPI completes with less that 64 packets.
[jbrouer@broadwell prototype-kernel-bpf]$ sudo ./napi_monitor NAPI RX bulking (measurement period: 2.000216) bulk[00] 614 ( 0 pps) bulk[01] 775 ( 387 pps) bulk[02] 1361 ( 1,361 pps) bulk[03] 1353 ( 2,029 pps) bulk[04] 1794 ( 3,588 pps) bulk[05] 1965 ( 4,912 pps) bulk[06] 3681 ( 11,042 pps) bulk[07] 2607 ( 9,124 pps) bulk[08] 5051 ( 20,202 pps) bulk[09] 3222 ( 14,497 pps) bulk[10] 3556 ( 17,778 pps) bulk[11] 3586 ( 19,721 pps) bulk[12] 4118 ( 24,705 pps) bulk[13] 4024 ( 26,153 pps) bulk[14] 8025 ( 56,169 pps) bulk[15] 4744 ( 35,576 pps) bulk[16] 6937 ( 55,490 pps) bulk[17] 5301 ( 45,054 pps) bulk[18] 5841 ( 52,563 pps) bulk[19] 5457 ( 51,836 pps) bulk[20] 9812 ( 98,109 pps) bulk[21] 5502 ( 57,765 pps) bulk[22] 11503 ( 126,519 pps) bulk[23] 5710 ( 65,658 pps) bulk[24] 6488 ( 77,848 pps) bulk[25] 5735 ( 71,680 pps) bulk[26] 6745 ( 87,676 pps) bulk[27] 5805 ( 78,359 pps) bulk[28] 13623 ( 190,701 pps) bulk[29] 6440 ( 93,370 pps) bulk[30] 11199 ( 167,967 pps) bulk[31] 6804 ( 105,451 pps) bulk[32] 7566 ( 121,043 pps) bulk[33] 7002 ( 115,521 pps) bulk[34] 11034 ( 187,558 pps) bulk[35] 7053 ( 123,414 pps) bulk[36] 13220 ( 237,934 pps) bulk[37] 7036 ( 130,152 pps) bulk[38] 7932 ( 150,692 pps) bulk[39] 7220 ( 140,775 pps) bulk[40] 8610 ( 172,181 pps) bulk[41] 7374 ( 151,151 pps) bulk[42] 17750 ( 372,710 pps) bulk[43] 7703 ( 165,597 pps) bulk[44] 15153 ( 333,330 pps) bulk[45] 7931 ( 178,428 pps) bulk[46] 8739 ( 200,975 pps) bulk[47] 7923 ( 186,170 pps) bulk[48] 10461 ( 251,037 pps) bulk[49] 7989 ( 195,709 pps) bulk[50] 13136 ( 328,365 pps) bulk[51] 7983 ( 203,545 pps) bulk[52] 8710 ( 226,436 pps) bulk[53] 7980 ( 211,447 pps) bulk[54] 9153 ( 247,104 pps) bulk[55] 7931 ( 218,079 pps) bulk[56] 18446 ( 516,432 pps) bulk[57] 7919 ( 225,667 pps) bulk[58] 16643 ( 482,595 pps) bulk[59] 7759 ( 228,866 pps) bulk[60] 8778 ( 263,312 pps) bulk[61] 7735 ( 235,892 pps) bulk[62] 9413 ( 291,772 pps) bulk[63] 7707 ( 242,744 pps) bulk[64] 2077468 ( 66,471,811 pps) NAPI-from-idle, 2529350 average bulk 59.00 ( 74,768,110 pps) bulk0=600 NAPI-ksoftirqd, 24485 average bulk 58.00 ( 713,623 pps) bulk0=14 System global SOFTIRQ stats: SOFTIRQ_NET_RX/sec enter:1276773/s exit:1276773/s raise:1276770/s SOFTIRQ_NET_TX/sec enter:0/s exit:0/s raise:0/s SOFTIRQ_TIMER/sec enter:3856/s exit:3856/s raise:3795/s
I captures an ethtool stats snap-shot with an unusual but small counter called “outbound_pci_stalled_wr”. The PHY counters show what the generator is MAX outputting.
Show adapter(s) (mlx5p1) statistics (ONLY that changed!) Ethtool(mlx5p1 ) stat: 4 ( 4) <= outbound_pci_stalled_wr /sec Ethtool(mlx5p1 ) stat: 12602448 ( 12,602,448) <= rx0_cache_reuse /sec Ethtool(mlx5p1 ) stat: 12602431 ( 12,602,431) <= rx0_xdp_drop /sec Ethtool(mlx5p1 ) stat: 13091898 ( 13,091,898) <= rx1_cache_reuse /sec Ethtool(mlx5p1 ) stat: 13091898 ( 13,091,898) <= rx1_xdp_drop /sec Ethtool(mlx5p1 ) stat: 12485274 ( 12,485,274) <= rx2_cache_reuse /sec Ethtool(mlx5p1 ) stat: 12485388 ( 12,485,388) <= rx2_xdp_drop /sec Ethtool(mlx5p1 ) stat: 12267209 ( 12,267,209) <= rx3_cache_reuse /sec Ethtool(mlx5p1 ) stat: 12267201 ( 12,267,201) <= rx3_xdp_drop /sec Ethtool(mlx5p1 ) stat: 12506807 ( 12,506,807) <= rx4_cache_reuse /sec Ethtool(mlx5p1 ) stat: 12507044 ( 12,507,044) <= rx4_xdp_drop /sec Ethtool(mlx5p1 ) stat: 12252285 ( 12,252,285) <= rx5_cache_reuse /sec Ethtool(mlx5p1 ) stat: 12252221 ( 12,252,221) <= rx5_xdp_drop /sec Ethtool(mlx5p1 ) stat: 88295187 ( 88,295,187) <= rx_64_bytes_phy /sec Ethtool(mlx5p1 ) stat: 5650856237 ( 5,650,856,237) <= rx_bytes_phy /sec Ethtool(mlx5p1 ) stat: 75214073 ( 75,214,073) <= rx_cache_reuse /sec Ethtool(mlx5p1 ) stat: 13066088 ( 13,066,088) <= rx_discards_phy /sec Ethtool(mlx5p1 ) stat: 10106 ( 10,106) <= rx_out_of_buffer /sec Ethtool(mlx5p1 ) stat: 88294650 ( 88,294,650) <= rx_packets_phy /sec Ethtool(mlx5p1 ) stat: 5650511306 ( 5,650,511,306) <= rx_prio0_bytes /sec Ethtool(mlx5p1 ) stat: 75224002 ( 75,224,002) <= rx_prio0_packets /sec Ethtool(mlx5p1 ) stat: 4513478678 ( 4,513,478,678) <= rx_vport_unicast_bytes /sec Ethtool(mlx5p1 ) stat: 75224475 ( 75,224,475) <= rx_vport_unicast_packets /sec Ethtool(mlx5p1 ) stat: 75214086 ( 75,214,086) <= rx_xdp_drop /sec
(Kernel: 4.17.0-rc6-bpf-next-rm-ndo-flush+ #24 SMP PREEMPT)
Redirect: ingress mlx5p1 redirect egress i40e1: 30,493,921 pps
$ sudo ./xdp_redirect_map $(</sys/class/net/mlx5p1/ifindex) $(</sys/class/net/i40e1//ifindex) input: 8 output: 4 map[0] (vports) = 4, map[1] (map) = 5, map[2] (count) = 0 ifindex 4: 40192251 pkt/s ifindex 4: 30493614 pkt/s ifindex 4: 30493921 pkt/s ifindex 4: 30490341 pkt/s ifindex 4: 30495391 pkt/s ifindex 4: 30498160 pkt/s
XDP-event CPU:to pps drop-pps extra-info XDP_REDIRECT total 0 0 Error cpumap-kthread total 0 0 0 devmap-xmit 0 4,927,675 0 16.00 bulk-average devmap-xmit 1 4,986,185 0 16.00 bulk-average devmap-xmit 2 5,044,664 0 16.00 bulk-average devmap-xmit 3 4,994,976 0 16.00 bulk-average devmap-xmit 4 4,983,994 0 16.00 bulk-average devmap-xmit 5 5,014,333 0 16.00 bulk-average devmap-xmit total 29,951,825 0 16.00 bulk-average
Forgot this was with rx_cqe_compress=on
ethtool --set-priv-flags mlx5p1 rx_cqe_compress on
Show adapter(s) (mlx5p1) statistics (ONLY that changed!) Ethtool(mlx5p1 ) stat: 5073293 ( 5,073,293) <= rx0_cache_empty /sec Ethtool(mlx5p1 ) stat: 219224 ( 219,224) <= rx0_cqe_compress_blks /sec Ethtool(mlx5p1 ) stat: 1329666 ( 1,329,666) <= rx0_cqe_compress_pkts /sec Ethtool(mlx5p1 ) stat: 5051573 ( 5,051,573) <= rx1_cache_empty /sec Ethtool(mlx5p1 ) stat: 222439 ( 222,439) <= rx1_cqe_compress_blks /sec Ethtool(mlx5p1 ) stat: 1380038 ( 1,380,038) <= rx1_cqe_compress_pkts /sec Ethtool(mlx5p1 ) stat: 5067505 ( 5,067,505) <= rx2_cache_empty /sec Ethtool(mlx5p1 ) stat: 220519 ( 220,519) <= rx2_cqe_compress_blks /sec Ethtool(mlx5p1 ) stat: 1315711 ( 1,315,711) <= rx2_cqe_compress_pkts /sec Ethtool(mlx5p1 ) stat: 5043176 ( 5,043,176) <= rx3_cache_empty /sec Ethtool(mlx5p1 ) stat: 223895 ( 223,895) <= rx3_cqe_compress_blks /sec Ethtool(mlx5p1 ) stat: 1349297 ( 1,349,297) <= rx3_cqe_compress_pkts /sec Ethtool(mlx5p1 ) stat: 5032563 ( 5,032,563) <= rx4_cache_empty /sec Ethtool(mlx5p1 ) stat: 222138 ( 222,138) <= rx4_cqe_compress_blks /sec Ethtool(mlx5p1 ) stat: 1301549 ( 1,301,549) <= rx4_cqe_compress_pkts /sec Ethtool(mlx5p1 ) stat: 5093823 ( 5,093,823) <= rx5_cache_empty /sec Ethtool(mlx5p1 ) stat: 214919 ( 214,919) <= rx5_cqe_compress_blks /sec Ethtool(mlx5p1 ) stat: 1273362 ( 1,273,362) <= rx5_cqe_compress_pkts /sec Ethtool(mlx5p1 ) stat: 88243413 ( 88,243,413) <= rx_64_bytes_phy /sec Ethtool(mlx5p1 ) stat: 5647560737 ( 5,647,560,737) <= rx_bytes_phy /sec Ethtool(mlx5p1 ) stat: 30362207 ( 30,362,207) <= rx_cache_empty /sec Ethtool(mlx5p1 ) stat: 1323158 ( 1,323,158) <= rx_cqe_compress_blks /sec Ethtool(mlx5p1 ) stat: 7949743 ( 7,949,743) <= rx_cqe_compress_pkts /sec Ethtool(mlx5p1 ) stat: 14635008 ( 14,635,008) <= rx_discards_phy /sec Ethtool(mlx5p1 ) stat: 43246222 ( 43,246,222) <= rx_out_of_buffer /sec Ethtool(mlx5p1 ) stat: 88243138 ( 88,243,138) <= rx_packets_phy /sec Ethtool(mlx5p1 ) stat: 5647524379 ( 5,647,524,379) <= rx_prio0_bytes /sec Ethtool(mlx5p1 ) stat: 73608194 ( 73,608,194) <= rx_prio0_packets /sec Ethtool(mlx5p1 ) stat: 4416504322 ( 4,416,504,322) <= rx_vport_unicast_bytes /sec Ethtool(mlx5p1 ) stat: 73608402 ( 73,608,402) <= rx_vport_unicast_packets /sec
Disabling rx_cqe_compress didn’t change performance, but stats changed:
Show adapter(s) (mlx5p1) statistics (ONLY that changed!) Ethtool(mlx5p1 ) stat: 5133804 ( 5,133,804) <= rx0_cache_empty /sec Ethtool(mlx5p1 ) stat: 5119036 ( 5,119,036) <= rx1_cache_empty /sec Ethtool(mlx5p1 ) stat: 5110855 ( 5,110,855) <= rx2_cache_empty /sec Ethtool(mlx5p1 ) stat: 5168146 ( 5,168,146) <= rx3_cache_empty /sec Ethtool(mlx5p1 ) stat: 5111374 ( 5,111,374) <= rx4_cache_empty /sec Ethtool(mlx5p1 ) stat: 5137363 ( 5,137,363) <= rx5_cache_empty /sec Ethtool(mlx5p1 ) stat: 88164618 ( 88,164,618) <= rx_64_bytes_phy /sec Ethtool(mlx5p1 ) stat: 5642515995 ( 5,642,515,995) <= rx_bytes_phy /sec Ethtool(mlx5p1 ) stat: 30780489 ( 30,780,489) <= rx_cache_empty /sec Ethtool(mlx5p1 ) stat: 13169863 ( 13,169,863) <= rx_discards_phy /sec Ethtool(mlx5p1 ) stat: 44213842 ( 44,213,842) <= rx_out_of_buffer /sec Ethtool(mlx5p1 ) stat: 88164306 ( 88,164,306) <= rx_packets_phy /sec Ethtool(mlx5p1 ) stat: 5642429004 ( 5,642,429,004) <= rx_prio0_bytes /sec Ethtool(mlx5p1 ) stat: 74992720 ( 74,992,720) <= rx_prio0_packets /sec Ethtool(mlx5p1 ) stat: 4499667608 ( 4,499,667,608) <= rx_vport_unicast_bytes /sec Ethtool(mlx5p1 ) stat: 74994463 ( 74,994,463) <= rx_vport_unicast_packets /sec
- State “DONE” from “TODO” [2018-06-09 Sat 20:59]
Maybe running the traffic generator with 6 flows from the beginning, and just varying the flow rules on RX is better? That way we wouldn’t get the weird dips at higher # of cores.
RXQs | XDP_DROP | XDP_REDIRECT | REDIR PREEMPT voluntary |
---|---|---|---|
1 | 25928270 | 7909103 | 8649872 |
2 | 14964733 | 15975491 | |
3 | 17586052 | 19222735 | |
4 | 20167875 | 21535588 | |
5 | 23863927 | 25464083 | |
6 | 75297461 | 28376755 | 29828924 |
Testing with newer xdp_rxq_info tool that have an option for reading data.
sudo ./xdp_rxq_info --dev mlx5p1 --action XDP_DROP --no-sep --read
Generator command:
start -f /home/jbrouer/git/xdp-paper/benchmarks/udp_for_benchmarks02.py -t packet_len=64,stream_count=RXQs --port 0 -m 100mpps
RXQs DROP | no_touch RX=1024 | no_touch RX=512 | read RX=1024 | read RX=512 |
---|---|---|---|---|
1 | 24379188 | 24804275 | 23095606 | 23062789 |
2 | 49805895 | 50232370 | 46552903 | 46537526 |
3 | 73230349 | 74350900 | 64474005 | 68859775 |
4 | 86624323 | 86198361 | 68250791 | 86168278 |
5 | 86830822 | 86973055 | 49905645 | 87341248 |
6 | 86608045 | 87101116 | 56323684 | 87585333 |
Update (<2018-06-19 Tue>): Jesper found that the maximum scaled drop rate can be improved by enabling the mlx5 priv-flags rx_cqe_compress=on (and rx_striding_rq=off). This confirms the PCIe bottleneck, as rx_cqe_compress reduce the transactions on PCIe by compressing the RX descriptors.
One issue is that with rx_cqe_compress=on, the per core performance is slightly slower as it requires more CPU cycles to “decompress” the descriptors.
RXQs DROP | no_touch RX=1024 | no_touch RX=512 | read RX=1024 | read RX=512 |
---|---|---|---|---|
1 | 23902641 | 23863471 | 22653821 | |
2 | 45463076 | 45514709 | 44345271 | |
3 | 65800412 | 67796536 | cache-misses | |
4 | 84563313 | 88821307 | starts… | |
5 | 99105872 | 99357978 | 98758694 | |
6 | 108118978 | 108607056 | 105077478 |
Observations: Even with these extremely high numbers we are still seeing idle CPU cycles.
Enable rx_cqe_compress=on cmdline:
ethtool --set-priv-flags mlx5p1 rx_cqe_compress on
$ ethtool --show-priv-flags mlx5p1 Private flags for mlx5p1: rx_cqe_moder : on tx_cqe_moder : off rx_cqe_compress: on rx_striding_rq : off
Issues: With the rx_cqe_compress=on setting, I’m seeing errors in the kernel dmesg, and individial RX-queues are stopping to work. (Stopping and starting XDP application enables queues again).
(dmesg errors) mlx5_core 0000:03:00.0: mlx5_eq_int:540:(pid 0): CQ error on CQN 0x41d, syndrome 0x1 mlx5_core 0000:03:00.0 mlx5p1: mlx5e_cq_error_event: cqn=0x00041d event=0x04 mlx5_core 0000:03:00.0: mlx5_cmd_check:714:(pid 28036): MODIFY_CQ(0x403) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x2f1396)
The default NIC driver mlx5 RX-ring size of 1024 frames turned out to be a performance problem. When scaling to multiple RX-queues, then the NIC driver will have more outstandig memory, which the DDIO mechanism tries to place in L3-cache. With RX-ring size 1024 we observed cache-misses (to main memory) on some RX-queues. This can only mean that DDIO some how didn’t manage to
The problem was identified as RX-ring size adjustable via ethtool:
ethtool -G mlx5p1 rx 512 tx 512
Data for RXQ=4 unbalance, two CPUs process 21Mpps and these CPUs have idle cycles 2.75% :
Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP options:read XDP stats CPU pps issue-pps XDP-RX CPU 0 21620425 0 XDP-RX CPU 1 21603145 0 XDP-RX CPU 2 12676047 0 XDP-RX CPU 3 12324059 0 XDP-RX CPU total 68223677 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 21620423 0 rx_queue_index 0:sum 21620423 rx_queue_index 1:1 21603146 0 rx_queue_index 1:sum 21603146 rx_queue_index 2:2 12676050 0 rx_queue_index 2:sum 12676050 rx_queue_index 3:3 12324050 0 rx_queue_index 3:sum 12324050 05:01:06 PM CPU %usr %sys %iowait %irq %soft %idle 05:01:08 PM all 0.00 0.17 0.00 0.17 64.89 34.77 05:01:08 PM 0 0.00 0.55 0.00 0.00 96.70 2.75 05:01:08 PM 1 0.00 0.54 0.00 0.54 96.22 2.70 05:01:08 PM 2 0.00 0.00 0.00 0.00 100.00 0.00 05:01:08 PM 3 0.00 0.00 0.00 0.00 100.00 0.00 05:01:08 PM 4 0.00 0.50 0.00 0.00 0.50 99.00 05:01:08 PM 5 0.00 0.51 0.00 0.00 0.00 99.49
In RXQ=4 case, the CPUs are experiencing different levels of cache-misses.
$ sudo ~/perf stat -C3 -e cycles -e instructions -e cache-references -e cache-misses -r 3 sleep 1 Performance counter stats for 'CPU(s) 3' (3 runs): 3,804,377,863 cycles ( +- 0.01% ) 5,865,630,935 instructions # 1.54 insn per cycle ( +- 0.03% ) 43,829,681 cache-references ( +- 0.04% ) 9,360,529 cache-misses # 21.357 % of all cache refs ( +- 0.03% ) $ sudo ~/perf stat -C0 -e cycles -e instructions -e cache-references -e cache-misses -r 3 sleep 1 Performance counter stats for 'CPU(s) 0' (3 runs): 3,728,030,288 cycles ( +- 0.01% ) 10,383,860,909 instructions # 2.79 insn per cycle ( +- 0.03% ) 85,613,852 cache-references ( +- 0.11% ) 358,027 cache-misses # 0.418 % of all cache refs( +- 1.94% )
Data for RXQ=3
Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP options:read XDP stats CPU pps issue-pps XDP-RX CPU 0 22088751 0 XDP-RX CPU 1 21279391 0 XDP-RX CPU 2 21189691 0 XDP-RX CPU total 64557835 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 22088753 0 rx_queue_index 0:sum 22088753 rx_queue_index 1:1 21279391 0 rx_queue_index 1:sum 21279391 rx_queue_index 2:2 21189695 0 rx_queue_index 2:sum 21189695 05:03:04 PM CPU %usr %sys %iowait %irq %soft %idle 05:03:06 PM all 0.00 0.08 0.00 0.00 50.04 49.87 05:03:06 PM 0 0.00 0.00 0.00 0.00 100.00 0.00 05:03:06 PM 1 0.00 0.00 0.00 0.00 100.00 0.00 05:03:06 PM 2 0.00 0.00 0.00 0.00 100.00 0.00 05:03:06 PM 3 0.50 0.50 0.00 0.50 0.00 98.51 05:03:06 PM 4 0.00 0.00 0.00 0.00 0.00 100.00 05:03:06 PM 5 0.00 0.00 0.00 0.00 0.00 100.00
These are from Toke’s test run. Note that REDIRECT throughput drops by 5 Mpps (on a single core) when running xdp_monitor at the same time!
RXQs | XDP_DROP | XDP_REDIRECT |
---|---|---|
1 | 25928270 | 8461375 |
2 | 51349744 | 16241020 |
3 | 76578241 | 18639798 |
4 | 82782450 | 21417122 |
5 | 82294143 | 25373567 |
6 | 80444303 | 29970889 |
d = np.array(data)
plt.plot(d[:,0], d[:,1]/10**6, marker='o', label="XDP_DROP")
plt.plot(d[:,0], d[:,2]/10**6, marker='o', label="XDP_REDIRECT")
plt.xlabel("Number of cores")
plt.ylabel("Mpps")
plt.legend()
plt.show()
For this test, we set the packet generator to a fixed pps and report the CPU usage using mpstat. For a single core, up to the 26Mpps maximum performance. For DPDK this is always 100%, by design.
The interface is running the ‘xdp_rxq_info’ sample program with XDP_DROP as action.
3 samples of 30sec intervals:
mpstat -P ALL 30 3
We report average %idle and plot the inverse. For XDP confirm with the output of the xdp1 program that no packets are dropped. For iptables –raw look at ethtool stats.
Mpps | XDP | Linux |
---|---|---|
0 | 100 | 100 |
0.25 | 93.0 | 79.0 |
0.5 | 88.3 | 69.8 |
1 | 80.4 | 57.4 |
2 | 71.1 | 40.4 |
3 | 64.8 | 29.2 |
5 | 52.7 | 0 |
10 | 37.6 | 0 |
15 | 25.3 | 0 |
20 | 8.2 | 0 |
24.8 | 0 | 0 |
25 | 0 | 0 |
30 | 0 | 0 |
35 | 0 | 0 |
40 | 0 | 0 |
43 | 0 | 0 |
Test invocation for RX test:
sudo ./testpmd -l 0-5 -- -i --nb-cores=1 --forward-mode=rxonly --auto-start --portmask=0x2
To get multiple cores working, we need to enable multiple rxqs and txqs, and also enable UDP RSS:
sudo ./testpmd -l 0-5 -n 4 -- -i --nb-cores=2 --forward-mode=rxonly --auto-start --portmask=0x2 --rxq 2 --txq 2 --rss-udp
Or instead of interactive mode, use stats reporting mode:
sudo ./testpmd -l 0-5 -n 4 -- --nb-cores=5 --forward-mode=rxonly --auto-start --portmask=0x2 --rxq 5 --txq 5 --rss-udp --stats-period=1
Cores | RX PPS | Forward PPS |
---|---|---|
1 | 43527279 | 23914513 |
2 | 70499318 | 35337558 |
3 | 82695730 | 56526568 |
4 | 82937531 | 58197505 |
5 | 80575187 | 62998140 |
From 3-5 this is actually bounded by the traffic generator, it would seem.
Trying faster generator setup:
Using two packet generators. T-rex sending 99 Mpps and kernel pktgen (pktgen_sample05_flow_per_thread.sh) sending approx 45 Mpps. Ethtool stats show on 100G switch show 144,371,992 pps TX towards DUT :
Show adapter(s) (sw1p5 sw1p9 sw1p13) statistics (ONLY that changed!) Ethtool(sw1p5 ) stat: 46132421 ( 46,132,421) <= a_frames_received_ok /sec Ethtool(sw1p5 ) stat: 98833878 ( 98,833,878) <= a_frames_transmitted_ok /sec Ethtool(sw1p5 ) stat: 667329 ( 667,329) <= a_mac_control_frames_received /sec Ethtool(sw1p5 ) stat: 2952475678 ( 2,952,475,678) <= a_octets_received_ok /sec Ethtool(sw1p5 ) stat: 6325368523 ( 6,325,368,523) <= a_octets_transmitted_ok /sec Ethtool(sw1p5 ) stat: 667329 ( 667,329) <= a_pause_mac_ctrl_frames_received /sec Ethtool(sw1p5 ) stat: 45465150 ( 45,465,150) <= rx_frames_prio_0 /sec Ethtool(sw1p5 ) stat: 2952478555 ( 2,952,478,555) <= rx_octets_prio_0 /sec Ethtool(sw1p5 ) stat: -122 ( -122) <= tc_transmit_queue_tc_0 /sec Ethtool(sw1p5 ) stat: 98834011 ( 98,834,011) <= tx_frames_prio_0 /sec Ethtool(sw1p5 ) stat: 6325376817 ( 6,325,376,817) <= tx_octets_prio_0 /sec Ethtool(sw1p9 ) stat: 144371988 ( 144,371,988) <= a_frames_transmitted_ok /sec Ethtool(sw1p9 ) stat: 9239808140 ( 9,239,808,140) <= a_octets_transmitted_ok /sec Ethtool(sw1p9 ) stat: -276 ( -276) <= tc_transmit_queue_tc_0 /sec Ethtool(sw1p9 ) stat: 144371992 ( 144,371,992) <= tx_frames_prio_0 /sec Ethtool(sw1p9 ) stat: 9239807395 ( 9,239,807,395) <= tx_octets_prio_0 /sec Ethtool(sw1p13 ) stat: 98855049 ( 98,855,049) <= a_frames_received_ok /sec Ethtool(sw1p13 ) stat: 45474257 ( 45,474,257) <= a_frames_transmitted_ok /sec Ethtool(sw1p13 ) stat: 6326723716 ( 6,326,723,716) <= a_octets_received_ok /sec Ethtool(sw1p13 ) stat: 2910352561 ( 2,910,352,561) <= a_octets_transmitted_ok /sec Ethtool(sw1p13 ) stat: 98855091 ( 98,855,091) <= rx_frames_prio_0 /sec Ethtool(sw1p13 ) stat: 6326726369 ( 6,326,726,369) <= rx_octets_prio_0 /sec Ethtool(sw1p13 ) stat: -61 ( -61) <= tc_transmit_queue_tc_0 /sec Ethtool(sw1p13 ) stat: 45474290 ( 45,474,290) <= tx_frames_prio_0 /sec Ethtool(sw1p13 ) stat: 2910354419 ( 2,910,354,419) <= tx_octets_prio_0 /sec
Note: DUT was running kernel 4.16.13-200.fc27.x86_64 during these DPDK tests to please the MLNX_OFED software.
Testpmd DPDK-“drop” command used changing variable CORES:
export CORES=3 ; sudo build/app/testpmd -l 0-5 -n 4 -- --nb-cores=$CORES --forward-mode=rxonly --auto-start --portmask=0x1 --rxq $CORES --txq $CORES --rss-udp --stats-period=2
Testpmd DPDK-“forward” command used changing variable CORES:
export CORES=1 ; sudo build/app/testpmd -l 0-5 -n 4 -- --nb-cores=$CORES --forward-mode=mac --auto-start --portmask=0x3 --rxq $CORES --txq $CORES --rss-udp --stats-period=2
Used the testpmd forward-mode “mac”, which I don’t know if it is correct(?).
Cores | RX PPS | DPDK Forward PPS | DPDK-rxonly-drop run#2 |
---|---|---|---|
1 | 43617057 | 21893094 | 43503636 |
2 | 75034064 | 26549320 | 74380044 |
3 | 97535287 | 38194032 | 97203856 |
4 | 115806957 | 40641805 | 113876503 |
5 | 100600263 | 51173067 | 115453781 |
run#2 was with our net-next-xdp kernel, and generators was sending with 138726434 pps (138,726,434 pps) measured at switch.
Ethtools stats (works due to Mellanox bifurcated driver) for dpdk_test2 with 5 cores:
Show adapter(s) (mlx5p1) statistics (ONLY that changed!) Ethtool(mlx5p1 ) stat: 144075528 ( 144,075,528) <= rx_64_bytes_phy /sec Ethtool(mlx5p1 ) stat: 9220824984 ( 9,220,824,984) <= rx_bytes_phy /sec Ethtool(mlx5p1 ) stat: 42595068 ( 42,595,068) <= rx_discards_phy /sec Ethtool(mlx5p1 ) stat: 144075393 ( 144,075,393) <= rx_packets_phy /sec Ethtool(mlx5p1 ) stat: 9220821849 ( 9,220,821,849) <= rx_prio0_bytes /sec Ethtool(mlx5p1 ) stat: 101480704 ( 101,480,704) <= rx_prio0_packets /sec Ethtool(mlx5p1 ) stat: 6088740630 ( 6,088,740,630) <= rx_vport_unicast_bytes /sec Ethtool(mlx5p1 ) stat: 101479024 ( 101,479,024) <= rx_vport_unicast_packets /sec
dpdk_test2 with 4 cores Show adapter(s) (mlx5p1) statistics (ONLY that changed!) Ethtool(mlx5p1 ) stat: 144211662 ( 144,211,662) <= rx_64_bytes_phy /sec Ethtool(mlx5p1 ) stat: 9229552001 ( 9,229,552,001) <= rx_bytes_phy /sec Ethtool(mlx5p1 ) stat: 28429333 ( 28,429,333) <= rx_discards_phy /sec Ethtool(mlx5p1 ) stat: 144211748 ( 144,211,748) <= rx_packets_phy /sec Ethtool(mlx5p1 ) stat: 9229514359 ( 9,229,514,359) <= rx_prio0_bytes /sec Ethtool(mlx5p1 ) stat: 115783855 ( 115,783,855) <= rx_prio0_packets /sec Ethtool(mlx5p1 ) stat: 6946966145 ( 6,946,966,145) <= rx_vport_unicast_bytes /sec Ethtool(mlx5p1 ) stat: 115782780 ( 115,782,780) <= rx_vport_unicast_packets /sec
d = np.array(data)
plt.plot(d[:,0], d[:,1]/10**6, marker='o', label="rxonly")
#plt.plot(d[:,0], d[:,2]/10**6, marker='o', label="XDP_REDIRECT")
plt.xlabel("Number of cores")
plt.ylabel("Mpps")
plt.legend()
plt.show()
There’s no good way to do any kind of bypass, so we just run this with normal Linux forwarding. Throughput is measure by ethtool on the TX interface, e.g., for one core:
Show adapter(s) (ens3f1) statistics (ONLY that changed!) Ethtool(ens3f1 ) stat: 27183 ( 27,183) <= ch0_arm /sec Ethtool(ens3f1 ) stat: 27183 ( 27,183) <= ch0_events /sec Ethtool(ens3f1 ) stat: 27183 ( 27,183) <= ch0_poll /sec Ethtool(ens3f1 ) stat: 27183 ( 27,183) <= ch_arm /sec Ethtool(ens3f1 ) stat: 27182 ( 27,182) <= ch_events /sec Ethtool(ens3f1 ) stat: 27183 ( 27,183) <= ch_poll /sec Ethtool(ens3f1 ) stat: 104380023 ( 104,380,023) <= tx0_bytes /sec Ethtool(ens3f1 ) stat: 1739714 ( 1,739,714) <= tx0_cqes /sec Ethtool(ens3f1 ) stat: 1739667 ( 1,739,667) <= tx0_csum_none /sec Ethtool(ens3f1 ) stat: 1739667 ( 1,739,667) <= tx0_packets /sec Ethtool(ens3f1 ) stat: 104380319 ( 104,380,319) <= tx_bytes /sec Ethtool(ens3f1 ) stat: 111340650 ( 111,340,650) <= tx_bytes_phy /sec Ethtool(ens3f1 ) stat: 1739714 ( 1,739,714) <= tx_cqes /sec Ethtool(ens3f1 ) stat: 1739672 ( 1,739,672) <= tx_csum_none /sec Ethtool(ens3f1 ) stat: 1739672 ( 1,739,672) <= tx_packets /sec Ethtool(ens3f1 ) stat: 1739707 ( 1,739,707) <= tx_packets_phy /sec Ethtool(ens3f1 ) stat: 111338817 ( 111,338,817) <= tx_prio0_bytes /sec Ethtool(ens3f1 ) stat: 1739669 ( 1,739,669) <= tx_prio0_packets /sec Ethtool(ens3f1 ) stat: 104381267 ( 104,381,267) <= tx_vport_unicast_bytes /sec Ethtool(ens3f1 ) stat: 1739687 ( 1,739,687) <= tx_vport_unicast_packets /sec
Cores | PPS |
---|---|
1 | 1739672 |
2 | 3370584 |
3 | 4976559 |
4 | 6488625 |
5 | 7848970 |
6 | 9285971 |
What is the performance of iptables dropping SKBs in the raw table.
Unloaded all iptables modules, and then invoke the iptables -t raw command line to make it loaded only the needed iptables kernel modules.
Cmdline for ‘raw’ table:
iptables -t raw -I PREROUTING -p udp --dport 9:19 --j DROP
Cmdline for ‘filter’ table:
iptables -t filter -I INPUT -p udp --dport 9:19 --j DROP
Cmdline for activating conntrack:
iptables -I INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
Default Fedora 27 firewalld setup with jump rules and REJECT target (reject-with icmp-host-prohibited) as the last rule getting hitted.
Cores | table raw | table filter | conntrack | firewalld |
---|---|---|---|---|
1 | 5051787 | 3319718 | 1819409 | 721284 |
2 | 10226514 | 6707809 | 3274018 | 1403399 |
3 | 15104793 | 9944065 | 4713610 | 2036345 |
4 | 20075858 | 13235307 | 6189412 | 2657446 |
5 | 24995919 | 16442723 | 7497838 | 3380752 |
6 | 29443869 | 19518401 | 8726498 | 4001466 |
Same test, but measuring the overhead of XDP_PASS (using xdp_rxq_info):
Cores | table raw + XDP_PASS | table raw |
---|---|---|
1 | 4545188 | 4837269 |
2 | 9069234 | 9661542 |
3 | 13596920 | 14483976 |
4 | 18081421 | 19250649 |
5 | 22520733 | 24070686 |
6 | 27002648 | 28817001 |
300 kpps overhead; 13.28 ns (1/4545188 - 1/4837269)
Modified sample/bpf xdp_rxq_info tool to ‘swapmac’ when doing XDP_TX benchmarks.
Copied table from bench05, to compare XDP_TX to XDP_REDIRECT.
RXQs | XDP_REDIRECT (1024) | RX-size=512 | RX-size=256 | RX-size=128 |
---|---|---|---|---|
1 | 8649872 | 8665664 | 8641197 | 7577448 |
2 | 15975491 | 16629074 | 16742325 | 14268397 |
3 | 19222735 | 25230973 | 25738189 | 22086964 |
4 | 21535588 | 28807445 | 34185239 | 29512278 |
5 | 25464083 | 31306207 | 41461874 | 36652247 |
6 | 29828924 | 33970445 | 46062737 | 43376903 |
Perf tool cmdline to measure cache misses on CPU-0:
perf stat -C0 -e L1-icache-load-misses -e cycles -e instructions -e cache-misses -e cache-references -e branch-misses -e branches -r 3 sleep 1
Below it is clear that XDP_TX is experiencing some cache-miss issue.
RXQs | XDP_REDIRECT RX=512 | XDP_TX RX=1024 | % cache-miss | insn per cycle |
---|---|---|---|---|
1 | 8665664 | 16230538 | 0.001 | 2.76 |
2 | 16629074 | 31226024 | 0.469 | 2.59 |
3 | 25230973 | 32344463 | 11.752 | 1.74 |
4 | 28807445 | 32604398 | 30.577 | 1.37 |
5 | 31306207 | 37494941 | 41.497 | 1.26 |
6 | 33970445 | 43230678 | 41.977 | 1.22 |
Testing changing RX + TX ring size in XDP_TX test:
ethtool -G mlx5p1 rx 512 tx 512
RXQs | XDP_TX RX=512 | % cache-miss | insn per cycle |
---|---|---|---|
1 | 17005709 | 0.001 | 2.86 |
2 | 33926316 | 0.004 | 2.81 |
3 | 45523114 | 1.159 | 2.60 |
4 | 50978970 | 4.868 | 2.22 |
5 | 50619968 | 13.544 | 1.76 |
6 | 52315564 | 22.498 | 1.46 |
ethtool -G mlx5p1 rx 256 tx 256
RXQs | XDP_TX RX=256 | % cache-miss | insn per cycle |
---|---|---|---|
1 | 16900699 | 0.001 | 2.84 |
2 | 34850297 | 0.002 | 2.87 |
3 | 51884051 | 0.002 | 2.90 |
4 | 68953922 | 0.010 | 2.90 |
5 | 70164732 | 0.316 | 2.72 |
6 | 68744044 | 1.340 | 2.48 |
Evaluate this section to get the right figure styles:
%config InlineBackend.figure_format = 'svg'
from matplotlib import pyplot as plt
import numpy as np
import os
BASEDIR=os.getenv("XDP_PAPER_BASEDIR") # or set manually
mpl.rcParams.update({
'axes.axisbelow': True,
'axes.edgecolor': 'white',
'axes.facecolor': '#E6E6E6',
'axes.formatter.useoffset': False,
'axes.grid': True,
'axes.labelcolor': 'black',
'axes.linewidth': 0.0,
'axes.prop_cycle': mpl.cycler('color', ["#1b9e77", "#d95f02", "#7570b3",
"#e7298a", "#66a61e", "#e6ab02",
"#a6761d", "#666666"]),
'figure.edgecolor': 'white',
'figure.facecolor': 'white',
'figure.figsize': (8.0, 5.0),
'figure.frameon': False,
'figure.subplot.bottom': 0.125,
'font.size': 16,
'grid.color': 'white',
'grid.linestyle': '-',
'grid.linewidth': 1,
'image.cmap': 'Greys',
'legend.frameon': False,
'legend.numpoints': 1,
'legend.scatterpoints': 1,
'lines.color': 'black',
'lines.linewidth': 1,
'lines.solid_capstyle': 'round',
'pdf.fonttype': 42,
'savefig.dpi': 100,
'text.color': 'black',
'xtick.color': 'black',
'xtick.direction': 'out',
'xtick.major.size': 0.0,
'xtick.minor.size': 0.0,
'ytick.color': 'black',
'ytick.direction': 'out',
'ytick.major.size': 0.0,
'ytick.minor.size': 0.0})
dpdk = np.array(dpdk_data)
xdp = np.array([i[:3] for i in xdp_data])
linux = np.array(linux_data)
plt.plot(dpdk[:,0], dpdk[:,3]/10**6, marker='o', label="DPDK")
plt.plot(xdp[:,0], xdp[:,2]/10**6, marker='s', label="XDP")
plt.plot(linux[:,0], linux[:,1]/10**6, marker='^', label="Linux (raw)")
plt.plot(linux[:,0], linux[:,3]/10**6, marker='x', label="Linux (conntrack)")
plt.xlabel("Number of cores")
plt.ylabel("Mpps")
plt.legend()
plt.ylim(0,130)
plt.savefig(BASEDIR+"/figures/drop-test.pdf", bbox_inches='tight')
plt.show()
data = np.array(data)
ones = np.ones(len(data[:,1]))*100
plt.plot(data[:,0], ones, marker='o', label="DPDK")
plt.plot(data[:11,0], ones[:11]-data[:11,1], marker='s', label="XDP")
plt.plot(data[:7,0], ones[:7]-data[:7,2], marker='^', label="Linux")
plt.xlabel("Offered load (Mpps)")
plt.ylabel("CPU usage (%)")
plt.legend()
plt.ylim(0,110)
plt.xlim(0,27)
plt.savefig(BASEDIR+"/figures/drop-cpu.pdf", bbox_inches='tight')
plt.show()
dpdk = np.array(dpdk_data)
xdp = np.array(xdp_data)
tx = np.array(tx_data)
plt.plot(dpdk[:,0], dpdk[:,2]/10**6, marker='o', label="DPDK (different NIC)")
plt.plot(tx[:,0], tx[:,1]/10**6, marker='s', label="XDP (same NIC)")
plt.plot(xdp[:,0], xdp[:,3]/10**6, marker='^', label="XDP (different NIC)")
plt.xlabel("Number of cores")
plt.ylabel("Mpps")
plt.legend()
plt.ylim(0,80)
plt.savefig(BASEDIR+"/figures/redirect-test.pdf", bbox_inches='tight')
plt.show()
tx = np.array(tx_data)
redir = np.array(redir_data)
plt.plot(tx[:,0], tx[:,1]/10**6, marker='o', label="TX")
plt.plot(redir[:,0], redir[:,3]/10**6, marker='s', label="REDIRECT")
plt.xlabel("Number of cores")
plt.ylabel("Mpps")
plt.legend()
plt.ylim(0,80)
plt.savefig(BASEDIR+"/figures/tx-test.pdf", bbox_inches='tight')
plt.show()