Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline drops half of the INT reports #104

Open
ccascone opened this issue Oct 6, 2020 · 2 comments
Open

Pipeline drops half of the INT reports #104

ccascone opened this issue Oct 6, 2020 · 2 comments
Labels
bug Something isn't working DTEL Dataplane telemetry

Comments

@ccascone
Copy link
Member

ccascone commented Oct 6, 2020

We observed this issue in the production pod, the cause is still unclear.

This is especially evident when monitoring high bandwidth (~10Gbps) TCP flows generated by iperf: DeepInsight shows a rate of dropped reports that is proportional, and in most cases the same, to that of successfully processed reports:
Screen Shot 2020-10-05 at 4 36 01 PM

DeepInsight uses the seq_no field in the INT report fixed header to detect dropped reports. In an iperf test, the INT reports delivered to the server have missing seq_nos. From this pcap trace, we see that reports have seq_no

07 88 0b a0
07 88 0b a1
07 88 0b a3 # skipped 1
07 88 0b a5 # skipped 1
07 88 0b a7 # skipped 1
07 88 0b a9 # skipped 1
07 88 0b ab # skipped 1
07 88 0b ad # skipped 1
07 88 0b b1 # skipped 3
07 88 0b b3 # skipped 1
07 88 0b b5 # skipped 1
07 88 0b b7 # skipped 1
07 88 0b b9 # skipped 1
07 88 0b bb # skipped 1
07 88 0b bd # skipped 1
07 88 0b bf # skipped 1
07 88 0b c3 # skipped 2
...

We don't believe it's an issue with seq_no computation in tofino, as when generating low bit rate traffic, the issue cannot be reproduced. Instead, we believe this is connected to how we use mirroring sessions and/or recirculation ports, and the fact that the port attached to the DI server is a 10G. The issue does not manifest when running a similar test in the staging server, where the DI port is 40G. On 03/30/2020 we have observed the same issue on the staging pod which uses 40G interfaces for the collector.

@ccascone ccascone changed the title Pipeline drops every other INT report Pipeline drops half of the INT reports Oct 6, 2020
@ccascone ccascone added the bug Something isn't working label Oct 6, 2020
@Yi-Tseng
Copy link
Collaborator

Yi-Tseng commented Oct 9, 2020

I also find the same issue on leaf2(new 32D switch) of the staging server when Charles tested the topology
di.pcap.zip

@Yi-Tseng Yi-Tseng added the DTEL Dataplane telemetry label Oct 16, 2020
@ccascone
Copy link
Member Author

Confirming that this issue still exists. I observed it happening on staging when using iperf3 in TCP mode with max bandwidth 10Mb/s. We should try to reproduce on FLiRT.

Attached Archive.zip:

  • pcap capture of iperf TCP packets at .163 (bidirectional iperf flows between .162 and .163)
  • pcap capture of INT reports at .161 (ignore UI glitch which shows different host)
  • P4RT write request log from Stratum

Screen Shot 2021-03-30 at 7 25 51 PM

carmelo@node3:~$ iperf3 -c 10.32.5.162 -b1M -P10 -t100 -i1
Connecting to host 10.32.5.162, port 5201
[  4] local 10.32.5.163 port 39228 connected to 10.32.5.162 port 5201
[  6] local 10.32.5.163 port 39230 connected to 10.32.5.162 port 5201
[  8] local 10.32.5.163 port 39232 connected to 10.32.5.162 port 5201
[ 10] local 10.32.5.163 port 39234 connected to 10.32.5.162 port 5201
[ 12] local 10.32.5.163 port 39236 connected to 10.32.5.162 port 5201
[ 14] local 10.32.5.163 port 39238 connected to 10.32.5.162 port 5201
[ 16] local 10.32.5.163 port 39240 connected to 10.32.5.162 port 5201
[ 18] local 10.32.5.163 port 39242 connected to 10.32.5.162 port 5201
[ 20] local 10.32.5.163 port 39244 connected to 10.32.5.162 port 5201
[ 22] local 10.32.5.163 port 39246 connected to 10.32.5.162 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   331 KBytes  2.71 Mbits/sec    0   28.3 KBytes
[  6]   0.00-1.00   sec   288 KBytes  2.36 Mbits/sec    0   24.0 KBytes
[  8]   0.00-1.00   sec   303 KBytes  2.48 Mbits/sec    0   28.3 KBytes
[ 10]   0.00-1.00   sec   303 KBytes  2.48 Mbits/sec    0   28.3 KBytes
[ 12]   0.00-1.00   sec   300 KBytes  2.46 Mbits/sec    0   28.3 KBytes
[ 14]   0.00-1.00   sec   320 KBytes  2.62 Mbits/sec    0   28.3 KBytes
[ 16]   0.00-1.00   sec   320 KBytes  2.62 Mbits/sec    0   28.3 KBytes
[ 18]   0.00-1.00   sec   303 KBytes  2.48 Mbits/sec    0   28.3 KBytes
[ 20]   0.00-1.00   sec   320 KBytes  2.62 Mbits/sec    0   28.3 KBytes
[ 22]   0.00-1.00   sec   288 KBytes  2.36 Mbits/sec    0   22.6 KBytes
[SUM]   0.00-1.00   sec  3.00 MBytes  25.2 Mbits/sec    0
- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   1.00-2.00   sec   115 KBytes   938 Kbits/sec    0   28.3 KBytes
[  6]   1.00-2.00   sec   124 KBytes  1.02 Mbits/sec    0   24.0 KBytes
[  8]   1.00-2.00   sec   153 KBytes  1.25 Mbits/sec    0   28.3 KBytes
[ 10]   1.00-2.00   sec   153 KBytes  1.25 Mbits/sec    0   28.3 KBytes
[ 12]   1.00-2.00   sec   153 KBytes  1.25 Mbits/sec    0   28.3 KBytes
[ 14]   1.00-2.00   sec   124 KBytes  1.02 Mbits/sec    0   28.3 KBytes
[ 16]   1.00-2.00   sec   124 KBytes  1.02 Mbits/sec    0   28.3 KBytes
[ 18]   1.00-2.00   sec   153 KBytes  1.25 Mbits/sec    0   28.3 KBytes
[ 20]   1.00-2.00   sec   124 KBytes  1.02 Mbits/sec    0   28.3 KBytes
[ 22]   1.00-2.00   sec   124 KBytes  1.02 Mbits/sec    0   22.6 KBytes
[SUM]   1.00-2.00   sec  1.32 MBytes  11.0 Mbits/sec    0
- - - - - - - - - - - - - - - - - - - - - - - - -
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working DTEL Dataplane telemetry
Projects
None yet
Development

No branches or pull requests

2 participants