New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel 4.9.x drops USB data on Pi 2B (regression from 4.4.x) #2134

Closed
mutability opened this Issue Jul 27, 2017 · 20 comments

Comments

Projects
None yet
6 participants
@mutability

mutability commented Jul 27, 2017

TL;DR: USB bulk data gets intermittently dropped on Pi 2B + kernel 4.9. The
same hardware with kernel 4.4 does not drop data; this is a regression between
raspberrypi-kernel 1.20170405-1 and 1.20170427-1. Pi 0/1/3 are not affected.

Long version:

We use Pis with a "rtl-sdr" RTL2838U-based DVB dongle, in SDR mode, to receive
ADS-B radio signals from aircraft. This is using userspace librtlsdr / libusb,
with the rtl2832 kernel module blacklisted.

In SDR mode the dongle streams baseband samples to the Pi via a USB bulk endpoint
at a rate of 4.8MB/s. Software on the Pi demodulates this data and produces a stream
of decoded ADS-B messages.

Message receive timestamps are calculated by counting samples. So the
timestamp is a free-running clock based on the oscillator on the dongle itself.
It's not anchored to wallclock time but the timestamps are self-consistent and
predictable. The timestamps are then used for multilateration of the
received signals. These timestamps are very sensitive to any sort of dropped
data, as that will cause a jump in the sample count versus other receivers.

We've found that users using Pi 2s and upgrading from a 4.4 kernel to a 4.9
kernel are have timing problems with multilateration. I've done some measurements
by comparing the timestamps generated on the Pi versus timestamps from a known-
good reference receiver and it looks like the Pi 2 + 4.9 kernel combination is
dropping USB data periodically. It drops around 2-4ms worth of data (= around
8kB-16kB of sample data) every few minutes. This is frequent enough that
multilateration doesn't work most of the time.

Here's a plot of the time offset for receiving the same messages on a
standalone receiver for the case where things are working correctly (e.g. on a 4.4
kernel). This is from identical RF data (split off a single antenna). This looks normal;
the slope in the offset is due to a small frequency difference between the receivers
(about 0.7ppm in this case)

pi-usb-good

Here's a plot of the same thing on a Pi 2 (same hardware as the Pi 2 graph above), but with a 4.9 kernel:

pi-usb-bad

(Note that the Y scale is much larger in the second set of graphs)

The discontinuities in that graph imply that sample data is getting dropped;
the timestamps from the reference receiver advance by around a 2ms - 4ms step relative
to the Pi's sample-counting clock.

Good combinations (data is not dropped):

  • Pi 1B+ revision 0010, raspberrypi-{kernel,bootloader} 1.20170703-2 (kernel 4.9.35+)
  • Pi 2B revision a01041, raspberrypi-{kernel,bootloader} 1.20170405-1 (kernel 4.4.50-v7+)
  • Pi 3B revision a02082, raspberrypi-{kernel,bootloader} 1.20170703-2 (kernel 4.9.35-v7+)
  • Pi 0W revision 9000c1, raspberrypi-{kernel,bootloader} 1.20170703-2 (kernel 4.9.35+)

Bad combinations:

  • Pi 2B revision a01041, raspberrypi-{kernel,bootloader} 1.20170427-1 (kernel 4.9.24-v7+)
  • Pi 2B revision a01041, raspberrypi-{kernel,bootloader} 1.20170703-2 (kernel 4.9.35-v7+)

Reproducing this is somewhat involved and needs specific hardware. I can loan hardware
in Cambridge & put together detailed instructions, or turn on whatever debug is needed and
retest on my setup - let me know.

lsusb -v output for the device:

Bus 001 Device 004: ID 0bda:2832 Realtek Semiconductor Corp. RTL2832U DVB-T
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               2.00
  bDeviceClass            0 (Defined at Interface level)
  bDeviceSubClass         0 
  bDeviceProtocol         0 
  bMaxPacketSize0        64
  idVendor           0x0bda Realtek Semiconductor Corp.
  idProduct          0x2832 RTL2832U DVB-T
  bcdDevice            1.00
  iManufacturer           1 Realtek
  iProduct                2 RTL2832U
  iSerial                 3 00001000
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           25
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          4 USB2.0-Bulk&Iso
    bmAttributes         0x80
      (Bus Powered)
    MaxPower              500mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass       255 Vendor Specific Class
      bInterfaceSubClass    255 Vendor Specific Subclass
      bInterfaceProtocol    255 Vendor Specific Protocol
      iInterface              5 Bulk-In, Interface
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
Device Qualifier (for other device speed):
  bLength                10
  bDescriptorType         6
  bcdUSB               2.00
  bDeviceClass            0 (Defined at Interface level)
  bDeviceSubClass         0 
  bDeviceProtocol         0 
  bMaxPacketSize0        64
  bNumConfigurations      2
Device Status:     0x0000
  (Bus Powered)
@P33M

This comment has been minimized.

Show comment
Hide comment
@P33M

P33M Jul 27, 2017

Contributor

This is an excellent bug report.

We've had rumours of "bad latency" appearing between our shipped 4.4 kernel and shipped 4.9 kernel. For reference, there is a thread on the Kodi forums about DVB-T dongles being more "lossy" with 4.9 (later pinned down to the 4.8->4.9 upstream kernel version bump):
https://forum.libreelec.tv/thread/4235-dvb-issue-since-le-switched-to-kernel-4-9-x/

Your ADS-B / rtl-sdr application may in fact be a much more reliable way to determine what's causing the problem.

How do you determine if there's been a drop or latency-related overflow? In other words, what do I have to do to replicate your samples/latency graphs?

If correlation is required between multiple receivers decoding ADS-B messages, it so happens that Pi Towers less than a mile away from Cambridge Airport so there's a good chance we can replicate the issue.

Contributor

P33M commented Jul 27, 2017

This is an excellent bug report.

We've had rumours of "bad latency" appearing between our shipped 4.4 kernel and shipped 4.9 kernel. For reference, there is a thread on the Kodi forums about DVB-T dongles being more "lossy" with 4.9 (later pinned down to the 4.8->4.9 upstream kernel version bump):
https://forum.libreelec.tv/thread/4235-dvb-issue-since-le-switched-to-kernel-4-9-x/

Your ADS-B / rtl-sdr application may in fact be a much more reliable way to determine what's causing the problem.

How do you determine if there's been a drop or latency-related overflow? In other words, what do I have to do to replicate your samples/latency graphs?

If correlation is required between multiple receivers decoding ADS-B messages, it so happens that Pi Towers less than a mile away from Cambridge Airport so there's a good chance we can replicate the issue.

@mutability

This comment has been minimized.

Show comment
Hide comment
@mutability

mutability Jul 27, 2017

Reproducing it is a bit involved. Essentially what I'm doing is feeding the same RF signal to two receivers and, post decoding, pairing up ADS-B position messages by their contents (these messages are fairly unique) to get pairs of times that should reflect the same transmission. The absolute times will be wildly different but the difference between them should be close to constant - that's what I'm plotting in the graphs. If the offset between the receiver times jumps suddenly then that means that one receiver dropped some data.

I'm actually using a separate standalone receiver (a Mode-S Beast, which does the demodulation onboard and produces just the decoded messages over serial) as one of the receivers, but you should be able to use a second rtlsdr dongle for that.

You will need:

a) a couple of dongles. These are suitable: https://www.modmypi.com/raspberry-pi/breakout-boards/flightaware/flightaware-pro-stick-usb-sdr-ads-b-receiver. Also, antennas for them; these will be OK: https://www.modmypi.com/electronics/antenna/antennas/3dbi-ads-b-1090mhz-sma-antenna-w-magnetic-base. Or I can loan you the right bits of hardware (I'm in Cambridge)

b) dump1090-fa from http://flightaware.com/adsb/piaware/files/packages/pool/piaware/d/dump1090-fa/dump1090-fa_3.5.0_armhf.deb (or build from http://github.com/flightaware/dump1090) to do the ADS-B decoding

c) the correlator script that pairs up the data. I'll clean that up and put it somewhere and get back to you

mutability commented Jul 27, 2017

Reproducing it is a bit involved. Essentially what I'm doing is feeding the same RF signal to two receivers and, post decoding, pairing up ADS-B position messages by their contents (these messages are fairly unique) to get pairs of times that should reflect the same transmission. The absolute times will be wildly different but the difference between them should be close to constant - that's what I'm plotting in the graphs. If the offset between the receiver times jumps suddenly then that means that one receiver dropped some data.

I'm actually using a separate standalone receiver (a Mode-S Beast, which does the demodulation onboard and produces just the decoded messages over serial) as one of the receivers, but you should be able to use a second rtlsdr dongle for that.

You will need:

a) a couple of dongles. These are suitable: https://www.modmypi.com/raspberry-pi/breakout-boards/flightaware/flightaware-pro-stick-usb-sdr-ads-b-receiver. Also, antennas for them; these will be OK: https://www.modmypi.com/electronics/antenna/antennas/3dbi-ads-b-1090mhz-sma-antenna-w-magnetic-base. Or I can loan you the right bits of hardware (I'm in Cambridge)

b) dump1090-fa from http://flightaware.com/adsb/piaware/files/packages/pool/piaware/d/dump1090-fa/dump1090-fa_3.5.0_armhf.deb (or build from http://github.com/flightaware/dump1090) to do the ADS-B decoding

c) the correlator script that pairs up the data. I'll clean that up and put it somewhere and get back to you

@mutability

This comment has been minimized.

Show comment
Hide comment
@mutability

mutability Jul 27, 2017

Correlator script is available here: https://github.com/mutability/mlat-client/blob/dev/tools/compare-message-timing.py (you will need to build the _modes module from the enclosing repo first)

mutability commented Jul 27, 2017

Correlator script is available here: https://github.com/mutability/mlat-client/blob/dev/tools/compare-message-timing.py (you will need to build the _modes module from the enclosing repo first)

@JamesH65

This comment has been minimized.

Show comment
Hide comment
@JamesH65

JamesH65 Jul 27, 2017

Contributor
Contributor

JamesH65 commented Jul 27, 2017

@rodizio1

This comment has been minimized.

Show comment
Hide comment
@rodizio1

rodizio1 Jul 28, 2017

One possible way to reproduce this could maybe be using wifibroadcast to inject wifi packets in monitor mode on one Pi and using another one to receive them. Drops can be seen easily (there are counters and sequence numbers). Jitter introduced somewhere between the wifi chip and the Pi can be checked by comparing the timestamps that the wifi hardware writes when receiving the packets (jitter is 1-2us max.) with the timestamps wireshark writes (which are coming from the kernel drivers or networking/wireless stack through libpcap I believe).

rodizio1 commented Jul 28, 2017

One possible way to reproduce this could maybe be using wifibroadcast to inject wifi packets in monitor mode on one Pi and using another one to receive them. Drops can be seen easily (there are counters and sequence numbers). Jitter introduced somewhere between the wifi chip and the Pi can be checked by comparing the timestamps that the wifi hardware writes when receiving the packets (jitter is 1-2us max.) with the timestamps wireshark writes (which are coming from the kernel drivers or networking/wireless stack through libpcap I believe).

@mutability

This comment has been minimized.

Show comment
Hide comment
@mutability

mutability Sep 4, 2017

I had a chance to retest this with 4.9.46-v7+ (current rpi-update at the time of writing) and the issue seems to be fixed. I'm bisecting to narrow down the version that fixed it.

mutability commented Sep 4, 2017

I had a chance to retest this with 4.9.46-v7+ (current rpi-update at the time of writing) and the issue seems to be fixed. I'm bisecting to narrow down the version that fixed it.

@popcornmix

This comment has been minimized.

Show comment
Hide comment
@popcornmix

popcornmix Sep 4, 2017

Collaborator

I'd be interested to hear the results. I'm not aware of anything between 4.9.35 and 4.9.46 that would have affected this.

Collaborator

popcornmix commented Sep 4, 2017

I'd be interested to hear the results. I'm not aware of anything between 4.9.35 and 4.9.46 that would have affected this.

@mutability

This comment has been minimized.

Show comment
Hide comment
@mutability

mutability Sep 4, 2017

Initial results:

Good: 4.9.43-v7+ (Hexxeh/rpi-firmware@bf19fe7)
Bad: 4.9.41-v7+ (Hexxeh/rpi-firmware@b9becbb)

As you say there's no obvious suspect there.

mutability commented Sep 4, 2017

Initial results:

Good: 4.9.43-v7+ (Hexxeh/rpi-firmware@bf19fe7)
Bad: 4.9.41-v7+ (Hexxeh/rpi-firmware@b9becbb)

As you say there's no obvious suspect there.

@popcornmix

This comment has been minimized.

Show comment
Hide comment
@popcornmix

popcornmix Sep 4, 2017

Collaborator

How confident are you of repeatability of results?
Is it consistent with multiple upgrade/downgrades of those versions?

Collaborator

popcornmix commented Sep 4, 2017

How confident are you of repeatability of results?
Is it consistent with multiple upgrade/downgrades of those versions?

@mutability

This comment has been minimized.

Show comment
Hide comment
@mutability

mutability Sep 4, 2017

The "bad" case is fairly definitive as I observed the failures directly.

The "good" case depends on how patient I am - in this case I ran it for 20 mins without seeing any failures. Usually if it's going to fail, it'll fail within about 10 mins. I'll set up a longer run to verify that .43 is OK

mutability commented Sep 4, 2017

The "bad" case is fairly definitive as I observed the failures directly.

The "good" case depends on how patient I am - in this case I ran it for 20 mins without seeing any failures. Usually if it's going to fail, it'll fail within about 10 mins. I'll set up a longer run to verify that .43 is OK

@mutability

This comment has been minimized.

Show comment
Hide comment
@mutability

mutability Sep 4, 2017

I ran .41 and .43 for an hour or so each and got the same results as earlier: .41 is bad, .43 is good.

bisected

mutability commented Sep 4, 2017

I ran .41 and .43 for an hour or so each and got the same results as earlier: .41 is bad, .43 is good.

bisected

@popcornmix

This comment has been minimized.

Show comment
Hide comment
@popcornmix

popcornmix Sep 4, 2017

Collaborator

Can you leave .43 running overnight to be sure?
I'll get you a .42 kernel to test tomorrow to narrow it down a bit more (or are you comfortable building kernel?) Once we've done that I'll have a close look at any upstream commits that may have affected this.

Collaborator

popcornmix commented Sep 4, 2017

Can you leave .43 running overnight to be sure?
I'll get you a .42 kernel to test tomorrow to narrow it down a bit more (or are you comfortable building kernel?) Once we've done that I'll have a close look at any upstream commits that may have affected this.

@P33M

This comment has been minimized.

Show comment
Hide comment
@P33M

P33M Sep 4, 2017

Contributor

There's not many commits between 41 and 43 -

$ git bisect bad 6da35f43acde8f718b53f6f05fc865bffa709fc5
$ git bisect good db397d9c6e66afdd34ae430172db122632b5f8a7
Bisecting: 56 revisions left to test after this (roughly 6 steps)

Should be relatively easy to narrow this down to a couple of candidates.

Contributor

P33M commented Sep 4, 2017

There's not many commits between 41 and 43 -

$ git bisect bad 6da35f43acde8f718b53f6f05fc865bffa709fc5
$ git bisect good db397d9c6e66afdd34ae430172db122632b5f8a7
Bisecting: 56 revisions left to test after this (roughly 6 steps)

Should be relatively easy to narrow this down to a couple of candidates.

@mutability

This comment has been minimized.

Show comment
Hide comment
@mutability

mutability Sep 4, 2017

Sure, will leave .43 overnight. If you can build the kernel that would be much easier, I don't have a kernel build environment set up.

mutability commented Sep 4, 2017

Sure, will leave .43 overnight. If you can build the kernel that would be much easier, I don't have a kernel build environment set up.

@mutability

This comment has been minimized.

Show comment
Hide comment
@mutability

mutability Sep 5, 2017

43 was fine overnight:

timing-4 9 43

mutability commented Sep 5, 2017

43 was fine overnight:

timing-4 9 43

@mutability

This comment has been minimized.

Show comment
Hide comment
@mutability

mutability Sep 5, 2017

I'll see if I can set up a kernel build environment today. Confusingly the git hash for the 4.9.43 prebuilt kernel - https://github.com/Hexxeh/rpi-firmware/blob/bf19fe7aa723ef8ddafdf8ca47a7205e35749ddc/git_hash - appears to point to a 4.9.44 kernel. But the prebuilt kernel did self-identify as 4.9.43.. Never mind - git log was misleading me.

mutability commented Sep 5, 2017

I'll see if I can set up a kernel build environment today. Confusingly the git hash for the 4.9.43 prebuilt kernel - https://github.com/Hexxeh/rpi-firmware/blob/bf19fe7aa723ef8ddafdf8ca47a7205e35749ddc/git_hash - appears to point to a 4.9.44 kernel. But the prebuilt kernel did self-identify as 4.9.43.. Never mind - git log was misleading me.

@pelwell

This comment has been minimized.

Show comment
Hide comment
@pelwell

pelwell Sep 5, 2017

Contributor

You beat me to it.

Contributor

pelwell commented Sep 5, 2017

You beat me to it.

@mutability

This comment has been minimized.

Show comment
Hide comment
@mutability

mutability Sep 5, 2017

I flattened the 4.9.41 - 4.9.43 history and bisected it.
So far it looks like 9ef8b23 is the commit that fixed things.

git bisect start '--term-new' 'fixed' '--term-old' 'broken'
# fixed: [530540bf72acd467aa0a0cf78a72e36bb88bc3de] config: Enable CONFIG_BRCMDBG temporarily for debugging on pi0
git bisect fixed 530540bf72acd467aa0a0cf78a72e36bb88bc3de
# broken: [4153f509b449f1c1c816cf124c314975c3daa824] Merge remote-tracking branch 'stable/linux-4.9.y' into rpi-4.9.y
git bisect broken 4153f509b449f1c1c816cf124c314975c3daa824
# fixed: [8ae1fcf4f95e6773e539e74e354273ffa662f79b] net/mlx5: Fix command bad flow on command entry allocation failure
git bisect fixed 8ae1fcf4f95e6773e539e74e354273ffa662f79b
# fixed: [b88da5fc69db4ec69094873d525891126df92bb1] iscsi-target: Fix initial login PDU asynchronous socket close OOPs
git bisect fixed b88da5fc69db4ec69094873d525891126df92bb1
# broken: [a5fdaec8276baf5d2a9b8785c5a498922eb2f02d] ALSA: hda - Fix speaker output from VAIO VPCL14M1R
git bisect broken a5fdaec8276baf5d2a9b8785c5a498922eb2f02d
# fixed: [a58af0eba517970e66af42ceb041bc854eaefe45] powerpc/64: Fix __check_irq_replay missing decrementer interrupt
git bisect fixed a58af0eba517970e66af42ceb041bc854eaefe45
# broken: [366aa1a0e2abcfd20f2e26223a0717ea087a9b84] KVM: async_pf: make rcu irq exit if not triggered from idle task
git bisect broken 366aa1a0e2abcfd20f2e26223a0717ea087a9b84
# fixed: [ed57bead53d3e331a0326f0e364a2f855bc89726] timers: Fix overflow in get_next_timer_interrupt
git bisect fixed ed57bead53d3e331a0326f0e364a2f855bc89726
# broken: [aec732e3d766b0b65af25da3a7b2217655899e7c] mm/page_alloc: Remove kernel address exposure in free_reserved_area()
git bisect broken aec732e3d766b0b65af25da3a7b2217655899e7c
# first fixed commit: [ed57bead53d3e331a0326f0e364a2f855bc89726] timers: Fix overflow in get_next_timer_interrupt

(note that the commit hashes above aren't directly useful due to the rebase I did to flatten the history)

mutability commented Sep 5, 2017

I flattened the 4.9.41 - 4.9.43 history and bisected it.
So far it looks like 9ef8b23 is the commit that fixed things.

git bisect start '--term-new' 'fixed' '--term-old' 'broken'
# fixed: [530540bf72acd467aa0a0cf78a72e36bb88bc3de] config: Enable CONFIG_BRCMDBG temporarily for debugging on pi0
git bisect fixed 530540bf72acd467aa0a0cf78a72e36bb88bc3de
# broken: [4153f509b449f1c1c816cf124c314975c3daa824] Merge remote-tracking branch 'stable/linux-4.9.y' into rpi-4.9.y
git bisect broken 4153f509b449f1c1c816cf124c314975c3daa824
# fixed: [8ae1fcf4f95e6773e539e74e354273ffa662f79b] net/mlx5: Fix command bad flow on command entry allocation failure
git bisect fixed 8ae1fcf4f95e6773e539e74e354273ffa662f79b
# fixed: [b88da5fc69db4ec69094873d525891126df92bb1] iscsi-target: Fix initial login PDU asynchronous socket close OOPs
git bisect fixed b88da5fc69db4ec69094873d525891126df92bb1
# broken: [a5fdaec8276baf5d2a9b8785c5a498922eb2f02d] ALSA: hda - Fix speaker output from VAIO VPCL14M1R
git bisect broken a5fdaec8276baf5d2a9b8785c5a498922eb2f02d
# fixed: [a58af0eba517970e66af42ceb041bc854eaefe45] powerpc/64: Fix __check_irq_replay missing decrementer interrupt
git bisect fixed a58af0eba517970e66af42ceb041bc854eaefe45
# broken: [366aa1a0e2abcfd20f2e26223a0717ea087a9b84] KVM: async_pf: make rcu irq exit if not triggered from idle task
git bisect broken 366aa1a0e2abcfd20f2e26223a0717ea087a9b84
# fixed: [ed57bead53d3e331a0326f0e364a2f855bc89726] timers: Fix overflow in get_next_timer_interrupt
git bisect fixed ed57bead53d3e331a0326f0e364a2f855bc89726
# broken: [aec732e3d766b0b65af25da3a7b2217655899e7c] mm/page_alloc: Remove kernel address exposure in free_reserved_area()
git bisect broken aec732e3d766b0b65af25da3a7b2217655899e7c
# first fixed commit: [ed57bead53d3e331a0326f0e364a2f855bc89726] timers: Fix overflow in get_next_timer_interrupt

(note that the commit hashes above aren't directly useful due to the rebase I did to flatten the history)

@popcornmix

This comment has been minimized.

Show comment
Hide comment
@popcornmix

popcornmix Sep 6, 2017

Collaborator

Thanks that is good to know (and that was @P33M's guess at most likely candidate).

I assume the timer callback not occurring at the expected time causes packets to be dropped.
I might have expected the failure to be more periodic, but I think the quantity is in nanoseconds, so will overflow a 32-bit number approximately every 4 seconds. But only some percentage of these will be wrong.

Collaborator

popcornmix commented Sep 6, 2017

Thanks that is good to know (and that was @P33M's guess at most likely candidate).

I assume the timer callback not occurring at the expected time causes packets to be dropped.
I might have expected the failure to be more periodic, but I think the quantity is in nanoseconds, so will overflow a 32-bit number approximately every 4 seconds. But only some percentage of these will be wrong.

@JamesH65

This comment has been minimized.

Show comment
Hide comment
@JamesH65

JamesH65 Sep 13, 2017

Contributor

Closing this issue as questions answered/issue resolved.

Contributor

JamesH65 commented Sep 13, 2017

Closing this issue as questions answered/issue resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment