Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel 4.9.x drops USB data on Pi 2B (regression from 4.4.x) #2134

Closed
mutability opened this issue Jul 27, 2017 · 20 comments
Closed

Kernel 4.9.x drops USB data on Pi 2B (regression from 4.4.x) #2134

mutability opened this issue Jul 27, 2017 · 20 comments

Comments

@mutability
Copy link

TL;DR: USB bulk data gets intermittently dropped on Pi 2B + kernel 4.9. The
same hardware with kernel 4.4 does not drop data; this is a regression between
raspberrypi-kernel 1.20170405-1 and 1.20170427-1. Pi 0/1/3 are not affected.

Long version:

We use Pis with a "rtl-sdr" RTL2838U-based DVB dongle, in SDR mode, to receive
ADS-B radio signals from aircraft. This is using userspace librtlsdr / libusb,
with the rtl2832 kernel module blacklisted.

In SDR mode the dongle streams baseband samples to the Pi via a USB bulk endpoint
at a rate of 4.8MB/s. Software on the Pi demodulates this data and produces a stream
of decoded ADS-B messages.

Message receive timestamps are calculated by counting samples. So the
timestamp is a free-running clock based on the oscillator on the dongle itself.
It's not anchored to wallclock time but the timestamps are self-consistent and
predictable. The timestamps are then used for multilateration of the
received signals. These timestamps are very sensitive to any sort of dropped
data, as that will cause a jump in the sample count versus other receivers.

We've found that users using Pi 2s and upgrading from a 4.4 kernel to a 4.9
kernel are have timing problems with multilateration. I've done some measurements
by comparing the timestamps generated on the Pi versus timestamps from a known-
good reference receiver and it looks like the Pi 2 + 4.9 kernel combination is
dropping USB data periodically. It drops around 2-4ms worth of data (= around
8kB-16kB of sample data) every few minutes. This is frequent enough that
multilateration doesn't work most of the time.

Here's a plot of the time offset for receiving the same messages on a
standalone receiver for the case where things are working correctly (e.g. on a 4.4
kernel). This is from identical RF data (split off a single antenna). This looks normal;
the slope in the offset is due to a small frequency difference between the receivers
(about 0.7ppm in this case)

pi-usb-good

Here's a plot of the same thing on a Pi 2 (same hardware as the Pi 2 graph above), but with a 4.9 kernel:

pi-usb-bad

(Note that the Y scale is much larger in the second set of graphs)

The discontinuities in that graph imply that sample data is getting dropped;
the timestamps from the reference receiver advance by around a 2ms - 4ms step relative
to the Pi's sample-counting clock.

Good combinations (data is not dropped):

  • Pi 1B+ revision 0010, raspberrypi-{kernel,bootloader} 1.20170703-2 (kernel 4.9.35+)
  • Pi 2B revision a01041, raspberrypi-{kernel,bootloader} 1.20170405-1 (kernel 4.4.50-v7+)
  • Pi 3B revision a02082, raspberrypi-{kernel,bootloader} 1.20170703-2 (kernel 4.9.35-v7+)
  • Pi 0W revision 9000c1, raspberrypi-{kernel,bootloader} 1.20170703-2 (kernel 4.9.35+)

Bad combinations:

  • Pi 2B revision a01041, raspberrypi-{kernel,bootloader} 1.20170427-1 (kernel 4.9.24-v7+)
  • Pi 2B revision a01041, raspberrypi-{kernel,bootloader} 1.20170703-2 (kernel 4.9.35-v7+)

Reproducing this is somewhat involved and needs specific hardware. I can loan hardware
in Cambridge & put together detailed instructions, or turn on whatever debug is needed and
retest on my setup - let me know.

lsusb -v output for the device:

Bus 001 Device 004: ID 0bda:2832 Realtek Semiconductor Corp. RTL2832U DVB-T
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               2.00
  bDeviceClass            0 (Defined at Interface level)
  bDeviceSubClass         0 
  bDeviceProtocol         0 
  bMaxPacketSize0        64
  idVendor           0x0bda Realtek Semiconductor Corp.
  idProduct          0x2832 RTL2832U DVB-T
  bcdDevice            1.00
  iManufacturer           1 Realtek
  iProduct                2 RTL2832U
  iSerial                 3 00001000
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           25
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          4 USB2.0-Bulk&Iso
    bmAttributes         0x80
      (Bus Powered)
    MaxPower              500mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass       255 Vendor Specific Class
      bInterfaceSubClass    255 Vendor Specific Subclass
      bInterfaceProtocol    255 Vendor Specific Protocol
      iInterface              5 Bulk-In, Interface
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
Device Qualifier (for other device speed):
  bLength                10
  bDescriptorType         6
  bcdUSB               2.00
  bDeviceClass            0 (Defined at Interface level)
  bDeviceSubClass         0 
  bDeviceProtocol         0 
  bMaxPacketSize0        64
  bNumConfigurations      2
Device Status:     0x0000
  (Bus Powered)
@P33M
Copy link
Contributor

P33M commented Jul 27, 2017

This is an excellent bug report.

We've had rumours of "bad latency" appearing between our shipped 4.4 kernel and shipped 4.9 kernel. For reference, there is a thread on the Kodi forums about DVB-T dongles being more "lossy" with 4.9 (later pinned down to the 4.8->4.9 upstream kernel version bump):
https://forum.libreelec.tv/thread/4235-dvb-issue-since-le-switched-to-kernel-4-9-x/

Your ADS-B / rtl-sdr application may in fact be a much more reliable way to determine what's causing the problem.

How do you determine if there's been a drop or latency-related overflow? In other words, what do I have to do to replicate your samples/latency graphs?

If correlation is required between multiple receivers decoding ADS-B messages, it so happens that Pi Towers less than a mile away from Cambridge Airport so there's a good chance we can replicate the issue.

@mutability
Copy link
Author

Reproducing it is a bit involved. Essentially what I'm doing is feeding the same RF signal to two receivers and, post decoding, pairing up ADS-B position messages by their contents (these messages are fairly unique) to get pairs of times that should reflect the same transmission. The absolute times will be wildly different but the difference between them should be close to constant - that's what I'm plotting in the graphs. If the offset between the receiver times jumps suddenly then that means that one receiver dropped some data.

I'm actually using a separate standalone receiver (a Mode-S Beast, which does the demodulation onboard and produces just the decoded messages over serial) as one of the receivers, but you should be able to use a second rtlsdr dongle for that.

You will need:

a) a couple of dongles. These are suitable: https://www.modmypi.com/raspberry-pi/breakout-boards/flightaware/flightaware-pro-stick-usb-sdr-ads-b-receiver. Also, antennas for them; these will be OK: https://www.modmypi.com/electronics/antenna/antennas/3dbi-ads-b-1090mhz-sma-antenna-w-magnetic-base. Or I can loan you the right bits of hardware (I'm in Cambridge)

b) dump1090-fa from http://flightaware.com/adsb/piaware/files/packages/pool/piaware/d/dump1090-fa/dump1090-fa_3.5.0_armhf.deb (or build from http://github.com/flightaware/dump1090) to do the ADS-B decoding

c) the correlator script that pairs up the data. I'll clean that up and put it somewhere and get back to you

@mutability
Copy link
Author

Correlator script is available here: https://github.com/mutability/mlat-client/blob/dev/tools/compare-message-timing.py (you will need to build the _modes module from the enclosing repo first)

@JamesH65
Copy link
Contributor

JamesH65 commented Jul 27, 2017 via email

@rodizio1
Copy link

rodizio1 commented Jul 28, 2017

One possible way to reproduce this could maybe be using wifibroadcast to inject wifi packets in monitor mode on one Pi and using another one to receive them. Drops can be seen easily (there are counters and sequence numbers). Jitter introduced somewhere between the wifi chip and the Pi can be checked by comparing the timestamps that the wifi hardware writes when receiving the packets (jitter is 1-2us max.) with the timestamps wireshark writes (which are coming from the kernel drivers or networking/wireless stack through libpcap I believe).

@mutability
Copy link
Author

I had a chance to retest this with 4.9.46-v7+ (current rpi-update at the time of writing) and the issue seems to be fixed. I'm bisecting to narrow down the version that fixed it.

@popcornmix
Copy link
Collaborator

I'd be interested to hear the results. I'm not aware of anything between 4.9.35 and 4.9.46 that would have affected this.

@mutability
Copy link
Author

Initial results:

Good: 4.9.43-v7+ (Hexxeh/rpi-firmware@bf19fe7)
Bad: 4.9.41-v7+ (Hexxeh/rpi-firmware@b9becbb)

As you say there's no obvious suspect there.

@popcornmix
Copy link
Collaborator

How confident are you of repeatability of results?
Is it consistent with multiple upgrade/downgrades of those versions?

@mutability
Copy link
Author

The "bad" case is fairly definitive as I observed the failures directly.

The "good" case depends on how patient I am - in this case I ran it for 20 mins without seeing any failures. Usually if it's going to fail, it'll fail within about 10 mins. I'll set up a longer run to verify that .43 is OK

@mutability
Copy link
Author

I ran .41 and .43 for an hour or so each and got the same results as earlier: .41 is bad, .43 is good.

bisected

@popcornmix
Copy link
Collaborator

Can you leave .43 running overnight to be sure?
I'll get you a .42 kernel to test tomorrow to narrow it down a bit more (or are you comfortable building kernel?) Once we've done that I'll have a close look at any upstream commits that may have affected this.

@P33M
Copy link
Contributor

P33M commented Sep 4, 2017

There's not many commits between 41 and 43 -

$ git bisect bad 6da35f43acde8f718b53f6f05fc865bffa709fc5
$ git bisect good db397d9c6e66afdd34ae430172db122632b5f8a7
Bisecting: 56 revisions left to test after this (roughly 6 steps)

Should be relatively easy to narrow this down to a couple of candidates.

@mutability
Copy link
Author

Sure, will leave .43 overnight. If you can build the kernel that would be much easier, I don't have a kernel build environment set up.

@mutability
Copy link
Author

43 was fine overnight:

timing-4 9 43

@mutability
Copy link
Author

mutability commented Sep 5, 2017

I'll see if I can set up a kernel build environment today. Confusingly the git hash for the 4.9.43 prebuilt kernel - https://github.com/Hexxeh/rpi-firmware/blob/bf19fe7aa723ef8ddafdf8ca47a7205e35749ddc/git_hash - appears to point to a 4.9.44 kernel. But the prebuilt kernel did self-identify as 4.9.43.. Never mind - git log was misleading me.

@pelwell
Copy link
Contributor

pelwell commented Sep 5, 2017

You beat me to it.

@mutability
Copy link
Author

mutability commented Sep 5, 2017

I flattened the 4.9.41 - 4.9.43 history and bisected it.
So far it looks like 9ef8b23 is the commit that fixed things.

git bisect start '--term-new' 'fixed' '--term-old' 'broken'
# fixed: [530540bf72acd467aa0a0cf78a72e36bb88bc3de] config: Enable CONFIG_BRCMDBG temporarily for debugging on pi0
git bisect fixed 530540bf72acd467aa0a0cf78a72e36bb88bc3de
# broken: [4153f509b449f1c1c816cf124c314975c3daa824] Merge remote-tracking branch 'stable/linux-4.9.y' into rpi-4.9.y
git bisect broken 4153f509b449f1c1c816cf124c314975c3daa824
# fixed: [8ae1fcf4f95e6773e539e74e354273ffa662f79b] net/mlx5: Fix command bad flow on command entry allocation failure
git bisect fixed 8ae1fcf4f95e6773e539e74e354273ffa662f79b
# fixed: [b88da5fc69db4ec69094873d525891126df92bb1] iscsi-target: Fix initial login PDU asynchronous socket close OOPs
git bisect fixed b88da5fc69db4ec69094873d525891126df92bb1
# broken: [a5fdaec8276baf5d2a9b8785c5a498922eb2f02d] ALSA: hda - Fix speaker output from VAIO VPCL14M1R
git bisect broken a5fdaec8276baf5d2a9b8785c5a498922eb2f02d
# fixed: [a58af0eba517970e66af42ceb041bc854eaefe45] powerpc/64: Fix __check_irq_replay missing decrementer interrupt
git bisect fixed a58af0eba517970e66af42ceb041bc854eaefe45
# broken: [366aa1a0e2abcfd20f2e26223a0717ea087a9b84] KVM: async_pf: make rcu irq exit if not triggered from idle task
git bisect broken 366aa1a0e2abcfd20f2e26223a0717ea087a9b84
# fixed: [ed57bead53d3e331a0326f0e364a2f855bc89726] timers: Fix overflow in get_next_timer_interrupt
git bisect fixed ed57bead53d3e331a0326f0e364a2f855bc89726
# broken: [aec732e3d766b0b65af25da3a7b2217655899e7c] mm/page_alloc: Remove kernel address exposure in free_reserved_area()
git bisect broken aec732e3d766b0b65af25da3a7b2217655899e7c
# first fixed commit: [ed57bead53d3e331a0326f0e364a2f855bc89726] timers: Fix overflow in get_next_timer_interrupt

(note that the commit hashes above aren't directly useful due to the rebase I did to flatten the history)

@popcornmix
Copy link
Collaborator

Thanks that is good to know (and that was @P33M's guess at most likely candidate).

I assume the timer callback not occurring at the expected time causes packets to be dropped.
I might have expected the failure to be more periodic, but I think the quantity is in nanoseconds, so will overflow a 32-bit number approximately every 4 seconds. But only some percentage of these will be wrong.

@JamesH65
Copy link
Contributor

Closing this issue as questions answered/issue resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants