Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lockup with USBPollerThread #16

Closed
jhoenicke opened this issue Jun 12, 2016 · 17 comments
Closed

Lockup with USBPollerThread #16

jhoenicke opened this issue Jun 12, 2016 · 17 comments

Comments

@jhoenicke
Copy link

As mentioned in the pull request for Hantek 6022 library, I have a problem with the USBPollerThread and synchronous API. The synchronous API will not be woken up if the event is handled by the poller thread and the call continues only after a timeout.
This happens with Ubuntu 16.04, libusb-1.0.20-1, libusb1-1.5.0

Any ideas how to test this further?

Minimal test program (for a hantek 6022):

import usb1
import select
from time import sleep

VENDOR_ID=0x04b4
PRODUCT_ID=0x6022

with usb1.USBContext() as context:
    usb1.USBPollerThread(context, select.poll()).start()
    device = context.getByVendorIDAndProductID(
            VENDOR_ID, PRODUCT_ID,
            skip_on_error=True,
            skip_on_access_error=True)
    handle = device.open()
    if handle is None:
        print("Cannot find scope")
        exit()
    handle.claimInterface(0)
    written = handle.controlWrite(0x40, 0xa0,
                            0xe600, 0x00, b'\x01')
    print(written)
    handle.releaseInterface(0)
    handle.close()
@vpelletier
Copy link
Owner

I think the first step would be to check exactly where both threads are at when deadlocking. I think there are 3 places which call into libusb which may lock:

  • in controlWrite (most likely deadlock place)
  • in handle.close
  • in context.__exit__
    (I don't think open/claim/releaseInterface would lock)
    You may also want to increase libusb verbosity by calling context.setDebug (may require to rebuild libusb with logging enabled) to see what it has to say.
    To see which locks causes the issue, you could use gdb (be sure to check the DebuggingWithGdb python wiki page to fetch needed macros).

I am surprised to discover a new-ish section about a polling thread in libusb(x) asyncio API documentation which shows a much simpler version of the handling loop without mentioning whether it is compatible with synchronous API. @jwrdegoede : I think you added this paragraph. Is described event loop synchronous-API-compatible ? Would there be any serious reason to think some libusb internal change would have broken the older, more complex recommended way to implement such loop ?

That page does mention that the event thread should be started after opening the first device and stopped before closing it. Just in case it also applies to the older/synchronous-compatible (not sure which apply here) approach, could you call start() after device.open() ? I pushed 2 commits to ask thread to exit before device closure, please fetch them and call the new close method on thread object - and please check that it does not block too, as it will join thread.

@jwrdegoede
Copy link

Hi,

On 13-06-16 02:41, Vincent Pelletier wrote:

I think the first step would be to check exactly where both threads are at when deadlocking. I think there are 3 places which call into libusb which may lock:

  • in |controlWrite| (most likely deadlock place)
  • in |handle.close|
  • in |context.exit| (I don't think open/claim/releaseInterface would lock) You may also want to increase libusb verbosity by calling |context.setDebug| (may require to rebuild libusb with logging enabled) to see what it has to say. To see which locks causes the issue, you could use gdb (be sure to check the DebuggingWithGdb python wiki page https://wiki.python.org/moin/DebuggingWithGdb to fetch needed macros).

I am surprised to discover a new-ish section about a polling thread in libusb(x) asyncio API documentation http://libusb.sourceforge.net/api-1.0/group__asyncio.html#eventthread which shows a much simpler version of the handling loop without mentioning whether it is compatible with synchronous API. @jwrdegoede https://github.com/jwrdegoede : I think you added this paragraph. Is described event loop synchronous-API-compatible ?

Yes I added that bit to the docs, mostly to document how to cleanly exit such a loop. Note that for 1.0.21
we've a new API pending to simply interrupt event-handling, making such a loop even easier.

I'm using such a loop with async functions only, but I see no reasons why it cannot work with sync functions,
note that libusb will use sync functions internally when getting descriptors and such, so in essence even I
am already using it with sync functions.

Would there be any serious reason to think some libusb internal change would have broken the older, more complex recommended way to implement such loop ?

I'm not sure what you mean with the older more complex way. If you read:

http://libusb.sourceforge.net/api-1.0/group__poll.html

Which I did not touch, it already gives you 2 options, with the simple
mostly matching my bit of the docs: "If your application revolves solely around libusb and does not need to handle other event sources, you can have a program structure as follows:"

In essence that is what my suggestion does, by using a dedicated thread,
so that we do not need to care about other event sources as those will
be handled in another thread.

The other option described there with all the pollfd stuff + timeout-handling etc.,
that only applies to integrating libusb into an existing
select() / poll() using event handling mechanism.

Do note that using a separate thread means that your call-back functions need to
be thread-safe. Also if you're running on a platform which does not have
some sort of timerfd this implementation will drop your transfer timeout
granularity to 60 seconds (you can make this better by using
libusb_handle_events_timeout with a shorter timeout).

As for your deadlock you may be hitting an issue introduced in 1.0.20, for
which I've recently posted some patches to the list. If you can reproduce
this reliably(-ish) you could try:

https://github.com/jwrdegoede/libusbx/commits/master

Which has a bunch of pending locking fixes. Chances are you are not hitting
this though, because the identified race requires some pretty exotic circumstances
to trigger.

Or for something closer to v1.0.20:

https://github.com/jwrdegoede/libusbx/commits/v1.0.20-fixes

Regards,

Hans

That page does mention that the event thread should be started after opening the first device and stopped before closing it. Just in case it also applies to the older/synchronous-compatible (not sure which apply here) approach, could you call |start()| after |device.open()| ? I pushed 2 commits to ask thread to exit before device closure, please fetch them and call the new |close| method on thread object - and please check that it does not block too, as it will |join| thread.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub #16 (comment), or mute the thread https://github.com/notifications/unsubscribe/AAh5VdZPq-hQsv0zzasawgmfpttzDssOks5qLKcjgaJpZM4Izzgm.

@jhoenicke
Copy link
Author

It is hard to debug, but I have some strace and ltrace logs. The main thread hangs in controlWrite, while the event was handled by the event handler thread. After a minute the main thread times out and continues. According to strace, the first thread hangs in futex(); I think it is pthread_cond_timedwait.

relevant strace (1878 main thread, 1886 USB poller):

1878  ioctl(10, USBDEVFS_SUBMITURB <unfinished ...>
1886  poll([{fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLOUT}], 3, 0 <unfinished ...>
1878  <... ioctl resumed> , 0x2420210)  = 0
1878  futex(0x23dd8e4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, {1465738340, 488365111}, ffffffff <unfinished ...>
1886  <... poll resumed> )              = 1 ([{fd=10, revents=POLLOUT}])
1886  ioctl(10, USBDEVFS_REAPURBNDELAY, 0x7fe819126020) = 0
1886  ioctl(10, USBDEVFS_REAPURBNDELAY, 0x7fe819126020) = -1 EAGAIN (Resource temporarily unavailable)
1886  poll([{fd=9, events=POLLIN}, {fd=10, events=POLLOUT}, {fd=7, events=POLLIN}], 3, -1 <unfinished ...>
1878  <... futex resumed> )             = -1 ETIMEDOUT (Connection timed out)

@vpelletier
Copy link
Owner

@jwrdegoede

I'm not sure what you mean with the older more complex way

I mean this beast: http://libusb.sourceforge.net/api-1.0/mtasync.html#threadwait

This is what I implemented in this python wrapper, as it describes that having a handler loop in a thread and a synchronous USB call (say, a bulk read), the handler loop could "steal" the completion event and let the synchronous call wait for an event which will the never arrive (I didn't check the code, and maybe I misunderstood the doc - or mis-remember it now as I did not re-read it entirely). Such issue does not exist at all if one only uses the async API.

About thread-safety, thankfully (despite the Giant Interpreter Lock not being too popular) python takes care of this for us, so that's not a problem here.

@jhoenicke

The main thread hangs in controlWrite, while the event was handled by the event handler thread.

Good idea to use ltrace/strace.

Could you try building libusb from @jwrdegoede 's repository and try to reproduce with it, as he reported there is a known deadlock regression in 1.0.20 ? python-libusb1 honours OS's library lookup order: LD_LIBRARY_PATH for linux, something similar but with another name for OSX, and I think windows looks in the current directory before going to C:\Windows\System32 (or whatever it is nowadays).

@jwrdegoede
Copy link

Hi,

On 13-06-16 15:57, Vincent Pelletier wrote:

@jwrdegoede https://github.com/jwrdegoede

I'm not sure what you mean with the older more complex way

I mean this beast: http://libusb.sourceforge.net/api-1.0/mtasync.html#threadwait

Ah, we need to remove that, that is no longer true. All the libusb sync functions
now use libusb_handle_events_completed which avoids this race.

Patches welcome ...

Regards,

Hans

@jwrdegoede
Copy link

Hi,

On 13-06-16 16:03, Hans de Goede wrote:

Hi,

On 13-06-16 15:57, Vincent Pelletier wrote:

@jwrdegoede https://github.com/jwrdegoede

I'm not sure what you mean with the older more complex way

I mean this beast: http://libusb.sourceforge.net/api-1.0/mtasync.html#threadwait

Ah, we need to remove that, that is no longer true. All the libusb sync functions
now use libusb_handle_events_completed which avoids this race.

Never-mind, that section actually is correct. But this is talking about taking
the event-lock yourself, to quote the text just above it:

"You must take the events lock before polling libusb file descriptors, using libusb_lock_events(). You must release the lock as soon as you have aborted your poll()/select() loop, using libusb_unlock_events()."

So you see this is only relevant if you're doing fd polling yourself, if
all you want to do is poll for events from a dedicated thread and it is
ok for this thread to block in libusb_handle_events() when there are no
events, then you can just call libusb_handle_events() which then takes care
of everything for you.

If you look a couple of lines above the bit of the docs you're referring
to, the simple method for event polling is explained there.

TL;DR: The complicated method you referring to is only necessary if
you want to poll libusb fds yourself, because there are other fds
which you want to poll at the same time (the typical single threaded
unix program which does event multiplexing via select/poll).

Or to quote the docs:

" Closing remarks

The above may seem a little complicated, but hopefully I have made it clear why such complications are necessary. Also, do not forget that this only applies to applications that take libusb's file descriptors and integrate them into their own polling loops."

Regards,

Hans

@vpelletier
Copy link
Owner

@jwrdegoede:
Mmh, right, I need to remember more about how I came to write that poller thread/class. There is another class in python-libusb1 implementing polling, so I had two distinct uses in mind. I have to remember which...

So, two potential good news: python-libusb1 will get simpler (if I can remember exactly why I did it this way and changing it does not break stuff), and the deadlock/timeout reported here may be a known 1.0.20 issue (if confirmed by @jhoenicke ).

@eldarkg
Copy link

eldarkg commented Jan 10, 2017

@vpelletier Hello. I'm faced with same issue.
I use async(USBPollerThread, USBTransferHelper) and sync access in main thread.
My OS Ubuntu 16.04, libusb-1.0.20-1, libusb1-1.6.2.
I checked to update libusb up to 1.0.21 (from launchpad The Zesty Zapus and from the latest src here and here). Nothing to helped.

@eldarkg
Copy link

eldarkg commented Jan 12, 2017

@vpelletier When I tried to use another way (threading.Thread and event_thread_run, USBTransferHelper) no use USBPollerThread all OK (libusb 1.0.21 from LP Zesty, libusb1-1.6.2).

@vpelletier
Copy link
Owner

Thanks for the update, this is indeed what I considered suggesting: if you do not actually need to poll other descriptors, this approach is indeed much simpler (hence less likely to have bugs).

But I also want to reproduce this issue now that I have a usb device running linux - which makes testing python-libusb1 much easier (well, which should eventually make it easier, if functionfs and gadget-mode controller drivers did not fail so much).

@vpelletier
Copy link
Owner

@eldarkg , @jhoenicke : What type of polling object did you pass to USBPollerThread ?

If "poll", then I now see the deadlock: python poll object exposes {,un}register methods separately from poll, but in reality the set of monitored file descriptor cannot be changed in one thread (here, it would be the thread doing the synchronous USB I/O) while another is already waiting for events (here, USBPollerThread). The issue would be most visible when there are no async USB I/O (which likely means a silent device and submitted transfers).

@eldarkg
Copy link

eldarkg commented Jan 15, 2017

@vpelletier Yes, I used the select.poll object.

@eldarkg
Copy link

eldarkg commented Jan 15, 2017

@vpelletier Why you want to deprecate USBPollerThread? Maybe need wrap this to USBPollerThread. It will be good helper.

@vpelletier
Copy link
Owner

Why you want to deprecate USBPollerThread?

Because its job is to poll non-libusb descriptors, only abstracting the libusb descriptor dance. It was not intended to be used where the simpler loop would do.

From there:

  • I could extend USBPollerThread to type-check against epoll to detect problematic uses. Which would be very unpythonic.
  • I could add a USBThread implementing the simpler loop. But it's so trivial to implement, and thread stop condition could vary across host applications that I prefer to let them implement it (code duplication should be minimal - if there is anything I would prefer to abstract better in python-libusb1 to reduce host application's boilerplate code it would be device enumeration).

OTOH, maybe I should extend USBPoller to:

  • implement bits of the more complex loop to at least improve the situation when poller supports live descriptor set changes
  • find a way to support kqueue too

@eldarkg
Copy link

eldarkg commented Jan 15, 2017

@vpelletier OK. Thank you for info

@vpelletier
Copy link
Owner

After a bit over 3 years and 8 releases with a deprecation warning, USBPollerThread is nearing its end. Its latest issue is that it causes an assertion failure in current libusb1 master libusb_exit, which ends in a python core dump.

libusb1 is now checking the result of its mutex de-allocation calls. When this thread has been able to acquire the event lock this mutex will fail when libusb1 tries to free it, as it is acquired. Releasing it requires the poller thread to be able to terminate before libusb_exit is called. Which requires USBContext to have a way to make the thread exit, and until now there was no such relation between instances, and no obvious ways to create one without bloating the USBContext API even further. Also, the thread must call USBContext.unlockEvents, which is protected by a mutex preventing use when the internal context pointer is invalid, mutex which is acquired at the beginning of USBContext teardown code. Teardown code which would be waiting for the thread to exit, completing the deadlock.

Most of this could be saved by converting pairs of acquisition/release methods into context managers so the lock release codepath does not need to acquire the context pointer mutex, which would bloat the API further and still not solve the fundamental lockup issue reported here, and caused by using any poller other than linux-only epoll.

I am hence:

  • removing USBPollerThread
  • deprecating the related USBContext API: tryLockEvents, lockEvents, lockEventWaiters, waitForEvent, unlockEventWaiters, eventHandlingOK, unlockEvents, handleEventsLocked, and eventHandlerActive.

handleEvents and handleEventsTimeout are the remaining supported event handlers. getPollFDList and setPollFDNotifiers stay supported as they do not imply threading (these would also likely benefit from being converted into context managers, but this is another topic).

@vpelletier
Copy link
Owner

Released as 1.9 .

prusnak added a commit to prusnak/nixpkgs that referenced this issue Jul 4, 2021
the hack in checkPhase is no longer needed since the issue has been
resolved: vpelletier/python-libusb1#16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants