Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libusb1 slightly slower than pyusb for large data. #21

Closed
diamondman opened this issue Oct 15, 2016 · 16 comments
Closed

libusb1 slightly slower than pyusb for large data. #21

diamondman opened this issue Oct 15, 2016 · 16 comments

Comments

@diamondman
Copy link

I am using libusb1 to communicate to a device and am looking for ways to make the transfer faster.
The bulk transfer is 5959114 bytes to the USB Bulk endpoint 2. Currently, the bulkWrite call of usb1 is taking the longest amount of time of my project. I tested sending the same data with pyusb and found pyusb usually faster than libusb1. I hope my statistics and use case can help increase performance of this great repository. Following are the statistics I came up with, and how to reproduce them.

I hooked kernprof to my project and profiled usb1's bulkWrite function. Here are the results. I noticed the ctype conversion takes almost as much time as the transfer. Maybe there is a way to do this with zero copy in C?

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1378                                               @profile
  1379                                               def bulkWrite(self, endpoint, data, timeout=0):
  1380                                                   """
  1381                                                   Synchronous bulk write.
  1382                                                   endpoint: endpoint to send data to.
  1383                                                   data: data to send.
  1384                                                   timeout: in milliseconds, how long to wait for device acknowledgement.
  1385                                                     Set to 0 to disable.
  1386                                           
  1387                                                   Returns the number of bytes actually sent.
  1388                                                   """
  1389                                                   # pylint: disable=undefined-variable
  1390         4           20      5.0      0.0          endpoint = (endpoint & ~ENDPOINT_DIR_MASK) | ENDPOINT_OUT
  1391                                                   # pylint: enable=undefined-variable
  1392         4      1381478 345369.5     41.5          data = (c_char * len(data))(*data)
  1393         4      1945835 486458.8     58.5          return self._bulkTransfer(endpoint, data, sizeof(data), timeout)

Example Code

Code with usb1

from usb1 import USBContext
import math
c = USBContext()
h = c.openByVendorIDAndProductID(0x03fd, 0x000d)
h.claimInterface(0)
h.setInterfaceAltSetting(0,1)
data = bytearray(b"\\xf0\\x0f"* (int(math.ceil(0xb5db91/4.0))))
h.controlWrite(0x40, 0xb0, 0xb5A6, 0xdb91, b"")
h.bulkWrite(2,data,timeout=5000)

timeit line:

timeit.timeit('h.controlWrite(0x40, 0xb0, 0xb5A6, 0xdb91, b"");h.bulkWrite(2,data,timeout=5000)', setup='from usb1 import USBContext;c = USBContext();h = c.openByVendorIDAndProductID(0x03fd, 0x000d);h.claimInterface(0);h.setInterfaceAltSetting(0,1);import math;data = bytearray(b"\\xf0\\x0f"* (int(math.ceil(0xb5db91/4.0))));',number=20)

89.61355720099527 (Average 4.480677860049764s)

Code with pyusb

import usb.core
from time import time
dev = usb.core.find(idVendor=0x03fd, idProduct=0x000d)
dev.set_configuration()
cfg = dev.get_active_configuration()
intf=cfg[0,1]
intf.set_altsetting()
ep = intf.endpoints()[2]
data = bytearray(b"\\xf0\\x0f"*(int(math.ceil(0xb5db91/4.0))))
h.controlWrite(0x40, 0xb0, 0xb5A6, 0xdb91, b"")
h.bulkWrite(2,data,timeout=5000)

Here is the timeit line

import timeit
timeit.timeit('h.controlWrite(0x40, 0xb0, 0xb5A6, 0xdb91, b"");h.bulkWrite(2,data,timeout=5000)', setup='from usb1 import USBContext;c = USBContext();h = c.openByVendorIDAndProductID(0x03fd, 0x000d);h.claimInterface(0);h.setInterfaceAltSetting(0,1);import math;data = bytearray(b"\\xf0\\x0f"*(int(math.ceil(0xb5db91/4.0))))',number=20)

63.37948983995011 (Averate 3.1689744919975054s)

I can do a comparison with code in C directly calling libusb 1.0 if the benchmarks are useful.

Here is my code that I ran during the profiling: https://github.com/diamondman/proteusisc/blob/059d94e3625331c2786bd5a592388bd9f4caa893/proteusisc/drivers/xilinxPC1driver.py#L327-L333 (Please forgive the messyness. It is still alpha)

@vpelletier
Copy link
Owner

I confirm, looks like I picked the slowest way to init a char array with ctypes (variant c below), with about 10% cost by *-expansion apparently (which is likely a first memory copy).
The b and d variants are not really legal.
The a variant raises when given a bytearray.
Could you apply the pattern used in h & i variants (same code), passing both a string or a bytearray ? I wonder if the latter would also accelerate the libusb call as well.

$ pprofile --exclude-syspath perftest.py
Command line: ['perftest.py']
Total duration: 9.82347s
File: perftest.py
File duration: 9.80669s (99.83%)
Line #|      Hits|         Time| Time per hit|      %|Source code
------+----------+-------------+-------------+-------+-----------
     1|         2|  0.000184059|  9.20296e-05|  0.00%|from ctypes import *
(call)|         1|     0.004565|     0.004565|  0.05%|# /usr/lib/python2.7/ctypes/__init__.py:1 <module>
     2|         1|  1.78814e-05|  1.78814e-05|  0.00%|import gc
     3|         1|  8.10623e-06|  8.10623e-06|  0.00%|gc.disable()
     4|         0|            0|            0|  0.00%|
     5|        11|  0.000188828|  1.71661e-05|  0.00%|def foo(cType, *args):
     6|        10|      8.71335|     0.871335| 88.70%|    return cType(*args)
     7|         0|            0|            0|  0.00%|
     8|         1|  6.91414e-06|  6.91414e-06|  0.00%|bin_c_char_p = POINTER(c_char)
     9|         0|            0|            0|  0.00%|
    10|         1|   0.00163412|   0.00163412|  0.02%|data = 'a\x00' * 3 * 1024 * 1024
    11|         1|   0.00143409|   0.00143409|  0.01%|datab = bytearray(data)
    12|        11|  0.000162125|  1.47386e-05|  0.00%|for _ in xrange(10):
    13|        10|  0.000105143|  1.05143e-05|  0.00%|    data_len = len(data)
    14|        10|   0.00019455|   1.9455e-05|  0.00%|    string_64k = c_char * data_len
    15|         0|            0|            0|  0.00%|
    16|        10|   0.00020504|   2.0504e-05|  0.00%|    print 'a'
    17|        10|   0.00079608|   7.9608e-05|  0.01%|    a = create_string_buffer(data, data_len)
(call)|        10|     0.011301|    0.0011301|  0.12%|# /usr/lib/python2.7/ctypes/__init__.py:49 create_string_buffer
    18|         0|            0|            0|  0.00%|
    19|        10|  0.000130177|  1.30177e-05|  0.00%|    print 'b'
    20|        10|  0.000385761|  3.85761e-05|  0.00%|    b = cast(data, c_char_p)
(call)|        10|  0.000339746|  3.39746e-05|  0.00%|# /usr/lib/python2.7/ctypes/__init__.py:498 cast
    21|         0|            0|            0|  0.00%|
    22|        10|  0.000205994|  2.05994e-05|  0.00%|    print 'c'
    23|        10|      1.05584|     0.105584| 10.75%|    c = foo(string_64k, *data)
(call)|        10|      8.71353|     0.871353| 88.70%|# perftest.py:5 foo
    24|         0|            0|            0|  0.00%|
    25|        10|  0.000319004|  3.19004e-05|  0.00%|    print 'd'
    26|        10|  0.000457525|  4.57525e-05|  0.00%|    d = cast(data, bin_c_char_p)
(call)|        10|  0.000419378|  4.19378e-05|  0.00%|# /usr/lib/python2.7/ctypes/__init__.py:498 cast
    27|         0|            0|            0|  0.00%|
    28|        10|  0.000164986|  1.64986e-05|  0.00%|    print 'e'
    29|        10|  0.000295162|  2.95162e-05|  0.00%|    e = string_64k.from_buffer(datab)
    30|         0|            0|            0|  0.00%|
    31|        10|  0.000160933|  1.60933e-05|  0.00%|    print 'f'
    32|        10|   0.00928569|  0.000928569|  0.09%|    f = string_64k.from_buffer_copy(data)
    33|         0|            0|            0|  0.00%|
    34|        10|  0.000159025|  1.59025e-05|  0.00%|    print 'g'
    35|        10|   0.00987339|  0.000987339|  0.10%|    g = string_64k.from_buffer_copy(datab)
    36|         0|            0|            0|  0.00%|
    37|        10|  0.000179529|  1.79529e-05|  0.00%|    print 'h'
    38|        10|  0.000205278|  2.05278e-05|  0.00%|    try:
    39|        10|  0.000179768|  1.79768e-05|  0.00%|        h = string_64k.from_buffer(data)
    40|        10|  0.000183105|  1.83105e-05|  0.00%|    except TypeError:
    41|        10|   0.00976896|  0.000976896|  0.10%|        h = string_64k.from_buffer_copy(data)
    42|         0|            0|            0|  0.00%|
    43|        10|  0.000205755|  2.05755e-05|  0.00%|    print 'i'
    44|        10|  0.000215292|  2.15292e-05|  0.00%|    try:
    45|        10|  0.000190973|  1.90973e-05|  0.00%|        i = string_64k.from_buffer(datab)
    46|         0|            0|            0|  0.00%|    except TypeError:
    47|         0|            0|            0|  0.00%|        i = string_64k.from_buffer_copy(datab)

@vpelletier
Copy link
Owner

I pushed a tentative patch doing this (and a many more conversions to bytearray) in "zerocopy" branch, which should improve performance:

  • without changes to get best performance for callers passing buffer size as data (ie, all read calls)
  • with changes to get best performance for callers passing a string as data (ie, all write calls)

@diamondman
Copy link
Author

I will pull the branch and test.
I changed my code to generate a bytes object from C, so I no longer rely on bytesarray.
Interesting issue of bytearray being easier to work with. I know that the C api has PyBytes_AsString (and the equivalent py2 version) that returns the internal buffer of a bytes object. I have not grasped the full code base yet, but maybe the functions like _bulkTransfer could take the bytes/bytearray and directly use the object's buffer with the libusb1 function calls.

Anything else I can do to assist?

@diamondman
Copy link
Author

Ran it. Execution time dropped down to 1.93 seconds. That is a drop of around a second and a half! Great job. The time is 1.93 s when I send either a bytes object or a bytearray, which was surprising because it looked like passing in bytes to bulkWrite caused a copy.

I did notice that the data returned from bulkRead() is now a bytearray. This caused some issues in my code because a library I use only takes bytes/strings. (https://github.com/ilanschnell/bitarray) The easy fix is to copy it to a bytes object. This got me thinking about what data type should be returned from a read.

Both bytearray and bytes have benefits, but I can not quite decide on what data type should be returned. I am leaning more towards the constant string/bytes object since that data will likely be processed, but I may be bias to my use case. I may just add bytearray support to that library I am using.

@vpelletier
Copy link
Owner

With bytes, I have one more type to care about: bytearray is needed when reading as libusb will mutate the buffer. So I went with bytearrays everywhere.

With the current code, any value passed is ultimately converted into a bytearray (which may involve memory copy), then into a ctype pointer. In any case, the argument should be either some type of binary array (bytes, bytearray, str with python2), or an integer (only when reading, of course) or a long with python2.

Current code should already contain fewer memory copies just because of this conversion, so you should benefit from it even without changes to caller code.

But pypy does not like it, as per latest travis-ci results, I have to fix this.

@diamondman
Copy link
Author

diamondman commented Oct 16, 2016

Since bytearray is used internally as you modify the buffer, then that is the answer :). As I said above, I am already benefiting from the speed increase.
Thanks for being so responsive on both issues I have opened up. I really appreciate your work.

@vpelletier
Copy link
Owner

Ah, indeed if caller is incompatible with bytearray, this breaks compatibility. Mmh.

On one hand, I hate breaking backward compatibility. On the other hand, zero-copy is something precious: higher level may make copies where acceptable, but if lower level does copies caller can't do much about it.

I'll need to think more about it.

About improved performance, would you know where the next hot spot would be ? Still somewhere in libusb (be it in the python wrapper or in libusb itself), or elsewhere ? I'm especially curious about the (lack of) improvement in _bulkTransfer total execution time.

Always happy to get feedback and bug reports on python-libusb1.

About Debian package request, I just bumped the RFP/ITP as I did not expect it to take this long. Maybe my module is not as easy to package as I thought.

@diamondman
Copy link
Author

I am personally fine with the backwards compatibility 'break' since this only happens when using CTYPE classes that fail to handle bytes and bytearrays the same. I will work around my case by fixing the bitarray library to accept both bytes and bytearrays.

On performance, were you hoping the _bulkTransfer time would be lower? Because a drop to 1.9 is already a great benefit to me and I would feel petulant asking for more. (If you are interested, my tool currently takes 4.1 seconds to run, so a little less than half of that is spent in usb1.)

But I am always for faster execution times! When I get up in the morning, I can run your pprofiler (seems better than kern's project) on my tool and profile any methods you would like. I believe targets are currently bulkWrite/bulkRead, nx _bulkTransfer. Let me know if I should profile anything else, and I will happily provide the report. Perhaps the family of create_XXXXX_buffer functions? I can also try transferring this data to te device in a C program to see the theoretical limit of libusb, and we can work from there.

@vpelletier
Copy link
Owner

My interest is that "absolute" performance is hard to put in perspective. If now python-libusb1 takes a negligible amount of time compared to time actually spent in libusb1, then there is little to gain by optimising further.

On a higher level, if you are not doing a one-shot transfer but something recurring, you may want to try the async API: setup a handful of transfer object, define a function to call on completion which will process received data (ideally not spending too long, maybe piping it to another process) or will push the next data chunk when sending. This way, you can keep the USB bus busy even when the CPU is busy resubmitting the transfer. This is what I do in my USB protocol analyser driver and I can receive 43MB/s with the bus and/or the device being the bottleneck (USB 2.0 device).

@diamondman
Copy link
Author

Testing now. Your current patch transfers my data in 1.93 seconds.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1405                                               def bulkWrite(self, endpoint, data, timeout=0):
  1418                                                   # pylint: disable=undefined-variable
  1419         4           13      3.2      0.0          endpoint = (endpoint & ~ENDPOINT_DIR_MASK) | ENDPOINT_OUT
  1420                                                   # pylint: enable=undefined-variable
  1421         4         2745    686.2      0.1          data, _ = create_initialised_buffer(data)
  1422         4      1946897 486724.2     99.9          return self._bulkTransfer(endpoint, data, sizeof(data), timeout)

  1398                                               def _bulkTransfer(self, endpoint, data, length, timeout):
  1399         8           19      2.4      0.0          transferred = c_int()
  1400         8           14      1.8      0.0          mayRaiseUSBError(libusb1.libusb_bulk_transfer(
  1401         8      1946531 243316.4    100.0              self.__handle, endpoint, data, length, byref(transferred), timeout,
  1402                                                   ))
  1403         8           31      3.9      0.0          return transferred.value

Looks like your patch moved nearly all the execution time to the libusb1.0 library.

I did a test in native C, and the same bulkWrite took 1.58s to transfer. The code is largely unnecessary, but here is the part that does the transfer.

  gettimeofday(&start, NULL);
  r = libusb_bulk_transfer(dev_handle, (2 | LIBUSB_ENDPOINT_OUT),
               data, bytecount, &actual, 0);
  gettimeofday(&end, NULL);
  double delta = ((end.tv_sec  - start.tv_sec) * 1000000u +
       end.tv_usec - start.tv_usec) / 1.e6;
  printf("Time is : %f\n",delta);

And this spat out
Time is : 1.582658

So there is a difference of 0.35 seconds.
This seems pretty reasonable, and I do not see any other p lace to trim it down. I do not know the performance difference between using Loadmodule from python and having an accompanying C file that has wrappers for each method and gets the buffer to pass directly with PyBytes_AsString, though I assume the latter is tedious and annoying.

@vpelletier
Copy link
Owner

Looks like your patch moved nearly all the execution time to the libusb1.0 library.

Great news ! Thanks for checking.

I do not know the performance difference between using Loadmodule from python and having an accompanying C file that has wrappers for each method and gets the buffer to pass directly with PyBytes_AsString, though I assume the latter is tedious and annoying.

I expect python-libusb1 to be close to this now: mutable buffers that ctypes can directly reference without copying them, and then pass the pointer to C. So I expect no memory copy left inside it - counting create_initialised_buffer's optional memory copy as external to python-libusb1, as it adapts caller's data for internal zero-copy.

@diamondman
Copy link
Author

I find it surprising that when I pass in bytes, the exec time of create_initialised_buffer is so short. I expected it to have a large copy time since now you want bytesarray.. As bytearray, the time is on average 1.924 to transfer. And with bytes the averate is 1.928. Does that small of a difference even make sense lol.

This issue looks solved to me. When you get the patch working with pypy, I will close it. Thanks for the quick responses. It has been a pleasure working with you on this.

diamondman added a commit to diamondman/proteusisc that referenced this issue Oct 18, 2016
Added badges to README.
Execution time down to 5.5 seconds.
Supports new patch for libusb1 that addresses vpelletier/python-libusb1#21 which shaves another second and a half off of execution time.
With libusb1 patch, execution time down to 4.1 seconds.
@vpelletier
Copy link
Owner

FWIW, the pypy breakage only affects the ancient version used on Travis, but tests pass on 5.6.0 for example (I think bytearray support is fairly recent there, at least in ctypes):

$ pypy --version
Python 2.7.12 (5.6.0+dfsg-2, Nov 12 2016, 10:14:18)
[PyPy 5.6.0 with GCC 6.2.0 20161103]
$ pypy -W all -m usb1.testUSB1
.....usb1/__init__.py:2025: DeprecationWarning: Use "with USBContext() as context:" for safer cleanup on interpreter shutdown. See also USBContext.open().
  DeprecationWarning,
........
----------------------------------------------------------------------
Ran 13 tests in 0.020s

OK

(side note: that deprecation warning comes from a test knowingly testing the deprecated code path, as it was accidentally broken at some point)

So I released 1.6 with this change, along with moving code around to eventually reduce top-level module namespace footprint while allowing to ship the unit test along with the code (this coming from a discussion with the Debian dev working on Debian packaging - which should be able to enter Debian before the upcoming freeze).

@diamondman
Copy link
Author

Thanks for the work. I would love to see a link to that conversation about including tests in the distribution (if it is public) since I was having a discussion about that with users of a library I inherited.

@vpelletier
Copy link
Owner

No link sadly, as the discussion happened in meatspace.

A summary could be:

The initial idea was at the very least to run tests when building the package. As a consequence (not sure if it was intended or not) the test ended up installed in python's top-level module namespace in an earlier package version. I noticed it, and was shown that many 3rd-party python modules (at least ones I have installed on my machine) contain some form of test ("tests" folder, test*.py, ...). As it could help when users report having issues on a specific install, as it makes possible to tell them "post the output of python -m usb1.testUSB1" or so (which in turn also give me more occasions and incentive to add more tests...).

@diamondman
Copy link
Author

Thanks. I will look for further work and see what I can learn for my project. As always, pleasure working with you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants