Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add frombytes to convert bytes-like to DeviceBuffer #253

Merged
merged 14 commits into from
Jan 27, 2020

Conversation

jakirkham
Copy link
Member

@jakirkham jakirkham commented Jan 25, 2020

Makes it easier to build a DeviceBuffer object from an existing bytes-like object in pure Python code.

@jakirkham jakirkham requested a review from a team as a code owner January 25, 2020 23:59
@jakirkham jakirkham force-pushed the add_devbuf_frombytes branch 27 times, most recently from 7d043ba to 1da0aa3 Compare January 26, 2020 21:09
@jakirkham jakirkham changed the title Add frombytes to convert bytes to DeviceBuffer Add frombytes to convert bytes-like to DeviceBuffer Jan 26, 2020
Instead of strictly requiring `frombytes` take a `bytes` object, allow
other `bytes`-like objects as well. This continues to support `bytes`,
but could allow things like `bytearray`s, `memoryview`s, NumPy arrays,
etc. Further all of these are supported without needing to copy the
underlying data.
python/rmm/_lib/device_buffer.pyx Show resolved Hide resolved
@@ -57,17 +59,36 @@ cdef class DeviceBuffer:
buf.c_obj = move(ptr)
return buf

@staticmethod
@cython.boundscheck(False)
cdef DeviceBuffer c_frombytes(const unsigned char[::1] b,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this handle memoryview objects in general? If so it would be nice if we could have an overloaded non-const version to handle things like bytearray or other non-readonly objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It handles memoryview objects as long as they are 1-D, contiguous, and of type unsigned char.

The const just means that we won't modify the underlying buffer (much like if we took const char* and were given char*). So any mutable object meeting the constraints above could be passed in as well. This just promises we won't modify the underlying data. IOW bytearray is permissible and we test for this below.

As to other memory types and layouts that don't meet these requirements, users could coerce them into something that meets these requirements (with a minimal amount of copying to ensure contiguous data). Decided not to do that for the user as I wanted them to know we are not preserving the type or layout of their data. Though they could take the resulting DeviceBuffer and layer this information back on top by constructing some other array (like CuPy or Numba) and adding this info to that object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with not implicitly making things contiguous at this level. I'm surprised that a bytearray is allowed for a const unsigned char[::1]. We previously ran into issues with using const vs non-const in handling numpy arrays that were marked read-only which is what prompted me to questioning an overload.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok cool.

Yeah I recall some cases where Cython didn't work as well with const memoryviews. Issue ( cython/cython#1772 ) comes to mind (though that's with fused types). Not sure if there were issues with const memoryviews on a specific type. Was this what you were thinking about?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: After discussion offline, we decided to add one more test using a read-only NumPy array, which has been included in commit ( 2234e5a ).

@kkraus14 kkraus14 added the 2 - In Progress Currently a work in progress label Jan 27, 2020
@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress labels Jan 27, 2020
@kkraus14 kkraus14 merged commit 71ffd0d into rapidsai:branch-0.13 Jan 27, 2020
@jakirkham jakirkham deleted the add_devbuf_frombytes branch January 27, 2020 17:54
@jakirkham
Copy link
Member Author

Here's a rough benchmark.

In [1]: import numpy                                                            

In [2]: import numba.cuda                                                       

In [3]: import rmm                                                              

In [4]: rmm.reinitialize(pool_allocator=True, 
   ...:                  initial_pool_size=int(2**30))                          
Out[4]: 0

In [5]: hb = numpy.asarray(memoryview(50_000_000 * b"a"))                       

In [6]: %timeit numba.cuda.to_device(hb)                                        
19.6 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit rmm.DeviceBuffer.frombytes(hb)                                  
4.69 ms ± 33.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit numba.cuda.as_cuda_array(rmm.DeviceBuffer.frombytes(hb))        
5.22 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

IOW one gets a ~4x improvement using frombuffer. The cost of converting to a Numba array after is also pretty small (if that is needed for some applications).

@jakirkham
Copy link
Member Author

Should add we may be able to speed this up further by using hugepages.

@kkraus14
Copy link
Contributor

~5ms still seems really high from my perspective, have you confirmed via a python profile that there's no silly Python overhead?

@jakirkham
Copy link
Member Author

jakirkham commented Jan 29, 2020

That's 10GB/s. What would you expect instead?

Edit: This is from host memory to device memory on a DGX-1.

@jakirkham
Copy link
Member Author

From reading the DGX-1's specs it states, "Pascal also supports 16 lanes of PCIe 3.0. In DGX-1, these are used for connecting between the CPUs and GPUs.". According to Wikipedia, this should give us a throughput of 15.75 GB/s. So it's possible to get 1.5x faster, but it doesn't seem like we are way slower. Please feel free to correct me.

@kkraus14
Copy link
Contributor

According to Wikipedia, this should give us a throughput of 15.75 GB/s. So it's possible to get 1.5x faster, but it doesn't seem like we are way slower. Please feel free to correct me.

I brainfarted and went off by an order of magnitude 😄. In practice that tends to go more towards ~12 GB/s so we're pretty close here which looks good.

@jakirkham
Copy link
Member Author

No worries. It was good to work through the numbers. 🙂

Am curious how we might get closer to 12 GB/s (or even more) if possible. Do you have any ideas? 😉

@kkraus14
Copy link
Contributor

No worries. It was good to work through the numbers. 🙂

Am curious how we might get closer to 12 GB/s (or even more) if possible. Do you have any ideas? 😉

I'm guessing that is lower level things like pinned memory, using gdrcopy, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants