# 10 Minutes to cuDF and CuPy

This notebook provides introductory examples of how you can use cuDF and CuPy together to take advantage of CuPy array functionality (such as sparse matrix operations).

### Converting a cuDF Series or DataFrame to a CuPy Array

If we want to convert a cuDF `DataFrame` to a CuPy `ndarray`, the best way is to use the `dlpack` interface.

In [1]:
import time

import numpy as np
import cupy
import cudf
from numba import cuda

In [2]:
nelem = 10000
df = cudf.DataFrame({'a':range(nelem),
                     'b':range(500, nelem+500),
                     'c':range(1000, nelem+1000)}
                   )

%time arr_cupy = cupy.fromDlpack(df.to_dlpack())
type(arr_cupy)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 609 µs


  return cpp_dlpack.to_dlpack(gdf_cols)


cupy.core.core.ndarray

The best way to convert a cuDF `Series` to a CuPy `ndarray` is to either pass the underlying Numba `DeviceNDArray` to `cupy.asarray` or leverage the `dlpack` interface for conversions. We can also pass the `Series` itself, but this will be far slower.

In [3]:
%time cola_cupy = cupy.asarray(df['a'].data.mem)
%time cola_cupy = cupy.fromDlpack(df['a'].to_dlpack())
%time cola_cupy = cupy.asarray(df['a'])
type(cola_cupy)

CPU times: user 0 ns, sys: 4 ms, total: 4 ms
Wall time: 405 µs
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 570 µs
CPU times: user 2.84 s, sys: 60 ms, total: 2.9 s
Wall time: 2.9 s


cupy.core.core.ndarray

From here, we can proceed with normal CuPy workflows, such as reshaping the array or getting the diagonal.

In [4]:
reshaped_arr = cola_cupy.reshape(50,200)
reshaped_arr

array([[   0,    1,    2, ...,  197,  198,  199],
       [ 200,  201,  202, ...,  397,  398,  399],
       [ 400,  401,  402, ...,  597,  598,  599],
       ...,
       [9400, 9401, 9402, ..., 9597, 9598, 9599],
       [9600, 9601, 9602, ..., 9797, 9798, 9799],
       [9800, 9801, 9802, ..., 9997, 9998, 9999]])

In [5]:
reshaped_arr.diagonal()

array([   0,  201,  402,  603,  804, 1005, 1206, 1407, 1608, 1809, 2010,
       2211, 2412, 2613, 2814, 3015, 3216, 3417, 3618, 3819, 4020, 4221,
       4422, 4623, 4824, 5025, 5226, 5427, 5628, 5829, 6030, 6231, 6432,
       6633, 6834, 7035, 7236, 7437, 7638, 7839, 8040, 8241, 8442, 8643,
       8844, 9045, 9246, 9447, 9648, 9849])

### Converting a CuPy Array to a cuDF DataFrame or Series

We can also convert a CuPy `ndarray` to a cuDF `DataFrame` or `Series`, using the same `dlpack` interface.

In [6]:
reshaped_df = cudf.from_dlpack(reshaped_arr.toDlpack())
print(reshaped_df.head())

  res, valids = cpp_dlpack.from_dlpack(pycapsule_obj)


   0  1  2  3  4  5  6 ...  199
0  1  1  1  1  1  1  1 ...    1
1  2  2  2  2  2  2  2 ...    2
2  3  3  3  3  3  3  3 ...    3
3  4  4  4  4  4  4  4 ...    4
4  5  5  5  5  5  5  5 ...    5
[192 more columns]


In [7]:
print(cudf.from_dlpack(reshaped_arr.diagonal().toDlpack()))

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
[40 more rows]
dtype: int64


  res, valids = cpp_dlpack.from_dlpack(pycapsule_obj)


### Converting a cuDF DataFrame to a CuPy Sparse Matrix.

We can also convert a `DataFrame` (or `Series`) to a CuPy sparse matrix. The sparse matrices data structure is defined by three dense arrays, which we can create from an existing cuDF `DataFrame` or `Series`. As this conversion is more involved, we define a helper function for this task.

In [8]:
def cudf_to_cupy_sparse(data, sparseformat='column'):
    """Converts a cuDF object to a CuPy sparse matrix.
    
    Note: Can't currently support DataFrames/Series with nulls
    due to Boolean indexing issues
    """
    if sparseformat not in ('row', 'column'):
        raise NotImplementedError('Please choose between row or column format.')
    
    if isinstance(data, cudf.DataFrame):
        nonzero_cols = [data[x][data[x] != 0] for x in data.columns]
    elif isinstance(data, cudf.Series):
        nonzero_cols = [data[data != 0]]
    else:
        raise TypeError('Please pass a cuDF object.')
    
    total_vals = sum([x.shape[0] for x in nonzero_cols])    
    non_zero_vals = cudf.concat(nonzero_cols)
    sparse_indices = cudf.concat([x.index for x in nonzero_cols])

    dense_pointer_offsets = [0]
    for i, col in enumerate(nonzero_cols):
        dense_pointer_offsets.append(dense_pointer_offsets[i] + col.shape[0])

    dense_pointer_offsets = cuda.to_device(dense_pointer_offsets)
    
    _matrix_constructor = cupy.sparse.csc_matrix
    if sparseformat == 'row':
        _matrix_constructor = cupy.sparse.csr_matrix
        
    out_arr = _matrix_constructor(
        (cupy.asarray(non_zero_vals.data.mem),
         cupy.asarray(sparse_indices.gpu_values),
         cupy.asarray(dense_pointer_offsets)
        )
    )

    return out_arr

We define a large, sparsely populated dataframe to illustrate this conversion to either sparse row or column matrices.

In [23]:
df = cudf.DataFrame()
nelem = 10000000
nonzero = 5000
for i in range(20):
    arr = np.random.normal(5,5, nelem)
    arr[np.random.choice(arr.shape[0], nelem-nonzero, replace=False)] = 0
    df['a'+str(i)] = arr

In [24]:
%time sparse_df = cudf_to_cupy_sparse(df, sparseformat='row')
%time sparse_df = cudf_to_cupy_sparse(df, sparseformat='column')

CPU times: user 464 ms, sys: 392 ms, total: 856 ms
Wall time: 1.13 s
CPU times: user 396 ms, sys: 424 ms, total: 820 ms
Wall time: 1.1 s


From here, we can continue our workflow with a CuPy sparse matrix. For example, let's say we wanted to get the sum of each column in the matrix. Because we're operating on sparse arrays (most of the values are 0) using a sparse matrix data structure, this operation is significantly faster (about 80-100x faster in this example).

In [26]:
%time colsums = sparse_df.sum(axis=0)
%time colsums = df.sum()

CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 7.11 ms
CPU times: user 328 ms, sys: 268 ms, total: 596 ms
Wall time: 593 ms
