Import the libraries we need. CuPy can be found readily installed on Google Colab.

In [1]:
import numpy as np
import cupy as cp

Data can be created with the array() function, just like in NumPy. But of course it will be GPU.

In [2]:
x_gpu = cp.array([1,2,3])
l2_gpu = cp.linalg.norm(x_gpu)
print(l2_gpu)


3.7416573867739413


In [3]:
x_cpu = np.array([1,2,3])
l2_cpu = np.linalg.norm(x_cpu)
print(l2_cpu)

3.7416573867739413


In [5]:
print(type(x_gpu))
print(type(l2_gpu))
print(type(x_cpu))
print(type(l2_cpu))

<class 'cupy.ndarray'>
<class 'cupy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.float64'>


### Multiple GPUs?
No problem, CuPy has you covered

In [7]:
x_on_gpu0 = cp.array([1,2,3,4])

#with cp.cuda.Device(1):
#  x_on_gpu1 = cp.array([1,2,3,4,5])

CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal

## Data transfr to the GPU

cupy.asarray() can be used to numpy.ndarray, lisr or any other type that can be passed to numpy.arra() to the current device.

In [10]:
x_cpu = np.array([1,2,3])
print(type(x_cpu))

<class 'numpy.ndarray'>


In [11]:
x_gpu = cp.asarray(x_cpu)
print(type(x_gpu))

<class 'cupy.ndarray'>


### Move array from device to the host

For this the CuPy functions asnumpy() and get() can be used

In [12]:
x_gpu = cp.array([1,2,3])
print(x_gpu)
print(type(x_gpu))



[1 2 3]
<class 'cupy.ndarray'>


In [13]:
x_cpu = cp.asnumpy(x_gpu)
print(x_cpu)
print(type(x_cpu))

[1 2 3]
<class 'numpy.ndarray'>


In [14]:
x_gpu = cp.array([1,2,3])
print(x_gpu)
print(type(x_gpu))

[1 2 3]
<class 'cupy.ndarray'>


In [15]:
x_cpu = x_gpu.get()
print(x_cpu)
print(type(x_cpu))

[1 2 3]
<class 'numpy.ndarray'>


### Operations betwen CPU and GPU
Rembember to have your data on the same device when you do operation between them.

In [16]:
x_cpu = np.array([1,2,3])
y_cpu = np.array([4,5,6])
x_cpu + y_cpu

array([5, 7, 9])

In [18]:
x_gpu = cp.asarray(x_cpu)

# x_gpu + x_cpu

In [19]:
cp.asnumpy(x_gpu) + y_cpu

array([5, 7, 9])

In [20]:
x_gpu + cp.asarray(y_cpu)

array([5, 7, 9])

### CuPy functions

CuPy supports most of the functions that NumPy provides. Here just examples.

In [22]:
a = cp.empty(10)
print(a)

[2.0e-323 2.5e-323 3.0e-323 0.0e+000 0.0e+000 0.0e+000 0.0e+000 0.0e+000
 0.0e+000 0.0e+000]


In [24]:
b = cp.ones_like(a)
print(b)

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


In [25]:
a = cp.zeros((4,4))
print(a)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


In [26]:
c = cp.reshape(a, a.size)
print(c)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [28]:
d = cp.random.randint(0,10, size=(4,4))
print(d)

[[2 1 8 5]
 [6 9 8 3]
 [3 3 4 7]
 [1 2 6 3]]


In [30]:
e = cp.diag(d,1)
print(e)

[1 8 7]


In [31]:
type(e)

cupy.ndarray

### Speed differences

Is there any difference on speed?

Create random vector in CPU

In [33]:
%%timeit
x_cpu = np.random.randn(1000)

26.4 µs ± 671 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Create random vector in GPU

In [34]:
%%timeit
x_gpu = cp.random.randn(1000)

16.3 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


What about creating the array in CPU and transferring it to the GPU

In [35]:
%%timeit
x_cpu = np.random.randn(1000)
x_gpu = cp.asarray(x_cpu)

72.7 µs ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


How long does the copying actually take.

In [36]:
%%timeit
x_gpu = cp.asarray(x_cpu)

29 µs ± 871 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


One has to also consider that copying data back from the GPU takes some time.

In [37]:
%%timeit
x_cpu = cp.asnumpy(x_gpu)

16.2 µs ± 1.57 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


So good to do as much as you can on the GPU and then only copy back the final result.

Try to avoid copying the data back and forth between the CPU and the GPU.