# PyCUDA installation

In [0]:
!pip install pycuda

Collecting pycuda
[?25l  Downloading https://files.pythonhosted.org/packages/5e/3f/5658c38579b41866ba21ee1b5020b8225cec86fe717e4b1c5c972de0a33c/pycuda-2019.1.2.tar.gz (1.6MB)
[K     |████████████████████████████████| 1.6MB 8.7MB/s 
[?25hCollecting pytools>=2011.2
[?25l  Downloading https://files.pythonhosted.org/packages/66/c7/88a4f8b6f0f78d0115ec3320861a0cc1f6daa3b67e97c3c2842c33f9c089/pytools-2020.1.tar.gz (60kB)
[K     |████████████████████████████████| 61kB 9.0MB/s 
Collecting appdirs>=1.4.0
  Downloading https://files.pythonhosted.org/packages/56/eb/810e700ed1349edde4cbdc1b2a21e28cdf115f9faf263f6bbf8447c1abf3/appdirs-1.4.3-py2.py3-none-any.whl
Collecting mako
[?25l  Downloading https://files.pythonhosted.org/packages/50/78/f6ade1e18aebda570eed33b7c534378d9659351cadce2fcbc7b31be5f615/Mako-1.1.2-py2.py3-none-any.whl (75kB)
[K     |████████████████████████████████| 81kB 11.1MB/s 
Building wheels for collected packages: pycuda, pytools
  Building wheel for pycuda (setup.py) ...



---



# Version #3: using ```gpuArrays```


The following initial code portion is the same as for Version #1 and #2.

In [0]:
import numpy as np

# --- PyCUDA initialization
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit

########
# MAIN #
########

start = cuda.Event()
end   = cuda.Event()

N = 100000

h_a = np.random.randn(1, N)
h_b = np.random.randn(1, N)

h_a = h_a.astype(np.float32)
h_b = h_b.astype(np.float32)
h_c = np.empty_like(h_a)

This version uses the ```gpuarray``` class. In this way, it is possible to allocate and move host memory space to device by ```gpuarray.to_gpu()```, perform the sum of the two arrays by simply using ```d_c = (d_a + d_b)``` and finally to move the result to host by the ```.get()``` method. There is no explicit declaration of ```d_c``` which automatically occurs during the execution of the ```d_c = (d_a + d_b)``` instruction.


In [0]:
d_a = gpuarray.to_gpu(h_a)
d_b = gpuarray.to_gpu(h_b)

# --- Warmup execution
d_c = (d_a + d_b)

start.record()
d_c = (d_a + d_b)
end.record() 
end.synchronize()
secs = start.time_till(end) * 1e-3
print("Processing time = %fs" % (secs))

h_c = d_c.get()

Processing time = 1.117813s


This last part is the same as for the previous versions.

In [0]:
if np.array_equal(h_c, h_a + h_b):
  print("Test passed!")
else :
  print("Error!")

cuda.Context.synchronize()

Test passed!
