# GPU-Jupyter

This Jupyterlab Instance is connected to the GPU via CUDA drivers. In this notebook, we test the installation and perform some basic operations on the GPU.

## Test GPU connection

#### Using the following command, your GPU type and its NVIDIA-SMI driver version should be listed:

In [1]:
!nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0


In [3]:
!nvidia-smi

Fri Jan  8 16:06:27 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce GTX 1080    Off  | 00000000:0D:00.0 Off |                  N/A |
|  0%   51C    P5    30W / 215W |      0MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#### Now, test if PyTorch can access the GPU via CUDA:

In [4]:
!pip install torch

Collecting torch
  Downloading torch-1.7.1-cp38-cp38-manylinux1_x86_64.whl (776.8 MB)
[K     |███████████████▊                | 382.1 MB 100.2 MB/s eta 0:00:04

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[K     |████████████████████████████████| 776.8 MB 27.6 MB/s eta 0:00:01
Installing collected packages: torch
Successfully installed torch-1.7.1


In [5]:
import torch
torch.cuda.is_available()

True

In [6]:
torch.__version__

'1.7.1'

In [7]:
from __future__ import print_function
import numpy as np
import torch
a = torch.rand(5, 3)
a

tensor([[0.1129, 0.7164, 0.9336],
        [0.0659, 0.8988, 0.0314],
        [0.6167, 0.2684, 0.0293],
        [0.0196, 0.7503, 0.2228],
        [0.0894, 0.9600, 0.7395]])

#### Now, test if Tensorflow can access the GPU via CUDA:

In [8]:
!pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.4.0-cp38-cp38-manylinux2010_x86_64.whl (394.8 MB)
[K     |█████████████████████████▍      | 312.7 MB 82.9 MB/s eta 0:00:01    |████▎                           | 53.3 MB 15.8 MB/s eta 0:00:22

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[K     |████████████████████████████████| 394.8 MB 117.7 MB/s eta 0:00:01
Collecting gast==0.3.3
  Downloading gast-0.3.3-py2.py3-none-any.whl (9.7 kB)
Collecting absl-py~=0.10
  Downloading absl_py-0.11.0-py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 75.1 MB/s eta 0:00:01
[?25hCollecting astunparse~=1.6.3
  Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting flatbuffers~=1.12.0
  Downloading flatbuffers-1.12-py2.py3-none-any.whl (15 kB)
Collecting google-pasta~=0.2
  Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
[K     |████████████████████████████████| 57 kB 10.9 MB/s eta 0:00:01
[?25hCollecting grpcio~=1.32.0
  Downloading grpcio-1.32.0-cp38-cp38-manylinux2014_x86_64.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 79.1 MB/s eta 0:00:01
[?25hCollecting h5py~=2.10.0
  Downloading h5py-2.10.0-cp38-cp38-manylinux1_x86_64.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 15.3 MB/s eta 0:00:01
[?25h

In [9]:
import tensorflow as tf
tf.test.is_built_with_cuda()

True

In [10]:
tf.__version__

'2.4.0'

In [11]:
gpus = tf.config.experimental.list_physical_devices('GPU')
print(f'Num GPUs Available: {len(gpus)}')
for gpu in gpus:
    print(f'Name: {gpu.name} Type: {gpu.device_type}')

Num GPUs Available: 1
Name: /physical_device:GPU:0 Type: GPU


In [12]:
 tf.config.list_physical_devices()

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

In [13]:
tf.config.get_visible_devices()

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

## Performance test

#### Now we want to know how much faster a typical operation is using GPU. Therefore we do the same operation in numpy, PyTorch and PyTorch with CUDA. The test operation is the calculation of the prediction matrix that is done in a linear regression.

### 1) Numpy

In [14]:
x = np.random.rand(10000, 256)

In [15]:
%%timeit
H = x.dot(np.linalg.inv(x.transpose().dot(x))).dot(x.transpose())

319 ms ± 72.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### 2) PyTorch

In [16]:
x = torch.rand(10000, 256)

In [17]:
%%timeit
# Calculate the projection matrix of x on the CPU
H = x.mm( (x.t().mm(x)).inverse() ).mm(x.t())

190 ms ± 27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### 3) PyTorch on GPU via CUDA

In [18]:
# let us run this cell only if CUDA is available
# We will use ``torch.device`` objects to move tensors in and out of GPU
if torch.cuda.is_available():
    device = torch.device("cuda")          # a CUDA device object
    x = torch.rand(10000, 256, device=device) # directly create a tensor on GPU
    y = x.to(device)                       # or just use strings ``.to("cuda")``
    print(x[0:5, 0:5])
    print(y.to("cpu", torch.double)[0:5, 0:5])

tensor([[0.7207, 0.8733, 0.5622, 0.7909, 0.0353],
        [0.9294, 0.2660, 0.3631, 0.9747, 0.9972],
        [0.9132, 0.7926, 0.2502, 0.0940, 0.7459],
        [0.1416, 0.6732, 0.5302, 0.8952, 0.3213],
        [0.0222, 0.5688, 0.5545, 0.5368, 0.9695]], device='cuda:0')
tensor([[0.7207, 0.8733, 0.5622, 0.7909, 0.0353],
        [0.9294, 0.2660, 0.3631, 0.9747, 0.9972],
        [0.9132, 0.7926, 0.2502, 0.0940, 0.7459],
        [0.1416, 0.6732, 0.5302, 0.8952, 0.3213],
        [0.0222, 0.5688, 0.5545, 0.5368, 0.9695]], dtype=torch.float64)


In [19]:
%%timeit
# Calculate the projection matrix of x on the GPU
H = x.mm( (x.t().mm(x)).inverse() ).mm(x.t())

10.2 ms ± 6.03 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### 4) Tensorflow

In [28]:
with tf.device("/cpu:0"):
    x = tf.random.uniform(shape=(10000, 256), minval=0, maxval=1)
    print(x[0:5, 0:5])

tf.Tensor(
[[0.81397283 0.61530554 0.68669045 0.10914683 0.22292495]
 [0.03711581 0.79037046 0.5876236  0.6046151  0.5394982 ]
 [0.5613158  0.6209928  0.20556688 0.09798014 0.62952113]
 [0.20849335 0.72676206 0.23844254 0.64522576 0.17406547]
 [0.00656402 0.99429834 0.09061527 0.9162439  0.29858232]], shape=(5, 5), dtype=float32)


In [30]:
%%timeit
with tf.device("/cpu:0"):
    op = tf.matmul(tf.matmul(x, tf.linalg.inv(tf.matmul(tf.transpose(x), x))), tf.transpose(x))
    #tf.print(op)

79.5 ms ± 477 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### 5) Tensorflow on GPU via CUDA

In [31]:
with tf.device("/gpu:0"):
    x = tf.random.uniform(shape=(10000, 256), minval=0, maxval=1)
    print(x[0:5, 0:5])

tf.Tensor(
[[0.29063284 0.08116019 0.43883002 0.3707354  0.86560535]
 [0.10458052 0.21598268 0.58840656 0.23835504 0.5701264 ]
 [0.5191399  0.77426255 0.21610951 0.6476507  0.7089114 ]
 [0.8885735  0.5961783  0.5665909  0.36262488 0.60498   ]
 [0.52788055 0.33618903 0.5600989  0.33179379 0.90417695]], shape=(5, 5), dtype=float32)


In [33]:
%%timeit
with tf.device("/gpu:0"):
    op = tf.matmul(tf.matmul(x, tf.linalg.inv(tf.matmul(tf.transpose(x), x))), tf.transpose(x))
    #tf.print(op)

10.2 ms ± 7.59 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
