<a href="https://colab.research.google.com/github/yektaKamane/GPU_Programming_Course/blob/main/HW3/program-outputs-and-profiles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tensorflow with GPU

This notebook provides an introduction to computing on a [GPU](https://cloud.google.com/gpu) in Colab. In this notebook you will connect to a GPU, and then run some basic TensorFlow operations on both the CPU and a GPU, observing the speedup provided by using the GPU.


## Enabling and testing the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

Next, we'll confirm that we can connect to the GPU with tensorflow:

In [1]:
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


## Observe TensorFlow speedup on GPU relative to CPU

This example constructs a typical convolutional neural network layer over a
random image and manually places the resulting ops on either the CPU or the GPU
to compare execution speed.

In [2]:
%tensorflow_version 2.x
import tensorflow as tf
import timeit

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  print(
      '\n\nThis error most likely means that this notebook is not '
      'configured to use a GPU.  Change this in Notebook Settings via the '
      'command palette (cmd/ctrl-shift-P) or the Edit menu.\n\n')
  raise SystemError('GPU device not found')

def cpu():
  with tf.device('/cpu:0'):
    random_image_cpu = tf.random.normal((100, 100, 100, 3))
    net_cpu = tf.keras.layers.Conv2D(32, 7)(random_image_cpu)
    return tf.math.reduce_sum(net_cpu)

def gpu():
  with tf.device('/device:GPU:0'):
    random_image_gpu = tf.random.normal((100, 100, 100, 3))
    net_gpu = tf.keras.layers.Conv2D(32, 7)(random_image_gpu)
    return tf.math.reduce_sum(net_gpu)
  
# We run each op once to warm up; see: https://stackoverflow.com/a/45067900
cpu()
gpu()

# Run the op several times.
print('Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images '
      '(batch x height x width x channel). Sum of ten runs.')
print('CPU (s):')
cpu_time = timeit.timeit('cpu()', number=10, setup="from __main__ import cpu")
print(cpu_time)
print('GPU (s):')
gpu_time = timeit.timeit('gpu()', number=10, setup="from __main__ import gpu")
print(gpu_time)
print('GPU speedup over CPU: {}x'.format(int(cpu_time/gpu_time)))

Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images (batch x height x width x channel). Sum of ten runs.
CPU (s):
3.694191392999983
GPU (s):
0.044143077000001085
GPU speedup over CPU: 83x


# Getting started by installing cuda
Completely uninstall any previous CUDA versions.We need to refresh the Cloud Instance of CUDA. 
Install CUDA Version 9

In [3]:
!apt-get --purge remove cuda nvidia* libnvidia-*
!dpkg -l | grep cuda- | awk '{print $2}' | xargs -n1 dpkg --purge
!apt-get remove cuda-*
!apt autoremove
!apt-get update

Reading package lists... Done
Building dependency tree       
Reading state information... Done
Note, selecting 'nvidia-kernel-common-418-server' for glob 'nvidia*'
Note, selecting 'nvidia-325-updates' for glob 'nvidia*'
Note, selecting 'nvidia-346-updates' for glob 'nvidia*'
Note, selecting 'nvidia-driver-binary' for glob 'nvidia*'
Note, selecting 'nvidia-331-dev' for glob 'nvidia*'
Note, selecting 'nvidia-304-updates-dev' for glob 'nvidia*'
Note, selecting 'nvidia-compute-utils-418-server' for glob 'nvidia*'
Note, selecting 'nvidia-384-dev' for glob 'nvidia*'
Note, selecting 'nvidia-libopencl1-346-updates' for glob 'nvidia*'
Note, selecting 'nvidia-fs-prebuilt' for glob 'nvidia*'
Note, selecting 'nvidia-driver-440-server' for glob 'nvidia*'
Note, selecting 'nvidia-340-updates-uvm' for glob 'nvidia*'
Note, selecting 'nvidia-dkms-450-server' for glob 'nvidia*'
Note, selecting 'nvidia-kernel-common' for glob 'nvidia*'
Note, selecting 'nvidia-kernel-source-440-server' for glob 'nvidia*'


In [4]:
!wget https://developer.nvidia.com/compute/cuda/9.2/Prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64 -O cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb
!dpkg -i cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb
!apt-key add /var/cuda-repo-9-2-local/7fa2af80.pub
!apt-get update
!apt-get install cuda-9.2

--2021-11-02 07:44:15--  https://developer.nvidia.com/compute/cuda/9.2/Prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64
Resolving developer.nvidia.com (developer.nvidia.com)... 152.195.19.142
Connecting to developer.nvidia.com (developer.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://developer.nvidia.com/compute/cuda/9.2/prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64 [following]
--2021-11-02 07:44:15--  https://developer.nvidia.com/compute/cuda/9.2/prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64
Reusing existing connection to developer.nvidia.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://developer.download.nvidia.com/compute/cuda/9.2/secure/Prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb?9JZrAo5AfKegrYDalLtO0RQDH0vTcItAJJ0WSM_OkjYvfL8du0SP0cUDEs_LSda8nmPUZBiT5Y5Ry86Fqi1Ne41khfVf49yr926ocdynWqEHd

# Checking the version

In [5]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Wed_Apr_11_23:16:29_CDT_2018
Cuda compilation tools, release 9.2, V9.2.88


In [6]:
!pip install git+git://github.com/andreinechaev/nvcc4jupyter.git

Collecting git+git://github.com/andreinechaev/nvcc4jupyter.git
  Cloning git://github.com/andreinechaev/nvcc4jupyter.git to /tmp/pip-req-build-1f8okg98
  Running command git clone -q git://github.com/andreinechaev/nvcc4jupyter.git /tmp/pip-req-build-1f8okg98
Building wheels for collected packages: NVCCPlugin
  Building wheel for NVCCPlugin (setup.py) ... [?25l[?25hdone
  Created wheel for NVCCPlugin: filename=NVCCPlugin-0.0.2-py3-none-any.whl size=4305 sha256=e74c0887850dc97db38d5296da52b157c6d7ef0b5f6556d4e93dc9a073b97d97
  Stored in directory: /tmp/pip-ephem-wheel-cache-pygjelxj/wheels/c5/2b/c0/87008e795a14bbcdfc7c846a00d06981916331eb980b6c8bdf
Successfully built NVCCPlugin
Installing collected packages: NVCCPlugin
Successfully installed NVCCPlugin-0.0.2


In [7]:
%load_ext nvcc_plugin

created output directory at /content/src
Out bin /content/result.out


# Running CUDA code
A simple Hello World
<br>Start the code section with a %%cu to let the notebook know you're coding in C

In [8]:
%%cu
#include <iostream>
    int
    main()
{
    std::cout << "Hello World\n";
    return 0;
}

Hello World



# HW3 - class
trying to run the assignment files here in colab

In [12]:
! ls
! nvcc --version
! nvcc -o add-vectors add-vectors.cu

add-vectors	cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb  src
add-vectors.cu	sample_data
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Wed_Apr_11_23:16:29_CDT_2018
Cuda compilation tools, release 9.2, V9.2.88


In [13]:
! ./add-vectors 5

    0.00 +     0.00 =     0.00
    1.00 +   100.00 =   101.00
    2.00 +   200.00 =   202.00
    3.00 +   300.00 =   303.00
    4.00 +   400.00 =   404.00


In [14]:
! ./add-vectors 50

    0.00 +     0.00 =     0.00
    1.00 +   100.00 =   101.00
    2.00 +   200.00 =   202.00
    3.00 +   300.00 =   303.00
    4.00 +   400.00 =   404.00
    5.00 +   500.00 =   505.00
    6.00 +   600.00 =   606.00
    7.00 +   700.00 =   707.00
    8.00 +   800.00 =   808.00
    9.00 +   900.00 =   909.00
   10.00 +  1000.00 =  1010.00
   11.00 +  1100.00 =  1111.00
   12.00 +  1200.00 =  1212.00
   13.00 +  1300.00 =  1313.00
   14.00 +  1400.00 =  1414.00
   15.00 +  1500.00 =  1515.00
   16.00 +  1600.00 =  1616.00
   17.00 +  1700.00 =  1717.00
   18.00 +  1800.00 =  1818.00
   19.00 +  1900.00 =  1919.00
   20.00 +  2000.00 =  2020.00
   21.00 +  2100.00 =  2121.00
   22.00 +  2200.00 =  2222.00
   23.00 +  2300.00 =  2323.00
   24.00 +  2400.00 =  2424.00
   25.00 +  2500.00 =  2525.00
   26.00 +  2600.00 =  2626.00
   27.00 +  2700.00 =  2727.00
   28.00 +  2800.00 =  2828.00
   29.00 +  2900.00 =  2929.00
   30.00 +  3000.00 =  3030.00
   31.00 +  3100.00 =  3131.00
   32.00

In [16]:
! ./add-vectors 10000
! ./add-vectors 10000000

In [17]:
! nvprof ./add-vectors 1000

==17510== NVPROF is profiling process 17510, command: ./add-vectors 1000
==17510== Profiling application: ./add-vectors 1000
==17510== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   47.44%  5.3430us         2  2.6710us  2.4310us  2.9120us  [CUDA memcpy HtoD]
                   26.71%  3.0080us         1  3.0080us  3.0080us  3.0080us  add_vectors(float*, float*, float*, int)
                   25.85%  2.9120us         1  2.9120us  2.9120us  2.9120us  [CUDA memcpy DtoH]
      API calls:   99.50%  208.06ms         3  69.353ms  2.6950us  208.05ms  cudaMalloc
                    0.21%  446.56us         1  446.56us  446.56us  446.56us  cuDeviceTotalMem
                    0.11%  226.01us        96  2.3540us     142ns  127.76us  cuDeviceGetAttribute
                    0.07%  154.88us         1  154.88us  154.88us  154.88us  cudaLaunchKernel
                    0.06%  132.37us         3  44.123us  3.1560us  120.20us  cuda

In [19]:
! nvprof ./add-vectors 100000000

==17540== NVPROF is profiling process 17540, command: ./add-vectors 100000000
==17540== Profiling application: ./add-vectors 100000000
==17540== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   65.14%  278.42ms         1  278.42ms  278.42ms  278.42ms  [CUDA memcpy DtoH]
                   25.94%  110.87ms         2  55.437ms  55.003ms  55.870ms  [CUDA memcpy HtoD]
                    8.92%  38.135ms         1  38.135ms  38.135ms  38.135ms  add_vectors(float*, float*, float*, int)
      API calls:   57.82%  428.48ms         3  142.83ms  55.065ms  317.42ms  cudaMemcpy
                   28.15%  208.61ms         3  69.537ms  702.62us  207.16ms  cudaMalloc
                   13.89%  102.97ms         3  34.322ms  579.87us  51.487ms  cudaFree
                    0.06%  469.26us         1  469.26us  469.26us  469.26us  cuDeviceTotalMem
                    0.04%  315.76us         1  315.76us  315.76us  315.76us  cudaLaunchKe

In [87]:
%%cu
#include <stdio.h>
#include <cuda.h>

__global__ void add_matrices(
    float *c,      // out - pointer to result matrix c
    float *a,      // in  - pointer to summand matrix a
    float *b,      // in  - pointer to summand matrix b
    int m,         // in  - matrix length
    int n          // in  - matrix lenght
    )
{
	// To DO: Device a row major indexing
	int rowID = threadIdx.y + blockIdx.y * blockDim.y; 	// Row address
	int colID = threadIdx.x + blockIdx.x * blockDim.x;	// Column Address
	int elemID;											                    // Element address

    // a_ij = a[i][j], where a is in row major order
	if(rowID < m && colID < n){
		elemID = colID + rowID * n; 				
		c[elemID] = a[elemID] + b[elemID];
	}
}

int main( int argc, char* argv[] ){
    // determine matrix length
    int n = 10;      // set default length
    int m = 10;

    if ( argc > 1 ){
        n = atoi( argv[1] );  // override default length
        if ( n <= 0 ){
            fprintf( stderr, "Matrix length must be positive\n" );
            return EXIT_FAILURE;
        }
        if (argc > 2){
            m = atoi( argv[2] );
            if (m <= 0 ){
               fprintf( stderr, "Matrix length must be positive\n" );
               return EXIT_FAILURE;
            }
        }
    }

    // determine matrix size in bytes
    const size_t matrix_size = (n * m) * sizeof( float );

    // declare pointers to matrices in host memory and allocate memory
    float *a, *b, *c;
    a = (float*) malloc( matrix_size );
    b = (float*) malloc( matrix_size );
    c = (float*) malloc( matrix_size );

    // declare pointers to matrices in device memory and allocate memory
    float *a_d, *b_d, *c_d;
    cudaMalloc( (void**) &a_d, matrix_size );
    cudaMalloc( (void**) &b_d, matrix_size );
    cudaMalloc( (void**) &c_d, matrix_size );

    // initialize matrices and copy them to device
    for ( int i = 0; i < n*m; i++ )
    {
        a[i] =   1.0 * i;
        b[i] = 100.0 * i;        
    }
    cudaMemcpy( a_d, a, matrix_size, cudaMemcpyHostToDevice );
    cudaMemcpy( b_d, b, matrix_size, cudaMemcpyHostToDevice );

    // do calculation on device
    dim3 block_size( 16, 16 );
    dim3 num_blocks( ( n - 1 + block_size.x ) / block_size.x, ( m - 1 + block_size.y ) / block_size.y );
                   
    add_matrices<<< num_blocks, block_size >>>( c_d, a_d, b_d, m, n );

    // retrieve result from device and store on host
    cudaMemcpy( c, c_d, matrix_size, cudaMemcpyDeviceToHost );

    // print results for vectors up to length 100
    if ( n <= 100 && m <= 100)
    {
        for ( int i = 0; i < m; i++ )
        {
            for (int j = 0; j < n; j++)
            {
                printf("%4.0f ", a[i*n + j]);
            }
            printf("  ");
            for (int j = 0; j < n; j++)
            {
                printf("%4.0f ", b[i*n + j]);
            }
            printf("  ");
            for (int j = 0; j < n; j++)
            {
                printf("%4.0f ", c[i*n + j]);
            }
            printf("\n");
            
        }
    }

    // cleanup and quit
    cudaFree( a_d );
    cudaFree( b_d );
    cudaFree( c_d );
    free( a );
    free( b );
    free( c );
  
    return 0;
}


   0    1    2    3    4    5    6    7    8    9      0  100  200  300  400  500  600  700  800  900      0  101  202  303  404  505  606  707  808  909 
  10   11   12   13   14   15   16   17   18   19   1000 1100 1200 1300 1400 1500 1600 1700 1800 1900   1010 1111 1212 1313 1414 1515 1616 1717 1818 1919 
  20   21   22   23   24   25   26   27   28   29   2000 2100 2200 2300 2400 2500 2600 2700 2800 2900   2020 2121 2222 2323 2424 2525 2626 2727 2828 2929 
  30   31   32   33   34   35   36   37   38   39   3000 3100 3200 3300 3400 3500 3600 3700 3800 3900   3030 3131 3232 3333 3434 3535 3636 3737 3838 3939 
  40   41   42   43   44   45   46   47   48   49   4000 4100 4200 4300 4400 4500 4600 4700 4800 4900   4040 4141 4242 4343 4444 4545 4646 4747 4848 4949 
  50   51   52   53   54   55   56   57   58   59   5000 5100 5200 5300 5400 5500 5600 5700 5800 5900   5050 5151 5252 5353 5454 5555 5656 5757 5858 5959 
  60   61   62   63   64   65   66   67   68   69   6000 6100 6200 630

In [89]:
! nvcc -o matrix_adder matrix_adder.cu

In [92]:
! ./matrix_adder 6 50

   0    1    2    3    4    5      0  100  200  300  400  500      0  101  202  303  404  505 
   6    7    8    9   10   11    600  700  800  900 1000 1100    606  707  808  909 1010 1111 
  12   13   14   15   16   17   1200 1300 1400 1500 1600 1700   1212 1313 1414 1515 1616 1717 
  18   19   20   21   22   23   1800 1900 2000 2100 2200 2300   1818 1919 2020 2121 2222 2323 
  24   25   26   27   28   29   2400 2500 2600 2700 2800 2900   2424 2525 2626 2727 2828 2929 
  30   31   32   33   34   35   3000 3100 3200 3300 3400 3500   3030 3131 3232 3333 3434 3535 
  36   37   38   39   40   41   3600 3700 3800 3900 4000 4100   3636 3737 3838 3939 4040 4141 
  42   43   44   45   46   47   4200 4300 4400 4500 4600 4700   4242 4343 4444 4545 4646 4747 
  48   49   50   51   52   53   4800 4900 5000 5100 5200 5300   4848 4949 5050 5151 5252 5353 
  54   55   56   57   58   59   5400 5500 5600 5700 5800 5900   5454 5555 5656 5757 5858 5959 
  60   61   62   63   64   65   6000 6100 6200 630

In [97]:
! nvprof ./matrix_adder 10000 10000

==20139== NVPROF is profiling process 20139, command: ./matrix_adder 10000 10000
==20139== Profiling application: ./matrix_adder 10000 10000
==20139== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   68.95%  264.26ms         1  264.26ms  264.26ms  264.26ms  [CUDA memcpy DtoH]
                   28.29%  108.42ms         2  54.209ms  54.032ms  54.385ms  [CUDA memcpy HtoD]
                    2.76%  10.591ms         1  10.591ms  10.591ms  10.591ms  add_matrices(float*, float*, float*, int, int)
      API calls:   56.54%  384.62ms         3  128.21ms  54.103ms  275.99ms  cudaMemcpy
                   28.28%  192.36ms         3  64.120ms  850.27us  190.56ms  cudaMalloc
                   15.03%  102.21ms         3  34.071ms  593.89us  50.839ms  cudaFree
                    0.07%  497.81us         1  497.81us  497.81us  497.81us  cuDeviceTotalMem
                    0.05%  317.39us         1  317.39us  317.39us  317.39us  

In [98]:
! nvprof ./matrix_adder 1000000 1000000

tcmalloc: large alloc 18446744070800031744 bytes == (nil) @  0x7fa62ebe11e7 0x5564548470d0 0x7fa62dc12bf7 0x556454846eca
tcmalloc: large alloc 18446744070800031744 bytes == (nil) @  0x7fa62ebe11e7 0x5564548470e0 0x7fa62dc12bf7 0x556454846eca
tcmalloc: large alloc 18446744070800031744 bytes == (nil) @  0x7fa62ebe11e7 0x5564548470f0 0x7fa62dc12bf7 0x556454846eca
==20154== NVPROF is profiling process 20154, command: ./matrix_adder 1000000 1000000
==20154== Profiling application: ./matrix_adder 1000000 1000000
==20154== Profiling result:
No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   99.55%  196.06ms         3  65.352ms  1.0660us  196.05ms  cudaMalloc
                    0.25%  492.23us         1  492.23us  492.23us  492.23us  cuDeviceTotalMem
                    0.09%  180.38us        96  1.8780us     138ns  83.544us  cuDeviceGetAttribute
                    0.08%  156.87us         1  156.87us  156.87us  156.8

In [99]:
! nvprof ./matrix_adder 100000 500

==20173== NVPROF is profiling process 20173, command: ./matrix_adder 100000 500
==20173== Profiling application: ./matrix_adder 100000 500
==20173== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   65.64%  120.94ms         1  120.94ms  120.94ms  120.94ms  [CUDA memcpy DtoH]
                   31.38%  57.812ms         2  28.906ms  28.301ms  29.511ms  [CUDA memcpy HtoD]
                    2.98%  5.4819ms         1  5.4819ms  5.4819ms  5.4819ms  add_matrices(float*, float*, float*, int, int)
      API calls:   45.86%  201.51ms         3  67.169ms  475.79us  200.54ms  cudaMalloc
                   42.22%  185.52ms         3  61.841ms  28.415ms  127.42ms  cudaMemcpy
                   11.69%  51.387ms         3  17.129ms  431.37us  25.493ms  cudaFree
                    0.10%  453.08us         1  453.08us  453.08us  453.08us  cuDeviceTotalMem
                    0.07%  301.34us         1  301.34us  301.34us  301.34us  cu