<a href="https://colab.research.google.com/github/yektaKamane/GPU_Programming_Course/blob/main/HW3/original_files_profile.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tensorflow with GPU

This notebook provides an introduction to computing on a [GPU](https://cloud.google.com/gpu) in Colab. In this notebook you will connect to a GPU, and then run some basic TensorFlow operations on both the CPU and a GPU, observing the speedup provided by using the GPU.


## Enabling and testing the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

Next, we'll confirm that we can connect to the GPU with tensorflow:

In [1]:
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


## Observe TensorFlow speedup on GPU relative to CPU

This example constructs a typical convolutional neural network layer over a
random image and manually places the resulting ops on either the CPU or the GPU
to compare execution speed.

In [2]:
%tensorflow_version 2.x
import tensorflow as tf
import timeit

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  print(
      '\n\nThis error most likely means that this notebook is not '
      'configured to use a GPU.  Change this in Notebook Settings via the '
      'command palette (cmd/ctrl-shift-P) or the Edit menu.\n\n')
  raise SystemError('GPU device not found')

def cpu():
  with tf.device('/cpu:0'):
    random_image_cpu = tf.random.normal((100, 100, 100, 3))
    net_cpu = tf.keras.layers.Conv2D(32, 7)(random_image_cpu)
    return tf.math.reduce_sum(net_cpu)

def gpu():
  with tf.device('/device:GPU:0'):
    random_image_gpu = tf.random.normal((100, 100, 100, 3))
    net_gpu = tf.keras.layers.Conv2D(32, 7)(random_image_gpu)
    return tf.math.reduce_sum(net_gpu)
  
# We run each op once to warm up; see: https://stackoverflow.com/a/45067900
cpu()
gpu()

# Run the op several times.
print('Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images '
      '(batch x height x width x channel). Sum of ten runs.')
print('CPU (s):')
cpu_time = timeit.timeit('cpu()', number=10, setup="from __main__ import cpu")
print(cpu_time)
print('GPU (s):')
gpu_time = timeit.timeit('gpu()', number=10, setup="from __main__ import gpu")
print(gpu_time)
print('GPU speedup over CPU: {}x'.format(int(cpu_time/gpu_time)))

Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images (batch x height x width x channel). Sum of ten runs.
CPU (s):
3.694191392999983
GPU (s):
0.044143077000001085
GPU speedup over CPU: 83x


# Getting started by installing cuda
Completely uninstall any previous CUDA versions.We need to refresh the Cloud Instance of CUDA. 
Install CUDA Version 9

In [3]:
!apt-get --purge remove cuda nvidia* libnvidia-*
!dpkg -l | grep cuda- | awk '{print $2}' | xargs -n1 dpkg --purge
!apt-get remove cuda-*
!apt autoremove
!apt-get update

Reading package lists... Done
Building dependency tree       
Reading state information... Done
Note, selecting 'nvidia-kernel-common-418-server' for glob 'nvidia*'
Note, selecting 'nvidia-325-updates' for glob 'nvidia*'
Note, selecting 'nvidia-346-updates' for glob 'nvidia*'
Note, selecting 'nvidia-driver-binary' for glob 'nvidia*'
Note, selecting 'nvidia-331-dev' for glob 'nvidia*'
Note, selecting 'nvidia-304-updates-dev' for glob 'nvidia*'
Note, selecting 'nvidia-compute-utils-418-server' for glob 'nvidia*'
Note, selecting 'nvidia-384-dev' for glob 'nvidia*'
Note, selecting 'nvidia-libopencl1-346-updates' for glob 'nvidia*'
Note, selecting 'nvidia-fs-prebuilt' for glob 'nvidia*'
Note, selecting 'nvidia-driver-440-server' for glob 'nvidia*'
Note, selecting 'nvidia-340-updates-uvm' for glob 'nvidia*'
Note, selecting 'nvidia-dkms-450-server' for glob 'nvidia*'
Note, selecting 'nvidia-kernel-common' for glob 'nvidia*'
Note, selecting 'nvidia-kernel-source-440-server' for glob 'nvidia*'


In [4]:
!wget https://developer.nvidia.com/compute/cuda/9.2/Prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64 -O cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb
!dpkg -i cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb
!apt-key add /var/cuda-repo-9-2-local/7fa2af80.pub
!apt-get update
!apt-get install cuda-9.2

--2021-11-02 07:44:15--  https://developer.nvidia.com/compute/cuda/9.2/Prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64
Resolving developer.nvidia.com (developer.nvidia.com)... 152.195.19.142
Connecting to developer.nvidia.com (developer.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://developer.nvidia.com/compute/cuda/9.2/prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64 [following]
--2021-11-02 07:44:15--  https://developer.nvidia.com/compute/cuda/9.2/prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64
Reusing existing connection to developer.nvidia.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://developer.download.nvidia.com/compute/cuda/9.2/secure/Prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb?9JZrAo5AfKegrYDalLtO0RQDH0vTcItAJJ0WSM_OkjYvfL8du0SP0cUDEs_LSda8nmPUZBiT5Y5Ry86Fqi1Ne41khfVf49yr926ocdynWqEHd

# Checking the version

In [5]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Wed_Apr_11_23:16:29_CDT_2018
Cuda compilation tools, release 9.2, V9.2.88


In [6]:
!pip install git+git://github.com/andreinechaev/nvcc4jupyter.git

Collecting git+git://github.com/andreinechaev/nvcc4jupyter.git
  Cloning git://github.com/andreinechaev/nvcc4jupyter.git to /tmp/pip-req-build-1f8okg98
  Running command git clone -q git://github.com/andreinechaev/nvcc4jupyter.git /tmp/pip-req-build-1f8okg98
Building wheels for collected packages: NVCCPlugin
  Building wheel for NVCCPlugin (setup.py) ... [?25l[?25hdone
  Created wheel for NVCCPlugin: filename=NVCCPlugin-0.0.2-py3-none-any.whl size=4305 sha256=e74c0887850dc97db38d5296da52b157c6d7ef0b5f6556d4e93dc9a073b97d97
  Stored in directory: /tmp/pip-ephem-wheel-cache-pygjelxj/wheels/c5/2b/c0/87008e795a14bbcdfc7c846a00d06981916331eb980b6c8bdf
Successfully built NVCCPlugin
Installing collected packages: NVCCPlugin
Successfully installed NVCCPlugin-0.0.2


In [7]:
%load_ext nvcc_plugin

created output directory at /content/src
Out bin /content/result.out


# Running CUDA code
A simple Hello World
<br>Start the code section with a %%cu to let the notebook know you're coding in C

In [8]:
%%cu
#include <iostream>
    int
    main()
{
    std::cout << "Hello World\n";
    return 0;
}

Hello World



Calculate the cube of some numbers using threads <br>

**Quick reminder that you need to run all the code above in order to actually get access to the GPU**

In [9]:
%%cu
#include <stdio.h>

__global__ void cube(float * d_out, float * d_in){
  int idx = threadIdx.x;
  float f = d_in[idx];
  d_out[idx] = f * f * f;
}

int main(int argc, char ** argv) {
	const int ARRAY_SIZE = 96;
	const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);

	// generate the input array on the host
	float h_in[ARRAY_SIZE];
	for (int i = 0; i < ARRAY_SIZE; i++) {
		h_in[i] = float(i);
	}
	float h_out[ARRAY_SIZE];

	// declare GPU memory pointers
	float * d_in;
	float * d_out;

	// allocate GPU memory
	cudaMalloc((void**) &d_in, ARRAY_BYTES);
	cudaMalloc((void**) &d_out, ARRAY_BYTES);

	// transfer the array to the GPU
	cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);

	// launch the kernel
	cube<<<1, ARRAY_SIZE>>>(d_out, d_in);

	// copy back the result array to the CPU
	cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);

	// print out the resulting array
	for (int i =0; i < ARRAY_SIZE; i++) {
		printf("%f", h_out[i]);
		printf(((i % 4) != 3) ? "\t" : "\n");
	}

	cudaFree(d_in);
	cudaFree(d_out);

	return 0;
}

0.000000	1.000000	8.000000	27.000000
64.000000	125.000000	216.000000	343.000000
512.000000	729.000000	1000.000000	1331.000000
1728.000000	2197.000000	2744.000000	3375.000000
4096.000000	4913.000000	5832.000000	6859.000000
8000.000000	9261.000000	10648.000000	12167.000000
13824.000000	15625.000000	17576.000000	19683.000000
21952.000000	24389.000000	27000.000000	29791.000000
32768.000000	35937.000000	39304.000000	42875.000000
46656.000000	50653.000000	54872.000000	59319.000000
64000.000000	68921.000000	74088.000000	79507.000000
85184.000000	91125.000000	97336.000000	103823.000000
110592.000000	117649.000000	125000.000000	132651.000000
140608.000000	148877.000000	157464.000000	166375.000000
175616.000000	185193.000000	195112.000000	205379.000000
216000.000000	226981.000000	238328.000000	250047.000000
262144.000000	274625.000000	287496.000000	300763.000000
314432.000000	328509.000000	343000.000000	357911.000000
373248.000000	389017.000000	405224.000000	421875.000000
438976.000000	456533.00

# Homework 1
**Color to Greyscale Conversion**

A common way to represent color images is known as RGBA - the color
is specified by how much Red, Green, and Blue is in it.
The 'A' stands for Alpha and is used for transparency; it will be
ignored in this homework.

Each channel Red, Blue, Green, and Alpha is represented by one byte.
Since we are using one byte for each color there are 256 different
possible values for each color.  This means we use 4 bytes per pixel.

Greyscale images are represented by a single intensity value per pixel
which is one byte in size.

To convert an image from color to grayscale one simple method is to
set the intensity to the average of the RGB channels.  But we will
use a more sophisticated method that takes into account how the eye 
perceives color and weights the channels unequally.

The eye responds most strongly to green followed by red and then blue.
The NTSC (National Television System Committee) recommends the following
formula for color to greyscale conversion:

**I = .299f * R + .587f * G + .114f * B**

Notice the trailing f's on the numbers which indicate that they are 
single precision floating point constants and not double precision
constants.

You should fill in the kernel as well as set the block and grid sizes
so that the entire image is processed.

In [None]:
%%cu
#include <stdio.h>

void referenceCalculation(const uchar4* const rgbaImage,
                          unsigned char *const greyImage,
                          size_t numRows,
                          size_t numCols)
{
  for (size_t r = 0; r < numRows; ++r) {
    for (size_t c = 0; c < numCols; ++c) {
      uchar4 rgba = rgbaImage[r * numCols + c];
      float channelSum = .299f * rgba.x + .587f * rgba.y + .114f * rgba.z;
      greyImage[r * numCols + c] = channelSum;
    }
  }
}

__global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
                       unsigned char* const greyImage,
                       int numRows, int numCols)
{
  //TODO
  //Fill in the kernel to convert from color to greyscale
  //the mapping from components of a uchar4 to RGBA is:
  // .x -> R ; .y -> G ; .z -> B ; .w -> A
  //
  //The output (greyImage) at each pixel should be the result of
  //applying the formula: output = .299f * R + .587f * G + .114f * B;
  //Note: We will be ignoring the alpha channel for this conversion

  //First create a mapping from the 2D block and grid locations
  //to an absolute 2D location in the image, then use that to
  //calculate a 1D offset
}

void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage,
                            unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
  //You must fill in the correct sizes for the blockSize and gridSize
  //currently only one block with one thread is being launched
  const dim3 blockSize(1, 1, 1);  //TODO
  const dim3 gridSize( 1, 1, 1);  //TODO
  rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
  
  cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}

/tmp/tmpblcbv5mg/70724077-2202-4dcc-b134-f493c6ae5494.cu(45): error: identifier "checkCudaErrors" is undefined

1 error detected in the compilation of "/tmp/tmpxft_00004671_00000000-8_70724077-2202-4dcc-b134-f493c6ae5494.cpp1.ii".



# HW3 - class
trying to run the assignment files here in colab

In [12]:
! ls
! nvcc --version
! nvcc -o add-vectors add-vectors.cu

add-vectors	cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb  src
add-vectors.cu	sample_data
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Wed_Apr_11_23:16:29_CDT_2018
Cuda compilation tools, release 9.2, V9.2.88


In [13]:
! ./add-vectors 5

    0.00 +     0.00 =     0.00
    1.00 +   100.00 =   101.00
    2.00 +   200.00 =   202.00
    3.00 +   300.00 =   303.00
    4.00 +   400.00 =   404.00


In [14]:
! ./add-vectors 50

    0.00 +     0.00 =     0.00
    1.00 +   100.00 =   101.00
    2.00 +   200.00 =   202.00
    3.00 +   300.00 =   303.00
    4.00 +   400.00 =   404.00
    5.00 +   500.00 =   505.00
    6.00 +   600.00 =   606.00
    7.00 +   700.00 =   707.00
    8.00 +   800.00 =   808.00
    9.00 +   900.00 =   909.00
   10.00 +  1000.00 =  1010.00
   11.00 +  1100.00 =  1111.00
   12.00 +  1200.00 =  1212.00
   13.00 +  1300.00 =  1313.00
   14.00 +  1400.00 =  1414.00
   15.00 +  1500.00 =  1515.00
   16.00 +  1600.00 =  1616.00
   17.00 +  1700.00 =  1717.00
   18.00 +  1800.00 =  1818.00
   19.00 +  1900.00 =  1919.00
   20.00 +  2000.00 =  2020.00
   21.00 +  2100.00 =  2121.00
   22.00 +  2200.00 =  2222.00
   23.00 +  2300.00 =  2323.00
   24.00 +  2400.00 =  2424.00
   25.00 +  2500.00 =  2525.00
   26.00 +  2600.00 =  2626.00
   27.00 +  2700.00 =  2727.00
   28.00 +  2800.00 =  2828.00
   29.00 +  2900.00 =  2929.00
   30.00 +  3000.00 =  3030.00
   31.00 +  3100.00 =  3131.00
   32.00

In [16]:
! ./add-vectors 10000
! ./add-vectors 10000000

In [17]:
! nvprof ./add-vectors 1000

==17510== NVPROF is profiling process 17510, command: ./add-vectors 1000
==17510== Profiling application: ./add-vectors 1000
==17510== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   47.44%  5.3430us         2  2.6710us  2.4310us  2.9120us  [CUDA memcpy HtoD]
                   26.71%  3.0080us         1  3.0080us  3.0080us  3.0080us  add_vectors(float*, float*, float*, int)
                   25.85%  2.9120us         1  2.9120us  2.9120us  2.9120us  [CUDA memcpy DtoH]
      API calls:   99.50%  208.06ms         3  69.353ms  2.6950us  208.05ms  cudaMalloc
                    0.21%  446.56us         1  446.56us  446.56us  446.56us  cuDeviceTotalMem
                    0.11%  226.01us        96  2.3540us     142ns  127.76us  cuDeviceGetAttribute
                    0.07%  154.88us         1  154.88us  154.88us  154.88us  cudaLaunchKernel
                    0.06%  132.37us         3  44.123us  3.1560us  120.20us  cuda

In [19]:
! nvprof ./add-vectors 100000000

==17540== NVPROF is profiling process 17540, command: ./add-vectors 100000000
==17540== Profiling application: ./add-vectors 100000000
==17540== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   65.14%  278.42ms         1  278.42ms  278.42ms  278.42ms  [CUDA memcpy DtoH]
                   25.94%  110.87ms         2  55.437ms  55.003ms  55.870ms  [CUDA memcpy HtoD]
                    8.92%  38.135ms         1  38.135ms  38.135ms  38.135ms  add_vectors(float*, float*, float*, int)
      API calls:   57.82%  428.48ms         3  142.83ms  55.065ms  317.42ms  cudaMemcpy
                   28.15%  208.61ms         3  69.537ms  702.62us  207.16ms  cudaMalloc
                   13.89%  102.97ms         3  34.322ms  579.87us  51.487ms  cudaFree
                    0.06%  469.26us         1  469.26us  469.26us  469.26us  cuDeviceTotalMem
                    0.04%  315.76us         1  315.76us  315.76us  315.76us  cudaLaunchKe