*   Student: Sina Moghimi
*   University: [Moscow Institute of Physics and Technology (National Research University)](https://eng.mipt.ru)
*   Group Number: лю01-108л░
*   Area: 03.04.01 Applied Mathematics and Physics
*   Field: Neural Networks & Neural Computers





In order to make this notebook work well, you should follow the instructions step by step.

0- Since we have some files to work with, we store them in the google drive. Therefore, we have to connect colab to the google drive



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


1- We need to run some initial commands.

In [None]:
!sudo apt update
!sudo apt-get update
# !apt-get install pciutils git curl build-essential
!sudo apt autoremove

from IPython.display import clear_output 
clear_output()

2- Since we run CUDA 10.1 in this task and the default version of the CUDA in colab is 11.1 by the time of writing this text, we will need to disable it first.

In [None]:
%cd /usr/local
!sudo rm cuda

/usr/local


3- To activate the CUDA 10.1, If you happend not to have CUDA-10.1 folder in the directory "/usr/local/" you should uncomment the commands below and install CUDA 10.1 first. Otherwise, just skip this step (step 3) and move on to the next step (step 4).

In [None]:
# !sudo apt-get --purge remove "*cublas*" "cuda*" "nsight*"
# 'Install CUDA 10.1 for Tesla K80'
# %cd /content/drive/MyDrive/Colab Notebooks/Installation File
# !pwd
# !wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
# !sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
# ## !wget https://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
# !sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
# !sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub
# !sudo apt-get update
# !sudo apt-get -y install cuda-10-1
# !sudo apt autoremove

4- Set CUDA 10.1 as the default version.

In [None]:
!sudo ln -s cuda-10.1 cuda
print('DONE!')

DONE!


5- Check CUDA and gcc version.

In [None]:
!nvcc --version
!echo " "
!gcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
 
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.



6- Prepare NVCC to run within the notebook cells.

In [None]:
%cd /content/
!pip install git+git://github.com/andreinechaev/nvcc4jupyter.git
%reload_ext nvcc_plugin

clear_output()
print('DONE!')

DONE!


7- In order to run our code, we need to move the required files to `/content/include` directory. 

NOTE: All the required files are stored in the directory:\
/content/drive/MyDrive/Tasks/CUDA/task1/task1/Files/

NOTE: You might need to modify the directory above with the directory where you stored your files on your drive.


In [None]:
%cd /content/drive/MyDrive/HWs/CUDA/task1/task1/Files
! mkdir /content/include
! cp input0.raw input1.raw output.raw wb.h /content/include
print('DONE!')

/content/drive/.shortcut-targets-by-id/1IjpD87Nmoz-Jorw5N5Q6zc-eyuUC-HcY/task1/Files
DONE!


8- It would be beneficial to hadle errors in CUDA, therefore we write a header file to do this for us.

In [None]:
%%cuda --name handle_error.cu
#include "cuda.h"

#define checkGPUError(res) {gpuAssert(res, __FILE__, __LINE__);}
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true){
    if (code != cudaSuccess){
        fprintf(stderr, "GPU assert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if(abort) exit(code);
    }
}  

'File written in /content/src/handle_error.cu'

9- It's time to write the code. (To save all changes we make, we need to run the cell below at last.)

In [None]:
%%cuda --name main.cu

#include "/content/include/wb.h"
#include "handle_error.cu"
#include <stdio.h>

__global__ void vecAdd(float *in1, float *in2, float *out) {
  //@@ Paste in the vector addition code
  // DONE!
  size_t L1_index{threadIdx.x + blockDim.x*(threadIdx.y + blockDim.y*threadIdx.z)};
  size_t L2_index{blockIdx.x + gridDim.x*(blockIdx.y + gridDim.y*blockIdx.z)};
  size_t block_size{blockDim.x*blockDim.y*blockDim.z};
  size_t index{L1_index + block_size*L2_index};

  out[index] = in1[index] + in2[index];
}


int main(int argc, char **argv) {

  int inputLength{};
  float *hostInput1{};
  float *hostInput2{};
  float *hostOutput{};
  float *deviceInput1{};
  float *deviceInput2{};
  float *deviceOutput{};

  argc = 4;
  argv = new char*[argc];
  argv[0] = NULL;
  argv[1] = "/content/include/input0.raw";
  argv[2] = "/content/include/input1.raw";
  argv[3] = "/content/include/output.raw";
  wbArg_t args{wbArg_read(argc, argv)};

  wbTime_start(Generic, "Importing data and creating memory on host");
  hostInput1 = (float *)wbImport(wbArg_getInputFile(args, 0), &inputLength);
  hostInput2 = (float *)wbImport(wbArg_getInputFile(args, 1), &inputLength);
  hostOutput = new float[inputLength];
  wbTime_stop(Generic, "Importing data and creating memory on host");

  wbLog(TRACE, "The input length is ", inputLength);

  wbTime_start(GPU, "Allocating GPU memory.");
  //@@ Allocating GPU memory code
  // DONE!
  size_t size_in_byte{inputLength * sizeof(float)};
  checkGPUError(cudaMalloc((void **)&deviceInput1, size_in_byte));
  checkGPUError(cudaMalloc((void **)&deviceInput2, size_in_byte));
  checkGPUError(cudaMalloc((void **)&deviceOutput, size_in_byte));

  wbTime_stop(GPU, "Allocating GPU memory.");

  wbTime_start(GPU, "Copying input memory to the GPU.");
  //@@ Paste your code
  // DONE!
  checkGPUError(cudaMemcpy(deviceInput1, hostInput1, size_in_byte, cudaMemcpyHostToDevice));
  checkGPUError(cudaMemcpy(deviceInput2, hostInput2, size_in_byte, cudaMemcpyHostToDevice));
  checkGPUError(cudaMemcpy(deviceOutput, hostOutput, size_in_byte, cudaMemcpyHostToDevice));

  wbTime_stop(GPU, "Copying input memory to the GPU.");
  //@@ Initialize grid and block dimensions
  // DONE!
  dim3 block_config(inputLength, 1, 1);
  dim3 grid_config(1, 1, 1);
  
  wbTime_start(Compute, "Performing CUDA computation");
  //@@ run GPU kernel
  // DONE!
  vecAdd<<<grid_config, block_config>>>(deviceInput1, deviceInput2, deviceOutput);

  checkGPUError(cudaDeviceSynchronize());
  
  wbTime_stop(Compute, "Performing CUDA computation");

  wbTime_start(Copy, "Copying output memory to the CPU");
  //@@ copy memory from GPU back to host
  // DONE!
  checkGPUError(cudaMemcpy(hostOutput, deviceOutput, size_in_byte, cudaMemcpyDeviceToHost));
  wbTime_stop(Copy, "Copying output memory to the CPU");

  wbTime_start(GPU, "Freeing GPU Memory");
  //@@ code to free memory on GPU
  // DONE!
  checkGPUError(cudaFree(deviceInput1));
  checkGPUError(cudaFree(deviceInput2));
  checkGPUError(cudaFree(deviceOutput));

  wbTime_stop(GPU, "Freeing GPU Memory");

  wbSolution(args, hostOutput, inputLength);

  free(hostInput1);
  free(hostInput2);
  delete(hostOutput);
  delete(argv);

  cudaDeviceReset();

  return 0;
}


'File written in /content/src/main.cu'

10- The code is saved in the directory "/content/src/", So we make sure we are in the same directory and then we run the Code!

In [None]:
%cd /content/src/
!nvcc -G main.cu -o run.out && ./run.out
# !sudo rm /content/src/main.cu

/content/src






[Generic] 0.000300032 Importing data and creating memory on host
Trace main::42 The input length is 130
[GPU    ] 0.000185088 Allocating GPU memory.
[GPU    ] 0.000048128 Copying input memory to the GPU.
[Compute] 0.000163840 Performing CUDA computation
[Copy   ] 0.000043008 Copying output memory to the CPU
[GPU    ] 0.000135168 Freeing GPU Memory
Solution is correct.


11- By the command below wew can check the efficiency of our thread and grid configurations.

In [None]:
%cd /content/src/
!nvprof ./run.out

/content/src
==2591== NVPROF is profiling process 2591, command: ./run.out
[Generic] 0.000315136 Importing data and creating memory on host
Trace main::42 The input length is 130
[GPU    ] 0.000212992 Allocating GPU memory.
[GPU    ] 0.000056832 Copying input memory to the GPU.
[Compute] 0.000175104 Performing CUDA computation
[Copy   ] 0.000047872 Copying output memory to the CPU
[GPU    ] 0.000134912 Freeing GPU Memory
Solution is correct.
==2591== Profiling application: ./run.out
==2591== Profiling result:
No events/metrics were profiled.
