---
# **LAB 1 - Intro CUDA**
---

# ▶️ Google Colaboratory (colab)

[Colaboratory](https://research.google.com/colaboratory/faq.html) (or Colab) is a **free research tool** from *Google* for machine learning education and research built on top of [Jupyter Notebook](https://jupyter.org/). It requires no setup and runs entirely in the **cloud**. In Google Colab you can write, execute, save and share your Jupiter Notebooks. You access powerful computing resources like **TPUs** and **GPUs** all for free through your browser. All major Python libraries like **Tensorflow**, **Scikit-learn**, **PyTorch**, **Pandas**, etc. are pre-installed. Google Colab requires no configuration, you only need a **Google Account** and then you are good to go. Your notebooks are stored in your **Google Drive**, or can be loaded from **GitHub**. Colab notebooks can be shared just as you would with Google Docs or Sheets. Simply click the Share button at the top right of any Colab notebook, or follow these Google Drive file sharing instructions.




##Upload/download files


Once you open a **Google Colab notebook**, it creates a **virtual machine** instance on a Google Cloud Platform. To **upload** files from your local machine to Colab virtual storage, use `upload` option from the left sidebar. To **download** files from Colab's virtual storage to your local machine, right-click on a file and then select `Download`. You can also mount your google drive: once you click on **MOUNT DRIVE** in the left sidebar, it will insert a code cell into your notebook that you'll need to run to mount your google drive (it will ask for your authorization). Another way to download files (without mounting a google drive) is to use a `!gdown` or `!wget` commands (more details in the [Shell commands](#scrollTo=JrF12-bqPKPm) section)<br><br>


<img src="https://drive.google.com/uc?export=view&id=1CRjolVrVbEboNPLVVw-c_AtsBBcSou1Z" width=800 px><br><br>




## Notebook rules

Some basic notebook rules:


1.   Click inside a cell with code and press SHIFT+ENTER (or click "PLAY" button) to execute it.
2.   Re-executing a cell will reset it (any input will be lost).
3.   Execute cells TOP TO BOTTOM.
5. Notebooks are saved to your Google Drive 
6. Mount your Google Drive to have a direct access from a notebook to the files stored in the drive (this includes Team Drives).
7. If using Colab's virtual storage only, all the uploaded/stored files will get deleted when a runtime is recycled.

## Shell commands

The command `uname` displays the information about the system.

* **-a option:** It prints all the system information in the following order: Kernel name, network node hostname, 
kernel release date, kernel version, machine hardware name, hardware platform, operating system
.

In [None]:
!uname -a && cat /etc/*release

In [None]:
!pwd

In [None]:
!ls -la

# ▶️ VS Code on Colab

In [None]:
#@title Colab-ssh tunnel
#@markdown Execute this cell to open the ssh tunnel. Check [colab-ssh documentation](https://github.com/WassimBenzarti/colab-ssh) for more details.

# Install colab_ssh on google colab
!pip install colab_ssh --upgrade --quiet

from colab_ssh import launch_ssh_cloudflared, init_git_cloudflared
ssh_tunnel_password = "gpu" #@param {type: "string"}
launch_ssh_cloudflared(password=ssh_tunnel_password)

# Optional: if you want to clone a Github or Gitlab repository
repository_url="https://github.com/giulianogrossi/GPUcomputing" #@param {type: "string"}
init_git_cloudflared(repository_url)

Define some paths...

In [None]:
# path setup
!mkdir -p /content/GPUcomputing/lab2
%cd /content/GPUcomputing/lab2
!mkdir -p src


# ▶️ CUDA zone

## How to use accelerated hardware

To change hardware runtime you just have to navigate from `Runtime -> change runtime` type and select your preferred accelerated hardware type **GPU** or **TPU**.



## NVIDIA System Management Interface (nvidia-smi) 

The NVIDIA System Management Interface (**`nvidia-smi`**) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the **management** and **monitoring** of NVIDIA GPU devices. 

This utility allows administrators to query GPU device state and with the appropriate privileges, permits administrators to modify GPU device state.  It is targeted at the TeslaTM, GRIDTM, QuadroTM and Titan X product, though limited support is also available on other NVIDIA GPUs.

For more details, please refer to the **`nvidia-smi`** documentation ([doc](http://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf))

For information on **Tesla T4** see: 

In [None]:
!ls -la
!pwd

In [None]:
!nvidia-smi

## NVCC Plugin for Jupyter notebook

*Usage*:


*   Load Extension `%load_ext nvcc_plugin`
*   Mark a cell to be treated as cuda cell
`%%cuda --name example.cu --compile false`

**NOTE**: The cell must contain either code or comments to be run successfully. It accepts 2 arguments. `-n | --name` - which is the name of either CUDA source or Header. The name parameter must have extension `.cu` or `.h`. Second argument -c | --compile; default value is false. The argument is a flag to specify if the cell will be compiled and run right away or not. It might be usefull if you're playing in the main function

*  We are ready to run CUDA C/C++ code right in your Notebook. For this we need explicitly say to the interpreter, that we want to use the extension by adding `%%cu` at the beginning of each cell with CUDA code. 




In [None]:
!pip install git+https://github.com/andreinechaev/nvcc4jupyter.git

In [None]:
%load_ext nvcc_plugin


To save the `.cu` file and compile it using the command-line syntax *define* the the magic cell:
```
%%cuda --name filename.cu 
```
The source file will be saved under the directory `src/`

Otherwise, you can use standard the macig command:
```
%%writefile <path-to-file->/filename.cu 
```

In [None]:
# plugin for cpp sintax highlighting 

!wget -O cpp_plugin.py https://gist.github.com/akshaykhadse/7acc91dd41f52944c6150754e5530c4b/raw/cpp_plugin.py
%load_ext cpp_plugin

Clone GPUcomputing site on github...

In [None]:
!git clone https://github.com/giulianogrossi/GPUcomputing.git

# ✅ Hello World!

In [None]:
#@title working directory: **/content/hello**
%mkdir -p hello
%ls -la

In [None]:
%%cuda --name hello.cu
#include <stdio.h>
#include <iostream>

using namespace std;

__global__ void helloFromGPU (void) {
  int tID = threadIdx.x;
  printf("Hello World from GPU (I'am thread %d)!\n", tID);
}

int main(void) {
  //# hello from GPU 
  cout << "Hello World from CPU!" << endl;
  cudaSetDevice(1);
  helloFromGPU <<<1, 10>>>();
  cudaDeviceSynchronize();
  return 0;
}


compile...

In [None]:
%%shell

nvcc src/hello.cu -o hello
ls -la hello

and execute...

In [None]:
%%shell

./hello

Edit, compile & exec with the magic cell `%cu` (only one file at a time) and execute direcly the code by executing the cell...

In [None]:
%%cu 
#include <stdio.h>

__global__ void helloFromGPU (void) {
  int tID = threadIdx.x;
  printf("Hello World from GPU (I'am thread %d)!\n", tID);
}

int main(void) {
    // # hello from GPU
    printf("Hello World from CPU!\n");
    cudaSetDevice(1);
    helloFromGPU <<<1,10>>>();
    cudaDeviceSynchronize();
    return 0;
}


#  ✅ MQDB: Matrici quadrate diagonali a blocchi

In [None]:
#@title Source directory: **/content/MQDB**
%mkdir -p MQDB
%ls -la

In [None]:
#@title  File: MQDB/mqdb.h
%%cpp -n MQDB/mqdb.h -s xcode

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <sys/time.h>
#include <time.h>

#ifndef MQDB_H
#define MQDB_H

#define randu() ((float)rand() / (float) RAND_MAX)
#define abs(x) ((x)<0 ? (-x) : (x))

typedef unsigned long ulong;
typedef unsigned int uint;

typedef struct MQDB {
	char desc[100];   // description
	int nBlocks;      // num. of blocks
	int *blkSize;     // block dimensions
	float *elem;       // elements in row-major order
	ulong nElems;     // actual number of elements
} mqdb;

typedef unsigned long ulong;
typedef unsigned int uint;

// # function prototypes #
int genRandDims(mqdb*, uint, uint);
void fillBlocks(mqdb*, uint, uint, char, float);
mqdb mqdbConst(uint, uint, uint, float);
void mqdbProd(mqdb, mqdb, mqdb);
void matProd(mqdb, mqdb, mqdb);
void checkResult(mqdb, mqdb);
void mqdbDisplay(mqdb);

inline double seconds() {
    struct timeval tp;
    struct timezone tzp;
    int i = gettimeofday(&tp, &tzp);
    return ((double)tp.tv_sec + (double)tp.tv_usec * 1.e-6);
}

#endif

In [None]:
#@title  File: MQDB/mqdb.cpp
%%cpp -n MQDB/mqdb.cpp -s xcode

#include "mqdb.h"

/**
 * random generate block dimensions
 */
int genRandDims(mqdb *M, uint n, uint k) {

	if (n == 0 || k == 0 || k > n) {
		printf("error: n,k must be positive and n > k!\n");
		return(-1);
	}
	// random generation of block sizes
	M->blkSize = (int *) malloc(k * sizeof(int));
	int sum = 0;
	int r;
	float mu = 2.0f * (float) n / (float) k;
	for (int i = 0; i < k - 1; i++) {
		// expected value E[block_size] = n/k
		while ((r = round(mu * randu())) > n - sum - k + i + 1);
		if (!r)
			r += 1;
		M->blkSize[i] = r;
		sum += r;
	}
	M->blkSize[k - 1] = n - sum;
	return(0);
}

/**
 * # fill blocks either random or constant #
 */
void fillBlocks(mqdb *M, uint n, uint k, char T, float c) {
	// mat size n*n
	M->elem = (float *) calloc(n * n, sizeof(float));
	M->nElems = 0;
	int offset = 0;
	// # loop on blocks #
	for (int i = 0; i < k; i++) {
		for (int j = 0; j < M->blkSize[i]; j++)
			for (int k = 0; k < M->blkSize[i]; k++)
				if (T == 'C')  	    // const fill mat entries
					M->elem[offset * n + j * n + k + offset] = c;
				else if (T == 'R') 	// random fill mat entries
					M->elem[offset * n + j * n + k + offset] = randu();
		offset += M->blkSize[i];
		M->nElems += M->blkSize[i]*M->blkSize[i];
	}
	// set description
	sprintf(M->desc, "Random mqdb:  mat. size = %d, num. blocks = %d, blk sizes: ",n,k);
}

/**
 * rand_gen_mqdb: mqdb  type returned
 *                n     square matrix size
 *                k     number of blocks
 *                seed  seed for random generator
 */
mqdb genRandMat(unsigned n, unsigned k, unsigned seed) {
	mqdb M;
	srand(seed);
	genRandDims(&M, n, k);
	M.nBlocks = k;

	srand(time(NULL));
	// random fill mat entries 
	fillBlocks(&M, n, k, 'R', 0.0);

	return M;
}

/**
 * const_mqdb: mqdb     is the type returned
 *                n     is the square matrix size
 *                k     is the number of blocks
 *                seed  is the seed for random generator
 *                c   	is the constant value assigned
 */
mqdb mqdbConst(uint n, uint k, uint seed, float c) {
	mqdb M;
	srand(seed);
	genRandDims(&M, n, k);
	M.nBlocks = k;

	// fill mat entries with a constant
	fillBlocks(&M, n, k, 'C', c);

	return M;
}

/*
 * standard (naive) matrix product on host
 */
void matProd(mqdb A, mqdb B, mqdb C) {
	int n = 0;
	for (uint i = 0; i < A.nBlocks; i++)
		n += A.blkSize[i];

	for (uint r = 0; r < n; r++)
		for (uint c = 0; c < n; c++) {
			double sum = 0;
			for (uint l = 0; l < n; l++){
				double a = A.elem[r * n + l];
				double b = B.elem[l * n + c];
				sum += a*b;
			}
			C.elem[r * n + c] = (float)sum;
		}
}

/*
 * elementwise comparison between two mqdb
 */
void checkResult(mqdb A, mqdb B) {
	double epsilon = 1.0E-8;
	bool match = 1;
	int n = 0;
	for (int i = 0; i < A.nBlocks; i++)
		n += A.blkSize[i];
	for (int i = 0; i < n * n; i++) {
		if (fabs(A.elem[i] - B.elem[i]) > epsilon) {
			match = 0;
			printf("   * Arrays do not match!\n");
			printf("     gpu: %2.2f,  host: %2.2f at current %d\n", A.elem[i],
					B.elem[i], i);
			break;
		}
	}
	if (match)
		printf("   Arrays match\n\n");
}
/*
 * print mqdb
 */
void mqdbDisplay(mqdb M) {
	int n = 0;
	printf("%s", M.desc);
	for (int j = 0; j < M.nBlocks; j++) {
		printf("%d  ", M.blkSize[j]);
		n += M.blkSize[j];
	}
	printf("\n");
	for (int j = 0; j < n * n; j++) {
		if (M.elem[j] == 0)
			printf("------");
		else
			printf("%5.2f ", M.elem[j]);
		if ((j + 1) % n == 0)
			printf("\n");
	}
	printf("\n");
}


# 🔴 TODO

Sviluppare una funzione `C` per effettuare il prodotto ottimizzato (ristretto ai soli blocchi sulla diagonale) tra due matrici $C=A*B$ ti tipo MQDB.

**PASSI**

1. fissare i parametri principali: `n` dimesione della matrice, `k` numero di blocchi sulla diagonale
2. generare matrici a caso `genRandMat(uint n, uint k, uint seed)` o con valore costante `mqdbConst(uint n, uint k, uint seed, float c)`. 
Nota: Le dimensioni dei blocchi $k_i$, tale che $n = \sum_{i=1}^k k_i$, viene generata a caso
3. Le matrici devono avere uguale dimensione (stesso lato $n$ e ugual dimensione $k_i$ dei $k$ blocchi sulla diagonale - usare stesso seed)

<img src="https://github.com/giulianogrossi/imgs/blob/main/GPU/MQDB.png?raw=true" align="center" width=600px >

In [None]:
%%cpp -n MQDB/prod_mqdb.cpp -s xcode

#include "mqdb.h"

# TODO

Main

In [None]:
%%cpp -n  MQDB/main.cpp
#include <sys/time.h>
#include "mqdb.h"

/*
 * main function
 */
int main(void) {
	uint n = 2*1024;      // matrix size
  uint k = 10;          // num of blocks
	mqdb A, B, C, C1;     // mqdb host matrices

	// # fill in #
	A = mqdbConst(n, k, 10, 1);
	B = mqdbConst(n, k, 10, 1);
	C = mqdbConst(n, k, 10, 1);
	C1 = mqdbConst(n, k, 10, 1);

	ulong nBytes = n * n * sizeof(float);
	ulong kBytes = k * sizeof(uint);
	printf("Memory size required = %.1f (MB)\n",(float)nBytes/(1024.0*1024.0));

	printf("CPU mat product...\n");
	double start = seconds();
  matProd(A, B, C);
	double CPUTime = seconds() - start;
	printf("CPU elapsed time: %.5f (sec)\n\n", CPUTime);

  printf("CPU MQDB product...\n");
	start = seconds();
  mqdbProd(A, B, C1);
	CPUTime = seconds() - start;
	printf("CPU elapsed time: %.5f (sec)\n\n", CPUTime);

	// check result
	checkResult(C, C1);
 
	return 0;
}


In [None]:
%%shell
# Compilazione ed esecuzione

g++ MQDB/prod_mqdb.cpp MQDB/main.cpp MQDB/mqdb.cpp -o main
./main

## Report

Riportare i tempi di esecuzione per 

$k = 10$
* n = 1024, time = 
* n = 2048, time = 
* n = 4096, time = 

$k = 20$
* n = 1024, time = 
* n = 2048, time = 
* n = 4096, time = 