<a href="https://colab.research.google.com/github/trefftzc/partition_COLAB_notebooks/blob/main/Using_CUDA_for_the_partition_problem.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## A small example of coding with CUDA
CUDA is the programming language that NVIDIA created to program its GPUs.

There are several key steps in programming a GPU:

1. Allocating memory in the GPU card
2. Copying data from the host memory to the GPU memory
3. Executing code in the GPU
4. Copying the results back to the host memory

Before continuing, make sure that you have set the runtime environment to a GPU.

Choose the Runtime set of commands on top and selecte
 Change runtime type

 Choose T4 GPU

The CUDA code needs a special compiler from NVIDIA called nvcc.
nvcc is installed by default on COLAB.

In [None]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


If you are interested in the specifications of the GPU, and you have an NVIDIA GPU on your system, the command nvidia-smi displays information about the GPU.

In [None]:
!nvidia-smi

Wed Oct  2 13:48:03 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

For the purpose of this example, we will see at an example that solves the partition problem.

In [None]:
%%writefile cudaPartition.cu
/*
 * cudaPartition.cu
 * Solve the Partition problem using CUDA.
 * https://en.wikipedia.org/wiki/Partition_problem
 * This code works for multisets of up to 32 elements
 * The input is expected to be as follows:
 * The first line will contain n, the number of elements in the multiset
 * The remaining n lines will contain the n values, one per line
 */
#include <stdio.h>
#include <stdlib.h>
// The kernel
// This function is executed, in parallel, on the processors on the GPU card
//
__global__
void evaluatePartition(  int n, int *array,int *result) {
  unsigned int value = blockIdx.x*blockDim.x + threadIdx.x;
  int sum0s = 0;
  int sum1s = 0;
  unsigned int mask = 1;
  for(int i = 0;i < n;i++) {
    if ((mask & value) != 0) {
      sum1s = sum1s + array[i];
    }
    else {
      sum0s = sum0s + array[i];
    }
    mask = mask * 2;
  }
  if (sum0s == sum1s)
     result[value] = 1;
  else
     result[value] = 0;
  // printf("%d %d \n",value,result[value]);
}

void printResults(unsigned int value,int n,int *array)
{
  printf("Solution:\n");
  printf("First partition: ") ;
  unsigned int mask = 1;
  int sum = 0;
  for(int i = 0;i < n;i++) {
    if ((mask & value) != 0) {
      printf("%d ",array[i]);
      sum = sum + array[i];
    }
    mask = mask * 2;
  }
  printf(" sum: %d \n",sum);
  printf("Second partition: ") ;
  mask = 1;
  sum = 0;
  for(int i = 0;i < n;i++) {
    if ((mask & value) == 0) {
      printf("%d ",array[i]);
      sum = sum + array[i];
    }
    mask = mask * 2;
  }
  printf(" sum: %d \n",sum);
}


int main() {

  int n;
  int *array;

  scanf("%d",&n);

  printf("The value of n is %d\n",n);
  array = (int *) malloc (n * sizeof(int));
  for(int i = 0;i < n;i++) {
    scanf("%d",&array[i]);
  }
  printf("The read values are: \n");
  for(int i = 0;i < n;i++) {
    printf("%d ",array[i]);
  }
  printf("\n");

  unsigned int nPartitions = 1;
  for(int i = 0;i < n;i++) {
    nPartitions = nPartitions * 2;
  }
  // printf("The number of possible partitions is: %d\n",nPartitions);
  // Only half of all possible partitions need be examined
  // The second half is symmetrical to the first half
  nPartitions = nPartitions / 2;

  int solutionFound = 0;
  int solution = -1;
  // Allocate the variables in the device:
  // The array with the integer values in the device is called d_array
  int *d_array;
  cudaMalloc(&d_array, n*sizeof(int));

  // Copy the variables from the host to the device
  cudaMemcpy(d_array,array,n*sizeof(int), cudaMemcpyHostToDevice);

  // Allocate on the device an array to keep all the results
  int *d_results;
  cudaMalloc(&d_results,nPartitions*sizeof(int));
// Now invoke the kernel
  evaluatePartition<<<(nPartitions+31)/32,32>>>(  n, d_array,d_results) ;
  // The array on the host that will contain the results is called results
  int *results;
  results = (int *) calloc (nPartitions , sizeof(int));
// Copy the results from the GPU card to main memory on the host
  cudaMemcpy(results,d_results,nPartitions*sizeof(int),cudaMemcpyDeviceToHost);
  /*
  for(int i = 0;i < nPartitions;i++) {
	 printf("%d ",results[i]);
  }
  printf("\n");
 */
  for(int i = 0;i < nPartitions;i++) {
	  if (results[i] != 0) {
		  solutionFound = 1;
		  solution = i;
		  break;
	  }
  }

  if (solutionFound == 1) {
    printResults(solution, n, array);
  }
  else {
    printf("No solution was found.");
  }
  return 0;
}

Writing cudaPartition.cu


After writing the source code, we can compile it using the nvcc compiler.

In [None]:
!nvcc cudaPartition.cu -o cudaPartition -O3

    scanf("%d",&n);
    ^


      scanf("%d",&array[i]);
      ^

    scanf("%d",&n);
    ^


      scanf("%d",&array[i]);
      ^

[01m[KcudaPartition.cu:[m[K In function ‘[01m[Kint main()[m[K’:
   70 | [01;35m[K  scanf("%d",&n[m[K);
      |   [01;35m[K~~~^~~~~~~~~~[m[K
   75 | [01;35m[K    scanf("%d",&array[i][m[K);
      |     [01;35m[K~^~~~~~~~~~~~~~~~~~~[m[K


Let's create a test file. This was used previously when testing the OpenMP code.


In [None]:
%%writefile testNoSolution29.txt
29
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
30


Writing testNoSolution29.txt


And now we can run and time the executable code on the GPU:

In [None]:
!time ./cudaPartition < testNoSolution29.txt

The value of n is 29
The read values are: 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 30 
No solution was found.
real	0m2.086s
user	0m0.754s
sys	0m0.936s


The key portions of the code are:
1. Allocating variables in the memory of the GPU.

This is done using the function
 cudaMalloc(&d_array, n*sizeof(int));

 This function has two parameters:
 - A pointer to a block of memory
 - The amount of memory that is needed

The next step is to copy from the host memory to the GPU memory the content of the variables that will be used in the computation in the GPU. The function
cudaMemcpy(d_array,array,n*sizeof(int), cudaMemcpyHostToDevice);

This function has four parameters:
- The variable that will be receiving the result
- The source of the copy
- The amount of memory (in bytes) that will be copied
- A constant that indicates the direction of the transfer. In this case we are copying from the host to the GPU (device).

The next step is to specify the code that will be executed on the GPU processors.

Specifying the code to be executed on the GPU processors is where CUDA is different from regular C code.

There are two main considerations:
- Describing the code to be executed on each core on the GPU. To designate a particular function as code that will be executed on the GPU cores, CUDA uses two additional keywords __global__ (this indicates that this code can be called from the host ) and __device__ this indicates that this function can be called from another function executing on the GPU.

The function will usually operate on entries on an array.
It will be necessary to identify the entry on which this function will operate.
This is usually achieved with a line like this:
```

unsigned int value = blockIdx.x*blockDim.x + threadIdx.x;

```
The block size is chosen by the programmer.

```

__global__
void evaluatePartition(  int n, int *array,int *result) {
  unsigned int value = blockIdx.x*blockDim.x + threadIdx.x;
  int sum0s = 0;
  int sum1s = 0;
  unsigned int mask = 1;
  for(int i = 0;i < n;i++) {
    if ((mask & value) != 0) {
      sum1s = sum1s + array[i];
    }
    else {
      sum0s = sum0s + array[i];
    }
    mask = mask * 2;
  }
  if (sum0s == sum1s)
     result[value] = 1;
  else
     result[value] = 0;
  // printf("%d %d \n",value,result[value]);

}
```
- The second consideration is how to call this function. NVIDIA introduced a new syntactic element into CUDA: Chevrons. Chevrons indicate that this function is meant to be executed on the GPU. Two values are passed between the Chevrons:
```
 evaluatePartition<<<(nPartitions+31)/32,32>>>(  n, d_array,d_results) ;
```
The first parameter is the number of blocks that will be required for the execution of the program as an expression based on the size of the problem. Frequently, the size of the problem is the size of the array that will be operated upon. The second parameter is the block size.

Finally, the results are copied back to the host:

```
cudaMemcpy(results,d_results,nPartitions*sizeof(int),cudaMemcpyDeviceToHost);
```

The only change with respect to the previous usage of cudaMemcpy is the direction. Now one is copying from the GPU (device) to the host.

Let's create some additional test files so that we can observe how the execution times increase as the problem size grows.


In [None]:
%%writefile test24.txt
24
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
23

Writing test24.txt


In [None]:
%%writefile test25.txt
25
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
24

Writing test25.txt


In [None]:
%%writefile test26.txt
26
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
25

Writing test26.txt


In [None]:
%%writefile test27.txt
27
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
26

Writing test27.txt


In [None]:
!time ./cudaPartition < test24.txt
!time ./cudaPartition < test25.txt
!time ./cudaPartition < test26.txt
!time ./cudaPartition < test27.txt

The value of n is 24
The read values are: 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 23 
Solution:
First partition: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  sum: 23 
Second partition: 23  sum: 23 

real	0m0.194s
user	0m0.037s
sys	0m0.143s
The value of n is 25
The read values are: 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 24 
Solution:
First partition: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  sum: 24 
Second partition: 24  sum: 24 

real	0m0.232s
user	0m0.053s
sys	0m0.168s
The value of n is 26
The read values are: 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 25 
Solution:
First partition: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  sum: 25 
Second partition: 25  sum: 25 

real	0m0.325s
user	0m0.101s
sys	0m0.204s
The value of n is 27
The read values are: 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 26 
Solution:
First partition: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  sum: 26 
Second partition: 26  sum: 26 

real	0m0.525s
user	0m

# Profiling the code
NVIDA has created a profiler that can be used to find where the time is being spent during the execution of the program. The command is called nvprof.

In [None]:
!nvprof ./cudaPartition < test24.txt


The value of n is 24
The read values are: 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 23 
==1698== NVPROF is profiling process 1698, command: ./cudaPartition
Solution:
First partition: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  sum: 23 
Second partition: 23  sum: 23 
==1698== Profiling application: ./cudaPartition
==1698== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   96.55%  24.633ms         1  24.633ms  24.633ms  24.633ms  [CUDA memcpy DtoH]
                    3.45%  879.44us         1  879.44us  879.44us  879.44us  evaluatePartition(int, int*, int*)
                    0.00%  1.2160us         1  1.2160us  1.2160us  1.2160us  [CUDA memcpy HtoD]
      API calls:   76.04%  87.504ms         2  43.752ms  118.86us  87.385ms  cudaMalloc
                   23.64%  27.205ms         2  13.602ms  20.762us  27.184ms  cudaMemcpy
                    0.17%  194.47us         1  194.47us  194.47us  194.47us  cudaLaunchK

In [None]:
!nvprof ./cudaPartition < testNoSolution29.txt

The value of n is 29
The read values are: 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 30 
==1950== NVPROF is profiling process 1950, command: ./cudaPartition
No solution was found.==1950== Profiling application: ./cudaPartition
==1950== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   96.91%  991.46ms         1  991.46ms  991.46ms  991.46ms  [CUDA memcpy DtoH]
                    3.09%  31.642ms         1  31.642ms  31.642ms  31.642ms  evaluatePartition(int, int*, int*)
                    0.00%  1.1840us         1  1.1840us  1.1840us  1.1840us  [CUDA memcpy HtoD]
      API calls:   87.91%  1.02522s         2  512.61ms  25.490us  1.02520s  cudaMemcpy
                   12.05%  140.50ms         2  70.251ms  1.7501ms  138.75ms  cudaMalloc
                    0.02%  241.21us         1  241.21us  241.21us  241.21us  cudaLaunchKernel
                    0.02%  192.37us       114  1.6870us     219ns  73.610us 

On the back of your summary of the readings for this week, fill the following form with the execution times:

| File name  | Execution Time |
:------------| -----------:|
| test24.txt |  |
| test25.txt |  |
| test26.txt |  |
| test27.txt |  |
| testNoSolution29.txt |  |

Look at the information provide by the profiler.
Where is most of the execution time being spent?