<a href="https://colab.research.google.com/github/trefftzc/partition_COLAB_notebooks/blob/main/partition_numba_cuda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!find / -iname 'libdevice'
!find / -iname 'libnvvm.so'

/root/.julia/artifacts/6283886c5dd600750885a23bff8d37f687172633/share/libdevice
/usr/local/lib/python3.12/dist-packages/nvidia/cuda_nvcc/nvvm/libdevice
/usr/local/cuda-12.5/nvvm/libdevice
find: ‘/proc/62/task/62/net’: Invalid argument
find: ‘/proc/62/net’: Invalid argument
/usr/local/lib/python3.12/dist-packages/nvidia/cuda_nvcc/nvvm/lib64/libnvvm.so
/usr/local/cuda-12.5/nvvm/lib64/libnvvm.so
find: ‘/proc/62/task/62/net’: Invalid argument
find: ‘/proc/62/net’: Invalid argument


In [2]:
import os
os.environ['NUMBAPRO_LIBDEVICE'] = "/usr/local/cuda-12.5/nvvm/libdevice"
os.environ['NUMBAPRO_NVVM'] = "/usr/local/cuda-12.5/nvvm/lib64/libnvvm.so"

In [3]:
!uv pip install -q --system numba-cuda==0.15

# Partition with NUMBA/CUDA
This python program is based on NUMBA/CUDA.
It solves the partition problem.

Make sure that COLAB has been set up to use a GPU.
In the main menu select the option:
 Runtime
In the pull-down menu select:
 Change runtime type
Select:
 T4 GPU

GPU cards are dedicated computers with their own memory and their own processors.

When programming a GPU, one needs to
- allocate memory on the GPU card
- copy data from the host's memory to the GPU's memory
- execute the kernel, the code that is applied across all the elements of an array
- copy the results of interest back to the host

This is achieved in the code below in this portion of the code in the function
 parallelFor

 arrayGPU = cupy.asarray(array)

resultGPU = cupy.asarray(result)

  
  
 evaluatePartition[nPartitions//256,256](arrayGPU,resultGPU,n)
  
  

One uses the decorator
 @cuda.jit
to indicate to the numba/cuda compiler the code for the kernel.  



In [4]:
%%writefile partition_numba_cuda.py
#
# Program that solves the partition problem in python
# Parallel version with numba
#
import sys
import numpy as np
#import numba
from numba import cuda
#from numba.cuda.cudadrv.devicearray import DeviceNDArray
import cupy
import time

# A reduction on the GPU
@cuda.reduce
def max_reduce(a, b):
    if a < b:
        return b
    else:
        return a

#
# This is the kernel, the code that is executed in each processor
# in the GPU
#
#def evaluatePartition(array:DeviceNDArray,result:DeviceNDArray,n:np.dtype=np.int64):
@cuda.jit
def evaluatePartition(array,result,n):
   value = cuda.grid(1)
   sum0s = 0
   sum1s = 0
   mask = 1
   for i in range(0,n):
    if ((mask & value) != 0):
      sum1s = sum1s + array[i]
    else:
      sum0s = sum0s + array[i]
    mask = mask * 2
   if (sum0s == sum1s):
     # print("Evaluate partition ",value," returns ",value)
     result[value] = value
   else:
    # print("Evaluate partition ",value," returns ",0)
    result[value] = 0

def printResults(value, n, array):
  print("Solution:\n")
  print("First partition: ")
  mask = 1
  sum = 0
  for i in range(0,n):
    if ((mask & value) != 0):
      print(array[i],end=" ")
      sum = sum + array[i]
    mask = mask * 2
  print(" sum: ",sum)
  print("Second partition: ")
  mask = 1
  sum = 0
  for i in range(0,n):
    if ((mask & value) == 0):
      print(array[i],end=" ")
      sum = sum + array[i]
    mask = mask * 2
  print(" sum: \n",sum)

def parallelFor(n,array,nPartitions):
  solutionFound = 0
  solution = -1
  result = np.zeros(nPartitions,dtype=np.int64)
  #arrayGPU = cuda.to_device(array)
  #resultGPU = cuda.to_device(result)
  arrayGPU = cupy.asarray(array)
  resultGPU = cupy.asarray(result)
  #evaluatePartition.forall(nPartitions)( arrayGPU,resultGPU, n)
  evaluatePartition[nPartitions//256,256](arrayGPU,resultGPU,n)
  # Perform a reduction to find the maximum value
  solutionFound = max_reduce(resultGPU,init=0)
  # resultGPU.copy_to_host(result)
  # solutionFound = np.max(result)

  return solutionFound

if __name__ == "__main__":

  # Read the problem
  n = int(input())
  valuesString = input()
  values = valuesString.split()
  for i in range(len(values)):
    values[i] = int(values[i])
# Print the instance of the problem
  print("Problem size: ",n)
  print("Problem instance: ",values)
  nPartitions = 2 ** n
  np_array = np.array(values)
  # Call twice so that the compilation time is not included
  solutionFound = parallelFor(n,np_array,nPartitions)
  start = time.time()
  solutionFound = parallelFor(n,np_array,nPartitions)
  end = time.time()
  elapsed = end - start
  if (solutionFound):
    printResults(solutionFound, n, values)
  else:
    print("No solution was found.")
  print("The program took: ",elapsed," seconds.")


Writing partition_numba_cuda.py


Now, let's execute the code with several test cases.

In [5]:
%%writefile instanceNoSolution24.Text
24
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 1000000

Writing instanceNoSolution24.Text


In [6]:
!python partition_numba_cuda.py < instanceNoSolution24.Text

Problem size:  24
Problem instance:  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 1000000]
No solution was found.
The program took:  0.06231880187988281  seconds.


In [7]:
%%writefile test27.Text
27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

Writing test27.Text


In [8]:
!python partition_numba_cuda.py < test27.Text

Problem size:  27
Problem instance:  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]
Solution:

First partition: 
1 20 21 22 23 24 25 26 27  sum:  189
Second partition: 
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19  sum: 
 189
The program took:  0.5470447540283203  seconds.


In [9]:
%%writefile test28.Text
28
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Writing test28.Text


In [10]:
!python partition_numba_cuda.py < test28.Text

Problem size:  28
Problem instance:  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28]
Solution:

First partition: 
7 21 22 23 24 25 26 27 28  sum:  203
Second partition: 
1 2 3 4 5 6 8 9 10 11 12 13 14 15 16 17 18 19 20  sum: 
 203
The program took:  0.9284102916717529  seconds.


In [11]:
%%writefile test29.Text
29
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Writing test29.Text


In [12]:
!python partition_numba_cuda.py < test29.Text

Problem size:  29
Problem instance:  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
No solution was found.
The program took:  2.0164570808410645  seconds.


In [13]:
%%writefile test30.Text
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Writing test30.Text


In [14]:
!python partition_numba_cuda.py < test30.Text

Problem size:  30
Problem instance:  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
No solution was found.
The program took:  3.66446852684021  seconds.


In [None]:
!lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  2
  On-line CPU(s) list:   0,1
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU @ 2.00GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  2
    Core(s) per socket:  1
    Socket(s):           1
    Stepping:            3
    BogoMIPS:            4000.29
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clf
                         lush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_
                         good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fm
                         a cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hyp
                         ervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd i

In [15]:
!nvidia-smi

Thu Sep  4 15:17:45 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   43C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [16]:
%%writefile instanceNoSolution25.Text
25
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1000000

Writing instanceNoSolution25.Text


In [17]:
!python partition_numba_cuda.py < instanceNoSolution25.Text

Problem size:  25
Problem instance:  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 1000000]
No solution was found.
The program took:  0.1255781650543213  seconds.
