# computing $\pi$ with GPU

The goal of this exercice is to compute an estimate of $\pi$ on the GPU using [Monte Carlo integration](https://en.wikipedia.org/wiki/Monte_Carlo_integration).


This technique consists in generating a number random 2D points with coordinates ranging from 0 to 1.
and then the distance of these points to the origin.
 
The fraction of points falling whithin a distance 1.0 from the origin should approximate $\pi/4$ with an increasing precision as the number of generated points increase.


![](figures/MCpi.png)


### What CPU and GPU am I using?

Before we start, lets check what processor and GPU we will be using. Performance can vary a lot depending on which model we are using. Google Collab does not allow us to choose the model, but it is free.

In [1]:
!echo "CPU:"
!cat /proc/cpuinfo | grep name
!echo "GPU:"
!nvidia-smi

CPU:
model name	: Intel(R) Xeon(R) CPU @ 2.20GHz
model name	: Intel(R) Xeon(R) CPU @ 2.20GHz
model name	: Intel(R) Xeon(R) CPU @ 2.20GHz
model name	: Intel(R) Xeon(R) CPU @ 2.20GHz
GPU:
/bin/bash: line 1: nvidia-smi: command not found


## CPU implementation

We provide a standard Python implementation for reference.

In [2]:
from numpy.random import seed
from numpy.random import random
from numba import jit,njit,prange,cuda, types, float32

@njit()
def estimate_pi_cpu( nb_points ):
    nb_points_in = 0 # number of points in the circle
    
    for i in range(nb_points):
        pt = random(2) # random 2D coordinates between 0 and 1
        
        dist = (pt**2).sum()**0.5  # distance from the origin
        
        nb_points_in += (dist <= 1.0) # increment if distance is <= 1

    return (nb_points_in / nb_points)*4
    

# a single estimate of pi
estimate_pi_cpu( 10**3 )    

3.172

The complexity is linear with the number of points generated. 
Therefore here we use a large vector size to increase the execution time. 

In [3]:
# timing the estimate with 2**24 points:
%time pi_estimate = estimate_pi_cpu( 2**24 ) 
print( pi_estimate )

CPU times: user 2.48 s, sys: 15.6 ms, total: 2.49 s
Wall time: 2.55 s
3.141481399536133


## The CUDA implementation

Now it's your turn to implement the CUDA kernel! 

The main difficulties here reside in the implementation of the random number generation as well as the reduction to the final value. 

Use the numba documentation to help you:
 * [random number generation](https://nvidia.github.io/numba-cuda/user/random.html)
 * [reduction](https://nvidia.github.io/numba-cuda/user/intrinsics.html#example) (here we propose the use of an "atomic" operation rather than the actual numba.reduce)

In [None]:
@cuda.jit
def estimate_pi_gpu( ... ):
    # generate a single point and check its distance
    #  + handle the reduction

# calling the function
size = 2**12

blocksize = # block size = number of threads per block dimension
gridsize = # grid size = number of blocks per grid dimension

# Check!
estimate_pi_gpu[gridsize, blocksize]()

Now, time your function:

In [None]:
# calling the function
size = 2**24

blocksize = # block size = number of threads per block dimension
gridsize = # grid size = number of blocks per grid dimension


%time pi_estimate = estimate_pi_gpu[gridsize, blocksize]()

### additional tasks: Foolproof

Adapt the previous code to handle sizes which are **not** a power of 2. **Hint:** you need to change both the kernel and the gridsize.


In [None]:
@cuda.jit
def estimate_pi_gpu( ... ):
    # generate a single point and check its distance
    #  + handle the reduction

# calling the function
size = 2**12 + 1

blocksize = # block size = number of threads per block dimension
gridsize = # grid size = number of blocks per grid dimension

# Check!
estimate_pi_gpu[gridsize, blocksize]()