In [10]:
from IPython.display import Image

# Gridding

Here we present multiple solutions and approaches on the gridding problem for a potentially huge dataset.

## Summary

- We achieve a **10 fold** increase...
- scaling
- C
- CUDA

In what follows we put the detailed study of the different approaches and benchmarking results for different versions: 

## 1. `python/` 

Contains the treatment of the problem in pure python, the approaches taken on each of the versions are as follow:
   - `v0_original.py`: The original version of the code.
   - `v1_index_calc_jitted.py`: Calculating the grid indices has been Just In Compiled (JIT).
   - `v2_gridding_jitted.py`: The whole gridding function has been compiled.
   - `v3_single_timestep_vectorized.py`: Calculation of a single timestep of the grid has been fully vectorized using numpy intrinsic functions.
   - `v4_single_timestep_vectorized_jitted.py`: On top of vectorizing a single timestep the function for calculating a single timestep of the grid has been compiled.
   - `v5_gridding_vectorized.py`: The whole gridding function has been fully vectorized using numpy.
   - `v6_gridding_vectorized_multithreaded.py`: On top of vectorizing the gridding function it uses python `concurrent` library to parallelize the gridding over `n_workers` threads using chunks of the dataset.
   - `v7_mpi_timesteps.py`: Using `mpi4py` library we divide the computation of the grid over timsteps to multiple processes.
   - `v8_mpi_baselines.py`: Same as above but divide the dataset over baseline pairs wrather than timesteps to multiple process.

### 1.1 Benchmarking

Here we present the benchmarking results obtained on pure python implementations. Note that these benchmarks are done on a single node, with *dual sockets* and an *AMD EPYC 7H12 64-Core Processor* per socket.

![original_version](python/plots/v0_original.png)

From the benchmark above we see that the most time consuming and the bottleneck is the `gridding` function as we expected from the 3 nestes loops in python! For this reason we will focus our attention on the gridding function and will present the different results and strategies on that.

Here we put the benchamrking for the different version of the code, without any explicit parallelism by us (e.g. the benchmarks below are for the versions 1 to 5 without any multi-threading or use of MPI multi-processing)

![v1tov5](python/plots/v1tov5.png)

![v6](python/plots/v6.png)

![v7v8](python/plots/v7v8.png)

## 2.`C_python/`


| Version 1 (OpenMP): Number of Threads  | Time(s) |
| ------------------ | --------|
| 1   | 0.316 |
| 2   | 0.172 |
| 4   | 0.094 |
| 8   | 0.053 |
| 16  | 0.035 |
| 32  | 0.034 |
| 64  | 0.037 |
| 128 | 0.048 |

<center><img src="C_python/plots/v1.png" alt="v1"/></center>


| Version 2 (MPI): Number of MPI Processes  | Time(s) |
| ------------------ | --------|
| 1   | 0.340 |
| 2   | 0.198 |
| 4   | 0.181 |
| 8   | 0.170 |
| 16  | 0.198 |
| 32  | 0.306 |
| 64  | 0.386 |
| 128 | 0.674 |

<center><img src="C_python/plots/v2.png" alt="v2"/></center>


| Version 3 (SIMD/OpenMP): Number of Threads  | Time(s) |
| ------------------ | --------|
| 1   | 0.162 |
| 2   | 0.088 |
| 4   | 0.045 |
| 8   | 0.030 |
| 16  | 0.026 |
| 32  | 0.030 |
| 64  | 0.034 |
| 128 | 0.045 |

<center><img src="C_python/plots/v3.png" alt="v3"/></center>