## Reduction: the sum of the elements of an array

In [2]:
import numpy as np
import sys
import os

# Modificación ejercicio 3.2 d)
try:
    value = int(sys.argv[1])
except:
    print("womp womp")
    value = 5*10**7

print("Value used is", value)

if 'SLURM_CPUS_PER_TASK' in os.environ:
    cpus = int(os.environ['SLURM_CPUS_PER_TASK'])
    print("Detected %s CPUs through slurm"%cpus)
else:
    cpus = 4
    print("Running on default number of CPUs (Para este boletin: %s)"%cpus)


def reduc_operation(A):
    """Compute the sum of the elements of Array A."""
    s = 0
    for i in range(A.size):
        s += A[i]
    return s

# Secuencial

X = np.random.rand(value)

# Para imprimir los primeros valores del array

#print(X[0:12])

# Utilizando las operaciones mágicas de ipython

tiempo = %timeit -r 2 -o -q reduc_operation(X)

print("Time taken by reduction operation using a function:", tiempo)

print(f"And the result of the sum of numbers in the range [0, value) is: {reduc_operation(X)}\n")


# Utilizando numpy.sum()

tiempo = %timeit -r 2 -o -q np.sum(X)

print("Time taken by reduction operation using numpy.sum():", tiempo)

print("Now, the result using numpy.sum():", np.sum(X),"\n ")


womp womp
Value used is 50000000
Running on default number of CPUs (Para este boletin: 4)
Time taken by reduction operation using a function: 5.12 s ± 49.6 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 25000452.20692294

Time taken by reduction operation using numpy.sum(): 19.2 ms ± 47.4 µs per loop (mean ± std. dev. of 2 runs, 100 loops each)
Now, the result using numpy.sum(): 25000452.20692847 
 


# Apartado 3.3
## Ejercicio A

In [6]:
import numpy as np
import multiprocessing as mp

def reduc_operation(A):
    """Compute the sum of the elements of Array A."""
    s = 0
    for i in range(A.size):
        s += A[i]
    return s

def reduc_operation_multiprocessing(A, cpus):
    #a) Dividir array en tantos subprocesos a ejecutar 
    sub_arrays = np.array_split(A, cpus)

    #b) Crear pool de procesos
    with mp.Pool(cpus) as pool:
        # c) Map
        partial_sums = pool.map(reduc_operation, sub_arrays)

    #d) Reducir
    total = np.sum(partial_sums)
    return total
    

# Secuencial
X = np.random.rand(value)

# Para imprimir los primeros valores del array

#print(X[0:12])

# Utilizando las operaciones mágicas de ipython

tiempo = %timeit -r 2 -o -q reduc_operation(X)

print("Time taken by reduction operation using a function:", tiempo)

print(f"And the result of the sum of numbers in the range [0, value) is: {reduc_operation(X)}\n")

# Utilizando numpy.sum()

tiempo = %timeit -r 2 -o -q np.sum(X)

print("Time taken by reduction operation using numpy.sum():", tiempo)

print("Now, the result using numpy.sum():", np.sum(X),"\n ")

# Utilizando multiprocessing

#cores = 2

tiempo = %timeit -r 2 -o -q reduc_operation_multiprocessing(X, cpus)

print("Time taken by reduction operation using a multiprocessing:", tiempo, " with ", cpus, " cores")

print(f"And the result of the sum of numbers in the range [0, value) is: {reduc_operation_multiprocessing(X, cpus)}\n")

# Con 4 cores

#cores = 4

#tiempo = %timeit -r 2 -o -q reduc_operation_multiprocessing(X, cores)

#print("Time taken by reduction operation using a multiprocessing:", tiempo, " with ", cores, " cores")

#print(f"And the result of the sum of numbers in the range [0, value) is: {reduc_operation_multiprocessing(X, cores)}\n")


Time taken by reduction operation using a function: 5.13 s ± 49.2 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 24997133.331796255

Time taken by reduction operation using numpy.sum(): 19.4 ms ± 1.57 µs per loop (mean ± std. dev. of 2 runs, 100 loops each)
Now, the result using numpy.sum(): 24997133.3317903 
 
Time taken by reduction operation using a multiprocessing: 1.78 s ± 54.3 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)  with  4  cores
And the result of the sum of numbers in the range [0, value) is: 24997133.33178983



## Ejercicio B

In [3]:
import numpy as np
from numba import njit, prange, set_num_threads

@njit
def reduc_operation_njit(A):
    """Compute the sum of the elements of Array A."""
    s = 0
    for i in range(A.size):
        s += A[i]
    return s

@njit(parallel = True)
def reduc_operation_njit_paralelismo(A):
    """Compute the sum of the elements of Array A."""
    s = 0
    for i in prange(A.size):
        s += A[i]
    return s

X = np.random.rand(value)

# Numba con njit
tiempo = %timeit -r 2 -o -q reduc_operation_njit(X)

print("Time taken by reduction operation using Numba:", tiempo)

print(f"And the result of the sum of numbers in the range [0, value) is: {reduc_operation_njit(X)}\n")

# Numba con njit y paralelismo (todos los cores disponibles, como dura milesimas de segundo no hay problema)
set_num_threads(cpus)
tiempo = %timeit -r 2 -o -q reduc_operation_njit_paralelismo(X)

print("Time taken by reduction operation using Numba with ", cpus, " cores:", tiempo)

print(f"And the result of the sum of numbers in the range [0, value) is: {reduc_operation_njit_paralelismo(X)}\n")

Time taken by reduction operation using Numba: 49.1 ms ± 78.7 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 25002330.949474234

Time taken by reduction operation using Numba with  4  cores: 13.6 ms ± 41.6 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 25002330.949468683



## Ejercicio C
La salida completa del *notebook* tras lanzar el *batch* es:
```text
#############################################################################
PARA 100000000 NÚMERO DE ELEMENTOS
#############################################################################
#############################################################################
CON 1 HILOS
#############################################################################
Value used is 100000000
Detected 1 CPUs through slurm
Time taken by reduction operation using a function: 17.3 s ± 118 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 50001179.13170367

Time taken by reduction operation using numpy.sum(): 64 ms ± 2.91 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 50001179.13170976 
 
Time taken by reduction operation using a function: 17.1 s ± 1.79 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 49996708.03999949

Time taken by reduction operation using numpy.sum(): 64 ms ± 3.22 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 49996708.03998077 
 
Time taken by reduction operation using a multiprocessing: 21.3 s ± 1.4 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)  with  1  cores
And the result of the sum of numbers in the range [0, value) is: 49996708.03999949

Time taken by reduction operation using Numba: 115 ms ± 55.5 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 49994599.060447365

Time taken by reduction operation using Numba with  1  cores: 115 ms ± 33.5 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 49994599.060447365

#############################################################################
CON 2 HILOS
#############################################################################
Value used is 100000000
Detected 2 CPUs through slurm
Time taken by reduction operation using a function: 17.4 s ± 206 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 50001868.21112538

Time taken by reduction operation using numpy.sum(): 64 ms ± 11.6 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 50001868.21112483 
 
Time taken by reduction operation using a function: 17.1 s ± 2.97 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 49998738.92549363

Time taken by reduction operation using numpy.sum(): 64 ms ± 5 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 49998738.92548931 
 
Time taken by reduction operation using a multiprocessing: 11.2 s ± 2.25 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)  with  2  cores
And the result of the sum of numbers in the range [0, value) is: 49998738.92549746

Time taken by reduction operation using Numba: 115 ms ± 96.9 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 49998489.05212019

Time taken by reduction operation using Numba with  2  cores: 57.8 ms ± 80.4 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 49998489.05212651

#############################################################################
CON 4 HILOS
#############################################################################
Value used is 100000000
Detected 4 CPUs through slurm
Time taken by reduction operation using a function: 17.3 s ± 79.6 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 50001101.96733471

Time taken by reduction operation using numpy.sum(): 64 ms ± 830 ns per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 50001101.96732033 
 
Time taken by reduction operation using a function: 17.1 s ± 5.26 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 49996336.87407814

Time taken by reduction operation using numpy.sum(): 64 ms ± 1.87 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 49996336.874079205 
 
Time taken by reduction operation using a multiprocessing: 6.22 s ± 1.55 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)  with  4  cores
And the result of the sum of numbers in the range [0, value) is: 49996336.87408465

Time taken by reduction operation using Numba: 115 ms ± 31 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 50001513.7392027

Time taken by reduction operation using Numba with  4  cores: 29.6 ms ± 19.2 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 50001513.73920658

#############################################################################
CON 8 HILOS
#############################################################################
Value used is 100000000
Detected 8 CPUs through slurm
Time taken by reduction operation using a function: 17.3 s ± 119 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 49998031.14231006

Time taken by reduction operation using numpy.sum(): 64 ms ± 8.6 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 49998031.1423115 
 
Time taken by reduction operation using a function: 17.1 s ± 545 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 49996635.54697312

Time taken by reduction operation using numpy.sum(): 64 ms ± 6.46 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 49996635.54698179 
 
Time taken by reduction operation using a multiprocessing: 3.9 s ± 4.25 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)  with  8  cores
And the result of the sum of numbers in the range [0, value) is: 49996635.54698497

Time taken by reduction operation using Numba: 115 ms ± 87 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 50001084.03318901

Time taken by reduction operation using Numba with  8  cores: 25.3 ms ± 192 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 50001084.03319165

#############################################################################
PARA 1000000000 NÚMERO DE ELEMENTOS
#############################################################################
#############################################################################
CON 1 HILOS
#############################################################################
Value used is 1000000000
Detected 1 CPUs through slurm
Time taken by reduction operation using a function: 2min 52s ± 769 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 500004796.85586435

Time taken by reduction operation using numpy.sum(): 640 ms ± 45.8 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
Now, the result using numpy.sum(): 500004796.85595435 
 
Time taken by reduction operation using a function: 2min 51s ± 117 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 500007606.52035695

Time taken by reduction operation using numpy.sum(): 640 ms ± 12.4 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
Now, the result using numpy.sum(): 500007606.5196507 
 
Time taken by reduction operation using a multiprocessing: 3min 28s ± 18.8 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)  with  1  cores
And the result of the sum of numbers in the range [0, value) is: 500007606.52035695

Time taken by reduction operation using Numba: 1.15 s ± 125 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 500009173.1866432

Time taken by reduction operation using Numba with  1  cores: 1.15 s ± 71.1 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 500009173.1866432

#############################################################################
CON 2 HILOS
#############################################################################
Value used is 1000000000
Detected 2 CPUs through slurm
Time taken by reduction operation using a function: 2min 51s ± 1.44 s per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 500009735.6142498

Time taken by reduction operation using numpy.sum(): 640 ms ± 197 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
Now, the result using numpy.sum(): 500009735.6149501 
 
Time taken by reduction operation using a function: 2min 48s ± 24.5 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 499994478.5931377

Time taken by reduction operation using numpy.sum(): 640 ms ± 254 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
Now, the result using numpy.sum(): 499994478.59335524 
 
Time taken by reduction operation using a multiprocessing: 1min 52s ± 745 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)  with  2  cores
And the result of the sum of numbers in the range [0, value) is: 499994478.5935718

Time taken by reduction operation using Numba: 1.15 s ± 1.14 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 500002486.2344382

Time taken by reduction operation using Numba with  2  cores: 600 ms ± 8.38 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 500002486.23482466

#############################################################################
CON 4 HILOS
#############################################################################
Value used is 1000000000
Detected 4 CPUs through slurm
Time taken by reduction operation using a function: 2min 53s ± 1.24 s per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 500001341.1929432

Time taken by reduction operation using numpy.sum(): 640 ms ± 185 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
Now, the result using numpy.sum(): 500001341.19302875 
 
Time taken by reduction operation using a function: 2min 51s ± 14.3 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 500005883.16111183

Time taken by reduction operation using numpy.sum(): 640 ms ± 146 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
Now, the result using numpy.sum(): 500005883.161257 
 
Time taken by reduction operation using a multiprocessing: 1min 5s ± 138 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)  with  4  cores
And the result of the sum of numbers in the range [0, value) is: 500005883.16126215

Time taken by reduction operation using Numba: 1.15 s ± 430 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 499993774.4315653

Time taken by reduction operation using Numba with  4  cores: 324 ms ± 3.36 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 499993774.4305874

#############################################################################
CON 8 HILOS
#############################################################################
Value used is 1000000000
Detected 8 CPUs through slurm
Time taken by reduction operation using a function: 3min 5s ± 1.61 s per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 499983535.5511725

Time taken by reduction operation using numpy.sum(): 640 ms ± 20.6 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
Now, the result using numpy.sum(): 499983535.5514781 
 
Time taken by reduction operation using a function: 2min 51s ± 12.6 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 499988005.1837364

Time taken by reduction operation using numpy.sum(): 640 ms ± 50.3 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
Now, the result using numpy.sum(): 499988005.1835812 
 
Time taken by reduction operation using a multiprocessing: 36.9 s ± 163 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)  with  8  cores
And the result of the sum of numbers in the range [0, value) is: 499988005.183612

Time taken by reduction operation using Numba: 1.15 s ± 938 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 500004231.8990439

Time taken by reduction operation using Numba with  8  cores: 225 ms ± 24.4 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 500004231.8984499
```

Para mejor legibilidad y análisis de los resultados:

| Valor    | Optimización / Función        | N Hilos | $\approx$ Tiempo (ms) |
| :---     | :---                          | :---:   | ---:                  |
| $10^8$   | reduc_operation original      | –       | $17300$               |
|          | numpy.sum                     | –       | $64$                  |
|          | Numba                         | –       | $115$                 |
|          |                               |         |                       |
|          | Multiprocessing               | 1       | $21300$               |
|          |                               | 2       | $11200$               |
|          |                               | 4       | $6220$                |
|          |                               | 8       | $3900$                |
|          |                               |         |                       |
|          | Numba paralelizado            | 1       | $115$                 |
|          |                               | 2       | $57.8$                |
|          |                               | 4       | $29.6$                |
|          |                               | 8       | $25.3$                |
|          |                               |         |                       |
| $10^9$   | reduc_operation original      | –       | $172000$              |
|          | numpy.sum                     | –       | $640$                 |
|          | Numba                         | –       | $1150$                |
|          |                               |         |                       |
|          | Multiprocessing               | 1       | $208000$              |
|          |                               | 2       | $112000$              |
|          |                               | 4       | $65000$               |
|          |                               | 8       | $36900$               |
|          |                               |         |                       |
|          | Numba paralelizado            | 1       | $1150$                |
|          |                               | 2       | $600$                 |
|          |                               | 4       | $324$                 |
|          |                               | 8       | $225$                 |

Y a continuación se muestran los speedups finales para $10^9$:

| Método                | N Hilos | Tiempo $\approx$ (ms) | Speedup     |
| :---                  | :---:   | ---:                  | ---:        |
| numpy.sum             | –       | $640$                 | $269\times$ |
| Numba                 | –       | $1150$                | $150\times$ |
| Multiprocessing       | 8       | $39000$               | $4.7\times$ |
| Numba paralelizado    | 8       | $255$                 | $764\times$ |

En primer lugar, para la implementación original `reduc_operation`, así como para `numpy.sum` y la versión compilada con `@njit` sin paralelización, los tiempos de ejecución son independientes del número de hilos utilizados. Entre estas tres alternativas, `numpy.sum` presenta los **mejores resultados** en todos los casos. Esto se debe a que opera directamente sobre arrays de *nunpy* mediante rutinas internas altamente optimizadas en C, minimizando el overhead de Python. Aunque *numba* ofrece una aceleración significativa frente a la implementación original, su versión no paralelizada resulta **menos eficiente** que `numpy.sum` para este problema concreto.

Al introducir paralelismo mediante *multiprocessing*, se observa una **sobrecarga** considerable cuando se utiliza un **único núcleo**, llegando incluso a empeorar el tiempo respecto a la implementación secuencial. Esta penalización se debe principalmente al coste de creación de procesos y a la copia de los datos entre procesos. No obstante, conforme aumenta el número de núcleos, el tiempo de ejecución disminuye de forma clara, aunque el escalado no es lineal.

Finalmente, la versión paralelizada con *numba* es la que ofrece los mejores resultados globales. Con dos núcleos ya supera al resto de métodos. Para el caso de $10^9$ elementos y 8 núcleos se obtiene una mejora de un factor de aproximadamente 764 respecto a la implementación original `reduc_operation`, confirmando que la paralelización a nivel de bucle mediante *numba* es la estrategia más eficiente para este tipo de reducción numérica.