## Goal
This notebook demonstrates how to leverage the **Dask** library alongside NumPy and MKL for efficient array processing. The primary goal is to illustrate the advantages of parallel processing, particularly how Dask can handle larger-than-memory datasets by chunking the data and executing operations concurrently using multiple threads.

## Required Modules for the Jupyter Notebook
**Module: mkl,da,numpy** 

Ensure that the software environment is properly set up to load these modules.

In [1]:
import mkl
import numpy as np
import dask.array as da

# Process an array with multiple threads

Multiple threads to process simultaneously different parts of the same array. `dask` automatically provides this feature by replacing the `numpy` function with `dask` functions. The key concept is a chunk, each chunk of data is executed separately by different threads. For example for a matrix we define a 2D block size and each of those blocks can be executed independently and then the results accumulated to get to the final answer.

In [2]:
# Currently numpy on some platforms is already multithreaded thanks to Intel MKL,
# for this example we disable multithreading
#import mkl
mkl.set_num_threads(1)

1

If the output is 1, then MKL is successfully restricted to using one thread, and your environment is correctly set up to perform single-threaded computations.

Let's create a 2D array with 20,000 rows and 4,000 columns using the Numpy function **numpy.random.rand()**. This function will generate random numbers between 0 and 1, uniformly distributed, to populate the array.

In [5]:
A = np.random.rand(20000,4000)

In [7]:
A

array([[0.87994796, 0.1126094 , 0.54094229, ..., 0.87440272, 0.93519058,
        0.11257881],
       [0.87430053, 0.85287911, 0.42298797, ..., 0.03867527, 0.27968843,
        0.02246341],
       [0.27810658, 0.88910386, 0.4791533 , ..., 0.98192612, 0.8885504 ,
        0.69478787],
       ...,
       [0.54834815, 0.20210413, 0.01712134, ..., 0.09009374, 0.24191001,
        0.45682417],
       [0.66792487, 0.13783056, 0.77971882, ..., 0.60420015, 0.61608451,
        0.71115639],
       [0.05417205, 0.63275994, 0.2387579 , ..., 0.42738878, 0.45551214,
        0.55014156]])

`%whos` is a magic function provided by `IPython` that gives memory consumption of defined variables

In [6]:
%whos

Variable   Type       Data/Info
-------------------------------
A          ndarray    20000x4000: 80000000 elems, type `float64`, 640000000 bytes (610.3515625 Mb)
da         module     <module 'dask.array' from<...>/dask/array/__init__.py'>
mkl        module     <module 'mkl' from '/cm/s<...>ackages/mkl/__init__.py'>
np         module     <module 'numpy' from '/ho<...>kages/numpy/__init__.py'>


First let's perform some operations on the matrix in pure `numpy`, using a single thread

In [8]:
%time B = A**2 + np.sin(A) * A * np.log(A)

CPU times: user 3.14 s, sys: 158 ms, total: 3.3 s
Wall time: 3.34 s


## Processing with dask

First create a chunked `dask` array from the `numpy` array

In [9]:
A_dask = da.from_array(A, chunks=(2000, 1000))

In [10]:
A_dask.numblocks

(10, 4)

Then replace each function with the equivalent provided by `dask`, it implements most of the `numpy` functions and operations.

In [11]:
compute_B = (A_dask**2 + da.sin(A_dask) * A_dask * da.log(A_dask))

In [12]:
%time B_dask = compute_B.compute(num_workers=1)

CPU times: user 3.48 s, sys: 151 ms, total: 3.63 s
Wall time: 3.64 s


In [13]:
%time B_dask = compute_B.compute(num_workers=2)

CPU times: user 3.5 s, sys: 132 ms, total: 3.63 s
Wall time: 1.91 s


In [14]:
#%time B_dask = compute_B.compute(num_workers=12)

In [15]:
#%time B_dask = compute_B.compute(num_workers=num_workers)

In [16]:
assert np.allclose(B, B_dask)

**Author**: Bob Sinkovits 

**Last Updated Date**: October 01, 2024

**Resources**: https://github.com/sinkovit/PythonSeries

## Submit Ticket
If you find anything that needs to be changed, edited, or if you would like to provide feedback or contribute to the notebook, please submit a ticket by contacting us at:

Email: consult@sdsc.edu

We appreciate your input and will review your suggestions promptly!