## Goal
This notebook demonstrates how to leverage the **Dask** library alongside NumPy and MKL for efficient array processing. The primary goal is to illustrate the advantages of parallel processing, particularly how Dask can handle larger-than-memory datasets by chunking the data and executing operations concurrently using multiple threads.

## Required Modules for the Jupyter Notebook
**Module: mkl,da,numpy** 

Ensure that the software environment is properly set up to load these modules.

In [1]:
import mkl
import numpy as np
import dask.array as da

# Process an array with multiple threads

Multiple threads to process simultaneously different parts of the same array. `dask` automatically provides this feature by replacing the `numpy` function with `dask` functions. The key concept is a chunk, each chunk of data is executed separately by different threads. For example for a matrix we define a 2D block size and each of those blocks can be executed independently and then the results accumulated to get to the final answer.

In [None]:
mkl.set_num_threads(1)

If the output is 1, then MKL is successfully restricted to using one thread, and your environment is correctly set up to perform single-threaded computations.

Let's create a 2D array with 20,000 rows and 4,000 columns using the Numpy function **numpy.random.rand()**. This function will generate random numbers between 0 and 1, uniformly distributed, to populate the array.

In [None]:
A = np.random.rand(20000,4000)

In [None]:
A

`%whos` is a magic function provided by `IPython` that gives memory consumption of defined variables

In [None]:
%whos

First let's perform some operations on the matrix in pure `numpy`, using a single thread

In [None]:
%time B = A**2 + np.sin(A) * A * np.log(A)

## Processing with dask

First create a chunked `dask` array from the `numpy` array

In [None]:
A_dask = da.from_array(A, chunks=(2000, 1000))

In [None]:
A_dask.numblocks

Then replace each function with the equivalent provided by `dask`, it implements most of the `numpy` functions and operations.

In [None]:
compute_B = (A_dask**2 + da.sin(A_dask) * A_dask * da.log(A_dask))

In [None]:
%time B_dask = compute_B.compute(num_workers=1)

In [None]:
%time B_dask = compute_B.compute(num_workers=2)

In [None]:
#%time B_dask = compute_B.compute(num_workers=12)

In [None]:
#%time B_dask = compute_B.compute(num_workers=num_workers)

In [None]:
assert np.allclose(B, B_dask)

**Author**: Bob Sinkovits 

**Last Updated Date**: October 01, 2024

**Resources**: https://github.com/sinkovit/PythonSeries

## Submit Ticket
If you find anything that needs to be changed, edited, or if you would like to provide feedback or contribute to the notebook, please submit a ticket by contacting us at:

Email: consult@sdsc.edu

We appreciate your input and will review your suggestions promptly!