# Running Environment
To run thus notebook, we need to allocate more than 50GB memeory

# Array Multicore

Instead of trivially parallel independent tasks here we want to use multiple threads to process simultaneously different parts of the same array. `dask` automatically provides this feature by replacing the `numpy` function with `dask` functions. The key concept is a chunk, each chunk of data is executed separately by different threads. For example for a matrix we define a 2D block size and each of those blocks can be executed independently and then the results accumulated to get to the final answer. See <http://dask.pydata.org/>

In [1]:
import numpy as np
import dask.array as da

In [2]:
np.__config__.show()

openblas64__info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None

In [3]:
A = np.random.rand(20000, 40000)
print(f"Memory required for A: {A.nbytes / 1e9} GB")  

Memory required for A: 6.4 GB


In [4]:
A.size / 1024**3

0.7450580596923828

In [5]:
%time B = A**2 + np.sin(A) * A * np.log(A)

CPU times: user 30.5 s, sys: 1.6 s, total: 32 s
Wall time: 32.2 s


In [6]:
from numba import njit

In [7]:
def compute_B(A):
    return A**2 + np.sin(A) * A * np.log(A)

In [8]:
%time B = njit(compute_B, parallel=True)(A)

CPU times: user 28.6 s, sys: 1.35 s, total: 30 s
Wall time: 15.6 s


In [9]:
A_dask = da.from_array(A, chunks=(2000, 2000))

In [10]:
A_dask.numblocks

(10, 20)

In [11]:
%time B_dask = (A_dask**2 + da.sin(A_dask) * A_dask * da.log(A_dask)).compute()

CPU times: user 34 s, sys: 1.41 s, total: 35.4 s
Wall time: 18.8 s


In [12]:
import psutil
import os
print(f"Used memory: {psutil.Process(os.getpid()).memory_info().rss / (1024 ** 2)} MB")

Used memory: 18614.0390625 MB


In [13]:
assert np.allclose(B, B_dask)