<h1>Dask Arrays</h1>

<p>
Dask arrays provide a scalable and efficient solution for handling large array computations in data science workflows. A Dask array is essentially a collection of smaller NumPy arrays, partitioned to allow for distributed or parallel processing. Dask seamlessly applies NumPy operations to each chunk and then aggregates the results, enabling efficient computation across large datasets that don't fit into memory.
</p><br>
Let's see the dask array in action with some examples:

<h3>Create dataset</h3>

In [2]:
%run prep-alt.py -d random

Created random data for array exercise in 42.46s


In [4]:
# Load the data with h5py
import os

import h5py

f = h5py.File(os.path.join('data', 'random.hdf5'), mode='r')
dset = f['/x']

In [6]:
dset

<HDF5 dataset "x": shape (1000000000,), type "<f4">

In [11]:
%%time

# Compute sum of a billion numbers by fetching a chunk of million numbers at a time
sums = []
for i in range(0, 1_000_000_000, 1_000_000):
    chunks = dset[i:i + 1_000_000] # Take out chunks
    sums.append(chunks.sum())

total = sum(sums)
print(total)

1000015360.0
CPU times: user 528 ms, sys: 2.02 s, total: 2.55 s
Wall time: 19.1 s


In [16]:
# Achieve the same with dask array
import dask.array as da 

d = da.from_array(dset, chunks=(1_000_000,))
d


Unnamed: 0,Array,Chunk
Bytes,3.73 GiB,3.81 MiB
Shape,"(1000000000,)","(1000000,)"
Dask graph,1000 chunks in 2 graph layers,1000 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 3.73 GiB 3.81 MiB Shape (1000000000,) (1000000,) Dask graph 1000 chunks in 2 graph layers Data type float32 numpy.ndarray",1000000000  1,

Unnamed: 0,Array,Chunk
Bytes,3.73 GiB,3.81 MiB
Shape,"(1000000000,)","(1000000,)"
Dask graph,1000 chunks in 2 graph layers,1000 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [17]:
result = d.sum()
result

Unnamed: 0,Array,Chunk
Bytes,4 B,4 B
Shape,(),()
Dask graph,1 chunks in 8 graph layers,1 chunks in 8 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
Array Chunk Bytes 4 B 4 B Shape () () Dask graph 1 chunks in 8 graph layers Data type float32 numpy.ndarray,,

Unnamed: 0,Array,Chunk
Bytes,4 B,4 B
Shape,(),()
Dask graph,1 chunks in 8 graph layers,1 chunks in 8 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [18]:
%%time

total = result.compute()
print(total)

1000015200.0
CPU times: user 1.04 s, sys: 2.32 s, total: 3.36 s
Wall time: 19 s


In [19]:
%%time
d.mean().compute()

CPU times: user 1 s, sys: 2.15 s, total: 3.15 s
Wall time: 19.2 s


np.float32(1.0000153)

In [20]:
%%time

d.std().compute()

CPU times: user 2.86 s, sys: 6.88 s, total: 9.74 s
Wall time: 19 s


np.float32(0.99999374)