# Out of core computing

In this notebook we will demonstrate how to use `dask` to perform computation on a file of 2GB that cannot be fully loaded in memory (a Raspberry pi 3 has 1 GB of RAM)

In [None]:
!ls -lah ../data/random.hdf5

## Create a h5py dataset

A `h5py` dataset references the data on disk without loading them to memory unless explicitely asked for.

In [None]:
import h5py
import os
f = h5py.File(os.path.join('..', 'data', 'random.hdf5'))
dset = f['/x']

In [None]:
dset.shape[0] / 1e6

In [None]:
dset.dtype

## Compute the sum with `dask`

We can create a `dask` array from any object that presents the same interface as `numpy` arrays, in this case a `h5py` dataset. The chunk size is defining how big is each subsection of the array that is going to be loaded and manipulated by `dask`, many chunks can be loaded simultaneously in memory to make use of multiple cores.

In [None]:
import dask.array as da
x = da.from_array(dset, chunks=(int(1e6),))

`dask` computations are lazy, they are not evaluated immediately because `dask` can combine different operations together and optimize it computation.

In [None]:
result = x[:int(4e7)].mean()
result

`num_workers` specifies the number of threads to be used, in this case we are bound by loading data from disk.

In [None]:
%time result.compute(num_workers=4)