# 10+ Minutes to Dask

<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/03.001%20-%2010%2B%20minutes%20to%20dask.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd

import dask.dataframe as dd
import dask.array as da
import dask.bag as db

# Dask Objects

## Dask DataFrames

In [3]:
# dask dataframe
# from pandas
idx = pd.date_range("2023-05-06", periods = 1000, freq="1H")
pd_df = pd.DataFrame({"a": np.arange(1000), "b": list("abcd"*250)}, index = idx)

In [4]:
dask_df = dd.from_pandas(pd_df, npartitions=10)

In [6]:
dask_df

Unnamed: 0_level_0,a,b
npartitions=10,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-05-06 00:00:00,int32,object
2023-05-10 04:00:00,...,...
...,...,...
2023-06-12 12:00:00,...,...
2023-06-16 15:00:00,...,...


In [16]:
dask_df.divisions

(Timestamp('2023-05-06 00:00:00'),
 Timestamp('2023-05-10 04:00:00'),
 Timestamp('2023-05-14 08:00:00'),
 Timestamp('2023-05-18 12:00:00'),
 Timestamp('2023-05-22 16:00:00'),
 Timestamp('2023-05-26 20:00:00'),
 Timestamp('2023-05-31 00:00:00'),
 Timestamp('2023-06-04 04:00:00'),
 Timestamp('2023-06-08 08:00:00'),
 Timestamp('2023-06-12 12:00:00'),
 Timestamp('2023-06-16 15:00:00'))

In [17]:
dask_df.partitions[1]

Unnamed: 0_level_0,a,b
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-05-10 04:00:00,int32,object
2023-05-14 08:00:00,...,...


## Dask Arrays

Dask arrays coordinate many Numpy arrays, arranged into chunks within a grid.  
Dask arrays support a subset of Numpy API.

In [13]:
np_array = np.arange(100000).reshape(200,500)

In [14]:
dask_array = da.from_array(np_array, chunks = (100,100))

In [15]:
dask_array

Unnamed: 0,Array,Chunk
Bytes,390.62 kiB,39.06 kiB
Shape,"(200, 500)","(100, 100)"
Dask graph,10 chunks in 1 graph layer,10 chunks in 1 graph layer
Data type,int32 numpy.ndarray,int32 numpy.ndarray
"Array Chunk Bytes 390.62 kiB 39.06 kiB Shape (200, 500) (100, 100) Dask graph 10 chunks in 1 graph layer Data type int32 numpy.ndarray",500  200,

Unnamed: 0,Array,Chunk
Bytes,390.62 kiB,39.06 kiB
Shape,"(200, 500)","(100, 100)"
Dask graph,10 chunks in 1 graph layer,10 chunks in 1 graph layer
Data type,int32 numpy.ndarray,int32 numpy.ndarray


In [18]:
dask_array.chunks

((100, 100), (100, 100, 100, 100, 100))

In [20]:
dask_array.blocks[1,3]

Unnamed: 0,Array,Chunk
Bytes,39.06 kiB,39.06 kiB
Shape,"(100, 100)","(100, 100)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray
"Array Chunk Bytes 39.06 kiB 39.06 kiB Shape (100, 100) (100, 100) Dask graph 1 chunks in 2 graph layers Data type int32 numpy.ndarray",100  100,

Unnamed: 0,Array,Chunk
Bytes,39.06 kiB,39.06 kiB
Shape,"(100, 100)","(100, 100)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray


In [46]:
# let's play with a slightly more interesting example
# x is a matrix of random numbers
x = da.random.random((100, 100), chunks=(10,10))

In [47]:
x

Unnamed: 0,Array,Chunk
Bytes,78.12 kiB,800 B
Shape,"(100, 100)","(10, 10)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 78.12 kiB 800 B Shape (100, 100) (10, 10) Dask graph 100 chunks in 1 graph layer Data type float64 numpy.ndarray",100  100,

Unnamed: 0,Array,Chunk
Bytes,78.12 kiB,800 B
Shape,"(100, 100)","(10, 10)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [49]:
# operations just like Numpy
y = x + x.T
y

Unnamed: 0,Array,Chunk
Bytes,78.12 kiB,800 B
Shape,"(100, 100)","(10, 10)"
Dask graph,100 chunks in 3 graph layers,100 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 78.12 kiB 800 B Shape (100, 100) (10, 10) Dask graph 100 chunks in 3 graph layers Data type float64 numpy.ndarray",100  100,

Unnamed: 0,Array,Chunk
Bytes,78.12 kiB,800 B
Shape,"(100, 100)","(10, 10)"
Dask graph,100 chunks in 3 graph layers,100 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [50]:
z1 = y[::2, 50:].mean(axis=0)
z2 = y[::2, 50:].mean(axis=1)

In [54]:
z1

Unnamed: 0,Array,Chunk
Bytes,400 B,80 B
Shape,"(50,)","(10,)"
Dask graph,5 chunks in 7 graph layers,5 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 400 B 80 B Shape (50,) (10,) Dask graph 5 chunks in 7 graph layers Data type float64 numpy.ndarray",50  1,

Unnamed: 0,Array,Chunk
Bytes,400 B,80 B
Shape,"(50,)","(10,)"
Dask graph,5 chunks in 7 graph layers,5 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [55]:
# to actually compute z1, let's use .compute()
z1.compute()

array([1.0501553 , 1.06197004, 1.01516805, 0.93068943, 0.91996986,
       0.96290249, 0.95436305, 1.02401206, 1.02742763, 0.99243557,
       0.99718952, 1.02277176, 0.96894195, 1.06672099, 0.95699315,
       0.96303598, 0.95443972, 1.00187926, 1.01374906, 1.08192617,
       0.9878612 , 0.96265303, 1.01540312, 1.11345933, 1.10503314,
       1.02269819, 0.94002421, 1.04850822, 1.00349413, 0.98533396,
       1.02348341, 1.02310511, 0.99864816, 1.01686015, 0.98120901,
       1.03313574, 1.01092895, 0.96804934, 0.92283257, 0.95952287,
       1.00533698, 1.06081486, 1.03399228, 1.03617546, 1.04105659,
       1.04961689, 1.0398077 , 0.89031904, 0.93275545, 1.10425938])

In [52]:
z2

Unnamed: 0,Array,Chunk
Bytes,400 B,40 B
Shape,"(50,)","(5,)"
Dask graph,10 chunks in 7 graph layers,10 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 400 B 40 B Shape (50,) (5,) Dask graph 10 chunks in 7 graph layers Data type float64 numpy.ndarray",50  1,

Unnamed: 0,Array,Chunk
Bytes,400 B,40 B
Shape,"(50,)","(5,)"
Dask graph,10 chunks in 7 graph layers,10 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [56]:
z2.compute()

array([1.03194912, 0.98617607, 1.01494417, 1.10864259, 1.09323665,
       0.98258642, 1.02550513, 1.04390162, 1.0713531 , 0.94868255,
       0.9968235 , 1.00088024, 1.01133825, 1.12596304, 1.00038937,
       1.0242949 , 0.90683964, 0.96937712, 1.01647708, 1.00213515,
       0.97681707, 1.02783107, 1.02108425, 0.99047058, 1.00493484,
       1.0758277 , 0.94491701, 0.97677376, 1.03889031, 1.11524223,
       0.96262916, 0.96272934, 0.94027077, 0.97636111, 1.07913921,
       0.99388244, 0.9082804 , 1.05450462, 1.08352098, 1.01350698,
       1.05083639, 0.98493071, 0.96317911, 0.99226219, 0.90567896,
       0.95376945, 1.12273386, 0.92814731, 0.95625753, 0.91621441])

## Dask Bag

Bag is unordered collection of objects allowing repeats. Use these for semi/un-structured data.  
It's fun but slower than dataframes and arrays.  
The [examples](https://examples.dask.org/bag.html) page is really interesting.

In [21]:
dask_bag = db.from_sequence([1,2,3,4,5,6,7,8,9,0], npartitions = 2)

In [24]:
dask_bag

dask.bag<from_sequence, npartitions=2>

In [25]:
dask_bag.take(2)

(1, 2)

In [30]:
# dask is lazy - this one grabs values from one partition
dask_bag.filter(lambda x: x>3).take(2)

(4, 5)

In [33]:
# Here's how we take ALL across all partitions
dask_bag.filter(lambda x: x>3).compute()

[4, 5, 6, 7, 8, 9]

In [31]:
dask_bag.map(lambda x:x*x).take(5)

(1, 4, 9, 16, 25)

In [32]:
dask_bag.count().compute()

10

In [43]:
# convert to a dask dataframe
# this is a trivial example
dask_df_from_bag = dask_bag.to_dataframe()

In [44]:
dask_df_from_bag

Unnamed: 0_level_0,0
npartitions=2,Unnamed: 1_level_1
,int64
,...
,...


In [45]:
# TODO: define a complex json and convert to dataframe
# step 1: define a 'flatten' function
# step 2: map 'flatten' to the bag
# step 3: convert the flattened bag to dataframe using bag_instance.to_dataframe()

