# 10+ Minutes to Dask

<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/03.001%20-%2010%2B%20minutes%20to%20dask.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd

import dask.dataframe as dd
import dask.array as da
import dask.bag as db

# Dask Objects

## Dask DataFrames

In [3]:
# dask dataframe
# from pandas
idx = pd.date_range("2023-05-06", periods = 1000, freq="1H")
pd_df = pd.DataFrame({"a": np.arange(1000), "b": list("abcd"*250)}, index = idx)

In [4]:
dask_df = dd.from_pandas(pd_df, npartitions=10)

In [6]:
dask_df

Unnamed: 0_level_0,a,b
npartitions=10,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-05-06 00:00:00,int32,object
2023-05-10 04:00:00,...,...
...,...,...
2023-06-12 12:00:00,...,...
2023-06-16 15:00:00,...,...


In [16]:
dask_df.divisions

(Timestamp('2023-05-06 00:00:00'),
 Timestamp('2023-05-10 04:00:00'),
 Timestamp('2023-05-14 08:00:00'),
 Timestamp('2023-05-18 12:00:00'),
 Timestamp('2023-05-22 16:00:00'),
 Timestamp('2023-05-26 20:00:00'),
 Timestamp('2023-05-31 00:00:00'),
 Timestamp('2023-06-04 04:00:00'),
 Timestamp('2023-06-08 08:00:00'),
 Timestamp('2023-06-12 12:00:00'),
 Timestamp('2023-06-16 15:00:00'))

In [17]:
dask_df.partitions[1]

Unnamed: 0_level_0,a,b
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-05-10 04:00:00,int32,object
2023-05-14 08:00:00,...,...


## Dask Arrays

In [13]:
np_array = np.arange(100000).reshape(200,500)

In [14]:
dask_array = da.from_array(np_array, chunks = (100,100))

In [15]:
dask_array

Unnamed: 0,Array,Chunk
Bytes,390.62 kiB,39.06 kiB
Shape,"(200, 500)","(100, 100)"
Dask graph,10 chunks in 1 graph layer,10 chunks in 1 graph layer
Data type,int32 numpy.ndarray,int32 numpy.ndarray
"Array Chunk Bytes 390.62 kiB 39.06 kiB Shape (200, 500) (100, 100) Dask graph 10 chunks in 1 graph layer Data type int32 numpy.ndarray",500  200,

Unnamed: 0,Array,Chunk
Bytes,390.62 kiB,39.06 kiB
Shape,"(200, 500)","(100, 100)"
Dask graph,10 chunks in 1 graph layer,10 chunks in 1 graph layer
Data type,int32 numpy.ndarray,int32 numpy.ndarray


In [18]:
dask_array.chunks

((100, 100), (100, 100, 100, 100, 100))

In [20]:
dask_array.blocks[1,3]

Unnamed: 0,Array,Chunk
Bytes,39.06 kiB,39.06 kiB
Shape,"(100, 100)","(100, 100)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray
"Array Chunk Bytes 39.06 kiB 39.06 kiB Shape (100, 100) (100, 100) Dask graph 1 chunks in 2 graph layers Data type int32 numpy.ndarray",100  100,

Unnamed: 0,Array,Chunk
Bytes,39.06 kiB,39.06 kiB
Shape,"(100, 100)","(100, 100)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray


## Dask Bag

Bag is unordered collection of objects allowing repeats. Use these for semi/un-structured data.  
It's fun but slower than dataframes and arrays.  
The [examples](https://examples.dask.org/bag.html) page is really interesting.

In [21]:
dask_bag = db.from_sequence([1,2,3,4,5,6,7,8,9,0], npartitions = 2)

In [24]:
dask_bag

dask.bag<from_sequence, npartitions=2>

In [25]:
dask_bag.take(2)

(1, 2)

In [30]:
# dask is lazy - this one grabs values from one partition
dask_bag.filter(lambda x: x>3).take(2)

(4, 5)

In [33]:
# Here's how we take ALL across all partitions
dask_bag.filter(lambda x: x>3).compute()

[4, 5, 6, 7, 8, 9]

In [31]:
dask_bag.map(lambda x:x*x).take(5)

(1, 4, 9, 16, 25)

In [32]:
dask_bag.count().compute()

10

In [43]:
# convert to a dask dataframe
# this is a trivial example
dask_df_from_bag = dask_bag.to_dataframe()

In [44]:
dask_df_from_bag

Unnamed: 0_level_0,0
npartitions=2,Unnamed: 1_level_1
,int64
,...
,...


In [45]:
# TODO: define a complex json and convert to dataframe
# step 1: define a 'flatten' function
# step 2: map 'flatten' to the bag
# step 3: convert the flattened bag to dataframe using bag_instance.to_dataframe()

