# 10+ Minutes to Dask

<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/03.001%20-%2010%2B%20minutes%20to%20dask.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import numpy as np
import pandas as pd
import dask
import dask.dataframe as dd
import dask.array as da
import dask.bag as db

# Dask Objects

## Dask DataFrames

Dask Dataframes coordinate many Pandas dataframes, partitioned along an index.  
Support a subset of the Pandas API.  


In [3]:
# dask dataframe
# from pandas
idx = pd.date_range("2023-05-06", periods = 1000, freq="1H")

In [4]:
idx

DatetimeIndex(['2023-05-06 00:00:00', '2023-05-06 01:00:00',
               '2023-05-06 02:00:00', '2023-05-06 03:00:00',
               '2023-05-06 04:00:00', '2023-05-06 05:00:00',
               '2023-05-06 06:00:00', '2023-05-06 07:00:00',
               '2023-05-06 08:00:00', '2023-05-06 09:00:00',
               ...
               '2023-06-16 06:00:00', '2023-06-16 07:00:00',
               '2023-06-16 08:00:00', '2023-06-16 09:00:00',
               '2023-06-16 10:00:00', '2023-06-16 11:00:00',
               '2023-06-16 12:00:00', '2023-06-16 13:00:00',
               '2023-06-16 14:00:00', '2023-06-16 15:00:00'],
              dtype='datetime64[ns]', length=1000, freq='H')

In [5]:
pd_df = pd.DataFrame({"a": np.arange(1000), "b": list("abcd"*250)}, index = idx)

In [6]:
pd_df

Unnamed: 0,a,b
2023-05-06 00:00:00,0,a
2023-05-06 01:00:00,1,b
2023-05-06 02:00:00,2,c
2023-05-06 03:00:00,3,d
2023-05-06 04:00:00,4,a
...,...,...
2023-06-16 11:00:00,995,d
2023-06-16 12:00:00,996,a
2023-06-16 13:00:00,997,b
2023-06-16 14:00:00,998,c


In [7]:
dask_df = dd.from_pandas(pd_df, npartitions=10)

In [8]:
dask_df

Unnamed: 0_level_0,a,b
npartitions=10,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-05-06 00:00:00,int32,object
2023-05-10 04:00:00,...,...
...,...,...
2023-06-12 12:00:00,...,...
2023-06-16 15:00:00,...,...


In [9]:
dask_df.divisions

(Timestamp('2023-05-06 00:00:00'),
 Timestamp('2023-05-10 04:00:00'),
 Timestamp('2023-05-14 08:00:00'),
 Timestamp('2023-05-18 12:00:00'),
 Timestamp('2023-05-22 16:00:00'),
 Timestamp('2023-05-26 20:00:00'),
 Timestamp('2023-05-31 00:00:00'),
 Timestamp('2023-06-04 04:00:00'),
 Timestamp('2023-06-08 08:00:00'),
 Timestamp('2023-06-12 12:00:00'),
 Timestamp('2023-06-16 15:00:00'))

In [10]:
dask_df.partitions[1]

Unnamed: 0_level_0,a,b
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-05-10 04:00:00,int32,object
2023-05-14 08:00:00,...,...


In [11]:
# data types of each of the columns
dask_df.dtypes

a     int32
b    object
dtype: object

We can do regular Pandas stuff with Dask Dataframes now...

In [12]:
# get a subset based on index (date-time)
dask_df2 = dask_df.loc[idx[0:100]]

In [13]:
dask_df2

Unnamed: 0_level_0,a,b
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-05-06 00:00:00,int32,object
2023-05-10 03:00:00,...,...


In [14]:
# perform analysis on the subset
dask_df2_grpby_count = dask_df2.groupby("b").count()

In [15]:
# Dask evaluates lazy
# nothing happens untill we call .compute()
dask_df2_grpby_count.compute()

Unnamed: 0_level_0,a
b,Unnamed: 1_level_1
a,25
b,25
c,25
d,25


## Dask Arrays

Dask arrays coordinate many Numpy arrays, arranged into chunks within a grid.  
Dask arrays support a subset of Numpy API.

In [16]:
np_array = np.arange(100000).reshape(200,500)

In [17]:
dask_array = da.from_array(np_array, chunks = (100,100))

In [18]:
dask_array

Unnamed: 0,Array,Chunk
Bytes,390.62 kiB,39.06 kiB
Shape,"(200, 500)","(100, 100)"
Dask graph,10 chunks in 1 graph layer,10 chunks in 1 graph layer
Data type,int32 numpy.ndarray,int32 numpy.ndarray
"Array Chunk Bytes 390.62 kiB 39.06 kiB Shape (200, 500) (100, 100) Dask graph 10 chunks in 1 graph layer Data type int32 numpy.ndarray",500  200,

Unnamed: 0,Array,Chunk
Bytes,390.62 kiB,39.06 kiB
Shape,"(200, 500)","(100, 100)"
Dask graph,10 chunks in 1 graph layer,10 chunks in 1 graph layer
Data type,int32 numpy.ndarray,int32 numpy.ndarray


In [19]:
dask_array.chunks

((100, 100), (100, 100, 100, 100, 100))

In [20]:
dask_array.blocks[1,3]

Unnamed: 0,Array,Chunk
Bytes,39.06 kiB,39.06 kiB
Shape,"(100, 100)","(100, 100)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray
"Array Chunk Bytes 39.06 kiB 39.06 kiB Shape (100, 100) (100, 100) Dask graph 1 chunks in 2 graph layers Data type int32 numpy.ndarray",100  100,

Unnamed: 0,Array,Chunk
Bytes,39.06 kiB,39.06 kiB
Shape,"(100, 100)","(100, 100)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray


In [21]:
# let's play with a slightly more interesting example
# x is a matrix of random numbers
x = da.random.random((100, 100), chunks=(10,10))

In [22]:
x

Unnamed: 0,Array,Chunk
Bytes,78.12 kiB,800 B
Shape,"(100, 100)","(10, 10)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 78.12 kiB 800 B Shape (100, 100) (10, 10) Dask graph 100 chunks in 1 graph layer Data type float64 numpy.ndarray",100  100,

Unnamed: 0,Array,Chunk
Bytes,78.12 kiB,800 B
Shape,"(100, 100)","(10, 10)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [23]:
# operations just like Numpy
y = x + x.T
y

Unnamed: 0,Array,Chunk
Bytes,78.12 kiB,800 B
Shape,"(100, 100)","(10, 10)"
Dask graph,100 chunks in 3 graph layers,100 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 78.12 kiB 800 B Shape (100, 100) (10, 10) Dask graph 100 chunks in 3 graph layers Data type float64 numpy.ndarray",100  100,

Unnamed: 0,Array,Chunk
Bytes,78.12 kiB,800 B
Shape,"(100, 100)","(10, 10)"
Dask graph,100 chunks in 3 graph layers,100 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [24]:
z1 = y[::2, 50:].mean(axis=0)
z2 = y[::2, 50:].mean(axis=1)

In [25]:
z1

Unnamed: 0,Array,Chunk
Bytes,400 B,80 B
Shape,"(50,)","(10,)"
Dask graph,5 chunks in 7 graph layers,5 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 400 B 80 B Shape (50,) (10,) Dask graph 5 chunks in 7 graph layers Data type float64 numpy.ndarray",50  1,

Unnamed: 0,Array,Chunk
Bytes,400 B,80 B
Shape,"(50,)","(10,)"
Dask graph,5 chunks in 7 graph layers,5 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [26]:
# to actually compute z1, let's use .compute()
z1.compute()

array([0.97353801, 1.02844861, 1.06520363, 1.05446419, 0.93803455,
       0.99803611, 1.00387065, 1.01041554, 0.95016994, 0.91659862,
       1.02434146, 1.05890953, 0.9516734 , 0.97019368, 1.07281716,
       1.04878525, 0.92483438, 1.04502919, 0.94282573, 0.94172099,
       1.12537585, 1.03281995, 0.93976849, 1.0011391 , 1.07423183,
       1.00384963, 0.92349168, 0.98070803, 1.04302798, 0.89963968,
       0.97781067, 1.01410889, 0.9562097 , 0.94543564, 1.03873395,
       0.98234484, 1.04494163, 0.88986827, 0.96159682, 1.03090534,
       0.83207303, 1.09516323, 1.03234816, 1.01138979, 1.05420189,
       0.96428564, 1.13044664, 0.91320433, 1.0505532 , 1.0544578 ])

In [27]:
z2

Unnamed: 0,Array,Chunk
Bytes,400 B,40 B
Shape,"(50,)","(5,)"
Dask graph,10 chunks in 7 graph layers,10 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 400 B 40 B Shape (50,) (5,) Dask graph 10 chunks in 7 graph layers Data type float64 numpy.ndarray",50  1,

Unnamed: 0,Array,Chunk
Bytes,400 B,40 B
Shape,"(50,)","(5,)"
Dask graph,10 chunks in 7 graph layers,10 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [28]:
z2.compute()

array([1.02346259, 0.99639792, 1.02768892, 1.05014243, 0.92389502,
       0.92748688, 0.99779037, 1.00420674, 1.00640365, 1.00734093,
       0.94799347, 1.0443322 , 1.06873102, 0.91485425, 0.90967781,
       0.95693231, 0.9089969 , 1.01414155, 1.07880526, 1.01263477,
       1.07439469, 0.95186629, 1.04382879, 1.0857993 , 0.99439504,
       1.01305053, 1.09130445, 0.9476037 , 0.89805551, 1.03457553,
       0.97532432, 0.89716531, 1.05223143, 1.02509804, 1.01098296,
       1.14220326, 0.9732965 , 1.0644032 , 0.97883769, 1.00693725,
       0.96682821, 1.02720542, 0.93317276, 0.89735661, 0.96001181,
       0.91357726, 1.0545463 , 1.00343285, 1.06527702, 1.01936537])

## Dask Bag

Bag is unordered collection of objects allowing repeats. Use these for semi/un-structured data.  
It's fun but slower than dataframes and arrays.  
The [examples](https://examples.dask.org/bag.html) page is really interesting.

In [29]:
dask_bag = db.from_sequence([1,2,3,4,5,6,7,8,9,0], npartitions = 2)

In [30]:
dask_bag

dask.bag<from_sequence, npartitions=2>

In [31]:
dask_bag.take(2)

(1, 2)

In [32]:
# dask is lazy - this one grabs values from one partition
dask_bag.filter(lambda x: x>3).take(2)

(4, 5)

In [33]:
# Here's how we take ALL across all partitions
dask_bag.filter(lambda x: x>3).compute()

[4, 5, 6, 7, 8, 9]

In [34]:
dask_bag.map(lambda x:x*x).take(5)

(1, 4, 9, 16, 25)

In [35]:
dask_bag.count().compute()

10

In [36]:
# convert to a dask dataframe
# this is a trivial example
dask_df_from_bag = dask_bag.to_dataframe()

In [37]:
dask_df_from_bag

Unnamed: 0_level_0,0
npartitions=2,Unnamed: 1_level_1
,int64
,...
,...


### Build bag with complex json and convert to dataframe
* Step 1: define a 'flatten' function
* Step 2: map 'flatten' to the bag
* Step 3: convert the flattened bag to dataframe using bag_instance.to_dataframe()

Using example from https://examples.dask.org/bag.html

#### Create Random Data

In [38]:
import json
import os

In [39]:
os.makedirs("./data/dask-bag-example-01", exist_ok = True)

In [40]:
b = dask.datasets.make_people()

In [41]:
b.map(json.dumps).to_textfiles("./data/dask-bag-example-01/*.json")

['d:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/0.json',
 'd:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/1.json',
 'd:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/2.json',
 'd:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/3.json',
 'd:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/4.json',
 'd:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/5.json',
 'd:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/6.json',
 'd:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/7.json',
 'd:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/8.json',
 'd:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/9.json']

#### Read JSON Data

In [42]:
# for windows
# !more .\data\dask-bag-example-01\0.json
# for linux
# !head -n 2 ./data/dask-bag-example-01/0.json

In [43]:
b = db.read_text('./data/dask-bag-example-01/*.json').map(json.loads)
b

dask.bag<loads, npartitions=10>

In [44]:
b.take(2)

({'age': 49,
  'name': ['Carmelo', 'Swanson'],
  'occupation': 'Stationer',
  'telephone': '+1-240-393-4926',
  'address': {'address': '1062 Apollo Center', 'city': 'McMinnville'},
  'credit-card': {'number': '3417 705685 47448', 'expiration-date': '06/16'}},
 {'age': 48,
  'name': ['Bryant', 'Christian'],
  'occupation': 'Booking Clerk',
  'telephone': '+1-975-593-0915',
  'address': {'address': '965 Barneveld Manor', 'city': 'Lake Forest'},
  'credit-card': {'number': '3412 082280 61526', 'expiration-date': '03/24'}})