![Dask Icon](dask_horizontal_black.gif "Dask Icon")
![Pandas Icon](images/pandas_logo.png "Pandas Icon")

# Gotcha's from Pandas to Dask

This notebook highlights some key differences when transfering code from `Pandas` to run in a `Dask` environment.  
Most issues have a link to the [Dask documentation](https://docs.dask.org/en/latest/) for additional information.

# Agenda  
1. Intro to `Dask` framework
2. Basic setup
3. 

[from documentation](https://docs.dask.org/en/latest/)

# Dask

Dask is a flexible library for parallel computing in Python.

Dask is composed of two parts:

1. Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
2. “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.

In [None]:
Dask emphasizes the following virtues:

* Familiar: Provides parallelized NumPy array and Pandas DataFrame objects
* Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.
* Native: Enables distributed computing in pure Python with access to the PyData stack.
* Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms
* Scales up: Runs resiliently on clusters with 1000s of cores
* Scales down: Trivial to set up and run on a laptop in a single process
* Responsive: Designed with interactive computing in mind, it provides rapid feedback and diagnostics to aid humans



Dask collections and schedulers

![Dask Framework](https://docs.dask.org/en/latest/_images/collections-schedulers.png)

See the [dask.distributed documentation (separate website)](https://distributed.dask.org/en/latest/) for more technical information on Dask’s distributed scheduler.

In [1]:
# since Dask is activly beeing developed - the current example is running with the below version
import dask
import pandas as pd
print(f'Dask versoin: {dask.__version__}')
print(f'Pandas versoin: {pd.__version__}')

Dask versoin: 1.2.2
Pandas versoin: 0.24.2


## Start Dask Client for Dashboard

Starting the Dask Client is optional.  In this example we are running on a `LocalCluster`, this  will also provide a dashboard which is useful to gain insight on the computation.  
For additional information on [Dask Client see documentation](https://docs.dask.org/en/latest/setup.html?highlight=client#setup)  

The link to the dashboard will become visible when you create a client (as shown below).  
When running in `Jupyter Lab` an [extenstion](https://github.com/dask/dask-labextension) can be installed to be able to view the various dashboard widgets. 

In [2]:
from dask.distributed import Client
# client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')
client = Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:43861  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 8  Memory: 67.44 GB


See [documentation for addtional cluster configuration](http://distributed.dask.org/en/latest/local-cluster.html)

* When running code within a script use `context manager`  
see question in [stack overflow](https://stackoverflow.com/a/53520917/5817977)  
* In order to get url dashboard use [inner function ](https://github.com/dask/distributed/issues/2083#issue-337057906)  



```python   
from ... import ...

if __name__ == '__main__':
    with Client() as client:
        ...
```

# Create 2 DataFrames for comparison: 
1. for Dask 
2. for Pandas  
Dask comes with builtin dataset samples, we will use this sample for our example. 

In [59]:
ddf = dask.datasets.timeseries()
print(type(ddf))
ddf

<class 'dask.dataframe.core.DataFrame'>


Unnamed: 0_level_0,id,name,x,y
npartitions=30,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01,int64,object,float64,float64
2000-01-02,...,...,...,...
...,...,...,...,...
2000-01-30,...,...,...,...
2000-01-31,...,...,...,...


* Remember `Dask framework` is **lazy** thus in order to see the result we need to run [compute()](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.compute) 
 (or `head()` which runs under the hood compute()) )

In [60]:
ddf.head(2)

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,1071,Ursula,0.862761,0.958352
2000-01-01 00:00:01,1034,Hannah,0.863905,-0.424926


 ## Consider using Persist
Since Dask is lazy - it may run the **entire** graph/DAG (again) even if it already run part of the calculation in a previous cell.  Thus use [persist](https://docs.dask.org/en/latest/dataframe-best-practices.html?highlight=parquet#persist-intelligently) to keep the results in memory 
```python
ddf = client.persist(ddf)
```
This is different from Pandas which once a variable was created it will keep all data in memory.  
Additional information can be read in this [stackoverflow issue](https://stackoverflow.com/questions/45941528/how-to-efficiently-send-a-large-numpy-array-to-the-cluster-with-dask-array/45941529#45941529) or see an exampel in [this post](http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes)   
This concept should also  be used when running a code within a script (rather then a jupyter notebook) which incoperates loops within the code.

In [5]:
ddf = dask.datasets.timeseries()
ddf = client.persist(ddf)
ddf.head(2)

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,941,Ingrid,-0.967193,-0.017845
2000-01-01 00:00:01,1048,Frank,0.366709,-0.068859


## Pandas
In order to create a `Pandas` dataframe we can use the `compute()` method from a `Dask dataframe`

In [63]:
pdf = ddf.compute()  
print(type(pdf))
pdf.head(2)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,1071,Ursula,0.862761,0.958352
2000-01-01 00:00:01,1034,Hannah,0.863905,-0.424926


## Creating a `Dask dataframe` from `Pandas`
In order to utilize `Dask` capablities on an existing `Pandas dataframe` (pdf) we need to convert the `Pandas dataframe` into a `Dask dataframe` (ddf)  with the [from_pandas](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.from_pandas) method. 
You must supply the number of partitions or chunksize that will be used to generate the dask dataframe

In [65]:
ddf2 = dask.dataframe.from_pandas(pdf, npartitions=10)
ddf2.compute()

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,1071,Ursula,0.862761,0.958352
2000-01-01 00:00:01,1034,Hannah,0.863905,-0.424926
2000-01-01 00:00:02,988,George,-0.562471,-0.475591
2000-01-01 00:00:03,967,Dan,-0.419287,0.499385
2000-01-01 00:00:04,983,Ursula,-0.106549,0.702420
2000-01-01 00:00:05,912,Laura,-0.172704,0.131661
2000-01-01 00:00:06,1063,Edith,-0.337477,-0.130608
2000-01-01 00:00:07,996,Quinn,-0.804542,-0.928083
2000-01-01 00:00:08,991,George,-0.760700,0.044566
2000-01-01 00:00:09,1040,Michael,0.093960,0.156991


## Partitions in Dask Dataframes

Notice that when we created a `Dask dataframe` we needed to supply an argument of `npartitions`.  
The number of partitions will assist `Dask` on how it's going to parallelize the computation.  
Each partition is a *separate* dataframe. For additional information see [partition documentation](https://docs.dask.org/en/latest/dataframe-design.html?highlight=meta%20utils#partitions)  

An example for this can be seen when examing the `reset_ index()` method:

In [66]:
pdf2 = pdf.reset_index()
# Only 1 row
pdf2.iloc[0]

timestamp    2000-01-01 00:00:00
id                          1071
name                      Ursula
x                       0.862761
y                       0.958352
Name: 0, dtype: object

In [67]:
ddf2 = ddf2.reset_index()
# each partition has an index=0
ddf2.loc[0].compute() 
# ddf2.loc[0].visualize()

Unnamed: 0,timestamp,id,name,x,y
0,2000-01-01,1071,Ursula,0.862761,0.958352
0,2000-01-04,1013,Sarah,0.556997,-0.005752
0,2000-01-07,1015,Tim,0.099402,0.190475
0,2000-01-10,1046,Frank,-0.218221,0.276958
0,2000-01-13,991,Zelda,0.009514,0.306028
0,2000-01-16,981,Ursula,-0.618383,0.898405
0,2000-01-19,1025,Oliver,0.024601,-0.790611
0,2000-01-22,993,Frank,-0.606106,0.030933
0,2000-01-25,1072,Victor,-0.363232,-0.818671
0,2000-01-28,935,Wendy,-0.998541,0.959165


Now that we have a `dask` (ddf) and a `pandas` (pdf) dataframe we can start to compair the interactions with them.

# Conceptual shift - from Update to Insert/Delete
Dask does not update - thus there are no arguments such as `inplace=True` which exist in Pandas.  
For more detials see [issue#653 on github](https://github.com/dask/dask/issues/653)

### Rename Columns

* using `inplace=True` is not considerd to be *best practice*. 

In [68]:
# Pandas 
print(pdf.columns)
pdf.rename(columns={'id':'ID'}, inplace=True)
# pdf = pdf.rename(columns={'id':'ID'})
pdf.columns

Index(['id', 'name', 'x', 'y'], dtype='object')


Index(['ID', 'name', 'x', 'y'], dtype='object')

In [12]:
# Dask - Error
# ddf.rename(columns={'id':'ID'}, inplace=True)
# ddf.columns

>---------------------------------------------------------------------------  
>TypeError                                 Traceback (most recent call last)  
><ipython-input-12-3e70ff3a549e> in <module>  
>      1 # Dask - Error  
>----> 2 ddf.rename(columns={'id':'ID'}, inplace=True)  
>      3 ddf.columns  
>TypeError: rename() got an unexpected keyword argument 'inplace'  

In [69]:
# Dask
print(ddf.columns)
ddf = ddf.rename(columns={'id':'ID'})
ddf.columns

Index(['id', 'name', 'x', 'y'], dtype='object')


Index(['ID', 'name', 'x', 'y'], dtype='object')

## Data manipilations  
There are several diffrences when manipulating data.  

### loc - Pandas

In [70]:
cond = (pdf['x']>0.5) & (pdf['x']<0.8)
pdf[cond].head(2)

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:21,1022,Ursula,0.557608,-0.94955
2000-01-01 00:00:26,1010,Michael,0.782082,0.40988


In [71]:
pdf.loc[cond, ['y']] = pdf['y']* 100
pdf[cond].head(2)

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:21,1022,Ursula,0.557608,-94.95499
2000-01-01 00:00:26,1010,Michael,0.782082,40.988035


### Dask - use mask/where

In [72]:
cond_dask = (ddf['x']>0.5) & (ddf['x']<0.8)
# Error
# ddf.loc[cond_dask, ['y']] = ddf['y']* 100

> TypeError                                 Traceback (most recent call last)  
> <ipython-input-16-2bbb2ae570bd> in <module> 
>       2 # Error  
> ----> 3 ddf.loc[cond_dask, ['y']] = ddf['y']* 100  
> TypeError: '_LocIndexer' object does not support item assignment  

In [75]:
ddf['y'] = ddf['y'].mask(cond_dask, ddf['y']* 100)
ddf[cond_dask].head(2)

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:21,1022,Ursula,0.557608,-949549.902417
2000-01-01 00:00:26,1010,Michael,0.782082,409880.346096


## Meta
One key difference is the introduction of `meta` arguement.  
> `meta` is the prescription of the names/types of the output from the computation  
[see stack overflow answer](https://stackoverflow.com/questions/44432868/dask-dataframe-apply-meta)

Since `Dask` creates a DAG for the computation it requires to understand what are the outputs of each calculation.  
For additinal information see [meta documentation](https://docs.dask.org/en/latest/dataframe-design.html?highlight=meta%20utils#metadata)

In [76]:
pdf['initials'] = pdf['name'].apply(lambda x: x[0]+x[1])
pdf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,1071,Ursula,0.862761,0.958352,Ur
2000-01-01 00:00:01,1034,Hannah,0.863905,-0.424926,Ha


In [77]:
ddf['initials'] = ddf['name'].apply(lambda x: x[0]+x[1])
ddf.head(2)

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('name', 'object'))



Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,1071,Ursula,0.862761,0.958352,Ur
2000-01-01 00:00:01,1034,Hannah,0.863905,-0.424926,Ha


In [80]:
# Describe the outcome type of the calculation
meta_cal = pd.Series(object, name='initials')

In [81]:
ddf['initials'] = ddf['name'].apply(lambda x: x[0]+x[1]
                                    , meta = meta_cal)
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,1071,Ursula,0.862761,0.958352,Ur
2000-01-01 00:00:01,1034,Hannah,0.863905,-0.424926,Ha


In [82]:
def func(row):
    if (row['x']> 0):
        return row['x'] * 1000  
    else:
        return row['y'] * -1

In [83]:
# ddf['z'] = ddf.apply(func, args=('coor_x', 'coor_y'), axis=1, meta=('z', 'float'))
ddf['z'] = ddf.apply(func, axis=1, meta=('z', 'float'))
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,z
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,1071,Ursula,0.862761,0.958352,Ur,862.76114
2000-01-01 00:00:01,1034,Hannah,0.863905,-0.424926,Ha,863.905028


### Map partitions
* We can supply a ad-hoc function to run on each partition using the [map_partitions](https://dask.readthedocs.io/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions) method.   
Mainly useful for functions that are not implemented in `Dask` or `Pandas` . 
* Finally we can return a new `dataframe` which needs to be described in the `meta` argument  
The function could also include arguments.

In [86]:
import numpy as np
def func2(df, coor_x, coor_y, drop_cols):
    df['dist'] =  np.sqrt ( (df[coor_x] - df[coor_x].shift())**2  
                           +  (df[coor_y] - df[coor_y].shift())**2 )
    df = df.drop(drop_cols, axis=1)
    return df

In [89]:
ddf2 = ddf.map_partitions(func2
                          , coor_x='x'
                          , coor_y='y'
                          , drop_cols=['initials', 'z']
                          , meta=pd.DataFrame({'ID':'i8'
                                              , 'name':str
                                              , 'x':'f8'
                                              ,'y':'f8'                                              
                                              , 'dist':'f8'}, index=[0]))
ddf2.head()

Unnamed: 0_level_0,ID,name,x,y,dist
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,1071,Ursula,0.862761,0.958352,
2000-01-01 00:00:01,1034,Hannah,0.863905,-0.424926,1.383278
2000-01-01 00:00:02,988,George,-0.562471,-0.475591,1.427276
2000-01-01 00:00:03,967,Dan,-0.419287,0.499385,0.985434
2000-01-01 00:00:04,983,Ursula,-0.106549,0.70242,0.372865


### Convert index into Time column

In [90]:
# Pandas
pdf = pdf.assign(times=pd.to_datetime(pdf.index).time)
pdf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,times
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,1071,Ursula,0.862761,0.958352,Ur,00:00:00
2000-01-01 00:00:01,1034,Hannah,0.863905,-0.424926,Ha,00:00:01


In [91]:
# ddf = ddf.assign(Time= dask.dataframe.to_datetime(ddf.index, format='%Y-%m-%d'). )
ddf = ddf.assign(times= dask.dataframe.to_datetime(ddf.index).dt.time
                , dates = dask.dataframe.to_datetime(ddf.index).dt.date)                 
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,z,times,dates
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2000-01-01 00:00:00,1071,Ursula,0.862761,0.958352,Ur,862.76114,00:00:00,2000-01-01
2000-01-01 00:00:01,1034,Hannah,0.863905,-0.424926,Ha,863.905028,00:00:01,2000-01-01


In [31]:
# Dask or Pandas
ddf = ddf.assign(times=ddf.index.astype('M8[ns]'))
ddf['dates'] = ddf['times'].dt.date
ddf['times'] = ddf['times'].dt.time
ddf.head()

Unnamed: 0_level_0,ID,name,x,y,initials,z,times,dates
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2000-01-01 00:00:00,941,Ingrid,-0.967193,-0.017845,In,0.017845,00:00:00,2000-01-01
2000-01-01 00:00:01,1048,Frank,0.366709,-0.068859,Fr,366.709023,00:00:01,2000-01-01
2000-01-01 00:00:02,1046,Ursula,0.011298,0.318318,Ur,11.297525,00:00:02,2000-01-01
2000-01-01 00:00:03,1009,Bob,-0.407134,0.140341,Bo,-0.140341,00:00:03,2000-01-01
2000-01-01 00:00:04,995,Charlie,-0.791907,0.986364,Ch,-0.986364,00:00:04,2000-01-01


## Drop NA on column

In [92]:
# Pandas
pdf = pdf.assign(colna = None)
print(pdf.head(2))
pdf = pdf.dropna(axis=1, how='all')
print(pdf.head(2))

                       ID    name         x         y initials     times colna
timestamp                                                                     
2000-01-01 00:00:00  1071  Ursula  0.862761  0.958352       Ur  00:00:00  None
2000-01-01 00:00:01  1034  Hannah  0.863905 -0.424926       Ha  00:00:01  None
                       ID    name         x         y initials     times
timestamp                                                               
2000-01-01 00:00:00  1071  Ursula  0.862761  0.958352       Ur  00:00:00
2000-01-01 00:00:01  1034  Hannah  0.863905 -0.424926       Ha  00:00:01


In odrer for `Dask` to drop a column with all `na` 

In [96]:
# Dask
ddf = ddf.assign(colna = None)
print(ddf.head(2))
if ddf.colna.isnull().all() == True:   # check if all values in column are Null - VERY slow
    ddf = ddf.drop(labels=['colna'],axis=1)
print(ddf.head(2))

                       ID    name         x         y initials           z  \
timestamp                                                                    
2000-01-01 00:00:00  1071  Ursula  0.862761  0.958352       Ur  862.761140   
2000-01-01 00:00:01  1034  Hannah  0.863905 -0.424926       Ha  863.905028   

                        times       dates colna  
timestamp                                        
2000-01-01 00:00:00  00:00:00  2000-01-01  None  
2000-01-01 00:00:01  00:00:01  2000-01-01  None  
                       ID    name         x         y initials           z  \
timestamp                                                                    
2000-01-01 00:00:00  1071  Ursula  0.862761  0.958352       Ur  862.761140   
2000-01-01 00:00:01  1034  Hannah  0.863905 -0.424926       Ha  863.905028   

                        times       dates  
timestamp                                  
2000-01-01 00:00:00  00:00:00  2000-01-01  
2000-01-01 00:00:01  00:00:01  2000-01-01 

##  Reset Index

In [97]:
# Pandas
pdf = pdf.reset_index(drop=True)
pdf.head(2)

Unnamed: 0,ID,name,x,y,initials,times
0,1071,Ursula,0.862761,0.958352,Ur,00:00:00
1,1034,Hannah,0.863905,-0.424926,Ha,00:00:01


In [98]:
# Dask
ddf = ddf.reset_index()
ddf = ddf.drop(labels=['timestamp'], axis=1 )
ddf.head(2)

Unnamed: 0,ID,name,x,y,initials,z,times,dates
0,1071,Ursula,0.862761,0.958352,Ur,862.76114,00:00:00,2000-01-01
1,1034,Hannah,0.863905,-0.424926,Ha,863.905028,00:00:01,2000-01-01


# Read / Save files

When working with `pandas` and `dask` preferable try and work with [parquet](https://docs.dask.org/en/latest/dataframe-best-practices.html?highlight=parquet#store-data-in-apache-parquet-format).  
Even so when working with `Dask` - the files can be read with multiple workers .  
Most `kwargs` are applicable for reading and writing files [see documentaion](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.to_csv) (including the option for output file naming).  
e.g. 
ddf = dask.dataframe.read_csv('data/pd2dd/ddf*.csv', compression='gzip', header=False).  

However some are not available such as  `nrows`.

## Save files

In [99]:
# Pandas
!mkdir data
pdf.to_csv('data/pdf_single_file.csv')

mkdir: cannot create directory ‘data’: File exists


In [100]:
!dir data  # use ls on linux systems

pd2dd  pdf_single_file.csv


`Dask`
Notice the '*' to allow for multiple file renaming. 



In [101]:
!mkdir data/pd2dd   #linux

mkdir: cannot create directory ‘data/pd2dd’: File exists


In [102]:
# Dask
ddf.to_csv('data/pd2dd/ddf*.csv', index = False)

['data/pd2dd/ddf00.csv',
 'data/pd2dd/ddf01.csv',
 'data/pd2dd/ddf02.csv',
 'data/pd2dd/ddf03.csv',
 'data/pd2dd/ddf04.csv',
 'data/pd2dd/ddf05.csv',
 'data/pd2dd/ddf06.csv',
 'data/pd2dd/ddf07.csv',
 'data/pd2dd/ddf08.csv',
 'data/pd2dd/ddf09.csv',
 'data/pd2dd/ddf10.csv',
 'data/pd2dd/ddf11.csv',
 'data/pd2dd/ddf12.csv',
 'data/pd2dd/ddf13.csv',
 'data/pd2dd/ddf14.csv',
 'data/pd2dd/ddf15.csv',
 'data/pd2dd/ddf16.csv',
 'data/pd2dd/ddf17.csv',
 'data/pd2dd/ddf18.csv',
 'data/pd2dd/ddf19.csv',
 'data/pd2dd/ddf20.csv',
 'data/pd2dd/ddf21.csv',
 'data/pd2dd/ddf22.csv',
 'data/pd2dd/ddf23.csv',
 'data/pd2dd/ddf24.csv',
 'data/pd2dd/ddf25.csv',
 'data/pd2dd/ddf26.csv',
 'data/pd2dd/ddf27.csv',
 'data/pd2dd/ddf28.csv',
 'data/pd2dd/ddf29.csv']

To find the number of partitions which will determine the number of output files use [dask.dataframe.npartitions](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.npartitions)  

In [105]:
ddf.npartitions

30

In [106]:
#!dir data\pd2dd\  # windows
!dir data/pd2dd/  # linux

ddf00.csv  ddf05.csv  ddf10.csv  ddf15.csv  ddf20.csv  ddf25.csv
ddf01.csv  ddf06.csv  ddf11.csv  ddf16.csv  ddf21.csv  ddf26.csv
ddf02.csv  ddf07.csv  ddf12.csv  ddf17.csv  ddf22.csv  ddf27.csv
ddf03.csv  ddf08.csv  ddf13.csv  ddf18.csv  ddf23.csv  ddf28.csv
ddf04.csv  ddf09.csv  ddf14.csv  ddf19.csv  ddf24.csv  ddf29.csv


To change the number of output files use [repartition](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.repartition) which is an expensive operation.

## Read files

For `pandas` it is possible to iterate and concat the files [see answer from stack overflow](https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe).

In [111]:
%%time
# Pandas 
import glob
import os
path = r'data/pd2dd/'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent
df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
len(concatenated_df)

CPU times: user 3.52 s, sys: 152 ms, total: 3.67 s
Wall time: 3.52 s


In [112]:
%%time
# Dask
_ddf = dask.dataframe.read_csv('data/pd2dd/ddf*.csv')
len(_ddf)

CPU times: user 386 ms, sys: 27.5 ms, total: 414 ms
Wall time: 1.35 s


# Group By - custom aggregations
In addition to the [groupby notebook example](https://github.com/dask/dask-examples/blob/master/dataframes/02-groupby.ipynb)  - 
this is another example how to try to eliminate the use of `groupby.apply`  
In this example we are grouping by columns into unique list.

In [113]:
# prepare pandas dataframe
pdf = pdf.assign(Time=pd.to_datetime(pdf.index).time)
pdf['seconds'] = pdf.Time.astype(str).str[-2:]
pdf.head()

Unnamed: 0,ID,name,x,y,initials,times,Time,seconds
0,1071,Ursula,0.862761,0.958352,Ur,00:00:00,00:00:00,0
1,1034,Hannah,0.863905,-0.424926,Ha,00:00:01,00:00:00,0
2,988,George,-0.562471,-0.475591,Ge,00:00:02,00:00:00,0
3,967,Dan,-0.419287,0.499385,Da,00:00:03,00:00:00,0
4,983,Ursula,-0.106549,0.70242,Ur,00:00:04,00:00:00,0


In [114]:
# pandas preperations
def set_list_att(x: dask.dataframe.Series):
        return list(set([item for item in x.values]))

In [119]:
%%time
# pandas option 1 using apply
pdf_gb = pdf.groupby(pdf.name)
gp_col = ['ID', 'seconds']
list_ser_gb = [pdf_gb[att_col_gr].apply(set_list_att) 
               for att_col_gr in gp_col]
df_edge_att = pdf_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        

CPU times: user 1.51 s, sys: 9.77 ms, total: 1.52 s
Wall time: 1.47 s


In [120]:
df_edge_att.head(2)

Unnamed: 0_level_0,Weight,ID,seconds
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alice,99991,"[1024, 1025, 1026, 1027, 1028, 1029, 1030, 103...","[58, 42, 96, 37, 62, 84, 97, 79, 31, 81, 86, 4..."
Bob,99818,"[1024, 1025, 1026, 1027, 1028, 1029, 1030, 103...","[58, 42, 96, 37, 62, 84, 97, 79, 31, 81, 86, 4..."


In [121]:
%%time
# pandas option 2 using lambda
pdf_gb = pdf.groupby(pdf.name)
gp_col = ['ID', 'seconds']
list_ser_gb = [pdf_gb[att_col_gr].apply(lambda x: list(set(x.to_list()))) 
               for att_col_gr in gp_col]
df_edge_att = pdf_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        

CPU times: user 1.03 s, sys: 8.88 ms, total: 1.04 s
Wall time: 998 ms


In [122]:
df_edge_att.head(2)

Unnamed: 0_level_0,Weight,ID,seconds
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alice,99991,"[1024, 1025, 1026, 1027, 1028, 1029, 1030, 103...","[58, 42, 96, 37, 62, 84, 97, 79, 31, 81, 86, 4..."
Bob,99818,"[1024, 1025, 1026, 1027, 1028, 1029, 1030, 103...","[58, 42, 96, 37, 62, 84, 97, 79, 31, 81, 86, 4..."


In any case sometimes using Pandas is more efficiante (assuming that you can load all the data into the RAM).  
In this case Pandas is faster

In [125]:
# prepare dask dataframe
ddf['seconds'] = ddf.times.astype(str).str[-2:]
ddf = client.persist(ddf)
ddf.head(2)

Unnamed: 0,ID,name,x,y,initials,z,times,dates,seconds
0,1071,Ursula,0.862761,0.958352,Ur,862.76114,00:00:00,2000-01-01,0
1,1034,Hannah,0.863905,-0.424926,Ha,863.905028,00:00:01,2000-01-01,1


In [130]:
%%time
# Dask option1 using apply
# notice the meta argument in the apply function
df_gb = ddf.groupby(ddf.name)
gp_col = ['ID', 'seconds']
list_ser_gb = [df_gb[att_col_gr].apply(set_list_att
                ,meta=pd.Series(dtype='object', name=f'{att_col_gr}_att')) 
               for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
    df_edge_att = df_edge_att.join(ser.to_frame(), how='left')
df_edge_att.head(2)   

CPU times: user 4.22 s, sys: 259 ms, total: 4.48 s
Wall time: 11.8 s


In [126]:
df_edge_att.head(2)

Unnamed: 0_level_0,Weight,ID_att,seconds_att
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Victor,99082,"[1024, 1025, 1026, 1027, 1028, 1029, 1030, 103...","[08, 36, 27, 23, 19, 35, 29, 06, 15, 09, 18, 3..."


Using [dask custom aggregation](https://docs.dask.org/en/latest/dataframe-api.html?highlight=dropna#dask.dataframe.groupby.Aggregation) is consideribly better

In [127]:
# Dask
# some preperations
import itertools
custom_agg = dask.dataframe.Aggregation(
    'custom_agg', 
    lambda s: s.apply(set), 
    lambda s: s.apply(lambda chunks: list(set(itertools.chain.from_iterable(chunks)))),
)

In [131]:
%%time
# Dask option1 using apply
df_gb = ddf.groupby(ddf.name)
gp_col = ['ID', 'seconds']
list_ser_gb = [df_gb[att_col_gr].agg(custom_agg) for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')
df_edge_att.head()

CPU times: user 547 ms, sys: 26.3 ms, total: 574 ms
Wall time: 1.51 s


## Debugging
Debugging may be more challenging since
1. when using a client - mutliprocessing is complecated
2. sometime introducing a faulty command into a graph (such as in a jupyter notebook) requirues to cache-out the graph and start the process from the begining

## Corrupted DAG

In [132]:
ddf = dask.datasets.timeseries()

In [133]:
# returns an error
def func_dist2(df, coor_x, coor_y):
#     dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())**2  +  (df[coor_y] - df[coor_y].shift())**2 )
    dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())^2  +  (df[coor_y] - df[coor_y].shift())^2 )
    return dist
ddf['col'] = ddf.map_partitions(func_dist2, coor_x='x', coor_y='y', meta=('float'))
ddf.head(2)

TypeError: unsupported operand type(s) for ^: 'float' and 'bool'

* Even if the function is currected the DAG is corrupted

In [134]:
# Still results with an error
def func_dist2(df, coor_x, coor_y):
    dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())**2  +  (df[coor_y] - df[coor_y].shift())**2 )
#     dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())^2  +  (df[coor_y] - df[coor_y].shift())^2 )
    return dist
ddf['col'] = ddf.map_partitions(func_dist2, coor_x='x', coor_y='y', meta=('float'))
ddf.head(2)

TypeError: unsupported operand type(s) for ^: 'float' and 'bool'

Need to reset the dataframe

In [135]:
ddf = dask.datasets.timeseries()
def func_dist2(df, coor_x, coor_y):
    dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())**2  +  (df[coor_y] - df[coor_y].shift())**2 )
#     dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())^2  +  (df[coor_y] - df[coor_y].shift())^2 )
    return dist
ddf['col'] = ddf.map_partitions(func_dist2, coor_x='x', coor_y='y', meta=('float'))
ddf.head(2)

Unnamed: 0_level_0,id,name,x,y,col
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,999,George,-0.479133,0.075877,
2000-01-01 00:00:01,1020,Edith,0.35548,0.38387,0.889629


In [141]:
ddf.set_index(['name','id'])

NotImplementedError: Dask dataframe does not yet support multi-indexes.
You tried to index with this index: ['name', 'id']
Indexes must be single columns only.

In [140]:
# ddf.describe().compute()
ddf.groupby(['name','id']).x.agg(['count', 'mean', 'sum']).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,sum
name,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alice,866,1,-0.797514,-0.797514
Alice,878,2,-0.16207,-0.32414
Alice,879,2,-0.528959,-1.057919
Alice,881,1,-0.002,-0.002
Alice,885,4,-0.605074,-2.420294


# Summary
