![Dask Icon](dask_horizontal_black.gif "Dask Icon")
![Pandas Icon](images/pandas_logo.png "Pandas Icon")

# Gotcha's from Pandas to Dask

This notebook highlights some key differences when transfering code from `Pandas` to run in a `Dask` environment.  
Most issues have a link to the [Dask documentation](https://docs.dask.org/en/latest/) for additional information.

# Agenda  
1. Intro to `Dask` Dataframe
2. Basic Comparison
3. 

In [1]:
# since Dask is activly beeing developed - the current example is running with the below version
import dask
import pandas as pd
print(f'Dask versoin: {dask.__version__}')
print(f'Pandas versoin: {pd.__version__}')

Dask versoin: 1.2.2
Pandas versoin: 0.24.2


## Start Dask Client for Dashboard

Starting the Dask Client is optional.  In this example we are running on a `LocalCluster`, this  will also provide a dashboard which is useful to gain insight on the computation.  
For additional information on [Dask Client see documentation](https://docs.dask.org/en/latest/setup.html?highlight=client#setup)  

The link to the dashboard will become visible when you create a client (as shown below).  
When running in `Jupyter Lab` an [extenstion](https://github.com/dask/dask-labextension) can be installed to be able to view the various dashboard widgets. 

In [2]:
from dask.distributed import Client
# client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')
client = Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:44852  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 8  Memory: 67.44 GB


See [documentation for addtional cluster configuration](http://distributed.dask.org/en/latest/local-cluster.html)

* When running code within a script use `context manager`  
see question in [stack overflow](https://stackoverflow.com/a/53520917/5817977)  
* In order to get url dashboard use [inner function ](https://github.com/dask/distributed/issues/2083#issue-337057906)  



```python   
with Client() as client:
    ...
```

# Create 2 DataFrames for comparison: 
1. for Dask 
2. for Pandas  
Dask comes with builtin dataset samples, we will use this sample for our example. 

In [5]:
ddf = dask.datasets.timeseries()
print(type(ddf))
ddf

<class 'dask.dataframe.core.DataFrame'>


Unnamed: 0_level_0,id,name,x,y
npartitions=30,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01,int64,object,float64,float64
2000-01-02,...,...,...,...
...,...,...,...,...
2000-01-30,...,...,...,...
2000-01-31,...,...,...,...


* Remember `Dask framework` is **lazy** thus in order to see the result we need to run [compute()](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.compute) 
 (or `head()` which runs under the hood compute()) )

In [6]:
ddf.head(2)

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,1007,Michael,0.542257,-0.848081
2000-01-01 00:00:01,1033,Frank,0.799762,0.515092


 ## Consider using Persist
Since Dask is lazy - it may run the **entire** graph/DAG (again) even if it already run part of the calculation in a previous cell.  Thus use [persist](https://docs.dask.org/en/latest/dataframe-best-practices.html?highlight=parquet#persist-intelligently) to keep the results in memory 
```python
ddf = client.persist(ddf)
```
This is different from Pandas which once a variable was created it will keep all data in memory.  
Additional information can be read in this [stackoverflow issue](https://stackoverflow.com/questions/45941528/how-to-efficiently-send-a-large-numpy-array-to-the-cluster-with-dask-array/45941529#45941529) or see an exampel in [this post](http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes)   
This concept should also  be used when running a code within a script (rather then a jupyter notebook) which incoperates loops within the code.

In [7]:
ddf = dask.datasets.timeseries()
ddf = client.persist(ddf)
ddf.head(2)

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,964,Patricia,0.141229,-0.078796
2000-01-01 00:00:01,1006,Laura,0.427487,0.729673


## Pandas
In order to create a `Pandas` dataframe we can use the `compute()` method from a `Dask dataframe`

In [8]:
pdf = ddf.compute()  
print(type(pdf))
pdf.head(2)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,964,Patricia,0.141229,-0.078796
2000-01-01 00:00:01,1006,Laura,0.427487,0.729673


## Creating a `Dask dataframe` from `Pandas`
In order to utilize `Dask` capablities on an existing `Pandas dataframe` (pdf) we need to convert the `Pandas dataframe` into a `Dask dataframe` (ddf)  with the [from_pandas](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.from_pandas) method. 
You must supply the number of partitions or chunksize that will be used to generate the dask dataframe

In [14]:
ddf2 = dask.dataframe.from_pandas(pdf, npartitions=10)
ddf2

Unnamed: 0_level_0,id,name,x,y
npartitions=10,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,int64,object,float64,float64
2000-01-04 00:00:00,...,...,...,...
...,...,...,...,...
2000-01-28 00:00:00,...,...,...,...
2000-01-30 23:59:59,...,...,...,...


## Partitions in Dask Dataframes

Notice that when we created a `Dask dataframe` we needed to supply an argument of `npartitions`.  
The number of partitions will assist `Dask` on how it's going to parallelize the computation.  
Each partition is a *separate* dataframe. For additional information see [partition documentation](https://docs.dask.org/en/latest/dataframe-design.html?highlight=meta%20utils#partitions)  

An example for this can be seen when examing the `reset_ index()` method:

In [11]:
pdf2 = pdf.reset_index()
# Only 1 row
pdf2.loc[0]

timestamp    2000-01-01 00:00:00
id                           964
name                    Patricia
x                       0.141229
y                     -0.0787963
Name: 0, dtype: object

In [15]:
ddf2 = ddf2.reset_index()
# each partition has an index=0
ddf2.loc[0].compute() 

Unnamed: 0,timestamp,id,name,x,y
0,2000-01-01,964,Patricia,0.141229,-0.078796
0,2000-01-04,973,Ray,-0.31781,0.306046
0,2000-01-07,961,Ursula,0.303154,0.251312
0,2000-01-10,1024,Sarah,0.12531,-0.868237
0,2000-01-13,989,Michael,-0.87675,0.809178
0,2000-01-16,1050,Kevin,-0.218924,0.715219
0,2000-01-19,967,Sarah,-0.583837,0.78923
0,2000-01-22,1012,Victor,0.201521,-0.442275
0,2000-01-25,983,Zelda,-0.765627,-0.729193
0,2000-01-28,1022,Ursula,0.935259,0.706244


Now that we have a `dask` (ddf) and a `pandas` (pdf) dataframe we can start to compair the interactions with them.

# Conceptual shift - from Update to Insert/Delete
Dask does not update - thus there are no arguments such as `inplace=True` which exist in Pandas.  
For more detials see [issue#653 on github](https://github.com/dask/dask/issues/653)

### Rename Columns

* using `inplace=True` is not considerd to be *best practice*. 

In [19]:
# Pandas 
print(pdf.columns)
pdf.rename(columns={'id':'ID'}, inplace=True)
# pdf = pdf.rename(columns={'id':'ID'})
pdf.columns

Index(['ID', 'name', 'x', 'y'], dtype='object')


Index(['ID', 'name', 'x', 'y'], dtype='object')

In [20]:
# Dask - Error
ddf.rename(columns={'id':'ID'}, inplace=True)
ddf.columns

TypeError: rename() got an unexpected keyword argument 'inplace'

In [21]:
# Dask
print(ddf.columns)
ddf = ddf.rename(columns={'id':'ID'})
ddf.columns

Index(['ID', 'name', 'x', 'y'], dtype='object')


Index(['ID', 'name', 'x', 'y'], dtype='object')

## Data munipilations  
There are several diffrences when manipulating data.  

### loc - Pandas

In [22]:
cond = (pdf['x']>0.5) &(pdf['x']<0.8)
pdf[cond].head(2)

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:03,1025,Bob,0.799201,-0.50344
2000-01-01 00:00:05,996,Ursula,0.592114,0.104375


In [27]:
pdf.loc[cond, ['y']] = pdf['y']* 100
pdf[cond].head(2)

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:03,1025,Bob,0.799201,-5034.40428
2000-01-01 00:00:05,996,Ursula,0.592114,1043.753529


### Dask - use mask/where

In [25]:
cond_dask = (ddf['x']>0.5) & (ddf['x']<0.8)

# Error
ddf.loc[cond_dask, ['y']] = ddf['y']* 100

TypeError: '_LocIndexer' object does not support item assignment

In [28]:
ddf['y'] = ddf['y'].mask(cond_dask, ddf['y']* 100)
ddf[cond_dask].head(2)

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:03,1025,Bob,0.799201,-5034.40428
2000-01-01 00:00:05,996,Ursula,0.592114,1043.753529


## Meta
One key difference is the introduction of `meta` arguement.  
> `meta` is the prescription of the names/types of the output from the computation  
[see stack overflow answer](https://stackoverflow.com/questions/44432868/dask-dataframe-apply-meta)

Since `Dask` creates a DAG for the computation it requires to understand what are the outputs of each calculation.  
For additinal information see [meta documentation](https://docs.dask.org/en/latest/dataframe-design.html?highlight=meta%20utils#metadata)

In [29]:
pdf['initials'] = pdf['name'].apply(lambda x: x[0]+x[1])
pdf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,964,Patricia,0.141229,-0.078796,Pa
2000-01-01 00:00:01,1006,Laura,0.427487,0.729673,La


In [30]:
ddf['initials'] = ddf['name'].apply(lambda x: x[0]+x[1])
ddf.head(2)

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('name', 'object'))



Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,964,Patricia,0.141229,-0.078796,Pa
2000-01-01 00:00:01,1006,Laura,0.427487,0.729673,La


In [31]:
# Describe the outcome type of the calculation
meta_cal = pd.Series(object, name='initials')

In [32]:
ddf['initials'] = ddf['name'].apply(lambda x: x[0]+x[1], meta = meta_cal)
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,964,Patricia,0.141229,-0.078796,Pa
2000-01-01 00:00:01,1006,Laura,0.427487,0.729673,La


In [33]:
def func(row):
    if (row['x']> 0):
        return row['x'] * 1000  
    else:
        return row['y'] * -1

In [34]:
# ddf['z'] = ddf.apply(func, args=('coor_x', 'coor_y'), axis=1, meta=('z', 'float'))
ddf['z'] = ddf.apply(func, axis=1, meta=('z', 'float'))
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,z
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,964,Patricia,0.141229,-0.078796,Pa,141.229048
2000-01-01 00:00:01,1006,Laura,0.427487,0.729673,La,427.486691


* We can supply a function to run on each partition using the [map_partitions](https://dask.readthedocs.io/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions) method.  
The function could also include arguments.

In [35]:
def func2(df, col1, col2):
    z = df[col1] * df[col2]
    return z

In [38]:
ddf['a'] = ddf.map_partitions(func2, col1='x', col2='y', meta=('float'))

In [39]:
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,z,a
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2000-01-01 00:00:00,964,Patricia,0.141229,-0.078796,Pa,141.229048,-0.011128
2000-01-01 00:00:01,1006,Laura,0.427487,0.729673,La,427.486691,0.311925


In [45]:
pdf.replace(to_replace={"P":'Q', 'L':'W'}, regex=True, ).head()

Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,964,Qatricia,0.141229,-0.078796,Qa
2000-01-01 00:00:01,1006,Waura,0.427487,0.729673,Wa
2000-01-01 00:00:02,983,Bob,0.465053,-0.61388,Bo
2000-01-01 00:00:03,1025,Bob,0.799201,-5034.40428,Bo
2000-01-01 00:00:04,1061,Wendy,-0.768113,-0.022926,We


In [47]:
ddf.replace(to_replace={"^P":'Q', 'L':'W'}, regex=True, ).head()

Unnamed: 0_level_0,ID,name,x,y,initials,z,a
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2000-01-01 00:00:00,964,Qatricia,0.141229,-0.078796,Qa,141.229048,-0.011128
2000-01-01 00:00:01,1006,Waura,0.427487,0.729673,Wa,427.486691,0.311925
2000-01-01 00:00:02,983,Bob,0.465053,-0.61388,Bo,465.052706,-0.285487
2000-01-01 00:00:03,1025,Bob,0.799201,-5034.40428,Bo,799.200722,-4023.499537
2000-01-01 00:00:04,1061,Wendy,-0.768113,-0.022926,We,0.022926,0.01761


* Finally we can return a `dataframe` which needs to be described in the `meta` argument

In [48]:
def func3(df, col1, col2, col3, col4):
    df['col12'] = df[col1] * df[col2]
    df = df.drop([col3, col4], axis=1)
    return df

In [49]:
ddf_z = ddf.map_partitions(func3, 'x', 'y', 'z','a',
                           meta=pd.DataFrame({'ID':'i8'
                                              , 'name':str
                                              , 'x':'f8'
                                              ,'y':'f8'
                                              , 'initials':str
                                              , 'col12':'f8'}, index=[0]) 
                          )
ddf_z.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,col12
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,964,Patricia,0.141229,-0.078796,Pa,-0.011128
2000-01-01 00:00:01,1006,Laura,0.427487,0.729673,La,0.311925


### Convert index into Time column

In [50]:
# Pandas
pdf = pdf.assign(Time=pd.to_datetime(pdf.index).time)
pdf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,Time
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,964,Patricia,0.141229,-0.078796,Pa,00:00:00
2000-01-01 00:00:01,1006,Laura,0.427487,0.729673,La,00:00:01


In [51]:
# ddf = ddf.assign(Time= dask.dataframe.to_datetime(ddf.index, format='%Y-%m-%d'). )
ddf = ddf.assign(Time= dask.dataframe.to_datetime(ddf.index).dt.time )
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,z,a,Time
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2000-01-01 00:00:00,964,Patricia,0.141229,-0.078796,Pa,141.229048,-0.011128,00:00:00
2000-01-01 00:00:01,1006,Laura,0.427487,0.729673,La,427.486691,0.311925,00:00:01


In [52]:
# Dask or Pandas
ddf = ddf.assign(Time2=ddf.index)
ddf['Time2'] = ddf['Time2'].dt.time
ddf.head()

Unnamed: 0_level_0,ID,name,x,y,initials,z,a,Time,Time2
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2000-01-01 00:00:00,964,Patricia,0.141229,-0.078796,Pa,141.229048,-0.011128,00:00:00,00:00:00
2000-01-01 00:00:01,1006,Laura,0.427487,0.729673,La,427.486691,0.311925,00:00:01,00:00:01
2000-01-01 00:00:02,983,Bob,0.465053,-0.61388,Bo,465.052706,-0.285487,00:00:02,00:00:02
2000-01-01 00:00:03,1025,Bob,0.799201,-5034.40428,Bo,799.200722,-4023.499537,00:00:03,00:00:03
2000-01-01 00:00:04,1061,Wendy,-0.768113,-0.022926,We,0.022926,0.01761,00:00:04,00:00:04


## Drop NA on column

In [53]:
# Pandas
pdf = pdf.assign(colna = None)
print(pdf.head(2))
pdf.dropna(axis=1, how='all', inplace=True)
print(pdf.head(2))

                       ID      name         x         y initials      Time  \
timestamp                                                                    
2000-01-01 00:00:00   964  Patricia  0.141229 -0.078796       Pa  00:00:00   
2000-01-01 00:00:01  1006     Laura  0.427487  0.729673       La  00:00:01   

                    colna  
timestamp                  
2000-01-01 00:00:00  None  
2000-01-01 00:00:01  None  
                       ID      name         x         y initials      Time
timestamp                                                                 
2000-01-01 00:00:00   964  Patricia  0.141229 -0.078796       Pa  00:00:00
2000-01-01 00:00:01  1006     Laura  0.427487  0.729673       La  00:00:01


In odrer for `Dask` to drop a column with all `na` 

In [61]:
# Dask
ddf = ddf.assign(colna = None)
print(ddf.head(2))

ddf = ddf.dropna(how='all')

# if ddf.colna.isnull().all() == True:   # check if all values in column are Null - VERY slow
#     ddf = ddf.drop(labels=['colna'],axis=1)
print(ddf.head(2))

                       ID      name         x         y initials           z  \
timestamp                                                                      
2000-01-01 00:00:00   964  Patricia  0.141229 -0.078796       Pa  141.229048   
2000-01-01 00:00:01  1006     Laura  0.427487  0.729673       La  427.486691   

                            a      Time     Time2 colna  
timestamp                                                
2000-01-01 00:00:00 -0.011128  00:00:00  00:00:00  None  
2000-01-01 00:00:01  0.311925  00:00:01  00:00:01  None  
                       ID      name         x         y initials           z  \
timestamp                                                                      
2000-01-01 00:00:00   964  Patricia  0.141229 -0.078796       Pa  141.229048   
2000-01-01 00:00:01  1006     Laura  0.427487  0.729673       La  427.486691   

                            a      Time     Time2 colna  
timestamp                                                
2000-01-01

##  1.4 Reset Index

In [None]:
# Pandas
pdf.reset_index(drop=True, inplace=True)
pdf.head(2)

In [None]:
# Dask
ddf = ddf.reset_index()
ddf['Time'] = ddf['timestamp'].dt.time
ddf = ddf.drop(labels=['timestamp'], axis=1 )
ddf.columns = ['ID','name','x','y','Time']
ddf.head(2)

# Read / Save files

When working with `pandas` and `dask` preferable try and work with [parquet](https://docs.dask.org/en/latest/dataframe-best-practices.html?highlight=parquet#store-data-in-apache-parquet-format).  
Even so when working with `Dask` - the files can be read with multiple workers .  
Most `kwargs` are applicable for reading and writing files [see documentaion](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.to_csv) (including the option for output file naming).  
e.g. 
ddf = dask.dataframe.read_csv('data/pd2dd/ddf*.csv', compression='gzip', header=False).  

However some are not available such as  `nrows`.

## Save files

In [None]:
# Pandas
!mkdir data
pdf.to_csv('data/pdf_single_file.csv')

In [None]:
!dir data  # use ls on linux systems

`Dask`
Notice the '*' to allow for multiple file renaming. 



In [None]:
# Dask
!mkdir data\pd2dd
ddf.to_csv('data/pd2dd/ddf*.csv', index = False)

In [None]:
!dir data\pd2dd\ 

To find the number of partitions which will determine the number of output files use [dask.dataframe.npartitions](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.npartitions)  

To change the number of output files use [repartition](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.repartition) which is an expensive operation.

## Read files

For `pandas` it is possible to iterate and concat the files [see answer from stack overflow](https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe).

In [None]:
# Pandas 
import glob
import os

path = r'data/pd2dd/'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent
df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
len(concatenated_df)

In [None]:
# Dask
_ddf = dask.dataframe.read_csv('data/pd2dd/ddf*.csv')
len(_ddf)

# Group By - custom aggregations
In addition to the notebook example that is in the repository - 
This is another example how to try to eliminate the use of `groupby.apply`  
In this example we are grouping and by columns into unique list.

In [None]:
# prepare pandas dataframe
pdf = pdf.assign(Time=pd.to_datetime(pdf.index).time)
pdf['seconds'] = pdf.Time.astype(str).str[-2:]
pdf.head()

In [None]:
# pandas preperations
def set_list_att(x: dask.dataframe.Series):
        return list(set([item for item in x.values]))

In [None]:
%%time
# pandas option 1 using apply
pdf_gb = pdf.groupby(pdf.name)
gp_col = ['ID', 'seconds']
list_ser_gb = [pdf_gb[att_col_gr].apply(set_list_att) for att_col_gr in gp_col]
df_edge_att = pdf_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

In [None]:
%%time
# pandas option 2 using lambda
pdf_gb = pdf.groupby(pdf.name)
gp_col = ['ID', 'seconds']
list_ser_gb = [pdf_gb[att_col_gr].apply(lambda x: list(set(x.to_list()))) for att_col_gr in gp_col]
df_edge_att = pdf_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

In any case sometimes using Pandas is more efficiante (assuming that you can load all the data into the RAM).  
In this case Pandas is faster

In [None]:
# prepare dask dataframe
ddf['seconds'] = ddf.Time.astype(str).str[-2:]
ddf = client.persist(ddf)
ddf.head()

In [None]:
%%time
# Dask option1 using apply
# notice the meta argument in the apply function
df_gb = ddf.groupby(ddf.name)
gp_col = ['ID', 'seconds']
list_ser_gb = [df_gb[att_col_gr].apply(set_list_att
                                      ,meta=pd.Series(dtype='object', name=f'{att_col_gr}_att')) 
               for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

Using [dask custom aggregation](https://docs.dask.org/en/latest/dataframe-api.html?highlight=dropna#dask.dataframe.groupby.Aggregation) is consideribly better

In [None]:
# Dask
# some preperations
import itertools
custom_agg = dask.dataframe.Aggregation(
    'custom_agg', 
    lambda s: s.apply(set), 
    lambda s: s.apply(lambda chunks: list(set(itertools.chain.from_iterable(chunks)))),
)

In [None]:
%%time
# Dask option1 using apply
df_gb = ddf.groupby(ddf.name)
gp_col = ['ID', 'seconds']
list_ser_gb = [df_gb[att_col_gr].agg(custom_agg) for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

## Debugging
Debugging may be more challenging since
1. when using a client - mutliprocessing is complecated
2. sometime introducing a faulty command into a graph (such as in a jupyter notebook) requirues to cache-out the graph and start the process from the begining