![Dask Icon](dask_horizontal_black.gif "Dask Icon")
![Pandas Icon](images/pandas_logo.png "Pandas Icon")

# Gotcha's from Pandas to Dask

This notebook highlights some key differences when transfering code from `Pandas` to run in a `Dask` environment.  
Most issues have a link to the [Dask documentation](https://docs.dask.org/en/latest/) for additional information.

# Agenda  
1. Intro to `Dask` framework
2. Basic setup
3. 

[from documentation](https://docs.dask.org/en/latest/)

# Dask

Dask is a flexible library for parallel computing in Python.

Dask is composed of two parts:

1. Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
2. “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.

Dask emphasizes the following virtues:

* Familiar: Provides parallelized NumPy array and Pandas DataFrame objects
* Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.
* Native: Enables distributed computing in pure Python with access to the PyData stack.
* Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms
* Scales up: Runs resiliently on clusters with 1000s of cores
* Scales down: Trivial to set up and run on a laptop in a single process
* Responsive: Designed with interactive computing in mind, it provides rapid feedback and diagnostics to aid humans

Dask collections and schedulers

![Dask Framework](https://docs.dask.org/en/latest/_images/collections-schedulers.png)

See the [dask.distributed documentation (separate website)](https://distributed.dask.org/en/latest/) for more technical information on Dask’s distributed scheduler.

In [1]:
# since Dask is activly beeing developed - the current example is running with the below version
import dask
import pandas as pd
print(f'Dask versoin: {dask.__version__}')
print(f'Pandas versoin: {pd.__version__}')

Dask versoin: 1.2.2
Pandas versoin: 0.24.2


## Start Dask Client for Dashboard

Starting the Dask Client is optional.  In this example we are running on a `LocalCluster`, this  will also provide a dashboard which is useful to gain insight on the computation.  
For additional information on [Dask Client see documentation](https://docs.dask.org/en/latest/setup.html?highlight=client#setup)  

The link to the dashboard will become visible when you create a client (as shown below).  
When running in `Jupyter Lab` an [extenstion](https://github.com/dask/dask-labextension) can be installed to be able to view the various dashboard widgets. 

In [2]:
from dask.distributed import Client
# client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')
client = Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:36544  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 8  Memory: 67.44 GB


See [documentation for addtional cluster configuration](http://distributed.dask.org/en/latest/local-cluster.html)

* When running code within a script use `context manager`  
see question in [stack overflow](https://stackoverflow.com/a/53520917/5817977)  
* In order to get url dashboard use [inner function ](https://github.com/dask/distributed/issues/2083#issue-337057906)  



```python   
with Client() as client:
    ...
```

# Create 2 DataFrames for comparison: 
1. for Dask 
2. for Pandas  
Dask comes with builtin dataset samples, we will use this sample for our example. 

In [3]:
ddf = dask.datasets.timeseries()
print(type(ddf))
ddf

<class 'dask.dataframe.core.DataFrame'>


Unnamed: 0_level_0,id,name,x,y
npartitions=30,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01,int64,object,float64,float64
2000-01-02,...,...,...,...
...,...,...,...,...
2000-01-30,...,...,...,...
2000-01-31,...,...,...,...


* Remember `Dask framework` is **lazy** thus in order to see the result we need to run [compute()](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.compute) 
 (or `head()` which runs under the hood compute()) )

In [4]:
ddf.head(2)

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,959,Tim,0.69857,-0.183814
2000-01-01 00:00:01,1018,Tim,-0.907868,-0.341085


 ## Consider using Persist
Since Dask is lazy - it may run the **entire** graph/DAG (again) even if it already run part of the calculation in a previous cell.  Thus use [persist](https://docs.dask.org/en/latest/dataframe-best-practices.html?highlight=parquet#persist-intelligently) to keep the results in memory 
```python
ddf = client.persist(ddf)
```
This is different from Pandas which once a variable was created it will keep all data in memory.  
Additional information can be read in this [stackoverflow issue](https://stackoverflow.com/questions/45941528/how-to-efficiently-send-a-large-numpy-array-to-the-cluster-with-dask-array/45941529#45941529) or see an exampel in [this post](http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes)   
This concept should also  be used when running a code within a script (rather then a jupyter notebook) which incoperates loops within the code.

In [5]:
ddf = dask.datasets.timeseries()
ddf = client.persist(ddf)
ddf.head(2)

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,1022,Laura,-0.29274,0.84017
2000-01-01 00:00:01,946,Ingrid,0.534213,0.350685


## Pandas
In order to create a `Pandas` dataframe we can use the `compute()` method from a `Dask dataframe`

In [6]:
pdf = ddf.compute()  
print(type(pdf))
pdf.head(2)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,1022,Laura,-0.29274,0.84017
2000-01-01 00:00:01,946,Ingrid,0.534213,0.350685


## Creating a `Dask dataframe` from `Pandas`
In order to utilize `Dask` capablities on an existing `Pandas dataframe` (pdf) we need to convert the `Pandas dataframe` into a `Dask dataframe` (ddf)  with the [from_pandas](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.from_pandas) method. 
You must supply the number of partitions or chunksize that will be used to generate the dask dataframe

In [7]:
ddf2 = dask.dataframe.from_pandas(pdf, npartitions=10)
ddf2

Unnamed: 0_level_0,id,name,x,y
npartitions=10,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,int64,object,float64,float64
2000-01-04 00:00:00,...,...,...,...
...,...,...,...,...
2000-01-28 00:00:00,...,...,...,...
2000-01-30 23:59:59,...,...,...,...


## Partitions in Dask Dataframes

Notice that when we created a `Dask dataframe` we needed to supply an argument of `npartitions`.  
The number of partitions will assist `Dask` on how it's going to parallelize the computation.  
Each partition is a *separate* dataframe. For additional information see [partition documentation](https://docs.dask.org/en/latest/dataframe-design.html?highlight=meta%20utils#partitions)  

An example for this can be seen when examing the `reset_ index()` method:

In [8]:
pdf2 = pdf.reset_index()
# Only 1 row
pdf2.iloc[0]

timestamp    2000-01-01 00:00:00
id                          1022
name                       Laura
x                       -0.29274
y                        0.84017
Name: 0, dtype: object

In [9]:
ddf2 = ddf2.reset_index()
# each partition has an index=0
ddf2.loc[0].compute() 
# ddf2.loc[0].visualize()

  (slice(0, 0, None), None, 'reset_index-2631a166bf2 ... f72403d1fabac')
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and 
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  % (format_bytes(len(b)), s)


Unnamed: 0,timestamp,id,name,x,y
0,2000-01-01,1022,Laura,-0.29274,0.84017
0,2000-01-04,994,Kevin,-0.025828,-0.032377
0,2000-01-07,1007,Norbert,0.852955,0.148651
0,2000-01-10,1052,Zelda,0.853382,-0.150757
0,2000-01-13,1027,Edith,-0.634955,0.142427
0,2000-01-16,1004,Sarah,-0.906209,0.389702
0,2000-01-19,1033,George,0.795186,0.862872
0,2000-01-22,993,Norbert,0.089105,-0.899723
0,2000-01-25,1004,Yvonne,-0.510665,-0.818292
0,2000-01-28,964,Xavier,-0.507076,0.927755


Now that we have a `dask` (ddf) and a `pandas` (pdf) dataframe we can start to compair the interactions with them.

In [10]:
ddf.head()

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,1022,Laura,-0.29274,0.84017
2000-01-01 00:00:01,946,Ingrid,0.534213,0.350685
2000-01-01 00:00:02,969,Laura,-0.330895,-0.334087
2000-01-01 00:00:03,1070,Jerry,-0.282437,0.655936
2000-01-01 00:00:04,1014,Edith,-0.914686,-0.764993


# Conceptual shift - from Update to Insert/Delete
Dask does not update - thus there are no arguments such as `inplace=True` which exist in Pandas.  
For more detials see [issue#653 on github](https://github.com/dask/dask/issues/653)

### Rename Columns

* using `inplace=True` is not considerd to be *best practice*. 

In [11]:
# Pandas 
print(pdf.columns)
pdf.rename(columns={'id':'ID'}, inplace=True)
# pdf = pdf.rename(columns={'id':'ID'})
pdf.columns

Index(['id', 'name', 'x', 'y'], dtype='object')


Index(['ID', 'name', 'x', 'y'], dtype='object')

In [12]:
# Dask - Error
# ddf.rename(columns={'id':'ID'}, inplace=True)
# ddf.columns

>---------------------------------------------------------------------------  
>TypeError                                 Traceback (most recent call last)  
><ipython-input-12-3e70ff3a549e> in <module>  
>      1 # Dask - Error  
>----> 2 ddf.rename(columns={'id':'ID'}, inplace=True)  
>      3 ddf.columns  
>  
>TypeError: rename() got an unexpected keyword argument 'inplace'  



In [13]:
# Dask
print(ddf.columns)
ddf = ddf.rename(columns={'id':'ID'})
ddf.columns

Index(['id', 'name', 'x', 'y'], dtype='object')


Index(['ID', 'name', 'x', 'y'], dtype='object')

## Data munipilations  
There are several diffrences when manipulating data.  

### loc - Pandas

In [14]:
cond = (pdf['x']>0.5) &(pdf['x']<0.8)
pdf[cond].head(2)

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:01,946,Ingrid,0.534213,0.350685
2000-01-01 00:00:05,994,Ursula,0.599345,0.17104


In [15]:
pdf.loc[cond, ['y']] = pdf['y']* 100
pdf[cond].head(2)

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:01,946,Ingrid,0.534213,35.068496
2000-01-01 00:00:05,994,Ursula,0.599345,17.103986


### Dask - use mask/where

In [16]:
cond_dask = (ddf['x']>0.5) & (ddf['x']<0.8)
# Error
# ddf.loc[cond_dask, ['y']] = ddf['y']* 100

>---------------------------------------------------------------------------  
> TypeError                                 Traceback (most recent call last)  
> <ipython-input-16-2bbb2ae570bd> in <module>  
>       2 # Error  
> ----> 3 ddf.loc[cond_dask, ['y']] = ddf['y']* 100  
>   
> TypeError: '_LocIndexer' object does not support item assignment  



In [17]:
ddf['y'] = ddf['y'].mask(cond_dask, ddf['y']* 100)
ddf[cond_dask].head(2)

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:01,946,Ingrid,0.534213,35.068496
2000-01-01 00:00:05,994,Ursula,0.599345,17.103986


In [18]:
# ddf = dask.datasets.timeseries()

## Meta
One key difference is the introduction of `meta` arguement.  
> `meta` is the prescription of the names/types of the output from the computation  
[see stack overflow answer](https://stackoverflow.com/questions/44432868/dask-dataframe-apply-meta)

Since `Dask` creates a DAG for the computation it requires to understand what are the outputs of each calculation.  
For additinal information see [meta documentation](https://docs.dask.org/en/latest/dataframe-design.html?highlight=meta%20utils#metadata)

In [19]:
pdf['initials'] = pdf['name'].apply(lambda x: x[0]+x[1])
pdf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,1022,Laura,-0.29274,0.84017,La
2000-01-01 00:00:01,946,Ingrid,0.534213,35.068496,In


In [20]:
ddf['initials'] = ddf['name'].apply(lambda x: x[0]+x[1])
ddf.head(2)

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('name', 'object'))



Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,1022,Laura,-0.29274,0.84017,La
2000-01-01 00:00:01,946,Ingrid,0.534213,35.068496,In


In [21]:
# Describe the outcome type of the calculation
meta_cal = pd.Series(object, name='initials')

In [22]:
ddf['initials'] = ddf['name'].apply(lambda x: x[0]+x[1], meta = meta_cal)
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,1022,Laura,-0.29274,0.84017,La
2000-01-01 00:00:01,946,Ingrid,0.534213,35.068496,In


In [23]:
def func(row):
    if (row['x']> 0):
        return row['x'] * 1000  
    else:
        return row['y'] * -1

In [24]:
# ddf['z'] = ddf.apply(func, args=('coor_x', 'coor_y'), axis=1, meta=('z', 'float'))
ddf['z'] = ddf.apply(func, axis=1, meta=('z', 'float'))
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,z
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,1022,Laura,-0.29274,0.84017,La,-0.84017
2000-01-01 00:00:01,946,Ingrid,0.534213,35.068496,In,534.212581


### Map partitions
* We can supply a ad-hoc function to run on each partition using the [map_partitions](https://dask.readthedocs.io/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions) method.   
Mainly useful for functions that are not implemented in `Dask` or `Pandas` . 
* Finally we can return a new `dataframe` which needs to be described in the `meta` argument  
The function could also include arguments.

In [25]:
import numpy as np
def func2(df, coor_x, coor_y, drop_cols):
    df['dist'] =  np.sqrt ( (df[coor_x] - df[coor_x].shift())**2  +  (df[coor_y] - df[coor_y].shift())**2 )
    df = df.drop(drop_cols, axis=1)
    return df

In [26]:
ddf.head()

Unnamed: 0_level_0,ID,name,x,y,initials,z
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,1022,Laura,-0.29274,0.84017,La,-0.84017
2000-01-01 00:00:01,946,Ingrid,0.534213,35.068496,In,534.212581
2000-01-01 00:00:02,969,Laura,-0.330895,-0.334087,La,0.334087
2000-01-01 00:00:03,1070,Jerry,-0.282437,0.655936,Je,-0.655936
2000-01-01 00:00:04,1014,Edith,-0.914686,-0.764993,Ed,0.764993


In [27]:
ddf = ddf.map_partitions(func2, coor_x='x', coor_y='y', drop_cols=['initials', 'z']
                         , meta=pd.DataFrame({'ID':'i8'
                                              , 'name':str
                                              , 'x':'f8'
                                              ,'y':'f8'                                              
                                              , 'dist':'f8'}, index=[0]))
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,dist
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,1022,Laura,-0.29274,0.84017,
2000-01-01 00:00:01,946,Ingrid,0.534213,35.068496,34.238314


### Convert index into Time column

In [28]:
# Pandas
pdf = pdf.assign(times=pd.to_datetime(pdf.index).time)
pdf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,times
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,1022,Laura,-0.29274,0.84017,La,00:00:00
2000-01-01 00:00:01,946,Ingrid,0.534213,35.068496,In,00:00:01


In [29]:
# ddf = ddf.assign(Time= dask.dataframe.to_datetime(ddf.index, format='%Y-%m-%d'). )
ddf = ddf.assign(times= dask.dataframe.to_datetime(ddf.index).dt.time )
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,dist,times
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,1022,Laura,-0.29274,0.84017,,00:00:00
2000-01-01 00:00:01,946,Ingrid,0.534213,35.068496,34.238314,00:00:01


In [30]:
# Dask or Pandas
ddf = ddf.assign(times=ddf.index.astype('M8[ns]'))
ddf['times'] = ddf['times'].dt.time
ddf.head()

Unnamed: 0_level_0,ID,name,x,y,dist,times
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,1022,Laura,-0.29274,0.84017,,00:00:00
2000-01-01 00:00:01,946,Ingrid,0.534213,35.068496,34.238314,00:00:01
2000-01-01 00:00:02,969,Laura,-0.330895,-0.334087,35.413151,00:00:02
2000-01-01 00:00:03,1070,Jerry,-0.282437,0.655936,0.991208,00:00:03
2000-01-01 00:00:04,1014,Edith,-0.914686,-0.764993,1.555243,00:00:04


## Drop NA on column

In [31]:
# Pandas
pdf = pdf.assign(colna = None)
print(pdf.head(2))
pdf = pdf.dropna(axis=1, how='all')
print(pdf.head(2))

                       ID    name         x          y initials     times  \
timestamp                                                                   
2000-01-01 00:00:00  1022   Laura -0.292740   0.840170       La  00:00:00   
2000-01-01 00:00:01   946  Ingrid  0.534213  35.068496       In  00:00:01   

                    colna  
timestamp                  
2000-01-01 00:00:00  None  
2000-01-01 00:00:01  None  
                       ID    name         x          y initials     times
timestamp                                                                
2000-01-01 00:00:00  1022   Laura -0.292740   0.840170       La  00:00:00
2000-01-01 00:00:01   946  Ingrid  0.534213  35.068496       In  00:00:01


In odrer for `Dask` to drop a column with all `na` 

In [32]:
# Dask
ddf = ddf.assign(colna = None)
print(ddf.head(2))

if ddf.colna.isnull().all() == True:   # check if all values in column are Null - VERY slow
    ddf = ddf.drop(labels=['colna'],axis=1)
print(ddf.head(2))

                       ID    name         x          y       dist     times  \
timestamp                                                                     
2000-01-01 00:00:00  1022   Laura -0.292740   0.840170        NaN  00:00:00   
2000-01-01 00:00:01   946  Ingrid  0.534213  35.068496  34.238314  00:00:01   

                    colna  
timestamp                  
2000-01-01 00:00:00  None  
2000-01-01 00:00:01  None  
                       ID    name         x          y       dist     times
timestamp                                                                  
2000-01-01 00:00:00  1022   Laura -0.292740   0.840170        NaN  00:00:00
2000-01-01 00:00:01   946  Ingrid  0.534213  35.068496  34.238314  00:00:01


##  Reset Index

In [33]:
# Pandas
pdf = pdf.reset_index(drop=True)
pdf.head(2)

Unnamed: 0,ID,name,x,y,initials,times
0,1022,Laura,-0.29274,0.84017,La,00:00:00
1,946,Ingrid,0.534213,35.068496,In,00:00:01


In [34]:
ddf.head()

Unnamed: 0_level_0,ID,name,x,y,dist,times
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,1022,Laura,-0.29274,0.84017,,00:00:00
2000-01-01 00:00:01,946,Ingrid,0.534213,35.068496,34.238314,00:00:01
2000-01-01 00:00:02,969,Laura,-0.330895,-0.334087,35.413151,00:00:02
2000-01-01 00:00:03,1070,Jerry,-0.282437,0.655936,0.991208,00:00:03
2000-01-01 00:00:04,1014,Edith,-0.914686,-0.764993,1.555243,00:00:04


In [35]:
# Dask
ddf = ddf.reset_index()
# ddf['timestamp'] = ddf['timestamp']
# ddf = ddf.drop(labels=['timestamp'], axis=1 )
# ddf.columns = ['ID','name','x','y','Time', 'initials', 'z', 'dist']
ddf.head(2)

ValueError: The columns in the computed data do not match the columns in the provided metadata

# Read / Save files

When working with `pandas` and `dask` preferable try and work with [parquet](https://docs.dask.org/en/latest/dataframe-best-practices.html?highlight=parquet#store-data-in-apache-parquet-format).  
Even so when working with `Dask` - the files can be read with multiple workers .  
Most `kwargs` are applicable for reading and writing files [see documentaion](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.to_csv) (including the option for output file naming).  
e.g. 
ddf = dask.dataframe.read_csv('data/pd2dd/ddf*.csv', compression='gzip', header=False).  

However some are not available such as  `nrows`.

## Save files

In [None]:
# Pandas
!mkdir data
pdf.to_csv('data/pdf_single_file.csv')

In [None]:
!dir data  # use ls on linux systems

`Dask`
Notice the '*' to allow for multiple file renaming. 



In [None]:
# Dask
!mkdir data\pd2dd
ddf.to_csv('data/pd2dd/ddf*.csv', index = False)

In [None]:
!dir data\pd2dd\ 

To find the number of partitions which will determine the number of output files use [dask.dataframe.npartitions](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.npartitions)  

To change the number of output files use [repartition](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.repartition) which is an expensive operation.

## Read files

For `pandas` it is possible to iterate and concat the files [see answer from stack overflow](https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe).

In [None]:
# Pandas 
import glob
import os

path = r'data/pd2dd/'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent
df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
len(concatenated_df)

In [None]:
# Dask
_ddf = dask.dataframe.read_csv('data/pd2dd/ddf*.csv')
len(_ddf)

# Group By - custom aggregations
In addition to the [groupby notebook example](https://github.com/dask/dask-examples/blob/master/dataframes/02-groupby.ipynb)  - 
this is another example how to try to eliminate the use of `groupby.apply`  
In this example we are grouping by columns into unique list.

In [None]:
# prepare pandas dataframe
pdf = pdf.assign(Time=pd.to_datetime(pdf.index).time)
pdf['seconds'] = pdf.Time.astype(str).str[-2:]
pdf.head()

In [None]:
# pandas preperations
def set_list_att(x: dask.dataframe.Series):
        return list(set([item for item in x.values]))

In [None]:
%%timeit
# pandas option 1 using apply
pdf_gb = pdf.groupby(pdf.name)
gp_col = ['id', 'seconds']
list_ser_gb = [pdf_gb[att_col_gr].apply(set_list_att) for att_col_gr in gp_col]
df_edge_att = pdf_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

In [None]:
%%timeit
# pandas option 2 using lambda
pdf_gb = pdf.groupby(pdf.name)
gp_col = ['id', 'seconds']
list_ser_gb = [pdf_gb[att_col_gr].apply(lambda x: list(set(x.to_list()))) for att_col_gr in gp_col]
df_edge_att = pdf_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

In any case sometimes using Pandas is more efficiante (assuming that you can load all the data into the RAM).  
In this case Pandas is faster

In [None]:
# prepare dask dataframe
ddf['seconds'] = ddf.Time.astype(str).str[-2:]
ddf = client.persist(ddf)
ddf.head()

In [None]:
%%timeit
# Dask option1 using apply
# notice the meta argument in the apply function
df_gb = ddf.groupby(ddf.name)
gp_col = ['id', 'seconds']
list_ser_gb = [df_gb[att_col_gr].apply(set_list_att
                                      ,meta=pd.Series(dtype='object', name=f'{att_col_gr}_att')) 
               for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

Using [dask custom aggregation](https://docs.dask.org/en/latest/dataframe-api.html?highlight=dropna#dask.dataframe.groupby.Aggregation) is consideribly better

In [None]:
# Dask
# some preperations
import itertools
custom_agg = dask.dataframe.Aggregation(
    'custom_agg', 
    lambda s: s.apply(set), 
    lambda s: s.apply(lambda chunks: list(set(itertools.chain.from_iterable(chunks)))),
)

In [None]:
%%time
# Dask option1 using apply
df_gb = ddf.groupby(ddf.name)
gp_col = ['id', 'seconds']
list_ser_gb = [df_gb[att_col_gr].agg(custom_agg) for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

## Debugging
Debugging may be more challenging since
1. when using a client - mutliprocessing is complecated
2. sometime introducing a faulty command into a graph (such as in a jupyter notebook) requirues to cache-out the graph and start the process from the begining

## Corrupted DAG

In [None]:
ddf = dask.datasets.timeseries()

In [None]:
# returns an error
def func_dist2(df, coor_x, coor_y):
#     dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())**2  +  (df[coor_y] - df[coor_y].shift())**2 )
    dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())^2  +  (df[coor_y] - df[coor_y].shift())^2 )
    return dist
ddf['col'] = ddf.map_partitions(func_dist2, coor_x='x', coor_y='y', meta=('float'))
ddf.head(2)

* Even if the function is currected the DAG is corrupted

In [None]:
# Still results with an error
def func_dist2(df, coor_x, coor_y):
    dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())**2  +  (df[coor_y] - df[coor_y].shift())**2 )
#     dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())^2  +  (df[coor_y] - df[coor_y].shift())^2 )
    return dist
ddf['col'] = ddf.map_partitions(func_dist2, coor_x='x', coor_y='y', meta=('float'))
ddf.head(2)

Need to reset the dataframe

In [None]:
ddf = dask.datasets.timeseries()
def func_dist2(df, coor_x, coor_y):
    dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())**2  +  (df[coor_y] - df[coor_y].shift())**2 )
#     dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())^2  +  (df[coor_y] - df[coor_y].shift())^2 )
    return dist
ddf['col'] = ddf.map_partitions(func_dist2, coor_x='x', coor_y='y', meta=('float'))
ddf.head(2)