![Dask Icon](dask_horizontal_black.gif "Dask Icon")
![Pandas Icon](images/pandas_logo.png "Pandas Icon")

# Gotcha's from Pandas to Dask

This notebook highlights some key differences when transfering code from `Pandas` to run in a `Dask` environment.  
Most issues have a link to the [Dask documentation](https://docs.dask.org/en/latest/) for additional information.

# Agenda  
1. Intro to `Dask` framework
2. Basic setup
3. 

![Dask Icon](dask_horizontal_black.gif "Dask Icon")

Dask is a flexible library for parallel computing in Python.

Dask is composed of two parts:

1. *Dynamic task scheduling* optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
2. *“Big Data” collections* like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.

[link to documentation](https://docs.dask.org/en/latest/)

Dask emphasizes the following virtues:

* Familiar: Provides parallelized NumPy array and Pandas DataFrame objects
* Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.
* Native: Enables distributed computing in pure Python with access to the PyData stack.
* Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms
* Scales up: Runs resiliently on clusters with 1000s of cores
* Scales down: Trivial to set up and run on a laptop in a single process
* Responsive: Designed with interactive computing in mind, it provides rapid feedback and diagnostics to aid humans

![Dask Framework](images/dask_graph_outline.gif)
See the [dask.distributed documentation (separate website)](https://distributed.dask.org/en/latest/) for more technical information on Dask’s distributed scheduler.

In [56]:
# since Dask is activly beeing developed - the current example is running with the below version
import dask
import dask.dataframe as dd
import pandas as pd
print(f'Dask versoin: {dask.__version__}')
print(f'Pandas versoin: {pd.__version__}')

Dask versoin: 1.2.2
Pandas versoin: 0.24.2


## Start Dask Client for Dashboard
![Dask Dashboard](images/dask_dashboard.png)



Starting the Dask Client is optional.  In this example we are running on a `LocalCluster`, this  will also provide a dashboard which is useful to gain insight on the computation.  
For additional information on [Dask Client see documentation](https://docs.dask.org/en/latest/setup.html?highlight=client#setup)  

The link to the dashboard will become visible when you create a client (as shown below).  
When running in `Jupyter Lab` an [extenstion](https://github.com/dask/dask-labextension) can be installed to be able to view the various dashboard widgets. 

In [57]:
from dask.distributed import Client
# client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')
client = Client()
client

Port 8787 is already in use. 
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.


0,1
Client  Scheduler: tcp://127.0.0.1:39243  Dashboard: http://127.0.0.1:33089/status,Cluster  Workers: 4  Cores: 8  Memory: 67.44 GB


See [documentation for addtional cluster configuration](http://distributed.dask.org/en/latest/local-cluster.html)

* When running code within a script use `context manager`  
see question in [stack overflow](https://stackoverflow.com/a/53520917/5817977)  
* In order to get url dashboard use [inner function ](https://github.com/dask/distributed/issues/2083#issue-337057906)  



```python   
from ... import ...

if __name__ == '__main__':
    with Client() as client:
        ...
```

# Create 2 DataFrames for comparison: 
* `Dask framework` is **lazy**  ![lazy python](images/Sleeping-snake.jpg)

In [3]:
ddf = dask.datasets.timeseries() #  Dask comes with builtin dataset samples, we will use this sample for our example. 
ddf

Unnamed: 0_level_0,id,name,x,y
npartitions=30,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01,int64,object,float64,float64
2000-01-02,...,...,...,...
...,...,...,...,...
2000-01-30,...,...,...,...
2000-01-31,...,...,...,...


In order to see the result we need to run [compute()](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.compute) 
 (or `head()` which runs under the hood compute()) )

In [62]:
ddf.compute()

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,1054,Sarah,-0.235025,0.260256
2000-01-01 00:00:01,998,Laura,0.123769,0.336053
2000-01-01 00:00:02,1033,Charlie,0.409716,-0.516768
2000-01-01 00:00:03,995,George,-0.059434,0.709339
2000-01-01 00:00:04,944,Jerry,0.100006,-0.074578
2000-01-01 00:00:05,1026,Quinn,0.412330,-0.286190
2000-01-01 00:00:06,1050,Charlie,-0.181835,-0.479077
2000-01-01 00:00:07,1009,Zelda,0.330772,-0.145269
2000-01-01 00:00:08,1007,Jerry,0.337340,0.267568
2000-01-01 00:00:09,1062,Frank,-0.389890,0.012417


## Pandas
In order to create a `Pandas` dataframe we can use the `compute()` 

In [5]:
pdf = ddf.compute()  
print(type(pdf))
pdf.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,999,Kevin,-0.147825,0.794439
2000-01-01 00:00:01,1027,Sarah,0.173003,-0.429579
2000-01-01 00:00:02,976,Wendy,-0.008544,0.540589
2000-01-01 00:00:03,992,Quinn,0.483938,-0.604981
2000-01-01 00:00:04,956,Oliver,-0.274719,-0.943501


## Creating a `Dask dataframe` from `Pandas`

In order to utilize `Dask` capablities on an existing `Pandas dataframe` (pdf) we need to convert the `Pandas dataframe` into a `Dask dataframe` (ddf)  with the [from_pandas](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.from_pandas) method. 
You must supply the number of `partitions` or `chunksize` that will be used to generate the dask dataframe

In [7]:
ddf2 = dd.from_pandas(pdf, npartitions=10)
print(type(ddf2))
ddf2.head() 

<class 'dask.dataframe.core.DataFrame'>


Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,999,Kevin,-0.147825,0.794439
2000-01-01 00:00:01,1027,Sarah,0.173003,-0.429579
2000-01-01 00:00:02,976,Wendy,-0.008544,0.540589
2000-01-01 00:00:03,992,Quinn,0.483938,-0.604981
2000-01-01 00:00:04,956,Oliver,-0.274719,-0.943501


## Partitions in Dask Dataframes

Notice that when we created a `Dask dataframe` we needed to supply an argument of `npartitions`.  
The number of partitions will assist `Dask` on how it's going to parallelize the computation.  
Each partition is a *separate* dataframe. For additional information see [partition documentation](https://docs.dask.org/en/latest/dataframe-design.html?highlight=meta%20utils#partitions)  



Using `reset_index()` method we can examin the partitions:  
First lets look at the `Pandas` dataframe

In [8]:
pdf2 = pdf.reset_index()
# Only 1 row
pdf2.iloc[0]

timestamp    2000-01-01 00:00:00
id                           999
name                       Kevin
x                      -0.147825
y                       0.794439
Name: 0, dtype: object

Now lets look at a `Dask` dataframe

In [9]:
ddf2 = ddf2.reset_index() 
ddf2.loc[0].compute()  # each partition has an index=0
# ddf2.loc[0].visualize()

Unnamed: 0,timestamp,id,name,x,y
0,2000-01-01,999,Kevin,-0.147825,0.794439
0,2000-01-04,1026,Charlie,0.689279,0.207926
0,2000-01-07,1011,Charlie,0.460504,0.080125
0,2000-01-10,995,Hannah,-0.51136,-0.283008
0,2000-01-13,1038,Norbert,0.445866,0.932486
0,2000-01-16,989,Laura,-0.529247,0.366283
0,2000-01-19,1015,Ray,-0.169299,0.370307
0,2000-01-22,1016,Norbert,0.581085,-0.963086
0,2000-01-25,986,Ursula,-0.939046,-0.754388
0,2000-01-28,1036,Oliver,-0.841643,0.157431


## dataframe.shape  
since `Dask` is lazy we cannot get the full shape before running `len`

In [11]:
print(f'Pandas shape: {pdf.shape}')
print('---------------------------')
print(f'Dask lazy shape: {ddf.shape}') 
print(f'Dask computed shape: {len(ddf.index)}')  # expensive

Pandas shape: (2592000, 4)
---------------------------
Dask lazy shape: (Delayed('int-6a1a07fa-c612-4ec0-9747-d6701dad94d0'), 4)
Dask computed shape: 2592000


Now that we have a `dask` (ddf) and a `pandas` (pdf) dataframe we can start to compair the interactions with them.

# Conceptual shift - from Update to Insert/Delete


![inplaceTrue](images/inplace_true.png "inplace_true")

Dask does not update - thus there are no arguments such as `inplace=True` which exist in Pandas.  
For more detials see [issue#653 on github](https://github.com/dask/dask/issues/653)

### Rename Columns

In [12]:
# Pandas 
print(pdf.columns)
pdf.rename(columns={'id':'ID'}, inplace=True)
# pdf = pdf.rename(columns={'id':'ID'})
pdf.columns

Index(['id', 'name', 'x', 'y'], dtype='object')


Index(['ID', 'name', 'x', 'y'], dtype='object')

* using `inplace=True` is not considerd to be *best practice*. 

In [13]:
# Dask - Error
# ddf.rename(columns={'id':'ID'}, inplace=True)
# ddf.columns

'''
---------------------------------------------------------------------------  
TypeError                                 Traceback (most recent call last)  
<ipython-input-12-3e70ff3a549e> in <module>  
      1 # Dask - Error  
----> 2 ddf.rename(columns={'id':'ID'}, inplace=True)  
      3 ddf.columns  
TypeError: rename() got an unexpected keyword argument 'inplace'  
'''

"\n---------------------------------------------------------------------------  \nTypeError                                 Traceback (most recent call last)  \n<ipython-input-12-3e70ff3a549e> in <module>  \n      1 # Dask - Error  \n----> 2 ddf.rename(columns={'id':'ID'}, inplace=True)  \n      3 ddf.columns  \nTypeError: rename() got an unexpected keyword argument 'inplace'  \n"

In [14]:
# Dask or Pandas
print(ddf.columns)
ddf = ddf.rename(columns={'id':'ID'})
ddf.columns

Index(['id', 'name', 'x', 'y'], dtype='object')


Index(['ID', 'name', 'x', 'y'], dtype='object')

## Data manipilations  
There are several diffrences when manipulating data.  

### loc - Pandas

In [15]:
cond = (pdf['x']>0.5) & (pdf['x']<0.8)

In [16]:
pdf.loc[cond, ['y']] = pdf['y']* 100
pdf[cond].head(2)

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:05,1013,Ingrid,0.570766,68.219435
2000-01-01 00:00:07,969,Hannah,0.539017,52.470365


### Dask - use mask/where

In [None]:
# Error
# cond_dask = (ddf['x']>0.5) & (ddf['x']<0.8)
# ddf.loc[cond_dask, ['y']] = ddf['y']* 100

'''
> TypeError                                 Traceback (most recent call last)  
> <ipython-input-16-2bbb2ae570bd> in <module> 
>       2 # Error  
> ----> 3 ddf.loc[cond_dask, ['y']] = ddf['y']* 100  
> TypeError: '_LocIndexer' object does not support item assignment  
'''

In [18]:
cond_dask = (ddf['x']>0.5) & (ddf['x']<0.8)
ddf['y'] = ddf['y'].mask(cond_dask, ddf['y']* 100)
ddf[cond_dask].head(2)

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:05,1013,Ingrid,0.570766,68.219435
2000-01-01 00:00:07,969,Hannah,0.539017,52.470365


[dask mask documentation](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.mask)

## Meta
One key issue is the introduction of `meta` arguement.  
> `meta` is the prescription of the names/types of the computation output   
[see stack overflow answer](https://stackoverflow.com/questions/44432868/dask-dataframe-apply-meta)

![crystal python](images/crystalBallsnake.png "crystal snake")
Since `Dask` creates a DAG for the computation it requires to understand what are the outputs of each calculation (see [meta documentation](https://docs.dask.org/en/latest/dataframe-design.html?highlight=meta%20utils#metadata))

In [19]:
pdf['initials'] = pdf['name'].apply(lambda x: x[0]+x[1])
pdf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,999,Kevin,-0.147825,0.794439,Ke
2000-01-01 00:00:01,1027,Sarah,0.173003,-0.429579,Sa


In [20]:
ddf['initials'] = ddf['name'].apply(lambda x: x[0]+x[1])
ddf.head(2)

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('name', 'object'))



Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,999,Kevin,-0.147825,0.794439,Ke
2000-01-01 00:00:01,1027,Sarah,0.173003,-0.429579,Sa


In [21]:
# Describe the outcome type of the calculation
meta_cal = pd.Series(object, name='initials')

In [22]:
ddf['initials'] = ddf['name'].apply(lambda x: x[0]+x[1]
                                    , meta = meta_cal)
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,999,Kevin,-0.147825,0.794439,Ke
2000-01-01 00:00:01,1027,Sarah,0.173003,-0.429579,Sa


In [23]:
def func(row):
    if (row['x']> 0):
        return row['x'] * 1000  
    else:
        return row['y'] * -1

In [24]:
# ddf['z'] = ddf.apply(func, args=('coor_x', 'coor_y'), axis=1, meta=('z', 'float'))
ddf['z'] = ddf.apply(func, axis=1, meta=('z', 'float'))
ddf.head()

Unnamed: 0_level_0,ID,name,x,y,initials,z
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,999,Kevin,-0.147825,0.794439,Ke,-0.794439
2000-01-01 00:00:01,1027,Sarah,0.173003,-0.429579,Sa,173.003302
2000-01-01 00:00:02,976,Wendy,-0.008544,0.540589,We,-0.540589
2000-01-01 00:00:03,992,Quinn,0.483938,-0.604981,Qu,483.937712
2000-01-01 00:00:04,956,Oliver,-0.274719,-0.943501,Ol,0.943501


### Map partitions
* We can supply an ad-hoc function to run on each partition using the [map_partitions](https://dask.readthedocs.io/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions) method.   
Mainly useful for functions that are not implemented in `Dask` or `Pandas` . 
* Finally we can return a new `dataframe` which needs to be described in the `meta` argument  
The function could also include arguments.

In [25]:
import numpy as np
def func2(df, coor_x, coor_y, drop_cols):
    df['dist'] =  np.sqrt ( (df[coor_x] - df[coor_x].shift())**2  
                           +  (df[coor_y] - df[coor_y].shift())**2 )
    df = df.drop(drop_cols, axis=1)
    return df

In [26]:
ddf2 = ddf.map_partitions(func2
                          , coor_x='x'
                          , coor_y='y'
                          , drop_cols=['initials', 'z']
                          , meta=pd.DataFrame({'ID':'i8'
                                              , 'name':str
                                              , 'x':'f8'
                                              , 'y':'f8'                                              
                                              , 'dist':'f8'}, index=[0]))
ddf2.head()

Unnamed: 0_level_0,ID,name,x,y,dist
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,999,Kevin,-0.147825,0.794439,
2000-01-01 00:00:01,1027,Sarah,0.173003,-0.429579,1.265366
2000-01-01 00:00:02,976,Wendy,-0.008544,0.540589,0.987008
2000-01-01 00:00:03,992,Quinn,0.483938,-0.604981,1.246945
2000-01-01 00:00:04,956,Oliver,-0.274719,-0.943501,0.830756


### Convert index into DateTime column

In [27]:
# Only Pandas
pdf = pdf.assign(times=pd.to_datetime(pdf.index).time)
pdf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,times
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,999,Kevin,-0.147825,0.794439,Ke,00:00:00
2000-01-01 00:00:01,1027,Sarah,0.173003,-0.429579,Sa,00:00:01


In [28]:
# ddf = ddf.assign(Time= dask.dataframe.to_datetime(ddf.index, format='%Y-%m-%d'). )
ddf = ddf.assign(times= dd.to_datetime(ddf.index).dt.time
                , dates = dd.to_datetime(ddf.index).dt.date)                 
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,z,times,dates
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2000-01-01 00:00:00,999,Kevin,-0.147825,0.794439,Ke,-0.794439,00:00:00,2000-01-01
2000-01-01 00:00:01,1027,Sarah,0.173003,-0.429579,Sa,173.003302,00:00:01,2000-01-01


In [27]:
# Dask or Pandas
ddf = ddf.assign(times=ddf.index.astype('M8[ns]'))
ddf['dates'] = ddf['times'].dt.date
ddf['times'] = ddf['times'].dt.time
ddf.head()

Unnamed: 0_level_0,ID,name,x,y,initials,z,times,dates
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2000-01-01 00:00:00,1007,Frank,-0.620432,0.084029,Fr,-0.084029,00:00:00,2000-01-01
2000-01-01 00:00:01,1013,Alice,-0.964277,0.902528,Al,-0.902528,00:00:01,2000-01-01
2000-01-01 00:00:02,999,Ingrid,0.491651,-0.626626,In,491.651205,00:00:02,2000-01-01
2000-01-01 00:00:03,964,Quinn,0.298458,-0.390178,Qu,298.457608,00:00:03,2000-01-01
2000-01-01 00:00:04,961,Quinn,-0.742823,-0.720618,Qu,0.720618,00:00:04,2000-01-01


## Drop NA on column

In [29]:
# Pandas
pdf = pdf.assign(colna = None)
print(pdf.head(2))
pdf = pdf.dropna(axis=1, how='all')
print(pdf.head(2))

                       ID   name         x         y initials     times colna
timestamp                                                                    
2000-01-01 00:00:00   999  Kevin -0.147825  0.794439       Ke  00:00:00  None
2000-01-01 00:00:01  1027  Sarah  0.173003 -0.429579       Sa  00:00:01  None
                       ID   name         x         y initials     times
timestamp                                                              
2000-01-01 00:00:00   999  Kevin -0.147825  0.794439       Ke  00:00:00
2000-01-01 00:00:01  1027  Sarah  0.173003 -0.429579       Sa  00:00:01


In odrer for `Dask` to drop a column with all `na` 

In [32]:
# Dask
ddf = ddf.assign(colna = None)
# check if all values in column are Null - expensive
if ddf.colna.isnull().all().compute() == True:   
    ddf = ddf.drop(labels=['colna'],axis=1)
print(ddf.head(2))

                       ID   name         x         y initials           z  \
timestamp                                                                   
2000-01-01 00:00:00   999  Kevin -0.147825  0.794439       Ke   -0.794439   
2000-01-01 00:00:01  1027  Sarah  0.173003 -0.429579       Sa  173.003302   

                        times       dates  
timestamp                                  
2000-01-01 00:00:00  00:00:00  2000-01-01  
2000-01-01 00:00:01  00:00:01  2000-01-01  


##  Reset Index

In [33]:
# Pandas
pdf = pdf.reset_index(drop=True)
pdf.head(2)

Unnamed: 0,ID,name,x,y,initials,times
0,999,Kevin,-0.147825,0.794439,Ke,00:00:00
1,1027,Sarah,0.173003,-0.429579,Sa,00:00:01


In [34]:
# Dask
ddf = ddf.reset_index()
ddf = ddf.drop(labels=['timestamp'], axis=1 )
ddf.head(2)

Unnamed: 0,ID,name,x,y,initials,z,times,dates
0,999,Kevin,-0.147825,0.794439,Ke,-0.794439,00:00:00,2000-01-01
1,1027,Sarah,0.173003,-0.429579,Sa,173.003302,00:00:01,2000-01-01


# Read / Save files

When working with `pandas` and `dask` preferable try and work with [parquet](https://docs.dask.org/en/latest/dataframe-best-practices.html?highlight=parquet#store-data-in-apache-parquet-format).  
Even so when working with `Dask` - the files can be read with multiple workers .  
Most `kwargs` are applicable for reading and writing files [see documentaion](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.to_csv) (including the option for output file naming).  
e.g. 
ddf = dd.read_csv('data/pd2dd/ddf*.csv', compression='gzip', header=False).  

However some are not available such as  `nrows`.

## Save files

In [35]:
# Pandas
!mkdir data
pdf.to_csv('data/pdf_single_file.csv')

mkdir: cannot create directory ‘data’: File exists


In [36]:
!dir data  # use ls on linux systems

pd2dd  pdf_single_file.csv


`Dask`
Notice the '*' to allow for multiple file renaming. 



In [38]:
!mkdir data/pd2dd   #linux
#!mkdir data\pd2dd   #windows

mkdir: cannot create directory ‘data/pd2dd’: File exists


In [39]:
# Dask
ddf.to_csv('data/pd2dd/ddf*.csv', index = False)

['data/pd2dd/ddf00.csv',
 'data/pd2dd/ddf01.csv',
 'data/pd2dd/ddf02.csv',
 'data/pd2dd/ddf03.csv',
 'data/pd2dd/ddf04.csv',
 'data/pd2dd/ddf05.csv',
 'data/pd2dd/ddf06.csv',
 'data/pd2dd/ddf07.csv',
 'data/pd2dd/ddf08.csv',
 'data/pd2dd/ddf09.csv',
 'data/pd2dd/ddf10.csv',
 'data/pd2dd/ddf11.csv',
 'data/pd2dd/ddf12.csv',
 'data/pd2dd/ddf13.csv',
 'data/pd2dd/ddf14.csv',
 'data/pd2dd/ddf15.csv',
 'data/pd2dd/ddf16.csv',
 'data/pd2dd/ddf17.csv',
 'data/pd2dd/ddf18.csv',
 'data/pd2dd/ddf19.csv',
 'data/pd2dd/ddf20.csv',
 'data/pd2dd/ddf21.csv',
 'data/pd2dd/ddf22.csv',
 'data/pd2dd/ddf23.csv',
 'data/pd2dd/ddf24.csv',
 'data/pd2dd/ddf25.csv',
 'data/pd2dd/ddf26.csv',
 'data/pd2dd/ddf27.csv',
 'data/pd2dd/ddf28.csv',
 'data/pd2dd/ddf29.csv']

To find the number of partitions which will determine the number of output files use [dask.dataframe.npartitions](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.npartitions)  

In [40]:
ddf.npartitions

30

In [41]:
# !dir data\pd2dd\  # windows
!dir data/pd2dd/  # linux

ddf00.csv  ddf05.csv  ddf10.csv  ddf15.csv  ddf20.csv  ddf25.csv
ddf01.csv  ddf06.csv  ddf11.csv  ddf16.csv  ddf21.csv  ddf26.csv
ddf02.csv  ddf07.csv  ddf12.csv  ddf17.csv  ddf22.csv  ddf27.csv
ddf03.csv  ddf08.csv  ddf13.csv  ddf18.csv  ddf23.csv  ddf28.csv
ddf04.csv  ddf09.csv  ddf14.csv  ddf19.csv  ddf24.csv  ddf29.csv


To change the number of output files use [repartition](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.repartition) which is an expensive operation.

## Read files

For `pandas` it is possible to iterate and concat the files [see answer from stack overflow](https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe).

In [44]:
%%time
# Pandas 
import glob
import os
path = r'data/pd2dd/'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent
df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
len(concatenated_df)

CPU times: user 3.31 s, sys: 167 ms, total: 3.48 s
Wall time: 3.36 s


In [45]:
%%time
# Dask
_ddf = dd.read_csv('data/pd2dd/ddf*.csv')
len(_ddf)

CPU times: user 223 ms, sys: 15.8 ms, total: 239 ms
Wall time: 1.08 s


 ## Consider using Persist
Since Dask is lazy - it may run the **entire** graph/DAG (again) even if it already run part of the calculation in a previous cell.  Thus use [persist](https://docs.dask.org/en/latest/dataframe-best-practices.html?highlight=parquet#persist-intelligently) to keep the results in memory 
```python
ddf = client.persist(ddf)
```
This is different from Pandas which once a variable was created it will keep all data in memory.  
Additional information can be read in this [stackoverflow issue](https://stackoverflow.com/questions/45941528/how-to-efficiently-send-a-large-numpy-array-to-the-cluster-with-dask-array/45941529#45941529) or see an exampel in [this post](http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes)   
This concept should also  be used when running a code within a script (rather then a jupyter notebook) which incoperates loops within the code.

In [40]:
_ddf = dd.read_csv('data/pd2dd/ddf*.csv')
ddf = client.persist(_ddf)
ddf.head(2)

Unnamed: 0,ID,name,x,y,initials,z,times,dates
0,1007,Frank,-0.620432,0.084029,Fr,-0.084029,00:00:00,2000-01-01
1,1013,Alice,-0.964277,0.902528,Al,-0.902528,00:00:01,2000-01-01


# Group By - custom aggregations
In addition to the [groupby notebook example](https://github.com/dask/dask-examples/blob/master/dataframes/02-groupby.ipynb)  - 
this is another example how to try to eliminate the use of `groupby.apply`  
In this example we are grouping by columns into unique list.

In [46]:
# prepare pandas dataframe
pdf = pdf.assign(Time=pd.to_datetime(pdf.index).time)
pdf['seconds'] = pdf.Time.astype(str).str[-2:]
pdf.head()

Unnamed: 0,ID,name,x,y,initials,times,Time,seconds
0,999,Kevin,-0.147825,0.794439,Ke,00:00:00,00:00:00,0
1,1027,Sarah,0.173003,-0.429579,Sa,00:00:01,00:00:00,0
2,976,Wendy,-0.008544,0.540589,We,00:00:02,00:00:00,0
3,992,Quinn,0.483938,-0.604981,Qu,00:00:03,00:00:00,0
4,956,Oliver,-0.274719,-0.943501,Ol,00:00:04,00:00:00,0


In [53]:
%%time
pdf_gb = pdf.groupby(pdf.name)
gp_col = ['ID', 'seconds']
list_ser_gb = [pdf_gb[att_col_gr].apply
               (lambda x: list(set(x.to_list()))) 
               for att_col_gr in gp_col]
df_edge_att = pdf_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        

CPU times: user 997 ms, sys: 0 ns, total: 997 ms
Wall time: 960 ms


In [54]:
df_edge_att.head(2)

Unnamed: 0_level_0,Weight,ID,seconds
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alice,99498,"[1024, 1025, 1026, 1027, 1028, 1029, 1030, 103...","[31, 64, 58, 67, 98, 55, 11, 02, 99, 25, 86, 7..."
Bob,100102,"[1024, 1025, 1026, 1027, 1028, 1029, 1030, 103...","[31, 64, 58, 67, 98, 55, 11, 02, 99, 25, 86, 7..."


In any case sometimes using Pandas is more efficiante (assuming that you can load all the data into the RAM).  
In this case Pandas is faster

In [61]:
def set_list_att(x: dd.Series):
        return list(set([item for item in x.values]))
ddf['seconds'] = ddf.times.astype(str).str[-2:]
ddf = client.persist(ddf)
ddf.head(2)

Unnamed: 0,ID,name,x,y,initials,z,times,dates,seconds
0,999,Kevin,-0.147825,0.794439,Ke,-0.794439,00:00:00,2000-01-01,0
1,1027,Sarah,0.173003,-0.429579,Sa,173.003302,00:00:01,2000-01-01,1


In [64]:
%%time
# Dask option1 using apply
# notice the meta argument in the apply function
df_gb = ddf.groupby(ddf.name)
gp_col = ['ID', 'seconds']
list_ser_gb = [df_gb[att_col_gr].apply(set_list_att
                ,meta=pd.Series(dtype='object', name=f'{att_col_gr}_att')) 
               for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
    df_edge_att = df_edge_att.join(ser.to_frame(), how='left')
df_edge_att.head(2)

CPU times: user 2.47 s, sys: 114 ms, total: 2.58 s
Wall time: 8.6 s


Using [dask custom aggregation](https://docs.dask.org/en/latest/dataframe-api.html?highlight=dropna#dask.dataframe.groupby.Aggregation) is consideribly better

In [67]:
# Dask
import itertools
custom_agg = dd.Aggregation(
    'custom_agg', 
    lambda s: s.apply(set), 
    lambda s: s.apply(lambda chunks: list(set(itertools.chain.from_iterable(chunks)))),)

In [68]:
%%time
# Dask option1 using apply
df_gb = ddf.groupby(ddf.name)
gp_col = ['ID', 'seconds']
list_ser_gb = [df_gb[att_col_gr].agg(custom_agg) for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')
df_edge_att.head(2)  

CPU times: user 257 ms, sys: 32 ms, total: 289 ms
Wall time: 1.29 s


In [52]:
df_edge_att.head()

Unnamed: 0_level_0,Weight,ID,seconds
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alice,99904,"[1024, 1025, 1026, 1027, 1028, 1029, 1030, 103...","[26, 40, 18, 11, 30, 47, 02, 25, 57, 45, 15, 1..."
Bob,100331,"[1024, 1025, 1026, 1027, 1028, 1029, 1030, 103...","[26, 40, 18, 11, 30, 02, 25, 47, 57, 45, 15, 1..."
Charlie,99948,"[1024, 1025, 1026, 1027, 1028, 1029, 1030, 103...","[26, 40, 18, 11, 30, 47, 02, 25, 57, 45, 15, 1..."
Dan,99583,"[1024, 1025, 1026, 1027, 1028, 1029, 1030, 103...","[26, 40, 18, 11, 30, 02, 47, 25, 57, 45, 15, 1..."
Edith,99962,"[1024, 1025, 1026, 1027, 1028, 1029, 1030, 103...","[26, 40, 18, 11, 30, 47, 02, 25, 45, 57, 15, 1..."


## Debugging
Debugging may be more challenging since
1. when using a client - mutliprocessing is complecated
2. sometime introducing a faulty command into a graph (such as in a jupyter notebook) requirues to cache-out the graph and start the process from the begining

## Corrupted DAG

In [69]:
# reset index
ddf = dask.datasets.timeseries()

In [70]:
# returns an error
def func_dist2(df, coor_x, coor_y):
    dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())^2  
                     +  (df[coor_y] - df[coor_y].shift())^2 )
    return dist
ddf['col'] = ddf.map_partitions(func_dist2, coor_x='x', coor_y='y'
                                , meta=('float'))

In [71]:
ddf.head()

TypeError: unsupported operand type(s) for ^: 'float' and 'bool'

* Even if the function is currected the DAG is corrupted

In [73]:
# Still results with an error
def func_dist2(df, coor_x, coor_y):
    dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())**2  
                     +  (df[coor_y] - df[coor_y].shift())**2 )
    return dist
ddf['col'] = ddf.map_partitions(func_dist2, coor_x='x', coor_y='y'
                                , meta=('float'))
ddf.head(2)

TypeError: unsupported operand type(s) for ^: 'float' and 'bool'

Need to reset the dataframe

In [74]:
ddf = dask.datasets.timeseries()
def func_dist2(df, coor_x, coor_y):
    dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())**2  +  (df[coor_y] - df[coor_y].shift())**2 )
#     dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())^2  +  (df[coor_y] - df[coor_y].shift())^2 )
    return dist
ddf['col'] = ddf.map_partitions(func_dist2, coor_x='x', coor_y='y', meta=('float'))
ddf.head(2)

Unnamed: 0_level_0,id,name,x,y,col
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,962,Yvonne,-0.608917,0.296572,
2000-01-01 00:00:01,995,Frank,-0.171466,-0.174535,0.642889


# Summary
1. `Dask` is lazy but efficient (parallel computing)
2. Usefull when comming from a `Pandas` (instead of `pysaprk`) 
3. Distributed environments - from single laptop to thousands clusters (including visabilty into the computation)
4. But beware of:
  * missing functionalities from `Pandas` API
  * currupted DAGs

