
![Dask Icon](images/dask_horizontal_black.gif "Dask Icon")
![Pandas Icon](images/pandas_logo.png "Pandas Icon")

# Gotcha's from Pandas to Dask  
<div style="text-align:right"><span style="color:Blue; font-family:Georgia; font-size:2em;">Sephi Berry: js.berry@gmail.com</span></div>

https://github.com/sephib/dask_pyconil2019

Thank you for coming to this session about dask and the
* Good afternoon - my name is Sephi Berry and I work for the Israeli Police
* I want to thank the orgenziers of the confernce for allowing me to come and speak about dask
* This talk is the outcome of my work in the data science team where we analyze data for the intelegance department 
* The notebook is on the conference site as a jupyter notebook and pdf


1. write down first and last 3 sentences
2. repeat Agenda
3. thank the organizers
and team

This notebook highlights some key differences when transfering code from `Pandas` to run in a `Dask` environment.  
Most issues have a link to the [Dask documentation](https://docs.dask.org/en/latest/) for additional information.

# Agenda  
1. Intro to `Dask` framework
2. Basic setup  `Dask Client`
3. Dask.dataframe
4. Data manipulation
5. Read/Write files
6. Advanced `groupby`
7. Debugging


[![dask graph outline](images/dask_graph_outline.jpg)](daskgraph)

<img src="images\dask-dataframes.png" alt="Dask dataframes" width="450"/>


Dask is composed of two parts:

1. *Dynamic task scheduling* optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
2. *“Big Data” collections* like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.

[link to documentation](https://docs.dask.org/en/latest/)

Dask emphasizes the following virtues:

* Familiar: Provides parallelized NumPy array and Pandas DataFrame objects
* Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.
* Native: Enables distributed computing in pure Python with access to the PyData stack.
* Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms
* Scales up: Runs resiliently on clusters with 1000s of cores
* Scales down: Trivial to set up and run on a laptop in a single process
* Responsive: Designed with interactive computing in mind, it provides rapid feedback and diagnostics to aid humans


See the [dask.distributed documentation (separate website)](https://distributed.dask.org/en/latest/) for more technical information on Dask’s distributed scheduler.

# [Why Dask?](https://docs.dask.org/en/latest/why.html)
* Scales from single comptuer out to clusters
* Familiar with `Pandas` API
* Responsive feedback (live dashboard)
* ...

# Agenda  
1. Intro to `Dask` framework  
**Basic setup  `Dask Client`**
3. Dask.dataframe
4. Data manipulation
5. Read/Write files
6. Advanced `groupby`
7. Debugging

In [2]:
# since Dask is activly beeing developed - the current example is running with the below version
import dask
import dask.dataframe as dd
import pandas as pd
print(f'Dask versoin: {dask.__version__}')
print(f'Pandas versoin: {pd.__version__}')

Dask versoin: 1.2.2
Pandas versoin: 0.24.2


## Dask `Distributed` scheduler  

In [3]:
from dask.distributed import Client 
# client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')
client = Client()

# When running code within a script use a `context manager`  
```python   
if __name__ == '__main__':
    with Client() as client:
        df = dd.read_csv(...) # do something
```

* see question in [stack overflow](https://stackoverflow.com/a/53520917/5817977)  
* In order to get url dashboard use [inner function ](https://github.com/dask/distributed/issues/2083#issue-337057906)  


## Expose Dask Client for Dashboard

In [4]:
client

0,1
Client  Scheduler: tcp://127.0.0.1:53760  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 8.50 GB


![Dask Dashboard](images\DaskDashboard2.png)


Starting the Dask Client is optional.  In this example we are running on a `LocalCluster`, this  will also provide a dashboard which is useful to gain insight on the computation.  
For additional information on [Dask Client see documentation](https://docs.dask.org/en/latest/setup.html?highlight=client#setup)  

The link to the dashboard will become visible when you create a client (as shown below).  
When running in `Jupyter Lab` an [extenstion](https://github.com/dask/dask-labextension) can be installed to be able to view the various dashboard widgets. 

See [documentation for addtional cluster configuration](http://distributed.dask.org/en/latest/local-cluster.html)

# Agenda  
1. Intro to `Dask` framework  
2. Basic setup  `Dask Client`  
**3. Dask.dataframe**
4. Data manipulation
5. Read/Write files
6. Advanced `groupby`
7. Debugging

# Create 2 DataFrames for comparison: 
* `Dask framework` is **lazy**  ![lazy python](images/Sleeping-snake.jpg)

In [5]:
ddf = dask.datasets.timeseries() #  Dask comes with builtin dataset samples, we will use this sample for our example. 
ddf

Unnamed: 0_level_0,id,name,x,y
npartitions=30,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01,int32,object,float64,float64
2000-01-02,...,...,...,...
...,...,...,...,...
2000-01-30,...,...,...,...
2000-01-31,...,...,...,...


In order to see the result we need to run [compute()](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.compute) 
 (or `head()` which runs under the hood compute()) )

In [6]:
ddf.compute()

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,1023,Jerry,-0.047476,-0.865430
2000-01-01 00:00:01,994,George,-0.920594,0.362047
2000-01-01 00:00:02,960,Edith,0.499008,-0.971223
2000-01-01 00:00:03,1002,Edith,0.409062,0.556840
2000-01-01 00:00:04,1026,Quinn,-0.123624,-0.970968
2000-01-01 00:00:05,958,Dan,0.661778,0.262635
2000-01-01 00:00:06,1015,Ursula,0.729569,0.837447
2000-01-01 00:00:07,1022,Charlie,0.125321,0.209590
2000-01-01 00:00:08,1006,Oliver,-0.727293,0.780616
2000-01-01 00:00:09,1025,Laura,-0.506821,-0.640431


## Pandas
In order to create a `Pandas` dataframe we can use the `compute()` 

In [7]:
pdf = ddf.compute()  
print(type(pdf))
pdf.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,1023,Jerry,-0.047476,-0.86543
2000-01-01 00:00:01,994,George,-0.920594,0.362047
2000-01-01 00:00:02,960,Edith,0.499008,-0.971223
2000-01-01 00:00:03,1002,Edith,0.409062,0.55684
2000-01-01 00:00:04,1026,Quinn,-0.123624,-0.970968


## Creating a `Dask dataframe` from `Pandas`

In order to utilize `Dask` capablities on an existing `Pandas dataframe` (pdf) we need to convert the `Pandas dataframe` into a `Dask dataframe` (ddf)  with the [from_pandas](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.from_pandas) method. 
You must supply the number of `partitions` or `chunksize` that will be used to generate the dask dataframe

In [9]:
ddf2 = dd.from_pandas(pdf, npartitions=10)
print(type(ddf2))
ddf2 

<class 'dask.dataframe.core.DataFrame'>


Unnamed: 0_level_0,id,name,x,y
npartitions=10,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,int32,object,float64,float64
2000-01-04 00:00:00,...,...,...,...
...,...,...,...,...
2000-01-28 00:00:00,...,...,...,...
2000-01-30 23:59:59,...,...,...,...


## Partitions in Dask Dataframes

Notice that when we created a `Dask dataframe` we needed to supply an argument of `npartitions`.  
The number of partitions will assist `Dask` on how it's going to parallelize the computation.  
Each partition is a *separate* dataframe. For additional information see [partition documentation](https://docs.dask.org/en/latest/dataframe-design.html?highlight=meta%20utils#partitions)  



Using `reset_index()` method we can examin the partitions:  

In [10]:
pdf2 = pdf.reset_index()
pdf2.loc[0] # Only 1 row

timestamp    2000-01-01 00:00:00
id                          1023
name                       Jerry
x                      -0.047476
y                       -0.86543
Name: 0, dtype: object

Now lets look at a `Dask` dataframe

In [12]:
ddf2 = ddf2.reset_index() 
ddf2.loc[0].compute()  # each partition has an index=0

Unnamed: 0,index,timestamp,id,name,x,y
0,0,2000-01-01,1023,Jerry,-0.047476,-0.86543
0,0,2000-01-04,1010,Jerry,-0.065975,-0.009539
0,0,2000-01-07,1026,Xavier,0.58606,0.536831
0,0,2000-01-10,965,Bob,-0.935206,-0.035664
0,0,2000-01-13,1054,Ray,-0.234153,0.693814
0,0,2000-01-16,1065,Charlie,0.516503,-0.406298
0,0,2000-01-19,1001,Michael,0.01451,0.156541
0,0,2000-01-22,994,Edith,-0.885981,0.049202
0,0,2000-01-25,1022,Charlie,-0.232499,0.912348
0,0,2000-01-28,1019,Norbert,-0.526318,0.569576


## dataframe.shape  
since `Dask` is lazy we cannot get the full shape before running `len`

In [13]:
print(f'Pandas shape: {pdf.shape}')
print('---------------------------')
print(f'Dask lazy shape: {ddf.shape}') 

Pandas shape: (2592000, 4)
---------------------------
Dask lazy shape: (Delayed('int-39bb68ca-480e-4f56-ab9e-43b0289a637b'), 4)


In [14]:
print(f'Dask computed shape: {len(ddf.index):,}')  # expensive

Dask computed shape: 2,592,000


Now that we have a `dask` (ddf) and a `pandas` (pdf) dataframe we can start to compair the interactions with them.

# Agenda  
1. Intro to `Dask` framework  
2. Basic setup  `Dask Client`
3. Dask.dataframe  
**4. Data manipulation**
5. Read/Write files
6. Advanced `groupby`
7. Debugging

# Moving from Update to Insert/Delete

![inplaceTrue](images/inplace_true.png "inplace_true")


Dask does not update - thus there are no arguments such as `inplace=True` which exist in Pandas.  
For more detials see [issue#653 on github](https://github.com/dask/dask/issues/653)

### Rename Columns

In [16]:
# Pandas 
print(pdf.columns)
pdf.rename(columns={'id':'ID'}, inplace=True)
pdf.columns

Index(['id', 'name', 'x', 'y'], dtype='object')


Index(['ID', 'name', 'x', 'y'], dtype='object')

In [None]:
# Dask - Error
# ddf.rename(columns={'id':'ID'}, inplace=True)
# ddf.columns

'''
---------------------------------------------------------------------------  
TypeError                                 Traceback (most recent call last)  
<ipython-input-12-3e70ff3a549e> in <module>  
      1 # Dask - Error  
----> 2 ddf.rename(columns={'id':'ID'}, inplace=True)  
      3 ddf.columns  
TypeError: rename() got an unexpected keyword argument 'inplace'  
'''

* using `inplace=True` is *not* considerd to be *best practice*. 

In [15]:
# Dask or Pandas
print(ddf.columns)
ddf = ddf.rename(columns={'id':'ID'})
ddf.columns

Index(['id', 'name', 'x', 'y'], dtype='object')


Index(['ID', 'name', 'x', 'y'], dtype='object')

## Data manipulation 

### loc - Pandas

In [17]:
mask_cond = (pdf['x']>0.5) & (pdf['x']<0.8)
pdf.loc[mask_cond, ['y']] = pdf['y']* 100
pdf[mask_cond].head(2)

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:05,958,Dan,0.661778,26.263509
2000-01-01 00:00:06,1015,Ursula,0.729569,83.744731


In [None]:
# Error
# cond_dask = (ddf['x']>0.5) & (ddf['x']<0.8)
# ddf.loc[cond_dask, ['y']] = ddf['y']* 100

'''
> TypeError                                 Traceback (most recent call last)  
> <ipython-input-16-2bbb2ae570bd> in <module> 
>       2 # Error  
> ----> 3 ddf.loc[cond_dask, ['y']] = ddf['y']* 100  
> TypeError: '_LocIndexer' object does not support item assignment  
'''

### Dask - use mask/where

In [22]:
# Pandas
mask_cond_p = (pdf['x']>0.5) & (pdf['x']<0.8)
pdf['y'] = pdf['y'].mask(cond=mask_cond_p, other=pdf['y']* 100)
#Dask
mask_cond_d = (ddf['x']>0.5) & (ddf['x']<0.8)
ddf['y'] = ddf['y'].mask(cond=mask_cond_d, other=ddf['y']* 100)
print(f'Pandas: {pdf[mask_cond].head(2)}')
print(f'---------------------------------------------------')
print(f'Dask: {ddf[mask_cond].head(2)}')


Pandas:                        ID    name         x             y
timestamp                                                
2000-01-01 00:00:05   958     Dan  0.661778  2.626351e+11
2000-01-01 00:00:06  1015  Ursula  0.729569  8.374473e+11
---------------------------------------------------
Dask:                        ID    name         x             y
timestamp                                                
2000-01-01 00:00:05   958     Dan  0.661778  2.626351e+09
2000-01-01 00:00:06  1015  Ursula  0.729569  8.374473e+09


[dask mask documentation](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.mask)

## Meta argument

> `meta` is the prescription of the names/types of the computation output   
[see stack overflow answer](https://stackoverflow.com/questions/44432868/dask-dataframe-apply-meta)

![crystal python](images/crystalBallsnake.png "crystal snake")
Since `Dask` creates a DAG for the computation it requires to understand what are the outputs of each calculation (see [meta documentation](https://docs.dask.org/en/latest/dataframe-design.html?highlight=meta%20utils#metadata))

In [23]:
pdf['initials'] = pdf['name'].apply(lambda x: x[0]+x[1])
pdf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,1023,Jerry,-0.047476,-0.86543,Je
2000-01-01 00:00:01,994,George,-0.920594,0.362047,Ge


In [24]:
ddf['initials'] = ddf['name'].apply(lambda x: x[0]+x[1])
ddf.head(2)

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('name', 'object'))



Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,1023,Jerry,-0.047476,-0.86543,Je
2000-01-01 00:00:01,994,George,-0.920594,0.362047,Ge


#### Introducing meta argument

In [25]:
# Describe the outcome type of the calculation
meta_cal = pd.Series(object, name='initials')
ddf['initials'] = ddf['name'].apply(lambda x: x[0]+x[1]
                                    , meta = meta_cal)
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,1023,Jerry,-0.047476,-0.86543,Je
2000-01-01 00:00:01,994,George,-0.920594,0.362047,Ge


In [26]:
def func(row, col1, col2):
    if (row[col1]> 0):  return row[col1] * 1000  
    else:        return row[col2] * -1
ddf['z'] = ddf.apply(func, args=('x', 'y'), axis=1
                     , meta=('z', 'float'))
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,z
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,1023,Jerry,-0.047476,-0.86543,Je,0.86543
2000-01-01 00:00:01,994,George,-0.920594,0.362047,Ge,-0.362047


### Map partitions
* We can supply an ad-hoc function to run on each partition using the [map_partitions](https://dask.readthedocs.io/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions) method.   
Mainly useful for functions that are not implemented in `Dask` or `Pandas` . 
* Finally we can return a new `dataframe` which needs to be described in the `meta` argument  
The function could also include arguments.

In [24]:
import numpy as np
def func2(df, coor_x, coor_y, drop_cols):
    df['dist'] =  np.sqrt ( (df[coor_x] - df[coor_x].shift())**2  
                           +  (df[coor_y] - df[coor_y].shift())**2 )
    df = df.drop(drop_cols, axis=1)
    return df

ddf2 = ddf.map_partitions(func2
                          , coor_x='x'
                          , coor_y='y'
                          , drop_cols=['initials', 'z']
                          , meta=pd.DataFrame({'ID':'i8'
                                              , 'name':str
                                              , 'x':'f8'
                                              , 'y':'f8'                                              
                                              , 'dist':'f8'}, index=[0]))
ddf2.head()

Unnamed: 0_level_0,ID,name,x,y,dist
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,977,Sarah,-0.792876,0.877726,
2000-01-01 00:00:01,1034,Ray,-0.122505,0.603085,0.724449
2000-01-01 00:00:02,1013,Ursula,0.217195,0.085143,0.619402
2000-01-01 00:00:03,1030,Kevin,0.542537,-9976.335657,9976.420806
2000-01-01 00:00:04,1046,Dan,-0.346649,0.183367,9976.519064


### Convert index into DateTime column

In [25]:
# Only Pandas
pdf = pdf.assign(times=pd.to_datetime(pdf.index).time)
pdf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,times
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,977,Sarah,-0.792876,0.877726,Sa,00:00:00
2000-01-01 00:00:01,1034,Ray,-0.122505,0.603085,Ra,00:00:01


In [26]:
# ddf.assign(times= dd.to_datetime(ddf.index).dt.time)
# Dask or Pandas
ddf = ddf.assign(times=ddf.index.astype('M8[ns]'))
ddf['times'] = ddf['times'].dt.time
ddf =client.persist(ddf)
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,z,times
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2000-01-01 00:00:00,977,Sarah,-0.792876,0.877726,Sa,-0.877726,00:00:00
2000-01-01 00:00:01,1034,Ray,-0.122505,0.603085,Ra,-0.603085,00:00:01


## Drop NA on column

In [27]:
pdf = pdf.drop(labels=['initials'],axis=1)
ddf = ddf.drop(labels=['initials','z'],axis=1) 

In [28]:
pdf = pdf.assign(colna = None)
print(f'pandas: {pdf.head(1)}')
ddf = ddf.assign(colna = None)
print(f'dask: {ddf.head(1)}')

pandas:              ID   name         x         y     times colna
timestamp                                                 
2000-01-01  977  Sarah -0.792876  0.877726  00:00:00  None
dask:              ID   name         x         y     times colna
timestamp                                                 
2000-01-01  977  Sarah -0.792876  0.877726  00:00:00  None


In odrer for `Dask` to drop a column with all `na` we need to assist the graph

In [29]:
pdf = pdf.dropna(axis=1, how='all')
pdf.head(1)

Unnamed: 0_level_0,ID,name,x,y,times
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01,977,Sarah,-0.792876,0.877726,00:00:00


In [30]:
# check if all values in column are Null - expensive
if ddf.colna.isnull().all() == True:   
    ddf = ddf.drop(labels=['colna'],axis=1)
print(f'dask: {ddf.compute().head(1)}')

dask:              ID   name         x         y     times
timestamp                                           
2000-01-01  977  Sarah -0.792876  0.877726  00:00:00


##  Reset Index

In [31]:
# Pandas
pdf = pdf.reset_index(drop=True)
pdf.head(1)

Unnamed: 0,ID,name,x,y,times
0,977,Sarah,-0.792876,0.877726,00:00:00


In [32]:
# Dask
ddf = ddf.reset_index()
ddf = ddf.drop(labels=['timestamp'], axis=1 )
ddf.head(1)

Unnamed: 0,ID,name,x,y,times
0,977,Sarah,-0.792876,0.877726,00:00:00


# Agenda  
1. Intro to `Dask` framework  
2. Basic setup  `Dask Client`
3. Dask.dataframe
4. Data manipulation  
**5. Read/Write files**
6. Advanced `groupby`
7. Debugging

# Read / Save files

When working with `pandas` and `dask` preferable try and work with [parquet](https://docs.dask.org/en/latest/dataframe-best-practices.html?highlight=parquet#store-data-in-apache-parquet-format).  
Even so when working with `Dask` - the files can be read with multiple workers .  
Most `kwargs` are applicable for reading and writing files [see documentaion](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.to_csv) (including the option for output file naming).  
e.g. 
ddf = dd.read_csv('data/pd2dd/ddf*.csv', compression='gzip', header=False).  

However some are not available such as  `nrows`.

## Save files

In [33]:
%%time
# Pandas
from pathlib import Path
output_file = 'pdf_single_file.csv'
output_dir = Path('data/')
output_dir.mkdir(parents=True, exist_ok=True)
pdf.to_csv(output_dir / output_file)

Wall time: 23.8 s


In [34]:
list(Path(output_dir).glob('*.csv'))

[WindowsPath('data/pdf_single_file.csv')]

`Dask`
Notice the '*' to allow for multiple file renaming. 



In [35]:
%%time
# Dask
output_dask_dir = Path('data/pd2dd/')
output_dir.mkdir(parents=True, exist_ok=True)
ddf.to_csv(f'{output_dask_dir}/ddf*.csv', index = False)

Wall time: 17.1 s


To find the number of partitions which will determine the number of output files use [dask.dataframe.npartitions](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.npartitions)  

In [36]:
ddf.npartitions

30

In [37]:
list(Path(output_dask_dir).glob('*.csv'))

[WindowsPath('data/pd2dd/ddf00.csv'),
 WindowsPath('data/pd2dd/ddf01.csv'),
 WindowsPath('data/pd2dd/ddf02.csv'),
 WindowsPath('data/pd2dd/ddf03.csv'),
 WindowsPath('data/pd2dd/ddf04.csv'),
 WindowsPath('data/pd2dd/ddf05.csv'),
 WindowsPath('data/pd2dd/ddf06.csv'),
 WindowsPath('data/pd2dd/ddf07.csv'),
 WindowsPath('data/pd2dd/ddf08.csv'),
 WindowsPath('data/pd2dd/ddf09.csv'),
 WindowsPath('data/pd2dd/ddf10.csv'),
 WindowsPath('data/pd2dd/ddf11.csv'),
 WindowsPath('data/pd2dd/ddf12.csv'),
 WindowsPath('data/pd2dd/ddf13.csv'),
 WindowsPath('data/pd2dd/ddf14.csv'),
 WindowsPath('data/pd2dd/ddf15.csv'),
 WindowsPath('data/pd2dd/ddf16.csv'),
 WindowsPath('data/pd2dd/ddf17.csv'),
 WindowsPath('data/pd2dd/ddf18.csv'),
 WindowsPath('data/pd2dd/ddf19.csv'),
 WindowsPath('data/pd2dd/ddf20.csv'),
 WindowsPath('data/pd2dd/ddf21.csv'),
 WindowsPath('data/pd2dd/ddf22.csv'),
 WindowsPath('data/pd2dd/ddf23.csv'),
 WindowsPath('data/pd2dd/ddf24.csv'),
 WindowsPath('data/pd2dd/ddf25.csv'),
 WindowsPath

To change the number of output files use [repartition](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.repartition) which is an expensive operation.

## Read files

For `pandas` it is possible to iterate and concat the files [see answer from stack overflow](https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe).

In [38]:
%%time
# Pandas
dir_path = Path(r'data/pd2dd')
concat_df = pd.concat([pd.read_csv(f) for f in list(dir_path.glob('*.csv'))])
len(concat_df)

Wall time: 5.63 s


In [40]:
%%time
# Dask
_ddf = dd.read_csv('data/pd2dd/ddf*.csv')
# len(_ddf)
_ddf

Wall time: 55.2 ms


 ## Consider using Persist

![dask graph](images\graphdask2.png "dask graph")

In [41]:
_ddf = dd.read_csv('data/pd2dd/ddf*.csv')
# do some filter
_ddf = client.persist(_ddf)
# do some computations
_ddf.head(2)

Unnamed: 0,ID,name,x,y,times
0,977,Sarah,-0.792876,0.877726,00:00:00
1,1034,Ray,-0.122505,0.603085,00:00:01


Since Dask is lazy - it may run the **entire** graph/DAG (again) even if it already run part of the calculation in a previous cell.  Thus use [persist](https://docs.dask.org/en/latest/dataframe-best-practices.html?highlight=parquet#persist-intelligently) to keep the results in memory  
Additional information can be read in this [stackoverflow issue](https://stackoverflow.com/questions/45941528/how-to-efficiently-send-a-large-numpy-array-to-the-cluster-with-dask-array/45941529#45941529) or see an exampel in [this post](http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes)   
This concept should also  be used when running a code within a script (rather then a jupyter notebook) which incoperates loops within the code.

# Agenda  
1. Intro to `Dask` framework  
2. Basic setup  `Dask Client`
3. Dask.dataframe
4. Data manipulation
5. Read/Write files  
**6. Advanced `groupby`**
7. Debugging

# Group By - custom aggregations
In addition to the [groupby notebook example](https://github.com/dask/dask-examples/blob/master/dataframes/02-groupby.ipynb)  - 
this is another example how to try to eliminate the use of `groupby.apply`  
In this example we are grouping by columns into unique list.

In [42]:
# prepare pandas dataframe
pdf = pdf.assign(time=pd.to_datetime(pdf.index).time)
pdf['seconds'] = pdf.time.astype(str).str[-2:]
cols_for_demo =['name', 'ID','seconds']
pdf[cols_for_demo].head()

Unnamed: 0,name,ID,seconds
0,Sarah,977,0
1,Ray,1034,0
2,Ursula,1013,0
3,Kevin,1030,0
4,Dan,1046,0


In [43]:
%%time
pdf_gb = pdf.groupby(pdf.name)
gp_col = ['ID', 'seconds']
list_ser_gb = [pdf_gb[att_col_gr].apply
               (lambda x: list(set(x.to_list()))) 
               for att_col_gr in gp_col]
df_edge_att = pdf_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')      

Wall time: 1.6 s


In [44]:
df_edge_att.head(2)

Unnamed: 0_level_0,Weight,ID,seconds
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alice,99934,"[1024, 1025, 1026, 1027, 1028, 1029, 1030, 103...","[00, 91, 81, 55, 21, 65, 71, 68, 47, 87, 94, 5..."
Bob,100001,"[1024, 1025, 1026, 1027, 1028, 1029, 1030, 103...","[00, 91, 81, 55, 21, 65, 71, 68, 47, 87, 94, 5..."


In any case sometimes using Pandas is more efficiante (assuming that you can load all the data into the RAM).  
In this case Pandas is faster

In [45]:
def set_list_att(x: dd.Series):
        return list(set([item for item in x.values]))
ddf['seconds'] = ddf.times.astype(str).str[-2:]
ddf = client.persist(ddf)
ddf[cols_for_demo].head(2)

Unnamed: 0,name,ID,seconds
0,Sarah,977,0
1,Ray,1034,1


In [46]:
%%time
# Dask option1 using apply
# notice the meta argument in the apply function
df_gb = ddf.groupby(ddf.name)
gp_col = ['ID', 'seconds']
list_ser_gb = [df_gb[att_col_gr].apply(set_list_att
                ,meta=pd.Series(dtype='object', name=f'{att_col_gr}_att')) 
               for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
    df_edge_att = df_edge_att.join(ser.to_frame(), how='left')
df_edge_att.head(2)

Wall time: 22.7 s


Using [dask custom aggregation](https://docs.dask.org/en/latest/dataframe-api.html?highlight=dropna#dask.dataframe.groupby.Aggregation) is considerably better

In [47]:
# Dask
import itertools
custom_agg = dd.Aggregation(
    'custom_agg', 
    lambda s: s.apply(set), 
    lambda s: s.apply(lambda chunks: list(set(itertools.chain.from_iterable(chunks)))),)

In [48]:
%%time
# Dask option1 using apply
df_gb = ddf.groupby(ddf.name)
gp_col = ['ID', 'seconds']
list_ser_gb = [df_gb[att_col_gr].agg(custom_agg) for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')
df_edge_att.head(2)  

Wall time: 3.15 s


In [None]:
df_edge_att.head()

# Agenda  
1. Intro to `Dask` framework  
2. Basic setup  `Dask Client`
3. Dask.dataframe
4. Data manipulation
5. Read/Write files
6. Advanced `groupby`  
**7. Debugging**

## [Debugging](https://docs.dask.org/en/latest/debugging.html)
Debugging may be challenging...
1. Run code without client 
2. Verify integrity of DAG
3. Use Dashboard profiler

## Corrupted DAG

In [49]:
# reset dataframe
ddf = dask.datasets.timeseries()
ddf.head(1)

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01,1025,Bob,0.163692,-0.982372


In [50]:
def func_dist2(df, coor_x, coor_y):
    dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())^2  
                     +  (df[coor_y] - df[coor_y].shift())^2 )
    return dist
ddf['col'] = ddf.map_partitions(func_dist2, coor_x='x', coor_y='y'
                                , meta=('float'))

In [51]:
# returns an error because of ^2
ddf.head()

TypeError: unsupported operand type(s) for ^: 'float' and 'bool'

* Even if the function is corrected the DAG is corrupted

In [52]:
# Still results with an error
def func_dist2(df, coor_x, coor_y):
    dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())**2  
                     +  (df[coor_y] - df[coor_y].shift())**2 )
    return dist
ddf['col'] = ddf.map_partitions(func_dist2, coor_x='x', coor_y='y'
                                , meta=('float'))
ddf.head(2)

TypeError: unsupported operand type(s) for ^: 'float' and 'bool'

Need to reset the dataframe

In [53]:
ddf = dask.datasets.timeseries()
def func_dist2(df, coor_x, coor_y):
    dist =  np.sqrt ( (df[coor_x] - df[coor_x].shift())**2  
                     +  (df[coor_y] - df[coor_y].shift())**2 )
    return dist
ddf['col'] = ddf.map_partitions(func_dist2, coor_x='x', coor_y='y'
                                , meta=('float'))
ddf.head(2)

Unnamed: 0_level_0,id,name,x,y,col
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,1002,Kevin,-0.365556,-0.621138,
2000-01-01 00:00:01,1051,Yvonne,-0.522927,0.774679,1.40466


# Summary
1. `Dask` is lazy but efficient (parallel computing)
2. Similar to `Pandas` API (especially with comparison to `pyspark`) 
3. Flexible environments - from single laptop to thousands of nodes (`client`) 
4. Bonus - dashboard
5. Beware of:
  * missing functionalities from `Pandas` API
  * corrupted DAGs
<div style="text-align:right"><span style="color:Blue; font-family:Georgia; font-size:2em;">Sephi Berry: js.berry@gmail.com</span></div>  
<div style="text-align:left"><span style="color:Black; font-family:Georgia; font-size:2em;">https://github.com/sephib/dask_pyconil2019</span></div>


In summary - I hope this talk highlited some of pithalls that can occour when moving from Pandas to Dask.  
remember that Dask is lazy thus requireing some assistance in order to generate the DAG - i.e. add the *meta* kwarg.  
I hope you noticed the similarity between the Pandas code and the Dask code.  

once transfered you enjoy the ease for potential out scaling  in addition to the bonus dashboard for monitoring and profiling of the tasks

beware of the documenation and missing functionalities and the complexity in the DAG

