# Dask

## Introduction
Pandas is a great tool and is used for a variety of purposes like data cleansing, exploratory data analysis, time series analysis, visual analysis, building features for ML models to name a few. We <3 pandas! However, we have seen that as soon as you hit scale, things start slowing down. People generally switch to Spark Data Frames. Porting pandas to spark DFs can be painful and might not be efficient until and unless you have super large datasets.

One of our data engineering piece started off small (100s of thousand data points / day) but quickly became a largish data problem (5M rows / day). That is when we decided to try out dask. Our chunked pandas dataframe techniques used to take 20 hrs to perform complex data cleansing on 300k rows vs 2.5 million rows in 60 mins on dask!

## What is Dask?
Dask provides a framework for performing parallel computing for analytics.

Dask is composed of two components:

* Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
* “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

## Broad Categories of Dask Components

* **Dask DataFrame** - mimics Pandas
* **Dask Bag** - mimics iterators, Toolz, and PySpark
* **Dask Delayed** - mimics for loops and wraps custom code

## Dask Dataframe

Dask dataframe is constituted of multiple pandas dataframes split along an index. The smaller pandas dataframes may reside in memory or on disk (if it does not fit in memory) or on multiple machines (in case of dask cluster)

![](./img/dask-dataframe.png)

Because the dask.dataframe application programming interface (API) is a subset of the Pandas API it should be familiar to Pandas users. There are some slight alterations due to the parallel nature of dask.

## Let's try it out

In [None]:
import dask.dataframe as dd
import pandas as pd
%matplotlib inline

## Connect to SQL Table

Create SQL alchemy engine

In [None]:
from dask._version import get_versions
get_versions()

In [None]:
from sqlalchemy import create_engine
import sqlite3
import numpy as np

In [None]:
engine = create_engine('sqlite:///datasets/database.sqlite')
conn = sqlite3.connect('datasets/database.sqlite')
uri = 'sqlite:///datasets/database.sqlite'

In [None]:
pdf = pd.read_sql("select * from status limit 10000000", con=engine)

In [None]:
pdf.shape

In [None]:
pdf.head()

In [None]:
ddf = dd.from_pandas(pdf, npartitions=16)

In [None]:
from dask.multiprocessing import get

In [None]:
%%time
pdf.apply(
        lambda x: x['bikes_available']**2 * x['docks_available']**2,
    axis=1
    )

In [None]:
%%time
ddf.apply(
        lambda x: x['bikes_available']**2 * x['docks_available']**2,
    axis=1
    ).compute()

In [None]:
ddf.apply(
        lambda x: x['bikes_available']**2 * x['docks_available']**2,
    axis=1
    ).visualize()

## read sql directly in Dask

In [None]:
meta = {
#     'id': np.int,
      'station_id': np.int,
    'bikes_available': np.int,
    'docks_available': np.int,
    'time': np.datetime64
}

In [None]:
data= {
#     'id': 0,
    'station_id': 1,
    'bikes_available': 2,
    'docks_available': 1,
    'time': '2014-08-30 12:01:20'
}

In [None]:
cols = ['station_id','bikes_available','docks_available','time']

In [None]:
df = pd.DataFrame(columns=cols)

In [None]:
df.dtypes

In [None]:
for c in df.columns:
    df[c] = df[c].astype(meta[c])

In [None]:
df.dtypes

In [None]:
df.head()

In [None]:
ddf = dd.read_sql_table("status", uri=uri, index_col='station_id', npartitions=16)

In [None]:
ddf.head()

In [None]:
ddf.dtypes

In [None]:
ddf._meta

In [None]:

type(ddf)

## Shuffling - Why index is important

In [None]:
%%time
ddf.groupby(['station_id'])['docks_available'].apply(lambda x: max(x)).compute()

In [None]:
%%time
ddf.groupby(['bikes_available'])['docks_available'].apply(lambda x: max(x)).compute()

In [None]:
ddf.groupby(['station_id'])['docks_available'].apply(lambda x: max(x)).visualize()

In [None]:
ddf.groupby(['bikes_available'])['docks_available'].apply(lambda x: max(x)).visualize()

First approach needs a lot of shuffling. Now imagine if you deployed the same code on a cluster. Shuffling here means network IO. This can slow down the computation very much