# dask.dataframe

<img src="assets/dask-dataframe.svg" 
     align="right"
     width="20%"
     alt="Dask dataframes are blocked Pandas dataframes">
     
Dask Dataframes coordinate many Pandas dataframes, partitioned along an index.  They support a large subset of the Pandas API.

## Start Dask Client for Dashboard

Starting the Dask Client is optional.  It will provide a dashboard which 
is useful to gain insight on the computation.  

The link to the dashboard will become visible when you create the client below.  We recommend having it open on one side of your screen while using your notebook on the other side.  This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.

In [None]:
# NOTE!!! Don't do this on Colab

from dask.distributed import Client, progress
client = Client(n_workers=2, threads_per_worker=2, memory_limit='1GB')
client

## Create Random Dataframe

In [None]:
# In order for the following mock datasets to work, we may need to install this further dependency.
# Uncomment the needed line ...

# !pip install fsspec
# !conda install fsspec

We create a random timeseries of data with the following attributes:

1.  It stores a record for every 10 seconds of the first month of 2000
2.  Along with a datetime index it has columns for names, ids, and numeric values

This is a small dataset (each day is about 6MB, for a total of 180MB). Increase the number of days or reduce the frequency to practice with a larger dataset.

We'll dump this mock data into files (one per day)

In [None]:
import os

if not os.path.exists('data'):
    os.mkdir('data')

import dask
import datetime

ts = dask.datasets.timeseries()

name = lambda i:str(datetime.date(2000, 1, 1) + i * datetime.timedelta(days=1))

ts.to_csv('data/*.csv', name_function=name);

In [None]:
# If you want to know more about this function to create fake data

help(dask.datasets.timeseries)

In [None]:
# These will work on Mac or Linux

!ls data/*.csv

!head data/2000-01-01.csv

Now we will load the csv files. Unlike Pandas, dask understands a wildcard * in a filename.

In [None]:
import dask.dataframe as dd

df = dd.read_csv('data/2000-*-*.csv')

# Unlike Pandas, Dask DataFrames are lazy and so no data is printed here.
df

But the column names and dtypes are known.

In [None]:
df.dtypes

In [None]:
df.visualize()

Some operations will automatically display the data.

In [None]:
df.head(3)

## Use Standard Pandas Operations

Most common Pandas operations operate identically on Dask dataframes

In [None]:
df2 = df[df.y > 0]
df3 = df2.groupby('name').x.std()
df3

Call `.compute()` when you want your result as a Pandas dataframe.

If you started `Client()` above then you may want to watch the status page during computation.

In [None]:
computed_df = df3.compute()
type(computed_df)

In [None]:
computed_df

In [None]:
df3.visualize()

[On to the next notebook (`dask.bag`)](06-bag.ipynb) ...