<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo\">


# Dask DataFrames

We finished the last section by building a parallel dataframe computation over a directory of CSV files using `dask.delayed`.  In this section we use `dask.dataframe` to build computations for us in the common case of tabular computations.  Dask dataframes look and feel like Pandas dataframes but they run on the same infrastructure that powers `dask.delayed`.


## When to use `dask.dataframe`

Pandas is great for tabular datasets that fit in memory. Dask becomes useful when the dataset you want to analyze is larger than your machine's RAM. 

In [None]:
import os

import dask
import dask.dataframe as dd
import pandas as pd

#pd.options.display.max_rows = 10

df = dd.read_csv(os.path.join('C:/Users/sharvik/data', 'nycflights', '*.csv'), parse_dates={'Date': [0, 1, 2]})

Reading the complete file using pandas

In [None]:
df

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df = dd.read_csv(os.path.join('C:/Users/sharvik/data', 'nycflights', '*.csv'),
                 parse_dates={'Date': [0, 1, 2]},
                 dtype={'TailNum': str,
                        'CRSElapsedTime': float,
                       'Cancelled': bool})

In [None]:
df.tail()

In [None]:
%time df.DepDelay.max().compute()

Now to do the same thing using Pandas, you have to loop over this all the files and calculate the individual file maximum and then take a max on all individual maximum

In [None]:
import pandas as pd
import os

from glob import glob
filenames = sorted(glob(os.path.join('C:/Users/sharvik/data', 'nycflights', '*.csv')))

In [None]:
%%time 

for file in filenames:
    data = pd.read_csv(file, dtype={'TailNum': str,
                        'CRSElapsedTime': float,
                       'Cancelled': bool})
    data.append(data)
    
non_cancelled = data[~data.Cancelled]
mean_delay = non_cancelled.DepDelay.mean()
std_delay = non_cancelled.DepDelay.std()

In [None]:
#df.DepDelay.max().visualize()

For example, lets compute the mean and standard deviation for departure delay of all non-canceled flights:

In [None]:
df.head()

In [None]:
non_cancelled = df[~df.Cancelled]
mean_delay = non_cancelled.DepDelay.mean()
std_delay = non_cancelled.DepDelay.std()

In [None]:
non_cancelled

In [None]:
%%time

mean_delay_res = mean_delay.compute()
std_delay_res = std_delay.compute()

In [None]:
%%time

mean_delay_res, std_delay_res = dask.compute(mean_delay, std_delay)

### Now let's try this for some bigger files. I got a file from Ravneet and this was

In [None]:
'C:/Users/sharvik/data/results/results.csv'

In [None]:
mean_delay_res

In [None]:
std_delay_res