Distributed DataFrames
======================

In this notebook we use distributed dataframes to analyze NYC Taxi data stored as CSV files on S3.

This data is stored as large CSV files on S3 in a public bucket.

In [None]:
from s3fs import S3FileSystem
s3 = S3FileSystem(anon=True)

s3.ls('dask-data/nyc-taxi/2015')

We would like to load this data with Pandas, but thre is too much data here to fit in memory.

In [None]:
import pandas as pd

with s3.open('dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv') as f:
    df = pd.read_csv(f, nrows=5)  # look at just five rows
    
df

Instead, we connect to the cluster and use dask.dataframe to load the CSV data into ~700 Pandas dataframes spread across our cluster.  We get back a Dask.dataframe to coordinate these small Pandas dataframes.

In [None]:
from dask.distributed import Executor, progress

e = Executor('schedulers:9000', set_as_default=True)
e

In [None]:
import dask.dataframe as dd

df = dd.read_csv('s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv',
                 parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
                 storage_options={'anon': True})
df

In [None]:
df = e.persist(df)
progress(df)

### Play

Existing Pandas experience transfers over decently well to Dask.dataframe.  However there are a few caveats when dealing with distributed systems:

*  Until you call `e.persist` (for large results) or `e.compute` (for small results), all computations are lazy
*  Call `progress` on a dataframe *after* you persist to track the progress of a computation.  You can continue doing work immediately.  All work happens in the background.
*  If you are computing a small result, just add `.compute()` to the end of your result, like `df.passenger_count.sum().compute()`.  This will block and return the result when finished.

### Example

In [None]:
positive_fares = df[df.fare_amount > 0]
fares = df[['fare_amount', 'tip_amount', 'payment_type']]

fares = e.persist(fares)  # triggers computation
progress(fares)

In [None]:
fares.head()

In [None]:
(fares.tip_amount == 0).sum().compute()

In [None]:
len(fares)

### Exercise