# Dask Tips and Tricks: Rolling Averages in Dask

How can a user take advantage of Dask parallelism while calculating rolling averages?

> "I need to calculate a rolling average of a numerical column, in time series data. In pandas, I can do this with rolling(x).mean() with sorted values, but what do I do in Dask, with distributed data?"


Great question! Time series data often poses unique challenges with distributed data and parallelization, but Dask can do it. Here's what we need to do.

* Sort by index within AND across partitions
* Know when to compute (convert to Pandas DF) or persist (process computations on cluster)
* Run calculations, with attention to our need to cross partitions correctly.

This example will walk you through these specific points, and demonstrate how it's done. We'll use New York City taxi trip data, and get the 30-day rolling average of base fare prices, for our example. Also, in order to really show how this can improve your life, we have chosen data too large to be held in memory at one time in pandas.

## Naive solution

First we'll try with some generated data using the pandas-like API that dask provides.

In [None]:
import dask

timeseries = dask.datasets.timeseries()
timeseries.rolling('1D').mean().compute()

Woohoo! That worked as expected and now we have the results. Let's try that with some real data on a distributed cluster.

## Set Up A Cluster

This is going to employ a three worker CPU machine cluster, so we can handle some large data.

In [None]:
from dask_saturn import SaturnCluster
from dask.distributed import Client

cluster = SaturnCluster(
    scheduler_size='medium',
    worker_size='xlarge',
    n_workers=3,
    nthreads=4,
)
client = Client(cluster)
client

***

## Load Large Dataset 

NYC taxi data is a good use case, because it is too large to hold in pandas memory. That really shows us what Dask can do!

In [None]:
import s3fs

s3 = s3fs.S3FileSystem(anon=True)
files_2019 = 's3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv'
s3.glob(files_2019)

In [None]:
%%time
import dask.dataframe as dd

taxi = dd.read_csv(
    files_2019,
    parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
    storage_options={'anon': True},
    assume_missing=True,
)

Great, we have our Dask Dataframe ready to go.

### Dataset Size

That's a lot of rows!

In [None]:
taxi.shape[0].compute()

***

## Shape data

Set pickup datetime as our index. This will cause the data to be sorted and repartitioned. This is the most time-intensive part of this job.


In [None]:
%%time
taxi = taxi.set_index("tpep_pickup_datetime")

We can check the boundaries of our partitions by calling `taxi.divisions`. Note that the partitions aren't evenly distributed across time, but this doesn't matter. Let's look at 10 just to get a sense.

In [None]:
taxi.divisions[:10]

Our taxi dataset has some unlikely extreme dates on the outer edges, so I'm going to filter by date just to make sure we have reliable data.

In [None]:
from dask.distributed import wait

taxi = taxi["2019-01-01": "2020-01-01"]
taxi = taxi.persist()
_ = wait(taxi)

***
### A Note On Persist and Compute

Lots of new users of Dask find the `.persist()` and `.compute()` processes confusing. This is understandable! But the answer is not as hard as you might think.

First, remember we have several machines working for us right now. We have our Jupyter instance right here running on one, and then our cluster of three worker machines also.

If we use `.compute()`, we are asking Dask to take all the computations and adjustments to the data that we have queued up, and run them, and bring it all to the surface here, in Jupyter. That means if it *was* distributed we want to convert it into a local object here and now. If it's a Dask Dataframe, when we call `.compute()`, we're saying "Run the transformations we've queued, and convert this into a pandas dataframe immediately.". If our data is too big to be held in local pandas memory, this can be a disaster! But if it is small, then we might be fine.

If we use `.persist()`, we are asking Dask to take all the computations and adjustments to the data that we have queued up, and run them, but then the object is going to remain distributed and will live on the cluster, not on the Jupyter instance. So when we do this with a Dask Dataframe, we are telling our cluster "Run the transformations we've queued, and leave this as a distributed Dask Dataframe."

So, if you want to process all the delayed tasks you've applied to a Dask object, either of these methods will do it. The difference is where your object will live at the end.

***


### Begin Calculations

Back to work! Date is now our index, and our data is sorted by this index within and across partitions.

> Note: `fare_amount` is the column we're going to work on, so here we'll average the fare by date, returning a Series that is the average fare per date. This means our end result is going to be the rolling average of the average daily fare - this may not be what you want to do in a real business case, but for this situation working with one value per date makes the computations easier to explain. We could change our grain to hour or minute, and aggregate that way, or not aggregate at all and fill in all the intervening time periods.

Let's see what we are working with:

In [None]:
%%time
taxi.fare_amount.tail(10)

Now we can try using the `rolling` method. Note it is very fast here because it is lazily evaluated (nothing has been calculated yet).

In [None]:
%%time
rolling_fares = taxi.fare_amount.rolling('30D').mean()

We can check the tail of the data to see if the results seem reasonable.

In [None]:
rolling_fares.tail()

## Attach Feature to Original Dataset

Convert the Dask Series to a Dask Dataframe, and then merge on the shared indices.

In [None]:
rolling_fares_df = rolling_fares.to_frame(name="fare_amount_rolled")
type(rolling_fares_df)

### Finally, time to merge!

The merge itself is very fast here because it is lazily evaluated.
This creates our new dataset, including all dates and with averages calculated.

In [None]:
taxi_new = taxi.join(rolling_fares_df, how='outer')

In [None]:
type(taxi_new)

In [None]:
len(taxi_new)

## Conclusion

And with that, our dataset is ready! 
* All our original fields (not all shown here, for ease of reading)
* 30 day rolling average of fare, if at least one fare found in the last 30 days

**Remember**, this has been appended back to the original object, which is too large to hold in memory, so we should not `.compute()` it.

In [None]:
taxi_new[['VendorID', 'tpep_dropoff_datetime','passenger_count','trip_distance','tip_amount',
          'fare_amount', 'total_amount', 'fare_amount_rolled',]].head(10)

In [None]:
taxi_new[['VendorID', 'tpep_dropoff_datetime','passenger_count','trip_distance','tip_amount',
          'fare_amount', 'total_amount', 'fare_amount_rolled',]].tail(20)