# Dask Tips and Tricks: Rolling Averages in Dask

How can a user take advantage of Dask parallelism while calculating rolling averages?

> "I need to calculate a rolling average of a numerical column, in time series data. In pandas, I can do this with rolling(x).mean() with sorted values, but what do I do in Dask, with distributed data?"


Great question! Time series data often poses unique challenges with distributed data and parallelization, but Dask can do it. Here's what we need to do.

* Sort by index within AND across partitions
* Know when to compute (convert to Pandas DF) or persist (process computations on cluster)
* Run calculations, with attention to our need to cross partitions correctly.

This example will walk you through these specific points, and demonstrate how it's done. We'll use New York City taxi trip data, and get the 30-day rolling average of base fare prices, for our example. Also, in order to really show how this can improve your life, we have chosen data too large to be held in memory at one time in pandas.

## Naive solution

First we'll try with some generated data using the pandas-like API that dask provides.

In [1]:
import dask

timeseries = dask.datasets.timeseries()
timeseries.rolling('1D').mean().compute()

Unnamed: 0_level_0,id,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000-01-01 00:00:00,981.000000,0.660481,0.137203
2000-01-01 00:00:01,1012.500000,-0.090472,0.433623
2000-01-01 00:00:02,1031.333333,0.187964,-0.023170
2000-01-01 00:00:03,1023.750000,0.206128,0.005866
2000-01-01 00:00:04,1018.200000,0.068513,-0.174119
...,...,...,...
2000-01-30 23:59:55,999.953507,0.002255,-0.000729
2000-01-30 23:59:56,999.953484,0.002248,-0.000720
2000-01-30 23:59:57,999.953472,0.002237,-0.000723
2000-01-30 23:59:58,999.953310,0.002233,-0.000723


Woohoo! That worked as expected and now we have the results. Let's try that with some real data on a distributed cluster.

## Set Up A Cluster

This is going to employ a three worker CPU machine cluster, so we can handle some large data.

In [2]:
from dask_saturn import SaturnCluster
from dask.distributed import Client

cluster = SaturnCluster(
    scheduler_size='medium',
    worker_size='xlarge',
    n_workers=3,
    nthreads=4,
)
client = Client(cluster)
client

[2020-11-20 21:09:05] INFO - dask-saturn | Cluster is ready
[2020-11-20 21:09:05] INFO - dask-saturn | Registering default plugins
[2020-11-20 21:09:05] INFO - dask-saturn | {'tcp://10.0.18.225:36265': {'status': 'repeat'}, 'tcp://10.0.22.66:46375': {'status': 'repeat'}, 'tcp://10.0.7.129:39773': {'status': 'repeat'}}


0,1
Client  Scheduler: tcp://d-fakej-tips-and-tricks-82863533ad5a4370a84837cd924c432f.main-namespace:8786  Dashboard: https://d-fakej-tips-and-tricks-82863533ad5a4370a84837cd924c432f.release-staging.saturncloud.org,Cluster  Workers: 3  Cores: 12  Memory: 94.50 GB


***

## Load Large Dataset 

NYC taxi data is a good use case, because it is too large to hold in pandas memory. That really shows us what Dask can do!

In [3]:
import s3fs

s3 = s3fs.S3FileSystem(anon=True)
files_2019 = 's3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv'
s3.glob(files_2019)

['nyc-tlc/trip data/yellow_tripdata_2019-01.csv',
 'nyc-tlc/trip data/yellow_tripdata_2019-02.csv',
 'nyc-tlc/trip data/yellow_tripdata_2019-03.csv',
 'nyc-tlc/trip data/yellow_tripdata_2019-04.csv',
 'nyc-tlc/trip data/yellow_tripdata_2019-05.csv',
 'nyc-tlc/trip data/yellow_tripdata_2019-06.csv',
 'nyc-tlc/trip data/yellow_tripdata_2019-07.csv',
 'nyc-tlc/trip data/yellow_tripdata_2019-08.csv',
 'nyc-tlc/trip data/yellow_tripdata_2019-09.csv',
 'nyc-tlc/trip data/yellow_tripdata_2019-10.csv',
 'nyc-tlc/trip data/yellow_tripdata_2019-11.csv',
 'nyc-tlc/trip data/yellow_tripdata_2019-12.csv']

In [4]:
%%time
import dask.dataframe as dd

taxi = dd.read_csv(
    files_2019,
    parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
    storage_options={'anon': True},
    assume_missing=True,
)

CPU times: user 29.7 ms, sys: 24.7 ms, total: 54.4 ms
Wall time: 107 ms


Great, we have our Dask Dataframe ready to go.

### Dataset Size

That's a lot of rows!

In [5]:
taxi.shape[0].compute()

84399019

***

## Shape data

Set pickup datetime as our index. This will cause the data to be sorted and repartitioned. This is the most time-intensive part of this job.


In [6]:
%%time
taxi = taxi.set_index("tpep_pickup_datetime")

CPU times: user 286 ms, sys: 32.7 ms, total: 319 ms
Wall time: 46.7 s


We can check the boundaries of our partitions by calling `taxi.divisions`. Note that the partitions aren't evenly distributed across time, but this doesn't matter. Let's look at 10 just to get a sense.

In [7]:
taxi.divisions[:10]

(Timestamp('2001-01-01 00:02:08'),
 Timestamp('2009-01-01 00:00:25'),
 Timestamp('2019-01-03 09:42:40'),
 Timestamp('2019-01-06 01:31:20.140165120'),
 Timestamp('2019-01-09 01:46:55.242659840'),
 Timestamp('2019-01-11 17:54:00.710126592'),
 Timestamp('2019-01-14 07:44:36'),
 Timestamp('2019-01-16 21:16:18'),
 Timestamp('2019-01-19 10:44:53'),
 Timestamp('2019-01-22 16:23:55'))

Our taxi dataset has some unlikely extreme dates on the outer edges, so I'm going to filter by date just to make sure we have reliable data.

In [8]:
%%time
from dask.distributed import wait

taxi = taxi["2019-01-01": "2020-01-01"]
taxi = taxi.persist()
_ = wait(taxi)

CPU times: user 584 ms, sys: 30.3 ms, total: 615 ms
Wall time: 1min 14s


***
### A Note On Persist and Compute

Lots of new users of Dask find the `.persist()` and `.compute()` processes confusing. This is understandable! But the answer is not as hard as you might think.

First, remember we have several machines working for us right now. We have our Jupyter instance right here running on one, and then our cluster of three worker machines also.

If we use `.compute()`, we are asking Dask to take all the computations and adjustments to the data that we have queued up, and run them, and bring it all to the surface here, in Jupyter. That means if it *was* distributed we want to convert it into a local object here and now. If it's a Dask Dataframe, when we call `.compute()`, we're saying "Run the transformations we've queued, and convert this into a pandas dataframe immediately.". If our data is too big to be held in local pandas memory, this can be a disaster! But if it is small, then we might be fine.

If we use `.persist()`, we are asking Dask to take all the computations and adjustments to the data that we have queued up, and run them, but then the object is going to remain distributed and will live on the cluster, not on the Jupyter instance. So when we do this with a Dask Dataframe, we are telling our cluster "Run the transformations we've queued, and leave this as a distributed Dask Dataframe."

So, if you want to process all the delayed tasks you've applied to a Dask object, either of these methods will do it. The difference is where your object will live at the end.

***


### Begin Calculations

Back to work! Date is now our index, and our data is sorted by this index within and across partitions.

> Note: `fare_amount` is the column we're going to work on, so here we'll average the fare by date, returning a Series that is the average fare per date. This means our end result is going to be the rolling average of the average daily fare - this may not be what you want to do in a real business case, but for this situation working with one value per date makes the computations easier to explain. We could change our grain to hour or minute, and aggregate that way, or not aggregate at all and fill in all the intervening time periods.

Let's see what we are working with:

In [9]:
%%time
taxi.fare_amount.tail(10)

CPU times: user 5.57 ms, sys: 153 µs, total: 5.72 ms
Wall time: 14.9 ms


tpep_pickup_datetime
2019-12-31 23:59:49    10.0
2019-12-31 23:59:51    24.0
2019-12-31 23:59:52     9.5
2020-01-01 00:00:06    16.0
2020-01-01 00:00:46    13.0
2020-01-01 00:02:13     4.0
2020-01-01 00:03:25    17.0
2020-01-01 00:03:35    16.5
2020-01-01 03:51:26    13.0
2020-01-01 23:46:06    17.5
Name: fare_amount, dtype: float64

Now we can try using the `rolling` method. Note it is very fast here because it is lazily evaluated (nothing has been calculated yet).

In [10]:
%%time
rolling_fares = taxi.fare_amount.rolling('30D').mean()

CPU times: user 46.1 ms, sys: 3.83 ms, total: 49.9 ms
Wall time: 50.5 ms


We can check the tail of the data to see if the results seem reasonable.

In [11]:
rolling_fares.tail()

tpep_pickup_datetime
2020-01-01 00:02:13    13.581654
2020-01-01 00:03:25    13.581566
2020-01-01 00:03:35    13.581567
2020-01-01 03:51:26    13.573353
2020-01-01 23:46:06    13.577013
Name: fare_amount, dtype: float64

## Attach Feature to Original Dataset

Convert the Dask Series to a Dask Dataframe, and then merge on the shared indices.

In [12]:
rolling_fares_df = rolling_fares.to_frame(name="fare_amount_rolled")
type(rolling_fares_df)

dask.dataframe.core.DataFrame

### Finally, time to merge!

The merge itself is very fast here because it is lazily evaluated.
This creates our new dataset, including all dates and with averages calculated.

In [13]:
taxi_new = taxi.join(rolling_fares_df, how='outer')

In [14]:
type(taxi_new)

dask.dataframe.core.DataFrame

In [15]:
len(taxi_new)

368539618

## Conclusion

And with that, our dataset is ready! 
* All our original fields (not all shown here, for ease of reading)
* 30 day rolling average of fare, if at least one fare found in the last 30 days

**Remember**, this has been appended back to the original object, which is too large to hold in memory, so we should not `.compute()` it.

In [16]:
taxi_new[['VendorID', 'tpep_dropoff_datetime','passenger_count','trip_distance','tip_amount',
          'fare_amount', 'total_amount', 'fare_amount_rolled',]].head(10)

Unnamed: 0_level_0,VendorID,tpep_dropoff_datetime,passenger_count,trip_distance,tip_amount,fare_amount,total_amount,fare_amount_rolled
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2019-01-01 00:00:00,2.0,2019-01-01 01:03:12,2.0,7.37,0.0,23.5,24.8,23.5
2019-01-01 00:00:01,2.0,2019-01-01 00:05:17,6.0,1.73,0.0,7.0,8.3,15.25
2019-01-01 00:00:03,1.0,2019-01-01 00:04:26,1.0,0.6,0.0,5.0,6.3,11.833333
2019-01-01 00:00:05,4.0,2019-01-01 00:11:05,1.0,1.53,0.0,9.0,10.3,11.125
2019-01-01 00:00:06,1.0,2019-01-01 00:44:05,1.0,3.2,5.45,26.0,32.75,14.1
2019-01-01 00:00:09,2.0,2019-01-01 00:18:03,2.0,6.45,4.46,21.0,26.76,15.25
2019-01-01 00:00:11,2.0,2019-01-01 00:04:24,1.0,1.29,1.46,6.0,8.76,13.928571
2019-01-01 00:00:13,2.0,2019-01-01 00:18:33,1.0,9.36,0.0,27.0,28.3,15.5625
2019-01-01 00:00:14,1.0,2019-01-01 00:01:33,1.0,0.2,0.0,3.0,4.3,14.166667
2019-01-01 00:00:15,2.0,2019-01-01 00:02:07,1.0,0.39,0.0,3.5,4.8,13.1


In [17]:
taxi_new[['VendorID', 'tpep_dropoff_datetime','passenger_count','trip_distance','tip_amount',
          'fare_amount', 'total_amount', 'fare_amount_rolled',]].tail(20)

Unnamed: 0_level_0,VendorID,tpep_dropoff_datetime,passenger_count,trip_distance,tip_amount,fare_amount,total_amount,fare_amount_rolled
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2019-12-31 23:59:28,2.0,2020-01-01 00:13:07,2.0,11.65,0.0,31.5,32.8,13.581848
2019-12-31 23:59:32,2.0,2020-01-01 00:06:16,1.0,2.5,0.0,9.0,10.3,13.581843
2019-12-31 23:59:36,2.0,2020-01-01 00:11:16,1.0,4.48,0.0,15.0,18.8,13.581833
2019-12-31 23:59:39,2.0,2020-01-01 00:02:59,5.0,0.35,0.0,4.0,7.8,13.581831
2019-12-31 23:59:41,2.0,2020-01-01 00:20:22,1.0,6.58,0.0,22.5,26.3,13.581837
2019-12-31 23:59:41,2.0,2020-01-01 00:20:22,1.0,6.58,0.0,22.5,26.3,13.581838
2019-12-31 23:59:41,2.0,2020-01-01 00:30:05,3.0,4.02,4.86,20.5,29.16,13.581837
2019-12-31 23:59:41,2.0,2020-01-01 00:30:05,3.0,4.02,4.86,20.5,29.16,13.581838
2019-12-31 23:59:43,1.0,2020-01-01 00:31:03,0.0,4.8,0.0,22.0,25.8,13.581839
2019-12-31 23:59:47,2.0,2020-01-01 00:18:23,1.0,1.36,0.0,12.0,15.8,13.581835
