Performance Tuning
==================

In this notebook we consider performance tuning of parallel algorithms.  We use the nyc taxi data from the last exercise.  We also use the dask diagnostic dashboard during this exercise.  Now would be a good time to connect to it.  We recommend running the jupyter notebook and the dask diagnostic status page side by side.

This notebook uses Dask.  You may want to use [Dask's diagnostic dashboard](../../../9002/status) while running this notebook for feedback from the cluster.  We recommend setting up the dashboard and your notebook side-by-side.

In [None]:
from dask.distributed import Client, progress, wait

client = Client('schedulers:9000')
client

In [None]:
import dask.dataframe as dd

df = dd.read_csv('gcs://anaconda-public-data/nyc-taxi/csv/2015/yellow_tripdata_2015-*.csv',
                 parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
                 storage_options={'token': 'cloud'})
df = df.persist()
df

In [None]:
progress(df)

### Partition size

Performance in distributed systems depends on many factors.  Some of these are familiar from single-machine computing but some are new.  In this section we consider the costs of too many or too few partitions.

In the following lines we split our dask.dataframe into 1000 smaller pandas dataframes.  This is both good and bad:

1. **Good**:  It exposes more parallelism.  If we have more cores we can split the computation more finely.
2.  **Bad**: It adds more overhead.  There is a fixed cost to every task.

### Exercise

Run the code below and use Dask's diagnostic dashboard to investigate what is taking up time.  Change two parameters in the computation to make the second cell as fast as possible:

-  `npartitions`: The number of partitions for our dataframe
-  `split_every`: The granularity by which we reduce intermediate values in the sum

*Note: lets not care about the cost of the first cell where we repartition.  This is typically done once at data ingestion.*

In [None]:
df2 = df.repartition(npartitions=1000).persist()
wait(df2);

In [None]:
%time df2.passenger_count.sum(split_every=1000).compute()

### Communication 

In this section we look at distributed matrix multiply.  This algorithm can be bound by communication depending on how the array is chunked.

We make a distributed numpy array.

In [None]:
import dask.array as da
x = da.random.random((10000, 10000), chunks=(1000, 1000))
x = x.persist()
x

Lets perform a matrix multiply of x by itself.  Watch the diagnostic dashboard, what do you notice?  In particular track the amount of *red* in the Task Stream plot, which corresponds to communication, and the amount of intermedaite data stored in the upper left per worker.

In [None]:
y = x.dot(x.T).persist()

### Change chunking

Currently our array is stored as a 10x10 grid of 1000x1000 numpy arrays.  We can change the chunkshape using the `.reshape` method.  The chunk shape that we chose can strongly impact the cost of the matrix multiply algorithm.  

We might choose larger or smaller chunks 

    x = x.rechunk((100, 100)).persist()  # more and smaller chunks
    x = x.rechunk((2000, 2000)).persist()  # more and smaller chunks
    
Or we might choose chunks of a different size

    x = x.rechunk((2000, 500)).persist()  # make chunks tall and skinny
    x = x.rechunk((500, 2000)).persist()  # make chunks short and wide

In [None]:
x = x.rechunk((500, 500)).persist()
wait(x);

In [None]:
%%time 
y = x.dot(x.T).persist()
wait(y)