# Dask Accelerated Deep Learning

Stephanie Kirmer  
Senior Data Scientist  
[Saturn Cloud](saturncloud.io)  

www.stephaniekirmer.com  
@data_stephanie

## Brief Introduction to Dask

Dask is an open-source framework that enables parallelization of Python code.

Two key concepts:
* Distributed data objects
* Distributed computation


### Distributed Data

Data is broken up across multiple machines, allowing analysis on data larger than any single machine's memory.

![](img/dask_df.png)

### Distributed Computation

By using "lazy" evaluation, tasks can be organized and queued into DAGs/task graphs for distribution to workers and later computation.

![](img/dask_graph.png)

[notes]
The foundation that makes this possible is what's called "lazy" evaluation or delayed evaluation. By creating delayed-evaluation tasks, you can develop task graphs, and distribute these across your compute resources to be run simultaneously. This may be used on single machines as well as clusters.

This example shows an interconnected task graph of several delayed functions.


## Dask Clusters

When we implement the Dask framework across multiple machines, the cluster architecture looks something like this. In this structure, we can distribute tasks to the various machines, and return results in aggregate to the client.

![](img/dask-cluster.png)

## Applications for Deep Learning

* Process extremely large data using distributed data objects and/or lazy loading
* Train very large or complex models using distributed training


### Distributed Training

* Training a single model across multiple machines simultaneously
* Break training data into subsets, each worker handles a different chunk

[notes] By applying these foundations to deep learning tasks, we can expand the computation possible in a single unit of time - this includes training a single model on multiple machines simultaneously, scaling the training speed. 

In this demonstration, I'll apply the PyTorch Distributed Data Parallel framework to allow training an image classification model across a cluster. This allows the workers to communicate at intervals, sharing learning acquired during the iterations.

![](img/step1.png)

![](img/step2.png)

![](img/step3.png)

# Demonstration

Training image classification model

* Architecture: Resnet50 (not pretrained)
* Dataset: Stanford Dogs (20,580 images)

### Key Elements

* Lazy, parallelized loading of training images (S3 to DataLoader)
* Distributed training across cluster, one job per worker
* Use GPU machines for computation
* Performance monitoring outside training context

In [None]:
#%run -i run_cluster_pyt.py