# Summary of the efforts to parallelize within the pipeline so far

*Written by Xiao Ming*

### 1. Motivation

The existing data pipeline have a large number of operations. We want to test if the pipeline itself can be broken into individual Dask delayed functions and use multithreading/multiprocessing to speed up the pipeline.

### 2. Understanding the subtasks

We begin with an analysis on subtasks dependencies by creating a directed acyclic graph.

Link to graph: https://www.figma.com/file/p3BJHBWEL8PQUXWGPYUxNf/Diagram-Basics-(Community)?type=whiteboard&node-id=0-1&t=oiyNPvAPb0DVlMFK-0

![title](img/dag.png)

### 3. Identify parallelization targets

Note that in the diagram above, clipping_check, score_data_set, and capacity_clustering can all execute as soon as get_daily_flags finishes. If we can defined those functions as a Dask delayed function, we can then execute them in parallel.

### 4. Refactor the code for the new parallelized pipeline

Requirements:

1. Entire pipeline refactored into Dask delayed functions
2. Dask delayed function won't get executed until we call compute
3. Data retrieval should also be a Dask delayed function
4. From the above requirements as laid out by the clients, we need to pay attention to a few issues.


Considerations/challenges:

1. Because of the way Dask tasks are scheduled, Dask does not allow us to mutate the input, meaning that everytime we change an object attribute, that attribute should be passed into the function as input, copied, mutated, and then spat out as output. https://docs.dask.org/en/latest/delayed-best-practices.html#don-t-mutate-inputs
2. For Dask multithreading/multiprocessing to have a meaningful impact on performance, the pipeline should be broken up into many small pieces. As can be seen from the diagram in part 2, if we only break apart the pipeline into the six predefined phases (preprocessing, cleaning, scoring, etc.), it is guaranteed that everything will be run serially, per the DAG above.
3. Because of the first consideration ("cannot mutate input/have to pass in everything the code changes"), many of our delayed functions will have many inputs, each with a large size(They are objects and matrices). According to Dask best practices, Dask will calculate hash for each input. This means that Dask will need to calculate the hash for a lot of objects, arrays, and matrices, resulting in slowdown. https://docs.dask.org/en/latest/delayed-best-practices.html#avoid-repeatedly-putting-large-inputs-into-delayed-calls