<h1 style="font-size:50px">Dask Introduction</h1>

Dask natively scales Python. It provides advanced parallelism for analytics, enabling performance at scale for the tools you love.

- Built in Python 
- Scales properly from single laptops to 1000-node clusters
- Leverages with existing Python APIs as mush as possible

**Please Note:**
  - Explicit is better than implicit
  - Simple is better than complex
  - Complex is better than complicated
  - Readability counts
  - If the implementation is hard to explain, it is a bad idea
  - If the implementation is easy to explain, it may be a good idea

# About Dask

Dask was created in 2014 as part of the Blaze project, a DARPA funded project at Continuum/Anaconda. It has since grown into a multi-institution community project with developers from projects including Numpy, Pandas, Jupter and Scikit-Learn. Many of the core Dask maintainers are employed to work on the project by companies including Continuum/Anaconda, Prefect, NVIDIA, Captial One, Saturn Cloud and Coiled.

Fundamentally, Dask allows a variety of parallel workflows using existing Python constructs, patterns, or libraries, including dataframes, arrays (Scaling out Numpy), bags (an unordered collection construct a bit like `Counter`), and `concurrent.futures`.

In addition to working in conjunction with Python ecosystem tools, Dask's extremely low scheduling overhead (nanoseconds in some cases) allows it work well even on single machines, and smoothly scale up.

Dask support a variety of use cases for industry and research: https://stories.dask.org/en/latest/

With its recent 2.x releases, and integration to other projects (e.g., RAPIDS for GPU computation), many commercial enterprise are paying attention and jumping in to parallel Python with Dask.

__Dask Ecosystem__

In addition to the core Dask library and its Distributed scheduler, then Dask ecosystem connects several additional initiatives, including...
* Dask ML - parallel machine learning, with a scikit-learn-style API
* Dask-kubernetes
* Dask-XGboost
* Dask-YARN
* Dask-image
* Dask-cuDF
* ... 

__What's Not Part of Dask?__

There are lots of functions that integrate to Dask, but are not represented in the core Dask ecosystem, including...
* a SQL engine
* data storage
* data catalog
* visualization
* coarse-grained scheduling / orchestration
* streaming

although there are typically other Python packages that fill these needs (e.g. Kartothek or Intake for a data catalog)

# How Do We Set Up and/or Deploy Dask?

The easiest way to install Dask is with Anaconda: `conda install dask`
    
__Schedulers and Clustering__

Dask has a simple default scheduler called the "single machine scheduler" -- this is the scheduler that is used if your `import dask` and start running code without explicitly using a `Client` object. It can be handy for quick-and-dirty testing, but I would (*warning! opinion!*) suggest that a best practice is to __use the newer "distributed scheduler" even for single-machine workloads__.

The distributed scheduler can work with
* threads (although that is often not a great idea due to the GIL) in one process
* multiple processes on one machine
* multiple processes on multiple machines

The distributed scheduler has addtional useful features including data locality awareness and realtime graphical dashboards.

# Let's See Some Code

Before we go any further, let's take a look at one particular, common use case for Dask: scaling Pandas dataframes to
- larger datasets (which don't fit in memory) and
- multiple processes (which could be on multiple nodes)

## Speed comparison for small dataset (2MB)

### 1. Pandas

### 2. Dask

## Speed comparison for big dataset (250MB)

### 1. Pandas

### 2. Dask

**There will be ___ partitions.**

**Return the number of rows for each country in Aisa**

**Note: "compute" really run the work**

There are writing counterparts to read methods which we can use:
    
- read_csv \ to_csv
- read_hdf \ to_hdf
- read_json \ to_json
- read_parquet \ to_parquet

## Best practice:

HDF5 is a popular choice for Pandas users with high performance needs. We encourage Dask DataFrame users to store and load data using **Parquet** instead. Apache Parquet is a columnar binary format that is easy to split into multiple files (easier for parallel loading) and is generally much simpler to deal with than HDF5.