####  What is Dask?
##### Dask is a flexible, parallel computing library in Python for big data and performance-optimized workflows. It scales your computations across cores, threads, or even clusters, using familiar APIs like Pandas, NumPy, and scikit-learn.

In [None]:
# ⚙️ Core Functionalities of Dask

In [2]:
dask_info = [
    {"Area": "Big Dataframes", "Functionality": "Parallelized Pandas", "API": "dask.dataframe"},
    {"Area": "Big Arrays", "Functionality": "Parallelized NumPy", "API": "dask.array"},
    {"Area": "Parallel Functions", "Functionality": "Lazy task scheduling", "API": "dask.delayed, dask.compute"},
    {"Area": "Machine Learning", "Functionality": "Parallel ML with scikit-learn", "API": "dask-ml"},
    {"Area": "Distributed Computing", "Functionality": "Scale to clusters", "API": "dask.distributed"},
    {"Area": "Streaming", "Functionality": "Stream processing", "API": "dask.streams (experimental)"},
    {"Area": "Graphs", "Functionality": "Task-based DAG execution", "API": "dask.graph"},
]

# Optional: print it as a table using tabulate
from tabulate import tabulate
print(tabulate(dask_info, headers="keys"))


Area                   Functionality                  API
---------------------  -----------------------------  ---------------------------
Big Dataframes         Parallelized Pandas            dask.dataframe
Big Arrays             Parallelized NumPy             dask.array
Parallel Functions     Lazy task scheduling           dask.delayed, dask.compute
Machine Learning       Parallel ML with scikit-learn  dask-ml
Distributed Computing  Scale to clusters              dask.distributed
Streaming              Stream processing              dask.streams (experimental)
Graphs                 Task-based DAG execution       dask.graph


In [4]:
import pandas as pd

dask_data = {
    "Area": [
        "Big Dataframes", "Big Arrays", "Parallel Functions", "Machine Learning",
        "Distributed Computing", "Streaming", "Graphs"
    ],
    "Functionality": [
        "Parallelized Pandas", "Parallelized NumPy", "Lazy task scheduling",
        "Parallel ML with scikit-learn", "Scale to clusters",
        "Stream processing", "Task-based DAG execution"
    ],
    "API": [
        "dask.dataframe", "dask.array", "dask.delayed, dask.compute",
        "dask-ml", "dask.distributed", "dask.streams (experimental)", "dask.graph"
    ]
}

df = pd.DataFrame(dask_data)
print(df)


                    Area                  Functionality  \
0         Big Dataframes            Parallelized Pandas   
1             Big Arrays             Parallelized NumPy   
2     Parallel Functions           Lazy task scheduling   
3       Machine Learning  Parallel ML with scikit-learn   
4  Distributed Computing              Scale to clusters   
5              Streaming              Stream processing   
6                 Graphs       Task-based DAG execution   

                           API  
0               dask.dataframe  
1                   dask.array  
2   dask.delayed, dask.compute  
3                      dask-ml  
4             dask.distributed  
5  dask.streams (experimental)  
6                   dask.graph  


In [8]:
!pip install dask[complete]



#### ✅ Common Use Cases
#### 1 Handle CSVs that don’t fit in memory

#### 2 Large matrix computations (e.g., climate, imaging, ML)

#### 3 Parallel data processing (ETL)

#### 4 Scalable ML training using Dask-ML

#### 5 Processing millions of rows like Pandas but faster

In [11]:
from tabulate import tabulate

data = [
    ["Memory", "Single machine", "Out-of-core (disk, memory-efficient)"],
    ["Performance", "Single-threaded", "Multi-threaded/cluster"],
    ["Syntax", "Easy", "Same syntax as Pandas"],
    ["Scale", "GBs", "100s of GBs to TBs"]
]

headers = ["Feature", "Pandas", "Dask"]

print(tabulate(data, headers=headers, tablefmt="grid"))


+-------------+-----------------+--------------------------------------+
| Feature     | Pandas          | Dask                                 |
| Memory      | Single machine  | Out-of-core (disk, memory-efficient) |
+-------------+-----------------+--------------------------------------+
| Performance | Single-threaded | Multi-threaded/cluster               |
+-------------+-----------------+--------------------------------------+
| Syntax      | Easy            | Same syntax as Pandas                |
+-------------+-----------------+--------------------------------------+
| Scale       | GBs             | 100s of GBs to TBs                   |
+-------------+-----------------+--------------------------------------+


### 1. Dask DataFrame (for Big CSVs)

In [16]:
import dask.dataframe as dd

# Load large CSV
df = dd.read_csv('big_sales_data.csv')

# Operations (lazy, parallel)
result = df.groupby('category')['sales'].mean().compute()
print(result)


category
Clothing        57.311667
Electronics    794.207778
Name: sales, dtype: float64


### 2. Dask Array (like NumPy)

In [19]:
import dask.array as da

# Create large random array
x = da.random.random((10000, 10000), chunks=(1000, 1000))
mean = x.mean().compute()
print("Mean:", mean)


Mean: 0.49998041051119124


### 🏗️ 3. Dask Delayed (Custom Parallel Tasks)

In [22]:
from dask import delayed

@delayed
def add(x, y):
    return x + y

@delayed
def multiply(x, y):
    return x * y

# Create task graph
final = add(multiply(10, 2), multiply(5, 3))
print("Result:", final.compute())


Result: 35


### ⚡ 4. Dask with Distributed Scheduler

In [25]:
from dask.distributed import Client

client = Client()  # Local cluster
print(client)

# Now you can run dask jobs in parallel threads


<Client: 'tcp://127.0.0.1:52013' processes=4 threads=4, memory=11.89 GiB>


### ✅ Pros of Dask
#### ✅ Parallel & Efficient: Multithreading and multiprocessing out-of-the-box
#### ✅ Out-of-Core: Works with datasets larger than RAM
#### ✅ Familiar APIs: Very similar to Pandas, NumPy, scikit-learn
#### ✅ Scalable: Can run on a laptop or 100-node cluster
#### ✅ Lazy Evaluation: Builds computation graphs for optimization
#### ✅ Dashboard: Built-in performance dashboard (when using distributed)

### ❌ Cons of Dask
#### ❌ Slower on small data (due to overhead of parallelism)
#### ❌ Debugging can be tricky in lazy evaluations
#### ❌ Not all Pandas operations are supported (e.g., .pivot_table())
#### ❌ Learning Curve for cluster deployment
#### ❌ Latency in small task scheduling