## Threads vs Processes in Python HPC with Dask and the GIL

---

## 1. Threads vs Processes

### Schematic Diagram

```
+-----------------------------------------------------+
|                    Single Process                   |
| +--------+   +--------+   +--------+   +--------+   |
| | Thread |   | Thread |   | Thread |   | Thread |   |
| |   1    |   |   2    |   |   3    |   |   4    |   |
| +--------+   +--------+   +--------+   +--------+   |
+-----------------------------------------------------+

vs.

+---------+   +---------+   +---------+   +---------+
| Process |   | Process |   | Process |   | Process |
|    A    |   |    B    |   |    C    |   |    D    |
| +-----+ |   | +-----+ |   | +-----+ |   | +-----+ |
| |Thr  | |   | |Thr  | |   | |Thr  | |   | |Thr  | |
| | 1   | |   | | 1   | |   | | 1   | |   | | 1   | |
| +-----+ |   | +-----+ |   | +-----+ |   | +-----+ |
+---------+   +---------+   +---------+   +---------+
```

### Threads

- Run inside the same program and share memory, allowing fast communication without copying data.
- Limited by the GIL for pure-Python work: only one thread runs Python code at a time.

### Processes

- Run as separate programs with independent memory spaces, each with its own Python interpreter and GIL.
- Can execute CPU-bound Python code in parallel across cores, at the cost of serializing data between processes.

---

## 2. What Is the Python GIL?

The Global Interpreter Lock (GIL) is a simple lock inside CPython that ensures only one thread executes Python bytecode at any moment. It exists because Python’s internal memory management is not safe for concurrent thread modifications without this guard.

- **Effect on CPU-bound code**: Threads cannot leverage multiple cores for pure-Python computations.
- **Libraries written in C** (e.g., NumPy) often release the GIL for heavy computations, allowing multi-threaded speedups.

---

## 3. Dask’s Single-Machine Schedulers

Dask lets you choose between threads and processes for running tasks on one machine:

- **Threaded scheduler** (`processes=False`): uses multiple threads in a single process. Ideal for I/O-bound or GIL-releasing operations.
- **Process scheduler** (`processes=True`): uses multiple processes. Best for CPU-bound pure-Python functions to achieve true parallelism.

---

## 4. Example: CPU-Bound Task

Compute the 30th Fibonacci number recursively four times, comparing runtimes.

In [None]:
# A pure-Python CPU-heavy function
def fib(n):
    if n < 2:
        return n
    return fib(n - 1) + fib(n - 2)

import time
import dask
from dask import delayed

# Create 4 delayed tasks
tasks = [delayed(fib)(30) for _ in range(4)]

# Threaded scheduler
t0 = time.time()
dask.compute(*tasks, scheduler='threads')
print("Threaded time:", time.time() - t0)

# Process scheduler
t1 = time.time()
dask.compute(*tasks, scheduler='processes')
print("Process time:", time.time() - t1)

---

## 5. Example: I/O-Bound Task

Simulate waiting by sleeping, which releases the GIL.

In [None]:
import time
import dask
from dask import delayed

def sleepy(sec):
    time.sleep(sec)
    return sec

# Create 4 delayed I/O tasks
tasks_io = [delayed(sleepy)(1) for _ in range(4)]

# Threaded scheduler (runs ~1s)
t0 = time.time()
dask.compute(*tasks_io, scheduler='threads')
print("Threaded I/O time:", time.time() - t0)

# Process scheduler (extra overhead)
t1 = time.time()
dask.compute(*tasks_io, scheduler='processes')
print("Process I/O time:", time.time() - t1)

---

## 6. Using `LocalCluster` for Control

Compare threading vs processing with Dask’s local clusters:

In [None]:
from dask.distributed import Client, LocalCluster
import time

# Threaded cluster: one process, multiple threads
cluster_t = LocalCluster(processes=False, n_workers=1, threads_per_worker=4)
client_t  = Client(cluster_t)

# Process cluster: multiple processes, one thread each
cluster_p = LocalCluster(processes=True, n_workers=4, threads_per_worker=1)
client_p  = Client(cluster_p)

# CPU-bound test on threaded cluster
futs = client_t.map(fib, [30]*4)
t0 = time.time(); client_t.gather(futs)
print("Distributed Threaded time:", time.time() - t0)

# CPU-bound test on process cluster
futs = client_p.map(fib, [30]*4)
t1 = time.time(); client_p.gather(futs)
print("Distributed Process time:", time.time() - t1)

---

## 7. Visualization

Visualize runtime differences between threads and processes with a bar chart:

In [None]:
# %%
import time
import dask
from dask import delayed
import matplotlib.pyplot as plt

# Define CPU-bound function
def fib(n):
    if n < 2:
        return n
    return fib(n - 1) + fib(n - 2)

# Prepare tasks
tasks = [delayed(fib)(30) for _ in range(4)]

# Measure runtimes
t0 = time.time(); dask.compute(*tasks, scheduler='threads'); time_threaded = time.time() - t0

t1 = time.time(); dask.compute(*tasks, scheduler='processes'); time_process = time.time() - t1

# Plot results
plt.figure(figsize=(6,4))
plt.bar(['Threaded', 'Processes'], [time_threaded, time_process])
plt.ylabel('Runtime (seconds)')
plt.title('Fib(30) x4: Threaded vs Process Runtime')
plt.tight_layout()
plt.show()

---

## 8. Key Takeaways

- **Threads vs Processes**: Threads share memory and GIL; processes have independent memory and GIL.
- **GIL**: Restricts threads from executing Python bytecode in parallel for CPU-bound tasks.
- **Use threads** for I/O-bound tasks or GIL-releasing libraries (e.g., NumPy).
- **Use processes** for pure-Python CPU-bound code to leverage multiple cores.
