## Threads vs Processes in Python HPC with Dask and the GIL

---

## 1. Threads vs Processes

### Schematic Diagram

```
+-----------------------------------------------------+
|                    Single Process                   |
| +--------+   +--------+   +--------+   +--------+   |
| | Thread |   | Thread |   | Thread |   | Thread |   |
| |   1    |   |   2    |   |   3    |   |   4    |   |
| +--------+   +--------+   +--------+   +--------+   |
+-----------------------------------------------------+

vs.

+---------+   +---------+   +---------+   +---------+
| Process |   | Process |   | Process |   | Process |
|    A    |   |    B    |   |    C    |   |    D    |
| +-----+ |   | +-----+ |   | +-----+ |   | +-----+ |
| |Thr  | |   | |Thr  | |   | |Thr  | |   | |Thr  | |
| | 1   | |   | | 1   | |   | | 1   | |   | | 1   | |
| +-----+ |   | +-----+ |   | +-----+ |   | +-----+ |
+---------+   +---------+   +---------+   +---------+
```

### Threads

- Run inside the same program and share memory, allowing fast communication without copying data.
- Limited by the GIL for pure-Python work: only one thread runs Python code at a time.

### Processes

- Run as separate programs with independent memory spaces, each with its own Python interpreter and GIL.
- Can execute CPU-bound Python code in parallel across cores, at the cost of serializing data between processes.

---

## 2. What Is the Python GIL?

The Global Interpreter Lock (GIL) is a simple lock inside CPython that ensures only one thread executes Python bytecode at any moment. It exists because Python’s internal memory management is not safe for concurrent thread modifications without this guard.

- **Effect on CPU-bound code**: Threads cannot leverage multiple cores for pure-Python computations.
- **Libraries written in C** (e.g., NumPy) often release the GIL for heavy computations, allowing multi-threaded speedups.

---

## 3. Dask’s Single-Machine Schedulers

Dask lets you choose between threads and processes for running tasks on one machine:

- **Threaded scheduler** (`processes=False`): uses multiple threads in a single process. Ideal for I/O-bound or GIL-releasing operations.
- **Process scheduler** (`processes=True`): uses multiple processes. Best for CPU-bound pure-Python functions to achieve true parallelism.

---

## 4. Example: CPU-Bound Task

Compute the 30th Fibonacci number recursively four times, comparing runtimes.

In [None]:
import os
n_tasks = os.cpu_count()
print(f'Number of CPU cores: {n_tasks}')

In [None]:
# A pure-Python CPU-heavy function
def fib(n):
    if n < 2:
        return n
    return fib(n - 1) + fib(n - 2)

import time
import dask
from dask import delayed

# Create a number of delayed tasks equal to the number of CPU cores
tasks = [delayed(fib)(30) for _ in range(n_tasks)]

# Threaded scheduler
t0 = time.time()
dask.compute(*tasks, scheduler='threads', num_workers=n_tasks)
print("Threaded time:", time.time() - t0)

# Process scheduler
t1 = time.time()
dask.compute(*tasks, scheduler='processes', num_workers=n_tasks)
print("Process time:", time.time() - t1);

### Which Should Be Faster and Why?

For this **CPU-bound** task, the **process scheduler should be significantly faster**.

*   **Process Scheduler (`scheduler='processes'`):** Each task runs in a separate process with its own Python interpreter and GIL. This allows Dask to execute the `fib` computations in true parallel on multiple CPU cores. The total time should be roughly the time it takes to run one task.
*   **Threaded Scheduler (`scheduler='threads'`):** All tasks run in threads within a single process. Because the `fib` function is pure Python code, the Global Interpreter Lock (GIL) prevents threads from running in parallel. The tasks will execute serially, one after another. The total time will be approximately the sum of the time for all tasks.

**Note:** If you see the threaded scheduler being faster, it might be due to a low core count on your machine or high process creation overhead in your specific environment.

---

## 5. Example: I/O-Bound Task

Simulate waiting by sleeping, which releases the GIL.

In [None]:
import time
import dask
from dask import delayed

def sleepy(sec):
    time.sleep(sec)
    return sec

# Create a number of delayed I/O tasks equal to the number of CPU cores
tasks_io = [delayed(sleepy)(1) for _ in range(n_tasks)]

# Threaded scheduler (runs ~1s)
t0 = time.time()
dask.compute(*tasks_io, scheduler='threads', num_workers=n_tasks)
print("Threaded I/O time:", time.time() - t0)

# Process scheduler (extra overhead)
t1 = time.time()
dask.compute(*tasks_io, scheduler='processes', num_workers=n_tasks)
print("Process time:", time.time() - t1)

### Which Should Be Faster and Why?

For this **I/O-bound** task, the **threaded scheduler should be faster**.

*   **Threaded Scheduler (`scheduler='threads'`):** The `time.sleep()` function releases the GIL, which allows other threads to run. All tasks can "sleep" concurrently, so the total time should be about 1 second (the duration of the longest task) plus a small amount of overhead.
*   **Process Scheduler (`scheduler='processes'`):** While processes also run in parallel, creating and managing them has a higher overhead than threads. For I/O-bound tasks where threads are not blocked by the GIL, the lower overhead of threads makes them a better choice.

**Note:** In some environments, especially containerized or virtualized ones, the cost of creating processes can be very high, leading to a much longer execution time for the process scheduler, as observed in the previous run.

---

## 8. Key Takeaways

- **Threads vs Processes**: Threads share memory and GIL; processes have independent memory and GIL.
- **GIL**: Restricts threads from executing Python bytecode in parallel for CPU-bound tasks.
- **Use threads** for I/O-bound tasks or GIL-releasing libraries (e.g., NumPy).
- **Use processes** for pure-Python CPU-bound code to leverage multiple cores.
