# üîÄ Multiprocessing

## üìñ Introduction

This chapter is the second part of the multithreading chapter.

Now that you know what is the GIL and how to use threads, let's talk about the ways Python is able to run code in parallel.

<img src='files/multithreading_vs_multiprocessing.png' alt='Multithreading vs Multiprocessing diagram' width='600' source="miro.medium.com">

## üì¶ The Multiprocessing Module

Python's `multiprocessing` module allows you to create multiple processes, each with its own Python interpreter and memory space. This is particularly useful for CPU-bound tasks that can benefit from parallel execution.

### üéØ Concurrent.futures

Remember the `concurrent.futures` module, which provides a high-level interface for asynchronously executing callables ?

In the last chapter, we used `concurrent.futures.ThreadPoolExecutor` to run tasks concurrently using threads.

Now, we can use `concurrent.futures.ProcessPoolExecutor` to run tasks concurrently using processes.

The syntax is very similar to what we used with threads, so you can easily switch between the two.

Let's take the same code we used in the last chapter and modify it to use `ProcessPoolExecutor` instead of `ThreadPoolExecutor`.

Also this time, we will use a the function `os.getpid()` to print the process ID of each task, so we can see which process is running which task.

In [None]:
# With ThreadPoolExecutor and os.getpid()

import threading
import concurrent.futures
import time
import os

print(f"Main process {os.getpid()}")

def task(n):
    print(f"{" " * n}Task {n} starting on thread {threading.current_thread().name}, process {os.getpid()}")
    time.sleep(n)
    print(f" {" " * n}Task {n} completed on thread {threading.current_thread().name}, process {os.getpid()}")
    return n * n

with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
    results = executor.map(task, [1, 2, 3, 4, 5])

print("Results:", list(results))
print("Main thread finished.")

In [None]:
# Now let's try with ProcessPoolExecutor.

import threading
import concurrent.futures
import time
import os

print(f"Main process {os.getpid()}")
def task(n):
    print(f"{" " * n}Task {n} starting on process {threading.current_thread().name}, process {os.getpid()}")
    time.sleep(n)
    print(f" {" " * n}Task {n} completed on process {threading.current_thread().name}, process {os.getpid()}")
    return n * n
with concurrent.futures.ProcessPoolExecutor(max_workers=3) as executor:
    results = executor.map(task, [1, 2, 3, 4, 5])
print("Results:", list(results))


- When using 'ThreadPoolExecutor', the process ID is the the same for all tasks, since they are all running in the same process.

- When using 'ProcessPoolExecutor', each task will run in a separate OS process, so the process ID is different for each task. If the print statements are not clear, it's because Jupyter Notebook is not designed to show output from multiple processes in a clear way.


## ‚ö° Speed Comparison CPU Task Bound

Let's compare the performance of using threads without threads and using processes for a CPU-bound task.



In [None]:
# Without any threading or processing

import time

def cpu_bound_task(n):
    print(f"Starting task {n} in process {os.getpid()}")
    count = 0
    for i in range(10**7):
        count += i % n
    print(f"Completed task {n} in process {os.getpid()}")
    return count

start = time.time()

for n in range(1, 6):
    cpu_bound_task(n)
end = time.time()
print(f"Total time without any executor: {end - start} seconds")


In [None]:
# With Multithreading (ThreadPoolExecutor)

import time

def cpu_bound_task(n):
    print(f"Starting task {n} in process {os.getpid()}")
    count = 0
    for i in range(10**7):
        count += i % n
    print(f"Completed task {n} in process {os.getpid()}")
    return count

start = time.time()

with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
    executor.map(cpu_bound_task, range(1, 6))
    
end = time.time()
print(f"Total time ThreadPoolExecutor: {end - start} seconds")

In [None]:
# With Multiprocessing (ProcessPoolExecutor)

import time

def cpu_bound_task(n):
    print(f"Starting task {n} in process {os.getpid()}")
    count = 0
    for i in range(10**7):
        count += i % n
    print(f"Completed task {n} in process {os.getpid()}")
    return count

start = time.time()

with concurrent.futures.ProcessPoolExecutor(max_workers=6) as executor:
    executor.map(cpu_bound_task, range(1, 6))
    
end = time.time()
print(f"Total time with ProcessPoolExecutor: {end - start} seconds")


### üìä Conclusion

- For a CPU-bound task, using only one process is the same, or sometimes faster, than using threads, because of the GIL.

- With `ProcessPoolExecutor`, we can see a significant speedup, because each process can run on a separate CPU core, bypassing the GIL limitation.

### üí™ Exercice

CPython comes now as two versions: one with GIL and one without GIL (called Gilectomy).

Try to reproduce this experiment using the Gilectomy version of Python and see how the results change.

https://medium.com/sdg-group/exploring-pythons-gil-single-multithreading-vs-multiprocessing-and-the-impact-of-gil-removal-ee8b6dd610f4

## üöÄ Other ways to run code in parallel

Using the library `multiprocessing` is not the only way to run code in parallel in Python. Some librairies provide their own way to do so.

### üî¢ NumPy / SciPy

NumPy is fast because heavy work happens in compiled C/Fortran, not Python. Those low-level libraries (BLAS, LAPACK, MKL, OpenBLAS) often use multiple CPU cores automatically.

In [None]:
import numpy as np
# np.show_config() # Check if NumPy is linked against a multi-threaded BLAS implementation

A = np.random.rand(5000, 5000)
B = np.random.rand(5000, 5000)

# Matrix multiplication
C = A @ B  # Check your CPU usage during this operation !

### ‚ö° Numba

Numba is a just-in-time compiler for Python that translates a subset of Python and NumPy code into fast machine code. It can automatically parallelize certain operations using multiple CPU cores.

### ü§ñ PyTorch / TensorFlow / JAX

These are parallel by design. They use C++ backends, release the GIL and run on GPU and CPU.

### üìä Dask

Dask is a flexible parallel computing library for analytics. It allows you to scale your computations from a single machine to a cluster of machines. Dask can parallelize NumPy, Pandas, and other operations easily.

### üêª‚Äç‚ùÑÔ∏è Polars

Polars is a fast DataFrame library implemented in Rust. It is designed for high performance and can utilize multiple CPU cores for data processing tasks. The only downside is that the API is not exactly the same as Pandas and you may need to adapt your code.

### ‚ö° Pyspark

PySpark is the Python API for Apache Spark, a distributed computing framework. It allows you to process large datasets in parallel across a cluster of machines. PySpark is particularly useful for big data applications. When using on a single machine, it can still utilize multiple CPU cores for parallel processing, but it may have more overhead compared to other libraries like Dask or Polars for smaller datasets. So, unless your plan is to scale to a cluster later, prefer Dask or Polars for single-machine parallelism.

### ü¶Ä Rust code in Python

If you need extreme performance and parallelism, you can write performance-critical parts of your code in Rust and call them from Python using libraries like `PyO3`. Rust has excellent support for concurrency and parallelism, allowing you to leverage multiple CPU cores effectively.