# Basic Python multiprocessing

In [1]:
from time import sleep, time

In [2]:
def add(xy):
    sleep(0.1) # imagine this is some complicated, slow calculation
    return xy[0] + xy[1]

t0 = time()
print("result:", add((2,3)))
t1 = time()
print(t1-t0, "seconds")

result: 5
0.10054564476013184 seconds


In [3]:
xy_pairs = [(10,1),(10,2),(10,3),(10,4),(10,5),(10,6),(10,7),(10,8),(10,9),(10,10)]

t0 = time()
for xy in xy_pairs:
    print("result:", add(xy))
t1 = time()
print(t1-t0, "seconds")

result: 11
result: 12
result: 13
result: 14
result: 15
result: 16
result: 17
result: 18
result: 19
result: 20
1.0040993690490723 seconds


In [5]:
from multiprocessing import Pool

In [6]:
with Pool(10) as p:
    t0 = time()
    for result in p.map(add, xy_pairs):
        print("result:", result)
    t1 = time()
    print(t1-t0, "seconds")

result: 11
result: 12
result: 13
result: 14
result: 15
result: 16
result: 17
result: 18
result: 19
result: 20
0.10954976081848145 seconds


We can see some good speedups from the above examples. That's because the function mostly does sleep and you can pack many "sleep" functions with very limited CPU resources.

# Python processing with some compute-intensive functions

In [7]:
def add_compute(xy):
    for i in range(3000000): # loop 3 million times
        pass
    return xy[0] + xy[1]

with Pool(1) as p:
    t0 = time()
    for result in p.map(add_compute, xy_pairs):
        print("result:", result)
    t1 = time()
    print(t1 - t0, "seconds (1 process)")

result: 11
result: 12
result: 13
result: 14
result: 15
result: 16
result: 17
result: 18
result: 19
result: 20
0.8569386005401611 seconds (1 process)


In [8]:
with Pool(10) as p:
    t0 = time()
    for result in p.map(add_compute, xy_pairs):
        print("result:", result)
    t1 = time()
    print(t1 - t0, "seconds (10 processes)")

result: 11
result: 12
result: 13
result: 14
result: 15
result: 16
result: 17
result: 18
result: 19
result: 20
0.9076948165893555 seconds (10 processes)


For compute-intensive tasks, the "speedup" I can achieve is really bounded by the number of CPU cores I have on that computer, no matter how many processes I actually launched (in the above example, 10 processes). 

In this case, it's a bit tricky to guess how many CPU cores I have on my EC2 VM, as I did not observe a good speedup by parallelizing ten tasks over 10 processes. The reason why might be due to the weak CPU power of the t3.large EC2 VM that I am using. 

I also ran the same test on my local computer, a 10-core MacBook, and we saw good speedups. See the other notebook for detailed result. 

# Python thread-level parallelism

In [9]:
import threading

In [10]:
def cpu_bound_task():
    for i in range(3000000):
        pass

In [11]:
threads = []
t0 = time()
for _ in range(1):
    thread = threading.Thread(target=cpu_bound_task)
    thread.start()
    threads.append(thread) # insert created thread object into the list

for thread in threads:
    thread.join() # parent thread waits for child thread(s) to complete and join

t1 = time()
print(t1 - t0, "seconds (1 single python thread)")

0.1160440444946289 seconds (1 single python thread)


In [14]:
threads = []
t0 = time()
for _ in range(10):
    thread = threading.Thread(target=cpu_bound_task)
    thread.start()
    threads.append(thread) # insert created thread object into the list

for thread in threads:
    thread.join() # parent thread waits for child thread(s) to complete and join

t1 = time()
print(t1 - t0, "seconds (10 single python thread)")

0.9581670761108398 seconds (10 single python thread)


Python threads are not able to achieve parallelism, due to the constraint of GIL (global interpreter lock). That explains why the run with 10 threads led to almost 10X of execution time. 

# Global variable in Python threads vs. Python processes

In [15]:
total = 0

def increment(amount):
    global total
    total += amount
    print(f"sub total so far: {total}")

In [16]:
threads = []
for _ in range(8):
    thread = threading.Thread(target=increment, args=(5,))
    thread.start()
    threads.append(thread)

for thread in threads:
    thread.join() 

print("Final result:", total)

sub total so far: 5
sub total so far: 10
sub total so far: 15
sub total so far: 20
sub total so far: 25
sub total so far: 30
sub total so far: 35
sub total so far: 40
Final result: 40




Python threads share the virtual memory address space, therefore different threads created within the same process see the same copy of global variable 'total'.


In [17]:
import os

In [18]:
total = 0

def increment(amount):
    pid = os.getpid() # get the process identifier
    global total
    total += amount
    print(f"{pid}: sub total so far: {total}\n")

with Pool(2) as p:
    p.map(increment, [5,5,5,5,5,5,5,5])

4364: sub total so far: 5
4365: sub total so far: 5


4364: sub total so far: 10
4365: sub total so far: 10


4364: sub total so far: 15
4365: sub total so far: 15


4364: sub total so far: 20

4365: sub total so far: 20



In [19]:
total 

0

However, in the case of multi-process parallelism, things become a bit complicated. Child processes created from a parent process also create a separate copy of the global variable 'total' in their own virtual memory address space. Each child process will then work on its own copy of 'total'.