# High-Performance Python

## Objectives

- Differentiate between processes & threads
- Provide examples of when to use mutli-processing vs multi-threading
- Identify common pitfalls

## Terminology

**Concurrency** is when two or more tasks can start, run, and complete in overlapping time. 
    - Does _not_ necessarily mean they’ll ever be running at the same instant. 
    - Eg. multitasking on a single-core machine.

**Parallelism** is when two or more tasks are executed simultaneously.

A **thread** is a sequence of instructions within a process. 
    - It can be thought of as a lightweight process. 
    - Threads share the same memory space.

A **process** is an instance of a program running in a computer which can contain one or more threads. 
    - Each process has independant memory space.

## Multi-Processing vs. Multi-Threading

**Multi-threading** (also known as concurrency) splits the work between different threads running on the same processor. 
    - When one thread is blocked the processor works on the tasks for the next one.
    - Multi-threading works better if you need to exchange data between threads. 
**Multi-processing** splits work across processes running on different processors or even different machines.
    - Multi-processing works better if the different processes can work heads down without communicating very much.

## Good To Know Before We Get Too Far
- CPython implementation has a Global Interpreter Lock (GIL) which allows only one thread to be active in the interpreter at once. 
    - This means that threads cannot be used for parallel execution of Python code. 
    - While parallel CPU computation is not possible, parallel IO operations are possible using threads. 
    - This is because performing IO operations releases the GIL.

- So, what are threads used for in Python?
    - In GUI applications to keep the UI thread responsive
    - IO tasks (network IO or filesystem IO)

- What should threads _not_ be used for in Python?
    - Threads should not be used for CPU bound tasks. 
    - Using threads for CPU bound tasks will actually result in worse performance compared to using a single thread.



## Pop Quiz

<details>
<summary>Q: I have to process a very large dataset and run it through a CPU-intensive algorithm. Should I use multi-processing or multi-threading to speed it up?</summary>
A: Multi-processing will produce a result faster. This is because it will be able to split the work across different processors or machines.
</details>

<details>
<summary>Q: I have a web scraping application that spends most of its time waiting for web servers to respond. Should I use multi-processing or multi-threading to speed it up?
</summary>
A: Multi-threading will produce a bigger payoff. This is because it will ensure that the CPU is fully utilized and does not waste time blocked on input.
</details>

## Analogies

Multi-Threading | Multi-Processing
---|---
Laundromat | Everyone has a washer-dryer
Uber or Carpool | Everyone has a car

## Multi-Threading: How-To

Q: How can I write a multi-threaded program that prints `"hello"` in different threads?

- Import `threading` and `sleep`

In [1]:
import threading
from time import sleep

- Define print function.

In [2]:
def print_with_delay(d, x):
    sleep(d)
    print (x)

- Create threads for printing.

In [3]:
t1 = threading.Thread(target = lambda: print_with_delay(1, 'hello with delay 1'))
t2 = threading.Thread(target = lambda: print_with_delay(2, 'hello with delay 2'))
t3 = threading.Thread(target = lambda: print_with_delay(3, 'hello with delay 3'))

- Start the threads.

In [4]:
t1.start()
print('when does this happen?')
t2.start()
t3.start()
print('when does this happen?')

when does this happen?
when does this happen?


- Wait for threads to finish.

In [5]:
t1.join()
t2.join()
t3.join()

hello with delay 1
hello with delay 2
hello with delay 3


## Multi-Processing: How-To

Q: Calculate the word count of strings using multi-processing.

- Import `Pool`

In [6]:
from multiprocessing import Pool

- Define how to count words in a string.

In [7]:
def word_count(string):
    return len(string.split())

- Define counting words sequentially.

In [8]:
def sequential_word_count(strings):
    return sum([word_count(string) for string in strings])

- Define counting words in parallel.

In [9]:
def parallel_word_count(strings):
    pool = Pool(processes = 4)
    results = pool.map(word_count, strings)
    return sum(results)

- Aside about map:

In [10]:
list(map(lambda x: x**3, range(10)))

[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]

In [11]:
print (tuple(map(lambda x: x**2, range(10))))

(0, 1, 4, 9, 16, 25, 36, 49, 64, 81)


### Head-to-Head Comparisson

- Create `word_count` version that saves result in thread object.

In [12]:
def thread_word_count(string):
    self = threading.current_thread()
    self.result = word_count(string)

- Define counting words using `Thread`.

In [13]:
def concurrent_word_count(strings):
    threads = []
    
    for string in strings:
        thread = threading.Thread(
            target = thread_word_count,
            args = (string,))
        threads.append(thread)
        
    for thread in threads:
        thread.start()
        
    for thread in threads:
        thread.join()
        
    results = []
    for thread in threads: results.append(thread.result)
    return sum(results)

Q: Time all 3 versions.

- Create a sample input.

In [14]:
strings = [
    'hello world',
    'this is another line',
    'this is yet another line'] * 100000

- Time each one

In [15]:
%time print (sequential_word_count(strings))

1100000
CPU times: user 171 ms, sys: 4.19 ms, total: 175 ms
Wall time: 174 ms


In [16]:
%time print (concurrent_word_count(strings))

1100000
CPU times: user 17.2 s, sys: 10.3 s, total: 27.5 s
Wall time: 23.1 s


In [17]:
%time print (parallel_word_count(strings))

1100000
CPU times: user 37.9 ms, sys: 17.6 ms, total: 55.4 ms
Wall time: 91.2 ms


### Pop Quiz

<details>
<summary>Q: Between sequential, parallel, and concurrent, which one is the fastest? Which one is the slowest? Why?</summary>
1. Parallel is the fastest. Sequential is second.  Concurrent is the slowest.
<br/>
2. Concurrent and parallel have higher overhead compared to sequential. This is not recovered for small problems.
<br/>
3. Use concurrent and parallel processes if and only if processing takes longer than the setup overhead.
<br/>
4. Concurrent will rarely win CPU-bound problems, but will nearly always win IO-bound problems. Parallel is just the opposite.
</details>


### Cleaning Up Zombie Python Processes

Here is how to kill all the processes that `multiprocessing` will bring up in the background.

```sh
ps ux | grep ipykernel | grep -v grep | awk '{print $2}' | xargs kill -9
```

##  IO-Bound Problem
Here's an ideal use-case for threading

In [18]:
import threading
from queue import Queue
import requests
import bs4
import time

print_lock = threading.Lock()

def get_url(current_url):
    with print_lock:
        print("\nStarting thread {}".format(threading.current_thread().name))
    res = requests.get(current_url)
    res.raise_for_status()

    current_page = bs4.BeautifulSoup(res.text,"html.parser")
    current_title = current_page.select('title')[0].getText()

    with print_lock:
        print("{}\n".format(threading.current_thread().name))
        print("{}\n".format(current_url))
        print("{}\n".format(current_title))
        print("\nFinished fetching : {}".format(current_url))

def process_queue():
    while True:
        current_url = url_queue.get()
        get_url(current_url)
        url_queue.task_done()

In [19]:
# run the multi-threading
url_queue = Queue()
url_list = ["https://www.google.com"]*5

for i in range(5):
    t = threading.Thread(target=process_queue)
    t.daemon = True
    t.start()

start = time.time()

for current_url in url_list:
    url_queue.put(current_url)
url_queue.join()

print(threading.enumerate())
print("Execution time = {0:.5f}".format(time.time() - start))


Starting thread Thread-300010

Starting thread Thread-300014

Starting thread Thread-300012

Starting thread Thread-300013

Starting thread Thread-300011
Thread-300013

https://www.google.com

Google


Finished fetching : https://www.google.com
Thread-300011

https://www.google.com

Google


Finished fetching : https://www.google.com
Thread-300012

https://www.google.com

Google


Finished fetching : https://www.google.com
Thread-300014

https://www.google.com

Google


Finished fetching : https://www.google.com
Thread-300010

https://www.google.com

Google


Finished fetching : https://www.google.com
[<_MainThread(MainThread, started 140736789971904)>, <Thread(Thread-2, started daemon 123145431674880)>, <Heartbeat(Thread-3, started daemon 123145436930048)>, <HistorySavingThread(IPythonHistorySavingThread, started 123145443258368)>, <ParentPollerUnix(Thread-1, started daemon 123145448513536)>, <Thread(Thread-300007, started daemon 123145453768704)>, <Thread(Thread-300008, started daem

## CPU-Bound Problem
Here's an ideal use-case for multi-processing

> The sieve of Eratosthenes is a simple, ancient algorithm for finding all prime numbers up to any given limit.

It does so by iteratively marking as composite (i.e., not prime) the multiples of each prime, starting with the first prime number, 2. The multiples of a given prime are generated as a sequence of numbers starting from that prime, with constant difference between them that is equal to that prime. This is the sieve's key distinction from using trial division to sequentially test each candidate number for divisibility by each prime.


To find all the prime numbers less than or equal to a given integer n by Eratosthenes' method:

- Create a list of consecutive integers from 2 through n: (2, 3, 4, ..., n).
- Initially, let p equal 2, the smallest prime number.
- Enumerate the multiples of p by counting to n from 2p in increments of p, and mark them in the list (these will be 2p, 3p, 4p, ...; the p itself should not be marked).
- Find the first number greater than p in the list that is not marked. If there was no such number, stop. Otherwise, let p now equal this new number (which is the next prime), and repeat from step 3.
- When the algorithm terminates, the numbers remaining not marked in the list are all the primes below n.

In [20]:
[True]* 5

[True, True, True, True, True]

In [21]:
N = 10**6

In [None]:
def primes_sieve(limit):
    a = [True] * limit                          # Initialize the primality list
    a[0] = a[1] = False

    for (i, isprime) in enumerate(a):
        if isprime:
            yield i
            for n in range(i*i, limit, i):     # Mark factors non-prime
                a[n] = False

In [None]:
%%timeit
primes_sieve(N)

In [None]:
%%timeit
list(primes_sieve(N))

In [None]:
prime_list = list(primes_sieve(100))
prime_list

Let's compare our generator to a function with a return statement

In [None]:
def is_prime(num):
    if num <= 1:
        return False
    elif num <= 3:
        return True
    elif num%2 == 0 or num%3 == 0:
        return False
    i = 5
    while i*i <= num:
        if num%i == 0 or num%(i+2) == 0:
            return False
        i += 6
    return True

In [None]:
is_prime_list = [is_prime(num) for num in range(100)]
is_prime_list

In [None]:
import numpy as np
import pandas as pd
prime_df = pd.DataFrame([is_prime(num) for num in range(100)], columns=['is_prime'])
prime_df[prime_df.is_prime == True]

In [None]:
%%timeit
[is_prime(num) for num in range(N)]

In [None]:
%%timeit
primes_sieve(N)

In [None]:
%%timeit
prime_list = list(primes_sieve(N))

In [None]:
print(prime_list[:10])
print(is_prime_list[:10])

So, the primes sieve generator is WAY faster than a similar funciton with a return statement. But what about if we want to do something that requires all the prime numbesr to be on hand? (For example, taking the sum of primes). How can multi-processing help?

In [None]:
from multiprocessing import Pool
import time

def sum_prime(num):
    
    sum_of_primes = 0

    ix = 2

    while ix <= num:
        if is_prime(ix):
            sum_of_primes += ix
        ix += 1

    return sum_of_primes

def is_prime(num):
    if num <= 1:
        return False
    elif num <= 3:
        return True
    elif num%2 == 0 or num%3 == 0:
        return False
    i = 5
    while i*i <= num:
        if num%i == 0 or num%(i+2) == 0:
            return False
        i += 6
    return True

if __name__ == '__main__':
    start = time.time()
    with Pool(4) as p:
        print(p.map(sum_prime, [1000000, 2000000, 3000000]))
    print("Time taken = {0:.5f}".format(time.time() - start))

## Notes below

In [None]:
def chunk_it(seq, num):
    avg = len(seq) / float(num)
    out = []
    last = 0
    while last < len(seq):
        out.append(seq[int(last):int(last + avg)])
        last += avg
    return out


chunk_it(list(range(12)), 4)