# [Multithreading vs multiprocessing vs acyncio in Python](https://www.linkedin.com/pulse/multithreading-vs-multiprocessing-asyncio-code-examples-kaushik-yxgjc/)

<img src="png/python_parallel.png" width=800 height=400>
<img src="png/parallel_methods.png" width=800 height=400>


Multithreading uses threads in a single process, multiprocessing spawns separate processes while asyncio leverages an event loop and coroutines for cooperative multitasking.

> Use multithreading when you need to run I/O bound or CPU bound jobs concurrently in a single process. Examples - serving concurrent requests in a web server, parallel processing in data science apps etc.

> Leverage multiprocessing for CPU bound jobs that require truly parallel execution across multiple cores. Examples - multimedia processing, scientific computations etc.

> Asyncio suits network applications like web servers, databases etc. where blocking I/O operations limit performance. Asyncio minimizes blocking for high throughput.


In [3]:
import multiprocessing
import time
print("Number of cpu : ", multiprocessing.cpu_count())


Number of cpu :  10


## Multithreading vs multiprocessing

> Multithreading refers to concurrently executing multiple threads within a single process. 

> Multiprocessing refers to executing multiple processes concurrently.

> The key advantage of multithreading is that it allows maximum utilization of a single CPU core by executing threads concurrently. All threads share same process resources like memory. Context switching between threads is lightweight.

> However, multithreading also comes with challenges like race conditions, deadlocks etc. when multiple threads try to access shared resources. Careful synchronization is needed to avoid these issues.

> Multiprocessing avoids GIL limitations and allows full utilization of multicore CPUs. But processes have higher memory overhead compared to threads. Interprocess communication is more complicated compared to thread synchronization primitives.



### Multiprocessing examples




In [6]:
from multiprocess import Pool
import time
import random

def cal_square(x: int=0)->int:
    time.sleep(random.random()*(1000-x)/200)
    return x**2

start_time = time.time()
p = Pool()
res = p.map(cal_square, [i for i in range(1000)])
p.close()
p.join()
print(res[:10])
print("--- %s seconds ---" % (time.time() - start_time))

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
--- 123.82402896881104 seconds ---


In [7]:
# asynchronus multiprocessing

def add_to_res(x: int):
    res.append(x)
start_time = time.time()
p = Pool(50)
res = []
for i in range(1000):
    p.apply_async(cal_square, args=(i,), callback=add_to_res)
p.close()
p.join()
print(res[:10])
print("--- %s seconds ---" % (time.time() - start_time))

[1681, 81, 2500, 2025, 484, 225, 1444, 1369, 0, 1225]
--- 25.077701807022095 seconds ---


In [13]:
# Use multiprocessing to parse urls
import requests
from bs4 import BeautifulSoup

def get_urls(url):
    reqs = requests.get(url)
    soup = BeautifulSoup(reqs.text, 'html.parser')

    urls = []
    for link in soup.find_all('a'):
        urls.append(link.get('href'))
    return urls

def get_all_urls(url, max_urls=10000):
    '''
        Serial method to get all urls
    '''
    q = [url]
    p = 0
    visited = {url: True}
    while p < len(q) and len(q) < max_urls:
        x = q[p]
        for y in get_urls(x):
            if y is not None and y.startswith('https://') and y.split('//')[1].startswith('www'):
                if y.split('/')[-1].startswith('?ref'):
                    y = '/'.join(y.split('/')[:-1])
                if not y.split('/')[-1].startswith('#') and y not in visited:
                    q.append(y)
                    visited[y] = True
        p += 1
    return q
    

def get_all_urls_parallel(url, max_urls=10000, max_worker = 10):
    '''
        Using multiprocessing to parallel get all urls
    '''
    def post_process(result):
        for y in result:
            if y not in visited:
                q.append(y)
                visited[y] = True
                
    q = [url]
    p = 0
    visited = {url: True}
    while p < len(q) and len(q) < max_urls:
        p_next = min([len(q), p + max_worker])
        pool = Pool(max_worker)
        for i in range(p, p_next):
            pool_res = pool.apply_async(get_urls, args=(q[i],), callback=post_process)
        pool.close()
        pool.join()
        p = p_next
    return q

In [11]:
# Serial method test
start_time = time.time()
res = get_all_urls('https://www.geeksforgeeks.org/', max_urls=20000)
print(time.time() - start_time)
print(len(res), res[432])

45.396723985672
20018 https://www.geeksforgeeks.org/angular-cheat-sheet-a-basic-guide-to-angular


In [14]:
# parallel method test
start_time = time.time()
res = get_all_urls_parallel('https://www.geeksforgeeks.org/', max_urls=20000)
print(time.time() - start_time)
print(len(res), res[432])

8.672960996627808
20004 https://www.geeksforgeeks.org/logarithms/?ref=outind


In [18]:
# understanding bottleneck
url = 'https://www.geeksforgeeks.org/'
s1 = time.time()
reqs = requests.get(url)
s2 = time.time()
soup = BeautifulSoup(reqs.text, 'html.parser')
s3 = time.time()
q = []
for link in soup.find_all('a'):
    y = link.get('href')
    if y is not None and y.startswith('https://') and y.split('//')[1].startswith('www') and (not y.split('/')[-1].startswith('#')):
        q.append(y)
s4 = time.time()
print(s2-s1, s3-s2, s4-s3)

0.2906820774078369 0.03322482109069824 0.0005812644958496094


# Async IO

> Asyncio provides a single-threaded, non-blocking concurrency model in Python. It uses cooperative multitasking and an event loop to execute coroutines concurrently.

> Asyncio is best suited for IO-bound tasks and use cases where execution consists of waiting on network responses, database queries etc. It provides high throughput and minimizes blocking.
However, asyncio doesn't allow true parallellism on multicore systems. CPU-bound processing may suffer performance issues. It has a steep learning curve compared to threads and processes.

In [19]:
import asyncio
import random

async def cal_square(x: int):
    await asyncio.sleep(random.random())
    return x**2

async def main():
    return await asyncio.gather(*[cal_square(i) for i in range(100)])


import time
s = time.time()
res = await main()
elapsed = time.time() - s
print(f"{elapsed:0.2f} seconds.")
print(res)

1.00 seconds.
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961, 1024, 1089, 1156, 1225, 1296, 1369, 1444, 1521, 1600, 1681, 1764, 1849, 1936, 2025, 2116, 2209, 2304, 2401, 2500, 2601, 2704, 2809, 2916, 3025, 3136, 3249, 3364, 3481, 3600, 3721, 3844, 3969, 4096, 4225, 4356, 4489, 4624, 4761, 4900, 5041, 5184, 5329, 5476, 5625, 5776, 5929, 6084, 6241, 6400, 6561, 6724, 6889, 7056, 7225, 7396, 7569, 7744, 7921, 8100, 8281, 8464, 8649, 8836, 9025, 9216, 9409, 9604, 9801]


In [22]:
# The main time used in prasing urls is to visit the url. 
# Hence we prefer to use asyncio for parallel to avoid overhead communications between multiple processes.
import asyncio
import concurrent.futures
import requests

import asyncio, time
import requests
from bs4 import BeautifulSoup

async def get_urls_asyncio(url, max_urls=10000):
    async def get_urls(urls, max_workers=20):
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
            loop = asyncio.get_event_loop()
            futures = [
                loop.run_in_executor(
                    executor, 
                    requests.get, 
                    url
                )
                for url in urls
            ]
        soups = []
        for reqs in await asyncio.gather(*futures):
            soups.append(BeautifulSoup(reqs.text, 'html.parser'))
        return soups
                  
    q = [url]
    visited = {url: True}
    p = 0
    while p < len(q) and len(q) < max_urls:
        p_next = len(q)
        soups = await get_urls(q[p:])
        for s in soups:
            for link in s.find_all('a'):
                y = link.get('href')
                if y is not None and y.startswith('https://') and y.split('//')[1].startswith('www') and (not y.split('/')[-1].startswith('#')):
                    if y not in visited:
                        q.append(y)
                        visited[y] = True
        p = p_next
    return q

In [23]:
# test asyncio method
s = time.time()
res = await get_urls_asyncio('https://www.geeksforgeeks.org/')
elapsed = time.time() - s
print(f"{elapsed:0.2f} seconds.")
print(len(res), res[432])

18.94 seconds.
21949 https://www.geeksforgeeks.org/introduction-to-queue-data-structure-and-algorithm-tutorials/?ref=outind
