# Multi-processing example

We’ll start with code that is clear, simple, and executed top-down. It’s easy to develop and incrementally testable:

In [1]:
import urllib.request
from multiprocessing.pool import ThreadPool as Pool

sites = [
    'https://jupyter-tutorial.readthedocs.io/en/latest/',
    'https://github.com/veit/jupyter-tutorial/',
    'https://cusy.io/en',
]

def sitesize(url):
    ''' Determine the size of a website '''
    with urllib.request.urlopen(url) as u:
        page = u.read()
        return url, len(page)

pool = Pool(10)
for result in pool.imap_unordered(sitesize, sites):
    print(result)

('https://cusy.io/en', 15655)
('https://jupyter-tutorial.readthedocs.io/en/latest/', 12630)
('https://github.com/veit/jupyter-tutorial/', 98527)


> **Note 1:** A good development strategy is to use [map](https://docs.python.org/3/library/functions.html#map), to test your code in a single process and thread before moving to multi-processing.

> **Note 2:** In order to better assess when `ThreadPool` and when process `Pool` should be used, here are some rules of thumb:
> 
> * `multiprocessing.pool.ThreadPool` should be used for IO-heavy jobs.
> * `multiprocessing.Pool` should be used for CPU-heavy jobs.
> * For jobs that are heavy on the CPU and IO, I usually prefer `multiprocessing.Pool`, as this achieves better process isolation.
> * For Python 3, take a look at the pool implementation of [concurrent.future.Executor](https://docs.python.org/3/library/concurrent.futures.html?highlight=concurrent%20futures#concurrent.futures.Executor).

In [2]:
import urllib.request
from multiprocessing.pool import ThreadPool as Pool

sites = [
    'https://jupyter-tutorial.readthedocs.io/en/latest/',
    'https://github.com/veit/jupyter-tutorial/',
    'https://cusy.io/en',
]

def sitesize(url):
    ''' Determine the size of a website '''
    with urllib.request.urlopen(url) as u:
        page = u.read()
        return url, len(page)

for result in map(sitesize, sites):
    print(result)

('https://jupyter-tutorial.readthedocs.io/en/latest/', 12630)
('https://github.com/veit/jupyter-tutorial/', 98651)
('https://cusy.io/en', 15655)


## What can be parallelised?

### Amdahl’s law

> The increase in speed is mainly limited by the sequential part of the problem, since its execution time cannot be reduced by parallelisation. In addition, parallelisation creates additional costs, such as for communication and synchronisation of the processes.

In our example, the following tasks can only be processed serially:

* UDP DNS request request for the URL
* UDP DNS response
* Socket from the OS
* TCP-Connection
* Sending the HTTP request for the root resource
* Waiting for the TCP response
* Counting characters on the site

In [3]:
import urllib.request
from multiprocessing.pool import ThreadPool as Pool

sites = [
    'https://jupyter-tutorial.readthedocs.io/en/latest/',
    'https://github.com/veit/jupyter-tutorial/',
    'https://cusy.io/en',
]

def sitesize(url):
    ''' Determine the size of a website '''
    with urllib.request.urlopen(url) as u:
        page = u.read()
        return url, len(page)

pool = Pool(10)
for result in pool.imap_unordered(sitesize, sites):
    print(result)

('https://cusy.io/en', 15655)
('https://jupyter-tutorial.readthedocs.io/en/latest/', 12630)
('https://github.com/veit/jupyter-tutorial/', 98526)


> **Note:** [imap_unordered](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.imap_unordered) is used to improve responsiveness. However, this is only possible because the function returns the argument and result as a tuple.

## Tips

* Don’t make too many trips back and forth

   If you get too many iterable results, this is a good indicator of too many trips, such as in

        def sitesize(url, start):
            req = urllib.request.Request()
            req.add_header('Range:%d-%d' % (start, start+1000))
            u = urllib.request.urlopen(url, req)
            block = u.read()
            return url, len(block)

* Make relevant progress on every trip

   Once you get the process, you should make significant progress and not get bogged down. The following example illustrates intermediate steps that are too small:

        def sitesize(url, results):
            with urllib.request.urlopen(url) as u:
                while True:
                    line = u.readline()
                    results.put((url, len(line)))

* Don’t send or receive too much data

   The following example unnecessarily increases the amount of data:

        def sitesize(url):
            u = urllib.request.urlopen(url)
            page = u.read()
            return url, page