# Week 4: Multiprocessing in Python
In this exercise, we are exploring the multiprocessing capabilities of Python to support the concepts of Processes, Process Lifecycle, Concurrent Execution and Synchronization.

## Working on Windows
Unfortunately, interactive Python doesn't play well with multiprocessing on Windows.  The problem is that the module namespace isn't inherited from the parent, so the function names that are the targets for the new processes aren't available.  We are therefore going to execute this exercise on colab, a Google hosted Jupyter environment, available at https://colab.research.google.com.  Alternatively, this is probably the time to understand how to program directly in Python, and to use editors such as Notepad++ or emacs to create python files directly, and run them from the command line. 

## Simple multiprocessing
The multiprocessing package supports the spawning of separate processes running your Python code. In the following code, we create the given number of processes. Note that these are operating system level processes with their own threads of control and address spaces. There is magic to allow some shared state (investigate the operating system "fork" to see how)

In [None]:
import multiprocessing

def worker(msg):
    print(multiprocessing.current_process(), " says ",msg)
    
def spawnWorkers(num,msg):
    for w in range(num):
        p = multiprocessing.Process(target=worker,args=(msg,))
        p.start()

In [None]:
spawnWorkers(4,"Hello World")

The processes above are simple processes that execute quickly and then exit. If we have longer lived process, it is good practice to have the parent process (the process spawning the children) wait for the children to terminate - a call to the join() method. So in general, we use the following idiom to manage individual processes:

In [None]:
def waitForWorkers(num,msg):
    processes = []
    for w in range(num):
        p = multiprocessing.Process(target=worker,args=(msg,))
        processes.append(p)
        p.start()
    for p in processes:
        p.join()

In [None]:
waitForWorkers(4,"hello world")

Let's create a program to investigate how processes are mapped to cores. For this, we're going to create a worker function that executes a lot of instructions (CPU-Bound), and we're going to time how long it takes. If each process is mapped to a different core, then each process should take the approximately the same amount of time. If not, then execution time should increase. Add the timing code as described below in the comments. You are welcome to extend the code to collect results and process them.

In [None]:
import random,time,multiprocessing

def workThatCPU(numLoops):
    id = multiprocessing.current_process()
    # Print that this process is executing at this time
    # YOUR CODE HERE
    # Insert code that captures the current time in milliseconds
    # YOUR CODE HERE
    
    for w in range(numLoops):
        random.random()
    # Print the execution time of the process
    #YOUR CODE HERE

def coreInvestigation(numCores, numLoops):
    processes = []
    for w in range(numCores):
        p = multiprocessing.Process(target=workThatCPU,args=(numLoops,))
        processes.append(p)
        p.start()
    for p in processes:
        p.join()
    

Each of our processes will be mapped to a physical core.  Starting from 1 process, the time taken for each process should remain constant till we exceed the available number of cores, typically 4 cores on modern desktop machines.

In [None]:
coreInvestigation(1,100000000)

## A multiprocessor Web Crawler
We're going to build a web crawler to answer the question - how many neighbours of neighbours does a given web site have? A web crawler retrieves the HTML corresponding to a URL from a web server, parses the HTML to extract the data that is interesting, and then processes the data. This is a typical application for multiprocessing, since the retrieval of the html corresponding to a url may take an arbitrary amount of time, depending upon the bandwidth of the link to the web server, and the load upon the server.
The preferred parser is the Beautiful_Soup package, and the connection to the web server is managed by the requests package. Our worker is going to collect a URL from a queue, retrieve and parse the HTML for references to other URLs, and add the set of discovered locations to the queue. Note that these are locations, so "http" needs to be prefixed.

In [None]:
from bs4 import BeautifulSoup
from urllib.parse import urlparse
import requests
import multiprocessing

def getLocations(url,completed):  
    page  = requests.get(url)
    data = page.text
    base = urlparse(url)
    completed.add(base.netloc)
    # BS likes to know which parser to use
    soup = BeautifulSoup(data, "html.parser")
    urls = set()
    for link in soup.find_all('a'):
        if(link.get('href')):
            r = urlparse(link.get('href'))
            if r.scheme and "http" in r.scheme and r.netloc not in completed:
                urls.add(r.netloc)
    return urls

def worker(urlqueue,baseloc):
    # create a completed set with the baseloc
    completed = set((baseloc))
    #baseUrl =  'http://' + baseloc + '/'
    results = set()
    while not urlqueue.empty():
        # TODO pull the first location from the queue
        # YOUR CODE HERE
        # TODO Create the url from the location
        # YOUR CODE HERE'
        # TODO get the neighbour locations
        # YOUR CODE HERE
        results.update(locations)
        completed.add(startLoc)
    return results
    
def crawler(url):
    manager = multiprocessing.Manager()
    myQueue = manager.Queue()
    startLoc = urlparse(url)
    startSet = getLocations(url,set())
    for u in startSet:
        myQueue.put(u)
    pool = multiprocessing.Pool(5)
    results = pool.apply_async(worker,(myQueue,startLoc.netloc))
    # Return asynchronous results, that will wait til we have everything completed
    return results.get()

In [None]:
crawler("http://www.sussex.ac.uk")

### Webcrawler problems
We can pass any URL to our webcrawler, and it will use the number of processes in the pool to get the results.  However, even though our Queue is synchronized, there are still potential problems.  We want to exit the process when there are no longer any locations to examine.  However, in between seeing if the queue is empty, and getting the next location, the queue may be accessed by another process.  We could fix this by making the check and retrieval a critical section with mutual exclusion, but its better in this situation to just accept that the queue may have emptied by the time we come to look at it. In general, always try to construct solutions that don't need synchronization, except that which is already built into the data structures. 