In [112]:
# setup
from IPython.core.display import display,HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
display(HTML(open('rise.css').read()))

# imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="whitegrid", font_scale=1.5, rc={'figure.figsize':(12, 6)})


# CMPS 2200
# Introduction to Algorithms

## Lecture 2: Parallelism


Today's agenda:

- Motivate why we are using parallel algorithms
- Overview of how to analyze parallel algorithms

## What is parallelism? (aka parallel computing)

> ability to run multiple computations at the same time

## Why study parallel algorithms?

- faster
- lower energy usage
  + performing a computation twice as fast sequentially requires roughly eight times as much energy
  + energy consumption is a cubic function of clock frequency
- better hardware now available
  + multicore processors are the norm
  + GPUs (graphics processor units)
  
E.g., more than **one million** core machines now possible:  
SpiNNaker (Spiking Neural Network Architecture), University of Manchester  
<img src="https://upload.wikimedia.org/wikipedia/commons/9/97/Spinn_1m_pano.jpg" alt="SpiNNaker"/>


<br><br>

**supporting Moore's Law since 2004**

<img src="https://upload.wikimedia.org/wikipedia/commons/8/8b/Moore%27s_Law_Transistor_Count_1971-2018.png"/>

## Example: Summing a list

Summing can easily be parallelised by splitting the input list into two (or $k$) pieces.


```python
def total(mylist):
    result = 0
    for v in mylist:
        result += v
    return result
```

<br>

becomes

<br>

```python
def parallel_total(mylist):

    result1, result2 = run_parallel(
        total(mylist[:len(mylist)//2]),
        total(mylist[len(mylist)//2:])
    )

    # combine results
    return result1 + result2
```
    
    
<br><br>
    
- How much faster should parallel version be?
- How much energy is consumed?

Parallel version of `total` is twice as fast with same amount of energy

<br><br>
...almost. This ignores the **overhead** to setup parallel code and communicate/combine results.

$O(\frac{n}{2}) + O(1)$  
$O(1)$  to combine results


<br>
some current state-of-the-art results:

| application                           | sequential | parallel (32 core) | speedup |
|---------------------------------------|------------|--------------------|---------|
| sort $10^7$ strings                   | 2.9        | .095               | 30x     | 
| remove duplicated from $10^7$ strings | .66        | .038               | 17x     | 
| min. spanning tree for $10^7$ edges   | 1.6        | .14                | 11x     | 
| breadth first search for $10^7$ edges | .82        | .046               | 18x     |

The **speedup** of a parellel algorithm $P$ over a sequential algorithms $S$ is:
$$
speedup(P,S) = \frac{T(S)}{T(P)}
$$

## Parallel software

So why isn't all software parallel?

**dependency**

> The fundamental challenge of parallel algorithms is that computations must be **independent** to be performed in parallel.  Parallel computations should not depend on each other.

**What should this code output?**

run this a few times and see if output changes...

In [118]:
import threading

total = 0

def count():
    global total
    for _ in range(100000):
        total += 1
    
def race_condition_example():
    global total
    t1 = threading.Thread(target=count)
    t2 = threading.Thread(target=count)
    t1.start()  # launch first thread in parallel
    t2.start()  # launch second thread in parallel
    t1.join()   # wait for first thread to complete
    t2.join()   # wait for second thread to complete
    print(total)
    
race_condition_example()

200000


#### Counting in parallel is hard!

- motivates functional programming (next class)

This course will focus on:
- understanding when things can run in parallel and when they cannot
- algorithm, not hardware specifics (though see CMPS 4760: Distributed Systems)
- runtime analysis

## Analyzing parallel algorithms

> **work**: total number of primitive operations performed by an algorithm

- For sequential machine, just total sequential time.
- On parallel machine, work is divided among $P$ processors

> **perfect speedup**: dividing $W$ work across $P$ processors yields total time $\frac{W}{P}$

> **span**: longest sequence of dependencies in computation
- time to run with an infinite number of processors
- measure of how "parallelized" an algorithm is 
- also called: *critical path length* or *computational depth*

<br><br>
**intuition**:  
**work**: total energy consumed by a computation  
**span**: minimum possible time that the computation requires

**What is work and span of parallel `total` algorithm?**


```python
def parallel_total(mylist):
    results = run_parallel(
        total(mylist[0], key),
        total(mylist[1], key),
        ...
        total(mylist[n], key),
    )

    # combine results
    total = 0
    for v in results:
        total += v
    return total
```


- work: $O(n)$
- span: $O(n)$


**oops** that didn't work...

<br>

can we do better?

**idea**: parallelize combination step

### divide and conquer

![dag](figures/dag.png)  
[source](https://homes.cs.washington.edu/~djg/teachingMaterials/spac/sophomoricParallelismAndConcurrency.pdf)
