## Factors against Parallelism

* Startup costs associated with initiating processes
  * May often overwhelm actual processing time (rendering ||ism useless)
  * Involve thread/process creation, data movement
* Interference: slowdown resulting from multiple processors accessing shared resources
  * Resources: memory, I/O, system bus, sub-processors
  * Software synchronization: locks, latches, mutex, barriers
  * Hardware synchronization: cache faults, interrupts
* Skew: when breaking a single task into many smaller tasks, not all tasks may be the same size
  * Not all tasks finish at the same time
  
### Understanding Factors by Analogy



#### Startup

* Szkieletor in Krakow Poland
  * Too expensive to complete or demolish
* Why is this like a parallel computer?
    * It is a parallel living environment
    * The parallel living throughput is 0 because of startup
    
<img src="./images/sz.png" width="300" title="http://en.wikipedia.org/wiki/File:Szkieletor2.jpg" />

#### Interference: 

* Congested intersections
  * mulitple vehicles compete for same resource (lanes in roundabout)
  * others await resources
* This is a parallel driving environment
  * unused capacity (16 outbound lanes) because resource competition (the roundabout) prevents its use
  * this system exhibits a _throughput collapse_ in which more usage reduces flow
  
<img src="./images/traffic.png" width="512" title="http://crowdcentric.net/2011/05/can-you-help-us-solve-the-worlds-traffic-congestion/" />


#### Skew: 

* Completion of parts for assembly
  * throughput: output planes stalled awaiting other parts
* The parallelism is implicit in the parallel construction of all parts
  * the entire system is stalled (seen) awaiting a nose section.

<img src="./images/plane.png" width="512" title="http://www.ainonline.com/?q=aviation-news/dubai-air-show/2011-11-12/" />


### In Computers: Real things that Degrade Parllelism

* I/O (memory and storage)
  * may be startup (load data before computation)
  * may be interference (awaiting data transfer between parallel tasks)
  * may be skew (await I/O completion of one task)

* Network communication
    * similar to I/O but always involves communication

* Failures—particularly slow/failed processes (often skew)

The HPC community focuses on communication (among processes) as the major source of slowdown.  This is a traditional (I/O and networking) view.

### Communication

* Parallel computation proceeds in phases
  * Compute (evaluate data that you have locally)
* Communicate (exchange data among compute tasks).  Performance is governed by:
  * Latency: fixed cost to send a message
  * Round trip time (speed of light and switching costs)
* Bandwidth: marginal cost to send a message
  * Link capacity
* Latency dominates small messages and bandwidth dominates large
Almost always better to increase message size for performance, but difficult to achieve in practice.

### Overlapping Computation and I/O

(I/O or messaging) and computation that occur in parallel are overlapped

<img src="./images/overlap.png" width="512" title="Unknown source" />

* _Concept_: When performing a slow operation
  * do the slow operation asynchronously
  * do useful work with processor while waiting
* Overlap is one of the simplest and most important forms of asynchronous execution
  * identify independent tasks and do in parallel
  * reorder I/O to initiate as early as possible and wait as late as possible
  * while computing at the same time
  
I've built a toy example to demonstrate.

In [1]:
# synchronous I/O and then compute
def factorial(number):  
    f = 1
    for i in range(2, number+1):
        f *= i
    return f

def io_from_devnull(number):
    with open("/dev/null", "rb") as fh:
        for i in range(number):
            fh.read(1)
    return number

In [2]:
%timeit -n 20 factorial(10000)
%timeit -n 20 io_from_devnull(30000)

15 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)
22.2 ms ± 605 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


In [3]:
%%timeit -n 20 

factorial(10000)
io_from_devnull(30000)

36.9 ms ± 843 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


In [4]:
%%timeit -n 20

from multiprocessing import Process
p1 = Process(target=factorial, args=(10000,))
p2 = Process(target=io_from_devnull, args=(30000,))
p1.start()
p2.start() 
p1.join()
p2.join()

28.9 ms ± 812 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)


### Bulk Synchronous Parallel

* BSP is a simplified model for parallel computation that divides programs into "supersteps" that consist of
  * Compute
  * Communicate
  * Synchronize
* Map Reduce is a variant (we'll find out later)
  * Compute
  * Communicate
  * Compute
  * Synchronize
* BSP is an formalism that makes it easy to represent parallel programs
  * Seperate independent compute from dependent communication
  * Most MPI programs use this pattern.
* BSP allows for no overlap
  * Recall it's compute and communication from the same process/node/worker
  * Some nodes may be sending/receiving while others computing. That's not overlap.
  
<img src="https://upload.wikimedia.org/wikipedia/en/thumb/e/ee/Bsp.wiki.fig1.svg/2560px-Bsp.wiki.fig1.svg.png" width="512"/>



### Some Factors Conclusions

* Factors against parallelism are the most important design consideration.
  * This is the non-parallel part in Amdahl’s law
* Typical experience
  * Design a parallel code
  * Test on n=2 or 4 nodes (works great)
  * Deploy on >16 nodes (sucks eggs)
* Measure factors against parallelism
  * Redesign/reimplement
