
### Motivation for this Class

* Modern computer hardware is parallel at all scales:
  * Instruction-level parallelism -- individual processor cores exceute 10s to 100s of instructions at the same time
      * processor pipelines execute single instructions over many cycles
      * vector computing executes the same instruction on a vector of data
  * Multi-core parallelism -- single 'processors' consist of many indpendendent cores
  * Multi-processor (Non-Uniform Memory Architecture) parallelism -- many chips intergrated into same computer
  * Distributed parallelism -- many computers connected over a network
      * Cloud computing
      * Supercomputing
* Why do we need to use the parallelism? Good utilization leads to:
    * Energy efficiency--operating energy of a system constant (to a first approximation). More FLOPSs per watt
    * Cost efficiency--fixed cost to acquire hardware. More FLOPs per \$\$.
    * Scalability--limits to how much hardware can be integrated efficiently.  Solve bigger problems.
        * number of cores on a die
        * number of processors on a system bus
        * number of nodes on a network.
    * So, save the earth, make more money, solve the hardest problems.
        * Parallel computing is the technology that unlocked AI and started the AI revolution in 2013.
        
    


### A Comment on the CS Curriculum

The  computer science curriculum often fails to adequately address parallelism. 

* The 'discipline' of CS has been built on serial algorithm and machine models
    * Computational complexity counts serial instructions
    * (Most) Programming languages express one instruction after another
    * Systems and archiecture build on the Von Neumann machine.
* These are powerful concepts and what you have learned so far.
* They are inadequate (at best) and deceptive.

* Von Neumann Architecture

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Von_Neumann_Architecture.svg/2880px-Von_Neumann_Architecture.svg.png" width="386" title="Von Neumann Architecture" />    


## Modern Processors: A Programmer's Perspective

In 1979, processors looked like Von Neumann's machines.  The were designed for serial execution on a single thread.

<img src="https://static.righto.com/images/8086-prefetch8088/die-labeled-w600.jpg" width="386" title="8088 Die Layout" /> 

For many reasons that we will cover later (Moore's Law and Dennard scaling) processors have evolved into parallel execution units. The goal is always to execute more instructions and this is accomplished in three main ways:


### Multicore

Processors consist of mulitple independent processing units. Each "core" is a separate processing unit.

<img src="https://www.cpushack.com/wp-content/uploads/2018/03/SB-EPLayout.jpg" width="386" title="Sandy Bridge Die Layout" /> 

The process of placing multiple cores on a single die started in 2014 and has continues.  By 2010 (above), 8 cores on a processor was possible. Core counts vary from 4 or 5 (phones) to 96 (servers).


### Pipeline

The execution of a single instruction is decomposed into stages that use different parts of the chips at the same time. This is known a "pipeline parallelism".

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Pipeline%2C_4_stage.svg/1920px-Pipeline%2C_4_stage.svg.png" width="386" title="Generic pipeline (wikipedia)" /> 

The instructions need to be independent so that they can run at the same time. If they are not independent, a pipeline stall occurs. This happens when an instruction uses as input the output of a prior instruction.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/67/Pipeline%2C_4_stage_with_bubble.svg/700px-Pipeline%2C_4_stage_with_bubble.svg.png" width="386" title="Generic pipeline (wikipedia)" /> 

### Vector Processing

Vector processing performs simulatneous instructions on a one-dimensional array (vector) of data. A common implementation is the  _Single Instruction Stream, Multiple Data Stream (SIMD)_ vectors of fixed width. 

![Vector Operation](./images/vector_op.JPG "Vector Operation")

### How many parallel operations?

At any single time, a CPU can conduct

    vector_width x pipeline_depth x core_count
    
instructions at the same time. Typically 200-1000. 

Our job, as programmers, is to feed the processor with enough independent work to fully utilize this hardware. Again, for power, cost, carbon. This can be done:

* __implicitly__: by feeding the compiler code patterns that result in pipelines without stalls and vectorized code
  
* __explicitly__
    * multithreaded programs (OpenMP, Cilk) that user multiple cores
    * vector intrinsics: instructions that program array operations
    
We will consider both approaches.

### Who should take this course?
 
This class aims to provide students with a comprehensive understanding of parallel computing principles, techniques, and best practices. By equipping students with the necessary knowledge and skills, the class seeks to empower them to leverage parallelism effectively and efficiently.

This course is designed for the following audiences:

* Undergraduates in Computer Science: The course offers a quick lift of skills that are highly valuable to employers and can enhance internship prospects. By taking this course, students can acquire practical parallel computing skills that are in demand in various industries.

* Graduate students in Science and Engineering: The course is designed to minimize dependencies on other computer science courses, making it accessible to students from diverse academic backgrounds. The course provides a self-contained treatment of operating systems, computer architecture, and other relevant topics.
 
The course takes an engineering and programming approach, focusing on practical applications rather than delving deeply into the theoretical aspects of parallel computation. This approach makes the course accessible and beneficial to individuals who are interested in understanding how programming languages interact with hardware architecture, particularly the memory system.
   