### Instruction-Level Parallelism 

Instruction-Level Parallelism (ILP) refers to the parallel execution of a sequence of instructions in a program. The amount of parallelism is measured by the number of instructions completed per cycle. ILP is not concurrent execution, because there is only a single-thread serial program. Rather, it is a set of techniques that are used to ensure that a processor completes as many instructions as possible by executing them simultaneously.  These techniques include:

* Instruction pipelining: instructions are completed in stages that can be overlapped when instructions are independent.

* Vector processing: multiple instructions can be executed in parallel on adjacent data.  This is a subset of superscalar processing, which is a more general term that includes the idea of using different hardware units at the same time. 

* Out-of-order execution: Instructions may be run in an order different than written in the program. This can be done statically at compile time or dynamically by the hardware.

* Speculative execution (branch prediction): Running a program past a control point. Most often this is done by predicting the outcome of an if-else branch and running the program past the expected outcome. 


One should be aware that all of these techniques exist and have a high-level understanding of them. As a parallel programmer, only **vector processing** has a programmatic interface that you will use. That is our next example and exercise.

It is typical to assume that CPUs complete one instruction per cycle. This is *NOT TRUE*. The completion rate is a complex function of architecture and program.
This is measured by:
* Cycles per instruction (CPI): # of clock cycles / # of instructions
* Instructions per cycles (IPC): # of instructions / # of clock cycles
These are dynamic measures taking against a running program. It is more typical to use CPI in parallel computing, because we are expecting to complete more than one.  

If you would like to go down the well, you can look at the [instruction tables](https://www.agner.org/optimize/instruction_tables.pdf) at the [Software optimization resources](https://www.agner.org/optimize/) page.
On many processers, the simplest instructions take about one cycle and complex instructions (such as division) take tens of cycles. 

#### Instruction Pipelining

Having a high-level knowledge of instruction pipeling is valuable in writing efficient code, be it parallel or not. The illustrated example of the [Wikipedia page](https://en.wikipedia.org/wiki/Instruction_pipelining) and the subsequent pipeline bubble example are sufficient. You should understand the following concepts:
* Instructions consist of multiple stages of execution
* Each stage can operate at the same time
* Independent instructions are issued and complete at a rate of one per clock cycle
* Data dependencies between instructions result in stalls/bubbles that prevent concurrent execution
* Waiting on instructions or data (often from memory) can prevent instructions from being issued

Managing the pipeline is the domain of the compiler writer or assmebly-level programmer. However, an application programmer that is aware of pipelines can write programs that are easier for the compiler to process by minimizing data dependencies, avoiding unneccessary branches, or explicitly prefetching data. 

In [None]:
exercise -- cpi