# Running computations in parallel


Consider a square and circle of radius one inside the square. 

Ratio between surface areas of circle and square

$\lambda = \frac{\pi}{4}$

Estimating λ: 
randomly sample points inside the square

Count how many fall inside the circle

Multiply this ratio by 4 for an estimate of $\pi$

In [1]:
import scala.util.Random
def mcCount(iter: Int): Int = {
val randomX = new Random
val randomY = new Random
var hits = 0
for (i <- 0 until iter) {
val x = randomX.nextDouble // in [0,1]
val y = randomY.nextDouble // in [0,1]
if (x*x + y*y < 1) hits= hits + 1
}
hits
}
def monteCarloPiSeq(iter: Int): Double = 4.0 * mcCount(iter) / iter

[32mimport [39m[36mscala.util.Random
[39m
defined [32mfunction[39m [36mmcCount[39m
defined [32mfunction[39m [36mmonteCarloPiSeq[39m

In [4]:
monteCarloPiSeq(100000)

[36mres3[39m: [32mDouble[39m = [32m3.1458[39m

Here is the parallel version of computing the same estimate.

```scala
def monteCarloPiPar(iter: Int): Double = {
val ((pi1, pi2), (pi3, pi4)) = parallel(
parallel(mcCount(iter/4), mcCount(iter/4)),
parallel(mcCount(iter/4), mcCount(iter - 3*(iter/4))))
4.0 * (pi1 + pi2 + pi3 + pi4) / iter
}
```

All computations are happening in parallel and independent.

# First Class Tasks

Let's now describe a more flexible construct for describing parallel computation. Consider

```scala
val (v1, v2) = parallel(e1, e2)
```

we can write alternatively using the `task` construct:

```scala
val t1 = task(e1)
val t2 = task(e2)
val v1 = t1.join
val v2 = t2.join
t1
t2

```

`
t = task(e)` starts computation `e` "in the background"

* `t` is a task, which performs computation of `e`
* current computation proceeds in parallel with `t`
* to obtain the result of `e`, use `t.join`
* `t.join` blocks and waits until the result is computed
* subsequent `t.join` calls quickly return the `s`

Here is a minimal interface for tasks:

```scala
def task(c: => A) : Task[A]
trait Task[A] {
def join: A
}

```

`task` and `join` establish maps between computations and tasks
In terms of the value computed the equation `task(e).join==e` holds

We can omit writing `.join` if we also define an implicit conversion.

```scala
implicit def getJoin[T](x:Task[T]): T = x.join
```

We have seen four-way parallel p-norm:

```scala
val ((part1, part2),(part3,part4)) =
parallel(parallel(sumSegment(a, p, 0, mid1),
sumSegment(a, p, mid1, mid2)),
parallel(sumSegment(a, p, mid2, mid3),
sumSegment(a, p, mid3, a.length)))
power(part1 + part2 + part3 + part4, 1/p)
```

Here is essentially the same computation expressed using task:
```scala
val t1 = task {sumSegment(a, p, 0, mid1)}
val t2 = task {sumSegment(a, p, mid1, mid2)}
val t3 = task {sumSegment(a, p, mid2, mid3)}
val t4 = task {sumSegment(a, p, mid3, a.length)}
power(t1 + t2 + t3 + t4, 1/p)
```

Notice the implicit conversion of `t1`, `t2`, `t3` and `t4` in the last statement.

Suppose you are allowed to use task
Implement parallel construct as a method using task

```scala
def parallel[A, B](cA: => A, cB: => B): (A, B) = {
val tB: Task[B] = task { cB }
val tA: A = cA
(tA, tB.join)
}
```

Suppose we want to compute computation `A` and `B`. Using `task` construct start a computation in parallel, and we continue in parallel
in thread of this function.

-------------------
What is wrong with the following definitions?

```scala
def parallelWrong[A, B](cA: => A, cB: => B): (A, B) = {
val tB: B = (task { cB }).join
val tA: A = cA
(tA, tB.join)
}
```

Although, above defintion compiles it does not the benefit of parallel computation in as task `tB` completed (immediatly called join) before task `tA`.

# Benchmarking Parallel Programs



* testing – ensures that parts of the program are behaving according to the intended behavior
* benchmarking – computes performance metrics for parts of the program.

By contrast benchmarking is used to evaluate various evaluation metrics of the program.

A performance metric could be

* running time
* memory foot print
* network traffic
* disk usage
* latency 

Often, this value is a random variable, it's value varies from run to run.
For example if we want to test the runnning time on list reverse we could 

In [12]:
val xs = List(1,2,3)
val startTime = System.nanoTime
xs.reverse
println((System.nanoTime - startTime))

5697


[36mxs[39m: [32mList[39m[[32mInt[39m] = [33mList[39m([32m1[39m, [32m2[39m, [32m3[39m)
[36mstartTime[39m: [32mLong[39m = [32m37832158629139L[39m
[36mres11_2[39m: [32mList[39m[[32mInt[39m] = [33mList[39m([32m3[39m, [32m2[39m, [32m1[39m)

This is a very naive way but we start with this to illustrate the purpose.

Typically, testing yields a binary output – a program or its part is either correct or it is not.

Benchmarking usually yields a continuous value, which denotes the extent to which the program is correct.

In our benchmarking example running time is a continous value and will be slightly different each time.

## Why do we benchmark parallel programs?

Performance benefits are the main reason why we are writing parallel
programs in the first place.

Benchmarking parallel programs is even more important than benchmarking sequential programs.

In sequential program the only measure performance are bottlenecks of the system. Converselt,a parallel program is typically a bottleneck in the system to begin with.

Performance (specifically, running time) is subject to many factors:
* processor speed
* number of processors
* memory access latency and throughput (affects contention)
* cache behavior (e.g. false sharing, associativity effects)
* runtime behavior (e.g. garbage collection, JIT compilation, thread scheduling)


The faster the processor the faster the resulting program. 

A parallel program to some extent distribute some of its workload across different processors. The number of available processors is necessary pre-condition to improve the performance. 

Since processors are separated from main memory with a bus they have to wait while fetching data from memory. Here we differentiate latency, which is the amount of time processor must wait since it requested data from main memory until data arrives, throughput the amount of data can be retrieved from memory per unit time.

These two effect memory contention.To decrease latency,high speed memory, called cache close to the processor cores which mirrors the main memory. Cache are divided into several levels. It allows to read data with out going to the bus. It boost performance by several orders of magnitude. 

However caches makes performance analysis complicated. Caches can decrease peformance (false sharing).

Almost all programs run within some runtime environment, such as VM, MMU or OS. These components use garbage collection, JIT compilation or thread scheduling. 

In reality many others drive performance.

Measuring performance is difficult – usually, the a performance metric is a
random variable.
* multiple repetitions
* statistical treatment – computing mean and variance
* eliminating outliers
* ensuring steady state (warm-up)
* preventing anomalies (GC, JIT compilation, aggressive optimizations)

garbage collection can be avoided by allocating heap memory.
JIT compilation can turned off.
Compilers can make optimizations.

To learn more, see _Statistically Rigorous Java Performance Evaluation, by Georges, Buytaert, and Eeckhout_

# ScalaMeter

ScalaMeter is a benchmarking and performance regression testing
framework for the JVM.
* performance regression testing – comparing performance of the current program run against known previous runs
* benchmarking – measuring performance of the current (part of the) program

We will focus on benchmarking

First, add ScalaMeter as a dependency.

```scala
libraryDependencies +=
"com.storm-enroute" %% "scalameter-core" % "0.6"
```
Then, import the contents of the ScalaMeter package, and measure:

```scala
import org.scalameter._
val time = measure {
(0 until 1000000).toArray
}
println(s"Array initialization time: $time ms")

```

When we run the above snippet in Scala REPl,we get numbers with a lot of variance. During the run JVM could have started dynamic compilation, evidence for this shorter time in few runs. Another thing could have potentially occured is garabage collection, during the runs more and more memory allocated. When it reaches certain threshold garbage collection kicks in. It happens periodically. Running time could be high for garbage collection. 

The demo showed two very different running times on two consecutive
runs of the program.

When a JVM program starts, it undergoes a period of warmup, after
which it achieves its maximum performance.

* first, the program is interpreted (the program is run in interpreter, the byte code output of Scala compiler runs directly on interpreter)
* then, parts of the program are compiled into machine code (JVM is smart enough to figure out which parts are hot parts (frequently used) , and those parts are compiled and turned into machine code
* later, the JVM may choose to apply additional dynamic optimizations
* eventually, the program reaches steady state

Usually, we want to measure steady state program performance.

ScalaMeter `Warmer` objects run the benchmarked code until detecting
steady state.

```scala
import org.scalameter._
val time = withWarmer(new Warmer.Default) measure {
(0 until 1000000).toArray
}
```

ScalaMeter configuration clause allows specifying various parameters, such as the minimum and maximum number of warmup runs.
```scala
val time = config(
Key.exec.minWarmupRuns -> 20,
Key.exec.maxWarmupRuns -> 60,
Key.verbose -> true
) withWarmer(new Warmer.Default) measure {
(0 until 1000000).toArray
}
```

Finally, ScalaMeter can measure more than just the running time.

* Measurer.Default – plain running time
* IgnoringGC – running time without GC pauses
* OutlierElimination – removes statistical outliers
* MemoryFootprint – memory footprint of an object
* GarbageCollectionCycles – total number of GC pauses
* newer ScalaMeter versions can also measure method invocation counts and boxing counts

To measure the memory footprint
```scala
withMeasurer(new Measurer.MemoryFootprint) withWarmer( new Warmer.Default)  measure { (0 until 100000).toArray}
```