# Data Operations and Parallel Mapping

Parallel processing of collections is important
* one the main applications of parallelism today
We examine conditions when this can be done
- properties of collections: ability to split, combine
- properties of operations: associativity, independence

Operations on collections are key to functional programming

map: apply function to each element
- `List(1,3,8).map(x => x*x) == List(1, 9, 64)`

fold: combine elements with a given operation
- `List(1,3,8).fold(100)((s,x) => s + x) == 112`

scan: combine folds of all list prefixes

- `List(1,3,8).scan(100)((s,x) => s + x) == List(100, 101, 104, 112)`


These operations are even more important for parallel than sequential collections: 
they encapsulate more complex algorithms.

You can think of `scan` as applying `fold` to all list prefixes, or alternatively, as recording the intermediate results
of computing the fold of a list. Let's apply now `scan` to
the same input list and to the same initial elements
in the operation of summation. What we will get is the sequence that has
the length one more than your original sequence and contains elements 100,
so the initial element. Then 101, so your initial element
plus the first element of the list, then 104 obtained by adding 3, and
then after adding 8, we obtain 112. 


So these operations exist in the sequential case already, but they become even more important
in case of parallel operations, because in that case, implementing them from scratch is more difficult. So there's even more value in reusing
such implementations from the library. 
We have been using list to specify the intended result of these operations, but in fact, lists themselves are not a very good implementation of parallel collections. 
Because we cannot efficiently split them in half since we would need to find the position of the middle of the list. 
And it is also not efficient to combine them because concatenation takes linear time. For simplicity, here, we will consider two alternatives to lists.


We use `List` to specify the results of operations.

Lists are not good for parallel implementations because we cannot efficiently
- split them in half (need to search for the middle)
- combine them (concatenation needs linear time)

We use for now these alternatives
- arrays: imperative (recall array sum)
- trees: can be implemented functionally


Subsequent lectures examine Scala’s parallel collection libraries
- includes many more data structures, implemented efficiently

## Map: Meaning and Properties

Map applies a given function to each list element
```scala
List(1,3,8).map(x => x*x) == List(1, 9, 64)
```
`List(a1, a2, …, an).map(f) == List(f(a1), f(a2), …, f(an))`

Properties to keep in mind:
- `list.map(x => x) == list`
- `list.map(f.compose(g)) == list.map(g).map(f)`

Recall that `(f.compose(g))(x) = f(g(x))`.

Let's write a sequential map function on lists.
```scala
def mapSeq[A,B](lst: List[A], f : A => B): List[B] = lst match {
case Nil => Nil
case h :: t => f(h) :: mapSeq(t,f)
}
```
We would like a version that parallelizes
- computations of $f(h)$ for different elements $h$
- finding the elements themselves (list is not a good choice)


Now we would like to have parallel
versions of such map operation, that means that we would like to perform computations of f applied to different list elements in parallel. And we would also like to parallelize
the transformation of the list itself, that means that the choice of list is no longer ideal. Because even finding the middle
element of the list is already linear, so we would not be able to parallelize
operation on long lists that have not yet achieve operation. We will therefore start by looking at implementations of maps on arrays. Here's one signature of such an operation.


In [1]:
def mapASegSeq[A,B](inp: Array[A], left: Int, right: Int, f : A => B,
out: Array[B]) = {
// Writes to out(i) for left <= i <= right-1
var i= left
while (i < right) {
out(i)= f(inp(i))
i= i+1
} }

defined [32mfunction[39m [36mmapASegSeq[39m

It would take an input array, denoted by
`inp`, and it also takes this argument, an output array to which
the results should be written. To indicate which part of
the array should be processed, we have indices `left` and `right`. And the processing should start
at left and stop at `right-1`. The function to be applied is again passed as argument, this is the function f. Let's look first at sequential
implementation of such function. We can do that using a simple while loop that starts from left, and then right, the output, right to the output array, the result of input of i for which function f is applied. Then we do that for increasingly large index i going from left up to and not including right. So the effect of this function is that the content of the out array is going to be changed between this, left and right-1.

In [2]:
val in = Array(2,3,4,5,6)
val out = Array(0,0,0,0,0)
val f = (x:Int) => x*x
mapASegSeq(in,1,3,f,out)
out

[36min[39m: [32mArray[39m[[32mInt[39m] = [33mArray[39m([32m2[39m, [32m3[39m, [32m4[39m, [32m5[39m, [32m6[39m)
[36mout[39m: [32mArray[39m[[32mInt[39m] = [33mArray[39m([32m0[39m, [32m9[39m, [32m16[39m, [32m0[39m, [32m0[39m)
[36mf[39m: [32mInt[39m => [32mInt[39m = ammonite.$sess.cmd1$Helper$$Lambda$2011/0x9abc2028@1f7a0ec
[36mres1_4[39m: [32mArray[39m[[32mInt[39m] = [33mArray[39m([32m0[39m, [32m9[39m, [32m16[39m, [32m0[39m, [32m0[39m)

How can we do this in parallel?
Let's use the `parallel` construct.

```scala
def mapASegPar[A,B](inp: Array[A], left: Int, right: Int, f : A => B,
out: Array[B]): Unit = {
// Writes to out(i) for left <= i <= right-1
if (right - left < threshold)
mapASegSeq(inp, left, right, f, out)
else {
val mid = left + (right - left)/2
parallel(mapASegPar(inp, left, mid, f, out),
mapASegPar(inp, mid, right, f, out))
}
}
```
When the difference between the left index
and the right index is small enough, below some threshold, then we can invoke the
function that we have defined previously. Otherwise, we compute
the middle element and then we invoke the functions recursively from
left to middle and from middle to right. There are two things that we
need to pay attention to. One is that we are now invoking, in parallel, computations that
are writing output to some array. It means that we need to
be careful that parallel operation write to disjoint
parts of the memory. 

In this case, what these codes are writing
to are elements of the out array, so we need to track to which
indices these codes are writing. And here we have a specification
returning as an informal comment saying that our recursive function,
just like the sequential counterpart. Writes to element out(i) for
indices i between and including left, and
up to and including right-1. Because of this property,
we see that this recursive calls, in fact, will not interfere,
because the highest element to which the first argument of
parallel will write is mid minus 1. And the first element to which
the second call will write is mid. Moreover, we see that if we assume
the specification holds for these calls, then the specification
will also hold for the entire function. So by induction, we can actually verify that this property
that we have written in comments holds. 



In terms of performance, another point
that we need to take into account is that this threshold needs to be large enough. This is particularly important for an example such as this, where the only thing that we are doing is
writing certain elements of the array. 

If the function f is relatively simple,
then performing this write to one index of
an array is going to be several orders of magnitude cheaper than invoking parallel computations? So the overhead of parallelization
will need to be somehow amortized over the number of times we are invoking these
writes to individual indices of the array. That's why threshold needs to be several orders of magnitude as well. Once we have defined such parallel maps, we can then use it for various concrete functions.


## Example of using mapASegPar: pointwise exponent
Raise each array element to power $p$:

$Array(a1, a2, . . . , an) \leftarrow Array(|a1|^p, |a2|^p, . . . , |an|^p)$
We can use previously defined higher-order functions:

```scala
val p: Double = 1.5
def f(x: Int): Double = power(x, p)
mapASegSeq(inp, 0, inp.length, f, out) // sequential
mapASegPar(inp, 0, inp.length, f, out) // parallel
```

Questions on performance:

- are there performance gains from parallel execution
- performance of re-using higher-order functions vs re-implementing



### Sequential pointwise exponent written from scratch

In [4]:
import math.pow
def normsOf(inp: Array[Int], p: Double,
left: Int, right: Int,
out: Array[Double]): Unit = {
var i= left
while (i < right) {
out(i)= pow(inp(i),p)
i= i+1
}
}

[32mimport [39m[36mmath.pow
[39m
defined [32mfunction[39m [36mnormsOf[39m

### Parallel pointwise exponent written from scratch

```scala
def normsOfPar(inp: Array[Int], p: Double,
left: Int, right: Int,
out: Array[Double]): Unit = {
if (right - left < threshold) {
var i= left
while (i < right) {
out(i)= power(inp(i),p)
i= i+1
}
} else {
val mid = left + (right - left)/2
parallel(normsOfPar(inp, p, left, mid, out),
normsOfPar(inp, p, mid, right, out))
}
}
```

Now, what do you think is the relative
performance of these different versions? 
How much performance improvement do you expect from:

- inlining the higher-order function of map
- parallelizing over several cores

- inp.length = 2000000
- threshold = 10000
- Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (4-core, 8 HW threads), 16GB RAM
|expression | time(ms)|
|-----------|----------|
|mapASegSeq(inp, 0, inp.length, f, out)| 174.17|
|mapASegPar(inp, 0, inp.length, f, out)| 28.93|
|normsOfSeq(inp, p, 0, inp.length, out)| 166.84|
|normsOfPar(inp, p, 0, inp.length, out)| 28.17|

* Parallelization pays off
* Manually removing higher-order functions does not pay off

## Parallel map on immutable trees
Consider trees where
- leaves store array segments
- non-leaf node stores two subtrees

```scala
sealed abstract class Tree[A] { val size: Int }
case class Leaf[A](a: Array[A]) extends Tree[A] {
override val size = a.size
}
case class Node[A](l: Tree[A], r: Tree[A]) extends Tree[A] {
override val size = l.size + r.size
}
```
Assume that our trees are balanced: we can explore branches in parallel.

We will consider trees whose leaves store arrays segments and whose non leaf nodes contain references to left and to right sub tree. It is also going to be convenient for both leaves and non leaves to store the total number of elements. 
We will be assuming that our trees are approximately balanced. That will allow us to explore the branches in parallel while obtaining benefits of parallelization.

Here is an implementation of paralle map on tree.
```scala
def mapTreePar[A:Manifest,B:Manifest](t: Tree[A], f: A => B) : Tree[B] =
t match {
case Leaf(a) => {
val len = a.length; val b = new Array[B](len)
var i= 0
while (i < len) { b(i)= f(a(i)); i= i + 1 }
Leaf(b) }
case Node(l,r) => {
val (lb,rb) = parallel(mapTreePar(l,f), mapTreePar(r,f))
Node(lb, rb) }
}
```
Speedup and performance similar as for the array

Arrays:

- (+) random access to elements, on shared memory can share array
- (+) good memory locality
- (-) imperative: must ensure parallel tasks write to disjoint parts
- (-) expensive to concatenate

Immutable trees:
* (+) purely functional, produce new trees, keep old ones
* (+) no need to worry about disjointness of writes by parallel tasks
* (+) efficient to combine two trees
* (-) high memory allocation overhead
* (-) bad locality


It is instructive to compare using
arrays versus basically immutable trees as collections on which we
perform operations such as map. 

Arrays are very appealingbecause of their simplicity and because we can access
elements in arbitrary order. 
When we have tasks that are executing on
the same shared memory, that means we just need to pass the pointer or reference to the beginning of the array. And some indication of
which indices of the array, appropriate task is suppose to process. 
Once we are processing a particular region of the array, because array elements are stored continuously in memory, we obtain good memory locality. 
On the other hand, because we need to
create these arrays in an imperative way, we have to be careful that tasks
that are executing in parallel write to disjoint parts of the array. Otherwise, we will obtain unpredictable results that depend on the order in which the writes are performed. That's one disadvantage of arrays. And another disadvantage is that if we obtain these arrays into completely different computations and
we later want to put them together, they will necessarily have to copy some parts of the array. 

It is interesting to compare the properties of arrays with properties of immutable trees. 

In this approach, we have seen that we produce new trees as the result of operations such as map and we keep the old ones. That means that we can continue using the old versions of data which is useful for many applications. Moreover, because operations such as map produces new trees, it is easier to ensure
that it does not write to the same parts of memory as some other operation that's executing in parallel. 
Next, it is easy to combine two trees
because all we need to do is create a new node that has two other trees is, certain trees. Even if we need to ensure balancing,
this can be done reasonably efficiently. 

Among the negative aspects of immutable trees is high memory allocation overhead and
also bad locality. Because different parts of the tree may be stored in principle in completely different parts of memory.