# 04: Matrix - An Exercise in Parallelism

An early use for Spark has been Machine Learning. Spark's `MLlib` of algorithms contains classes for vectors and matrices, which are important for many ML algorithms. This exercise uses a simpler representation of matrices to explore another topic; explicit parallelism.

The sample data is generated internally; there is no input that is read. The output is written to the file system as before.

See the corresponding Spark job [Matrix4.scala](https://github.com/deanwampler/spark-scala-tutorial/blob/master/src/main/scala/sparktutorial/Matrix4.scala).

Let's start with a class to represent a Matrix.

In [1]:
/**
 * A special-purpose matrix case class. Each cell is given the value
 * i*N + j for indices (i,j), counting from 0.
 * Note: Must be serializable, which is automatic for case classes.
 */
case class Matrix(m: Int, n: Int) {
  assert(m > 0 && n > 0, "m and n must be > 0")

  private def makeRow(start: Long): Array[Long] =
    Array.iterate(start, n)(i => i+1)

  private val repr: Array[Array[Long]] =
    Array.iterate(makeRow(0), m)(rowi => makeRow(rowi(0) + n))

  /** Return row i, <em>indexed from 0</em>. */
  def apply(i: Int): Array[Long]  = repr(i)

  /** Return the (i,j) element, <em>indexed from 0</em>. */
  def apply(i: Int, j: Int): Long = repr(i)(j)

  private val cellFormat = {
    val maxEntryLength = (m*n - 1).toString.length
    s"%${maxEntryLength}d"
  }

  private def rowString(rowI: Array[Long]) =
    rowI map (cell => cellFormat.format(cell)) mkString ", "

  override def toString = repr map rowString mkString "\n"
}

defined class Matrix


Some variables:

In [2]:
val nRows = 5
val nCols = 10
val out = "output/matrix4"

nRows = 5
nCols = 10
out = output/matrix4


output/matrix4

Let's create a matrix.

In [3]:
val matrix = Matrix(nRows, nCols)

matrix = 


 0,  1,  2,  3,  4,  5,  6,  7,  8,  9
10, 11, 12, 13, 14, 15, 16, 17, 18, 19
20, 21, 22, 23, 24, 25, 26, 27, 28, 29
30, 31, 32, 33, 34, 35, 36, 37, 38, 39
40, 41, 42, 43, 44, 45, 46, 47, 48, 49

With a Scala data structure like this, we can use `SparkContext.parallelize` to convert it into an `RDD`. In this case, we'll actually create an `RDD` with a count of indices for the number of rows, `1 to nRows`. Then we'll map over that `RDD` and use it compute the average of each row's columns. Finally, we'll "collect" the results back to an `Array` in the driver.

In [4]:
val sums_avgs = sc.parallelize(1 to nRows).map { i =>
  // Matrix indices count from 0.
  val sum = matrix(i-1) reduce (_ + _)  // Recall that "_ + _" is the same as "(i1, i2) => i1 + i2".
  (sum, sum/nCols)                      // We'll return RDD[(sum, average)]
}.collect                               // ... then convert to an array

sums_avgs = Array((45,4), (145,14), (245,24), (345,34), (445,44))


[(45,4), (145,14), (245,24), (345,34), (445,44)]

## Recap

`RDD.parallelize` is a convenient way to convert a data structure into an RDD.

## Exercises

### Exercise 1: Try different values of nRows and nCols

### Exercise 2: Try other statistics, like standard deviation

The code for the standard deviation that you would add is the following:

```scala
val row = matrix(i-1)
...
val sumsquares = row.map(x => x*x).reduce(_+_)
val stddev = math.sqrt(1.0*sumsquares) // 1.0* => so we get a Double for the sqrt!
```

Given the synthesized data in the matrix, are the average and standard deviation actually very meaningful here, if this were representative of real data?