# Spark MLlib
*Official documentation [here](https://spark.apache.org/docs/latest/mllib-guide.html).*

In [None]:
import org.apache.spark.SparkContext

classOf[SparkContext].getPackage.getImplementationVersion
val sqlContext = new org.apache.spark.sql.SQLContext(spark)

/*
val hiveContext = new org.apache.spark.sql.hive.HiveContext(spark)

hiveContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
hiveContext.sql("LOAD DATA LOCAL INPATH 'small_data/employer/hashmap.csv' INTO TABLE src")

// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)
*/

In [None]:
def time[R](block: => R): R = {
    val start: Long = System.nanoTime()
    val result = block
    val end: Long = System.nanoTime()
    val duration: Double = (end - start) / 1000000000.0  // what happens if you leave off the decimal point?
    println("Elapsed time: " + duration + "s")
    result
}

## MLlib Data Types
### Vector
A mathematical vector. MLlib supports both dense vectors, where every entry is stored, and sparse vectors, where only the nonzero entries are stored to save space. Vectors can be constructed with the mllib.linalg.Vectors class.

In [None]:
import org.apache.spark.mllib.linalg.{Vector, Vectors}

// Create a dense vector (1.0, 0.0, 3.0).
val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)

// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))

// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.
val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))

assert(sv1 == sv2)
assert(dv == sv1)

**Note:** you cannot easily add these vectors!  These are really meant to be used just for machine-learning algorithms.  Convert these to [Breeze](https://github.com/scalanlp/breeze) vetors for your basic linear algebra manipulations.  Like [numpy](http://www.numpy.org/), Breeze uses low-level libraries like [BLAS](http://www.netlib.org/blas/) and [LAPACK](http://www.netlib.org/lapack/) under the hood.  It is not as full ledged as numpy and type-safety can be a little annoying

In [None]:
// NOTE: you cannot easily add these vectors!

val x = Vectors.dense((1 to 10).map(_.toDouble).toArray)
val y = Vectors.dense((1 to 10).map(_.toDouble).toArray)
// x + y (this will fail!)

import breeze.linalg.DenseVector

val bx = new DenseVector(x.toArray)
val by = new DenseVector(y.toArray)
val sum = Vectors.dense((bx + by).toArray)

### LabeledPoint
A labeled data point for supervised learning algorithms such as classification and regression. Includes a feature vector and a label (which is a floating-point value). Located in the mllib.regression package.

In [None]:
import org.apache.spark.mllib.regression.LabeledPoint

val labelPoint1 = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))

val labelPoint2 = LabeledPoint(1.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

println(labelPoint1.label)
println(labelPoint1.features)
assert(labelPoint1 == labelPoint2)

### Local matrix
*Integer*-typed row and column indices and double-typed values, stored on a single machine. 

In [None]:
import org.apache.spark.mllib.linalg.{Matrix, Matrices}

val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))

### Distributed matrix 
*Long*-typed row and column indices and double-typed values, stored distributively in one or more RDDs.  They come in three formats:

- **RowMatrix:** Row-oriented distributed matrix without meaningful row indices, e.g. a collection of feature vectors. It is backed by an RDD of its rows, where each row is a local vector. We assume that the number of columns is not huge for a RowMatrix so that a single local vector can be reasonably communicated to the driver and can also be stored / operated on using a single node. 

In [None]:
import org.apache.spark.mllib.linalg.distributed.RowMatrix

val rows = spark.parallelize(List(
    Vectors.dense(1.0, 2.0),
    Vectors.dense(2.0, 3.0),
    Vectors.dense(3.0, 4.0)
))
val rowMatrix: RowMatrix = new RowMatrix(rows)

// return size
val m = rowMatrix.numRows()
val n = rowMatrix.numCols()

println(s"rows: $m, cols: $n")

- **IndexedRowMatrix:** Similar to a RowMatrix but with meaningful row indices. It is backed by an RDD of indexed rows, so that each row is represented by its index (long-typed) and a local vector.

In [None]:
import org.apache.spark.mllib.linalg.distributed.{IndexedRowMatrix, IndexedRow}

val indexedRows = spark.parallelize(List(
    IndexedRow(0L, Vectors.dense(1.0, 2.0)),
    IndexedRow(1L, Vectors.dense(2.0, 3.0)),
    IndexedRow(3L, Vectors.dense(4.0, 5.0))
))

val indexedRowMatrix: IndexedRowMatrix = new IndexedRowMatrix(indexedRows)

// return size
val m = indexedRowMatrix.numRows()
val n = indexedRowMatrix.numCols()

println(s"rows: $m, cols: $n")

- **CoordinateMatrix:** (i.e. a distributed sparse matrix) is (essentially) a list of `(Long, Long, Double)`.

In [None]:
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}

val matrixEntires = spark.parallelize(List(
    MatrixEntry(0, 0, 1.),
    MatrixEntry(1, 1, 1.),
    MatrixEntry(2, 2, 1.)
))

val coordinateMatrix: CoordinateMatrix = new CoordinateMatrix(matrixEntires)

val m = coordinateMatrix.numRows()
val n = coordinateMatrix.numCols()

println(s"rows: $m, cols: $n")

- **BlockMatrix:** A distributed matrix backed by an RDD of MatrixBlocks, where a MatrixBlock is a tuple of ((Int, Int), Matrix), where the (Int, Int) is the index of the block, and Matrix is the sub-matrix at the given index with size rowsPerBlock x colsPerBlock

In [None]:
import org.apache.spark.mllib.linalg.distributed.BlockMatrix

val blockMatrix: BlockMatrix = coordinateMatrix.toBlockMatrix()

Note, because the matrix is stored in a distributed way, converting between matrix formats is expensive!

**Rating**  
A rating of a product by a user, used in the `mllib.recommendation` package for product recommendation.

## Machine learning in MLlib

Spark supports a number of machine-learning algorithms.

- Classification and Regression
    - SVM, linear regression
    - SVR, logistic regression
    - Naive Bayes
    - Decision Trees
    - Random Forests and Gradient-Boosted Trees
- Clustering
    - K-means (and streaming K-means)
    - Gaussian Mixture Models
    - Latent Dirichlet Allocation
- Dimensionality Reduction
    - SVD and PCA
- It also has support for lower-level optimization primitives:
    - Stochastic Gradient Descent
    - Low-memory BFGS and L-BFGS

### Parallelized SGD

For linear models like SVM, Linear Regression, and Logistic Regression, the cost function we're trying to optimize is essentially an average over the individual error term from each data point. This is particularly great for parallelization.  For example, in linear regression, recall that the gradient is

$$\begin{align}
\frac{\partial \log(L(\beta))}{\partial \beta} &= \frac{\partial}{\partial \beta} \frac{1}{2}\sum_j \|y_j - X_{j \cdot} \cdot \beta\| \\
&= \frac{1}{2}\sum_j \frac{\partial}{\partial \beta} \|y_j - X_{j \cdot} \cdot \beta\| \\
& = \sum_j y_j - X_{j \cdot} \cdot \beta \\
& \approx \sum_{sample \mbox{ } j} y_j - X_{j \cdot} \cdot \beta
\end{align}$$

The key *mathematical properties* we have used are:

1. the error functions are the sum of error contributions of different training instances
1. linearity of the derivative
1. associativity of addition
1. downsampling giving an unbiased estimator

Since the last sum is over the different training instances and these are stored on different nodes, we can parallelize the computation of the gradient in SGD across multiple nodes.  Of course, we still need to maintain the running weight $\beta$ that has to be present on every node (through a broadcast variable that is updated).  Notice that SVM, Linear Regression, and Logistic Regression all have error functions that are just sums over training instances so SGD can be used for all these algorithms.

Spark's [implementation](http://spark.apache.org/docs/latest/mllib-optimization.html#stochastic-gradient-descent-sgd) uses a tunable minibatch size parameter to sample a percentage of the features RDD. For each iteration, the updated weights are broadcast to the executors, and the update is calculated for each data point and sent back to be aggregated.

Parallelization handles increasing number of sampled data points m quite well since there are no interaction terms and each calculation is independent. Controlling how the algorithm iterates to convergence is also important, and can be done with parameters for the total iterations and step size.

In [None]:
import org.apache.spark.mllib.regression.LinearRegressionWithSGD 
import org.apache.spark.mllib.evaluation.RegressionMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors

// parameters
val trainingIterations = 10
val trainingFraction = .6

// generate data
val data = spark.parallelize(1 to 10000).map( _ =>
    LabeledPoint(
        math.random,
        Vectors.dense(math.random, math.random, math.random)
    )
)

// split the training and test sets
val (training, test) = {
    val splits = data.randomSplit(
        Array(trainingFraction, 1. - trainingFraction),
        seed = 42L
    )
    (splits(0).cache(), splits(1))
}

// train model
val model = LinearRegressionWithSGD.train(training, trainingIterations)

// get r2 score
val r2 = {
    val predictionAndLabels = test.map{
        case LabeledPoint(label, features) => 
            (model.predict(features), label)
    }
    new RegressionMetrics(predictionAndLabels).r2
}

## Spark ML
Spark ML implements the ideas of transformers, estimators, and pipelines by standardizing APIS across machine learning algorithms. This can streamline more complex workflows.

The core functionality includes:
* DataFrames - built off Spark SQL, can be created either directly or from RDDs as seen above
* Transformers - algorithms that accept a DataFrame as input and return a DataFrame as output
* Estimators - algorithms that accept a DataFrame as input and return a Transformer as output
* Pipelines - chaining together Transformers and Estimators
* Parameters - common API for specifying hyperparameters

For example, a learning algorithm can be implemented as an Estimator which trains on a DataFrame of features and returns a Transformer which can output predictions based on a test DataFrame.

Full documentation can be found [here](http://spark.apache.org/docs/latest/ml-guide.html)

*Note:* See the PySpark_MLlib [notebook](PySpark_MLlib.ipynb) for examples (requiring Spark 1.5.1)

### A side note on DataFrame performance and tuning

See [here](http://spark.apache.org/docs/latest/sql-programming-guide.html#performance-tuning) for details.

### A digression: Plotting (in-memory) with Spark and Scala

### Adding  external dependencies
TL;DR use the `--packages` flag for external dependencies with Maven coordinates and `--jars` flag for local dependency jars for anything *not* in your `build.sbt`.

- **spark-shell**: 3rd party (Maven) dependencies in spark-shell with Maven coordinates: 
```bash
$ spark-shell --master {host} --packages "{group}:{artifact}:{version}" --jars oneJar.jar,anotherJar.jar"
```

- **spark-submit**: Include 3rd party library jars to spark-submit. Note that packages with Maven coordinates should properly be in your `build.sbt` if you are using `spark-submit`:
```bash
$ spark-submit --master {host} --jars {someJar}.jar,{anotherJar}.jar --class MyApp path/to/MyApp.jar
```


### Example: plot values in spark-shell

**N.B.** To get plots to appear from DigitalOcean on your local machine, you have to turn on X11 forwarding by ssh'ing into your box with `ssh -X`

- To add dependencies in spark-shell with Maven coordinates: 
```bash
$ spark-shell --master local[4] --packages "{group}:{artifact}:{version}"
```


1. Fire up the spark-shell:
```bash
$ spark-shell --master local[*] --packages "org.scalanlp:breeze-viz_2.10:0.11.2"
```

1. Create a plot that appears in a separate window

```scala
scala> val a = Array(0.3, 0.9, 1.4, 34.21, -3.4)
a: Array[Double] = Array(0.3, 0.9, 1.4, 34.21, -3.4)

scala> import breeze.plot._
import breeze.plot._

scala> val f = Figure()
f: breeze.plot.Figure = breeze.plot.Figure@73543048

scala> val p = f.subplot(0)
p: breeze.plot.Plot = breeze.plot.Plot@2e549515

scala> p += plot(a,a,'.')
res4: breeze.plot.Plot = breeze.plot.Plot@2e549515

scala> val g = breeze.stats.distributions.Gaussian(0,1)
...
scala> val p2 = f.subplot(1,1)
...
scala> p2 += hist(g.sample(100000),100)
```

### Example: plot a distribution

Check out Cloudera's [`KernelDensity.estimate` implementation](https://github.com/sryza/aas/tree/master/ch09-risk/src/main/scala/com/cloudera/datascience/risk)

```scala
import breeze.plot._
import com.cloudera.datascience.risk._

object AnalysisClass {
    ...
  def plotDistribution(samples: Array[Double]): Figure = {
    val min = samples.min
    val max = samples.max
    // Using toList before toArray avoids a Scala bug
    val domain = Range.Double(min, max, (max - min) / 100).toList.toArray
    val densities = KernelDensity.estimate(samples, domain)
    val f = Figure()
    val p = f.subplot(0)
    p += plot(domain, densities)
    p.ylabel = "Density"
    f
  }
}
```

In order to run this, include the *breeze* dependency in your `build.sbt`, and include the path to the `com.cloudera.datascience` package in the `--jars` flag of spark-submit.

```scala
libraryDependencies  ++= Seq(
  "org.scalanlp" %% "breeze" % "0.11.2",
  "org.scalanlp" %% "breeze-natives" % "0.11.2"
)
```

### Exercise: Use SVM to predict colon cancer from gene expressions
You can start getting a feel for the MLlib operations by following the [Spark docs example](https://spark.apache.org/docs/1.3.0/mllib-linear-methods.html#linear-support-vector-machines-svms) on this dataset.

#### About the data format: LibSVM
MLlib conveniently provides a data loading method, `MLUtils.loadLibSVMFile()`, for the LibSVM format for which many other languages (R, Matlab, etc.) also have loading methods.  
A dataset of *n* features will have one row per datum, with the label and values of each feature organized as follows:
>{label} 1:{value} 2:{value} ... n:{value}

Take these two datapoints with six features and labels of -1 and 1 respectively as an example:
>-1.000000  1:2.080750 2:1.099070 3:0.927763 4:1.029080 5:-0.130763 6:1.265460  
1.000000  1:1.109460 2:0.786453 3:0.445560 4:-0.146323 5:-0.996316 6:0.555759 

#### About the colon-cancer dataset
This dataset was introduced in the 1999 paper [Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.](http://www.pnas.org/content/96/12/6745.short)  

Here's the abstract of the paper:  
> *Oligonucleotide arrays can provide a broad picture of the state of the cell, by monitoring the expression level of thousands of genes at the same time. It is of interest to develop techniques for extracting useful information from the resulting data sets. Here we report the application of a two-way clustering method for analyzing a data set consisting of the expression patterns of different cell types. Gene expression in 40 tumor and 22 normal colon tissue samples was analyzed with an Affymetrix oligonucleotide array complementary to more than 6,500 human genes. An efficient two-way clustering algorithm was applied to both the genes and the tissues, revealing broad coherent patterns that suggest a high degree of organization underlying gene expression in these tissues. Coregulated families of genes clustered together, as demonstrated for the ribosomal proteins. Clustering also separated cancerous from noncancerous tissue and cell lines from in vivo tissues on the basis of subtle distributed patterns of genes even when expression of individual genes varied only slightly between the tissues. Two-way clustering thus may be of use both in classifying genes into functional groups and in classifying tissues based on gene expression.*

There are 2000 features, 62 data points (40 tumor (label=0), 22 normal (label=1)), and 2 classes (labels) for the colon cancer dataset. 

#### Exit Tickets
1. When would you use `org.apache.spark.mllib.linalg.Vector` versus `breeze.linalg.DenseVector`?
1. Why can SVM, Linear Regression, and Logistic Regression be parallelized?  How would you parallelize KMeans?


*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*