### The [Apache Mahout](http://mahout.apache.org/)™ project's goal is to build an environment for quickly creating scalable performant machine learning applications.

#### Apache Mahout software provides three major features:

- A simple and extensible programming environment and framework for building scalable algorithms
- A wide variety of premade algorithms for Scala + Apache Spark, H2O, Apache Flink
- Samsara, a vector math experimentation environment with R-like syntax which works at scale

#### In other words:

*Apache Mahout provides a unified API for quickly creating machine learning algorithms on a variety of engines.*

#### Getting Started

Apache Mahout is a collection of Libraries that enhance Apache Flink, Apache Spark, and others. Currently Zeppelin support the Flink and Spark Engines. A convenience script is provided to setup the nessecary imports and configurations to run Mahout on Spark and Flink. 

We can use Apache Mahout's R-Like Domain Specific Language (DSL) inline with native Flink or Spark code.  We must however, first declare a few imports that are different for Spark and Flink

__References:__

[Mahout-Samsara's In-Core Linear Algebra DSL Reference](http://mahout.apache.org/users/environment/in-core-reference.html)
[Mahout-Samsara's Distributed Linear Algebra DSL Reference](http://mahout.apache.org/users/environment/out-of-core-reference.html)
[Getting Started with the Mahout-Samsara Shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html)


#### "Installing" the Apache Mahout dependencies and configuring a new Spark and Flink interpreter

The following two paragraphs are convenience paragraphs. You **only need to run them once** to create two new interpreters `%spark.mahout` and `%flink.mahout`. These are intended for users who don't have Apache Mahout already installed. They assume you started Apache Zeppelin from the top level directory or from the bin.  You can tell which one is you by weather you started Zeppelin by typing `./zeppelin-daemon.sh start` or `bin/zeppelin-daemon.sh start`.  If you started Zeppelin from somewhere else you will also need to run them from the command line.

They both run a python script which may be found at `ZEPPELIN_HOME/scripts/mahout/add_mahout.py`

In short this script:
- Downloads Apache Mahout
- Creates a new Flink interpreter with dependencies.
- Creates a new Spark interpreter with dependencies and modified configuration to use Kryo serialization.

__You only need to run this script once ever.__ (Maybe again if for some reason you delete `conf/interpreter.json`) 


After the interpreters are created you will need to 'bind' them by clicking on the little gear in the top right corner, scrolling to the top, and clicking on `mahoutFlink` and `mahoutSpark` so that they are highlighted in blue.

#### Running Mahout code

You will need to import certain libraries, and declare the _Mahout Distributed Context_ when you first start your notebook using the interpreters. 

If using Apache Flink the code you need to run is:
```scala
%flinkMahout

import org.apache.flink.api.scala._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.flinkbindings._
import org.apache.mahout.math._
import scalabindings._
import RLikeOps._


implicit val ctx = new FlinkDistributedContext(benv)
```

If using Apache Spark the code you need to run is
```scala
%sparkMahout

import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._

implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = sc2sdc(sc)
```

__Note: For Apache Mahout on Apache Spark you must be running Spark 1.5.x or 1.6.x.  We are working hard on supporting Spark 2.0__
In the meantime, feel free to play with Mahout on Flink and then simple _copy and paste your Mahout code to Spark once it is supported!_

### A Side by Side Example


### Taking advantage of Zeppelin Resource Pools

One of the major motivations for integrating Apache Mahout with Apache Zeppelin was the many benefits that come from leveraging the resource pools.  A resource pool is a block of memory that can be acccessed by all interpreters and is useful for sharing small variables between the interpreters. 

The Spark interpreter has a simple interface for accessing the ResourcePools, the Flink interface is less documented but can be reverse engineered from code (thanks open source!)


Collect betas from Spark and Flink- compare in Python

Create Matrix in Flink and Spark - visualize with R

In [10]:
%spark.pyspark

import ast

flinkBetaDict = ast.literal_eval(z.get("flinkBeta"))
sparkBetaDict = ast.literal_eval(z.get("sparkBeta"))

print "----------------- differences between betas calulated in Flink and Spark-----------------"
for i in range(0,4):
    print "beta", i, ": " , flinkBetaDict[i] - sparkBetaDict[i]

## Plotting Mahout with R

The following examples show how we can leverage R to plot our results from Mahout


In [14]:
%spark.r {"imageWidth": "400px"}

library("ggplot2")

flinkSinStr = z.get("flinkSinDrm")
sparkSinStr = z.get("sparkSinDrm")

flinkData <- read.table(text= flinkSinStr, sep="\t", header=FALSE)
sparkData <- read.table(text= sparkSinStr, sep="\t", header=FALSE)

plot(flinkData,  col="red")
# Graph trucks with red dashed line and square points
points(sparkData, col="blue")

# Create a title with a red, bold/italic font
title(main="Sampled Mahout Sin Graph in R", col.main="black", font.main=4)

legend("bottomright", c("Apache Flink", "Apache Spark"), col= c("red", "blue"), pch= c(22, 22)) 



In [16]:
%spark.r {"imageWidth": "400px"}

library(scatterplot3d)


flinkGaussStr = z.get("flinkGaussDrm")
flinkData <- read.table(text= flinkGaussStr, sep="\t", header=FALSE)

scatterplot3d(flinkData, color="green")



**NOTE** To install `scatterplot3d` on Ubuntu use:

```sh
sudo apt-get install r-cran-scatterplot3d
```

