# Plongeur

A *topological data analysis* library.

> Core algorithm written in [Scala](http://www.scala-lang.org/), using Apache [Spark](http://spark.apache.org/).
> 
> Executed in a [Jupyter](http://jupyter.org/) notebook, using the Apache [Toree](https://github.com/apache/incubator-toree) kernel and [declarative widgets](http://jupyter-incubator.github.io/declarativewidgets/docs.html).
>
> Graphs rendered with [Sigma](http://sigmajs.org/)/[Linkurious](https://github.com/Linkurious/linkurious.js), wrapped in a [Polymer](https://www.polymer-project.org/1.0/) component.
> 
> Reactive machinery powered by [Rx](http://reactivex.io/) [RxScala](https://github.com/ReactiveX/RxScala).


#### Maven dependencies

In [1]:
%AddDeps org.apache.spark spark-mllib_2.10 1.6.2 --repository file:/Users/tmo/.m2/repository
%AddDeps org.scalanlp breeze-natives_2.10 0.12 --repository file:/Users/tmo/.m2/repository
%AddDeps com.github.haifengl smile-core 1.1.0 --transitive --repository file:/Users/tmo/.m2/repository
%AddDeps io.reactivex rxscala_2.10 0.26.1 --transitive --repository file:/Users/tmo/.m2/repository
%AddDeps com.softwaremill.quicklens quicklens_2.10 1.4.4 --repository file:/Users/tmo/.m2/repository
%AddDeps com.chuusai shapeless_2.10 2.3.0 --repository https://oss.sonatype.org/content/repositories/releases/ --repository file:/Users/tmo/.m2/repository
%AddDeps org.tmoerman plongeur-spark_2.10 0.3.33 --repository file:/Users/tmo/.m2/repository

Marking org.apache.spark:spark-mllib_2.10:1.6.2 for download
Preparing to fetch from:
-> file:/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/toree_add_deps6909012094675380908/
-> file:/Users/tmo/.m2/repository
-> https://repo1.maven.org/maven2
-> New file at /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/toree_add_deps6909012094675380908/https/repo1.maven.org/maven2/org/apache/spark/spark-mllib_2.10/1.6.2/spark-mllib_2.10-1.6.2.jar
Marking org.scalanlp:breeze-natives_2.10:0.12 for download
Preparing to fetch from:
-> file:/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/toree_add_deps6909012094675380908/
-> file:/Users/tmo/.m2/repository
-> https://repo1.maven.org/maven2
-> New file at /Users/tmo/.m2/repository/org/scalanlp/breeze-natives_2.10/0.12/breeze-natives_2.10-0.12.jar
Marking com.github.haifengl:smile-core:1.1.0 for download
Preparing to fetch from:
-> file:/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/toree_add_deps6909012094675380908/
-> file:/Users/tmo/.m2/repository

In [2]:
%addjar http://localhost:8888/nbextensions/declarativewidgets/declarativewidgets.jar

Starting download from http://localhost:8888/nbextensions/declarativewidgets/declarativewidgets.jar
Finished download of declarativewidgets.jar


#### Import classes

In [3]:
import rx.lang.scala.{Observer, Subscription, Observable}
import rx.lang.scala.subjects.PublishSubject
import rx.lang.scala.subjects._

import shapeless.HNil

import org.tmoerman.plongeur.tda._
import org.tmoerman.plongeur.tda.Model._
import org.tmoerman.plongeur.tda.cluster.Clustering._
import org.tmoerman.plongeur.tda.cluster.Scale._
import org.tmoerman.plongeur.tda.Colour._
import org.tmoerman.plongeur.tda.Brewer._

import org.tmoerman.plongeur.ui.Controls._

import declarativewidgets._
initWidgets

import declarativewidgets.WidgetChannels.channel

In [4]:
import java.util.concurrent.atomic.AtomicReference

case class SubRef(val ref: AtomicReference[Option[Subscription]] = new AtomicReference[Option[Subscription]](None)) extends Serializable {

    def update(sub: Subscription): Unit = ref.getAndSet(Option(sub)).foreach(old => old.unsubscribe())

    def reset(): Unit = update(null)

}

#### Import polymer elements

These cells triggers Bower installations of the specified web components. 

If it doesn't work, check whether Bower has sufficient permissions to install in the jupyter `/nbextensions` folder.

In [5]:
%%html
<link rel='import' href='urth_components/paper-slider/paper-slider.html' is='urth-core-import' package='PolymerElements/paper-slider'>
<link rel='import' href='urth_components/paper-button/paper-button.html' is='urth-core-import' package='PolymerElements/paper-button'>
<link rel='import' href='urth_components/paper-dropdown-menu/paper-dropdown-menu.html' is='urth-core-import' package='PolymerElements/paper-dropdown-menu'>
<link rel='import' href='urth_components/paper-listbox/paper-listbox.html' is='urth-core-import' package='PolymerElements/paper-listbox'>
<link rel='import' href='urth_components/paper-item/paper-item.html' is='urth-core-import' package='PolymerElements/paper-item'>
<link rel='import' href='urth_components/plongeur-graph/plongeur-graph.html' is='urth-core-import' package='tmoerman/plongeur-graph'>
<link rel='import' href='urth_components/urth-viz-scatter/urth-viz-scatter.html' is='urth-core-import'>

#### Reactive TDA Machine

Keep references to Rx `Subscription` instances apart.

In [6]:
val in$_subRef = SubRef()

Instantiate a `PublishSubject`. This stream of `TDAParams` instances represents the input of a `TDAMachine`. The `PublishSubject` listens to changes and sets these to the channel `"ch_TDA_1"` under the `"params"` key.

In [7]:
val in$ = PublishSubject[TDAParams]

in$_subRef.update(in$.subscribe(p => channel("ch_TDA_1").set("params", p.toString)))

#### Inititalize rdd

In this example, we are using the MNIST data set.

In [8]:
import org.apache.spark.rdd.RDD
import org.apache.commons.lang.StringUtils.trim
import org.apache.spark.mllib.linalg.Vectors

def readMnist(file: String): RDD[DataPoint] =
    sc.
      textFile(file).
      map(s => {
        val columns = s.split(",").map(trim).toList

        columns match {
          case cat :: rawFeatures =>
            val nonZero =
              rawFeatures.
                map(_.toInt).
                zipWithIndex.
                filter{ case (v, idx) => v != 0 }.
                map{ case (v, idx) => (idx, v.toDouble) }

            val sparseFeatures = Vectors.sparse(rawFeatures.size, nonZero)

            (cat, sparseFeatures)
        }}).
      zipWithIndex.
      map {case ((cat, features), idx) => IndexedDataPoint(idx.toInt, features, Some(Map("cat" -> cat)))}

In [9]:
val mnist_path = "/Users/tmo/Work/batiskav/projects/plongeur/scala/plongeur-spark/src/test/resources/mnist/"

val mnist_train = mnist_path + "mnist_train.csv"

In [10]:
val mnistRDD = readMnist(mnist_train)

In [11]:
val mnistSample5pctRDD = mnistRDD.sample(false, .05, 0l).cache

In [12]:
mnistSample5pctRDD.count

3036

In [13]:
val ctx = TDAContext(sc, mnistSample5pctRDD)

Turn a TDAResult into a data structure.

In [63]:
val r = scala.util.Random

def format(result: TDAResult) = Map(
    "nodes" -> result.clusters.map(c =>
      Map(
        "id"     -> c.id.toString,
        "size"   -> c.dataPoints.size,
        "color"  -> c.colours.headOption.getOrElse("#999"),
        "x"      -> r.nextInt(100),
        "y"      -> r.nextInt(100))),
    "edges" -> result.edges.map(e => {
      val (from, to) = e.toArray match {case Array(f, t) => (f, t)}
        
      Map(
        "id"     -> s"$from--$to",        
        "source" -> from.toString,
        "target" -> to.toString)}))

Run the machine, obtaining an `Observable` of `TDAResult` instances

In [15]:
val out$_subRef = SubRef()

In [78]:
val out$: Observable[(TDAParams, TDAResult)] = TDAMachine.run(ctx, in$)

In [79]:
out$_subRef.update(
    out$.subscribe(
        onNext = (t) => t match {case (p, r) => channel("ch_TDA_1").set("result", format(r))},
        onError = (e) => println("Error in TDA machine: ", e)))

#### Reactive inputs

In [18]:
val pipe$_subRef = SubRef()

In [19]:
kernel.magics.html(controlsCSS)

In [87]:
val cat = new AttributeSelector[Boolean]("cat") {  
  val target = "1"
  override def apply(number: Any): Boolean = number.asInstanceOf[String] == target 
}
        
val BASE = 
    TDAParams(
        lens = TDALens(          
          Filter("PCA" :: 0 :: HNil, 25, 0.3),
          Filter("eccentricity" :: "infinity" :: HNil, 25, 0.3)),
        clusteringParams = ClusteringParams(),
        scaleSelection = histogram(10),
        collapseDuplicateClusters = false,
        colouring = Colouring(Brewer.palettes("PuOr").get(9).map(_.reverse), LocalPercentage(9, cat)))
        
val (sub, html) = BASE.makeControls(channel("ch_TDA_1"), in$)

pipe$_subRef.update(sub)

kernel.magics.html(html)

Filter,PCA :: 0 :: HNil,PCA :: 0 :: HNil.1
Nr of cover bins,,[[filter-0-nrBins]]
Cover overlap,,[[filter-0-overlap]]%
Filter,eccentricity :: infinity :: HNil,eccentricity :: infinity :: HNil
Nr of cover bins,,[[filter-1-nrBins]]
Cover overlap,,[[filter-1-overlap]]%
,General,General
Nr of scale bins,,[[scaleBins]]
Collapse duplicates,,


In [41]:
%%html
<template is='urth-core-bind' channel='ch_TDA_1'>    
    <plongeur-graph height="1200" data="{{result}}"></plongeur-graph>
</template>

In [22]:
%%html
<template is='urth-core-bind' channel='ch_TDA_1'>  
    <div style='background: #FFB; padding: 10px;'>
        <span style='font-family: "Courier"'>[[params]]</span>
    </div>
</template>