<h1>Wordcount with Rheem <div style="float:right; z-index:1"><img src="rheem.png" width="100px" /></div></h1>

This notebook demonstrates how to run Wordcount, the _Hello world!_ for data processing tools. To run this notebook, you will need the [Jupyter Scala kernel](https://github.com/alexarchambault/jupyter-scala).

In [1]:
val offline = false

[36moffline[39m: [32mBoolean[39m = [32mfalse[39m

At first, we obtain an input dataset.

In [2]:
locally {
    import java.io._
    import scala.io.Source
    
    val file = new File("data/iliad.txt")
    if (!file.exists) {
        file.getParentFile.mkdirs()
        val source = Source.fromURL("http://www.gutenberg.org/cache/epub/6130/pg6130.txt")
        val writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "UTF-8"))
        source.foreach(char => writer.write(char.asInstanceOf[Int]))
        writer.close()
        source.close()
    }
}

Next, we intialize Rheem.

In [3]:
// Load dependencies into the kernel.
import $ivy.`org.slf4j:slf4j-nop:1.7.12`,
    $ivy.`org.qcri.rheem:rheem-api:0.2.1-SNAPSHOT`,
    $ivy.`org.qcri.rheem:rheem-basic:0.2.1-SNAPSHOT`,
    $ivy.`org.qcri.rheem:rheem-java:0.2.1-SNAPSHOT`,
    $ivy.`org.qcri.rheem:rheem-spark:0.2.1-SNAPSHOT`,
    $ivy.`com.github.sekruse::spark-summit-demo:1.0-SNAPSHOT`

// Do the relevant imports.
import org.qcri.rheem.api._
import org.qcri.rheem.core.api._
import org.qcri.rheem.core.optimizer.ProbabilisticDoubleInterval
import org.qcri.rheem.java.Java, org.qcri.rheem.spark.Spark
import com.github.sekruse.spark_summit_demo._

// Set up a Rheem context.
val localDir = new java.io.File(".").getAbsoluteFile
val config = new Configuration(s"file://$localDir/rheem.properties")
val rheemCtx = new RheemContext(config) withPlugin Java.basicPlugin withPlugin Spark.basicPlugin

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/basti/.coursier/cache/v1/https/repo1.maven.org/maven2/org/slf4j/slf4j-nop/1.7.12/slf4j-nop-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/basti/.m2/repository/org/slf4j/slf4j-simple/1.7.13/slf4j-simple-1.7.13.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.helpers.NOPLoggerFactory]


[32mimport [39m[36m$ivy.$                           ,
    $ivy.$                                        ,
    $ivy.$                                          ,
    $ivy.$                                         ,
    $ivy.$                                          ,
    $ivy.$                                                   

// Do the relevant imports.
[39m
[32mimport [39m[36morg.qcri.rheem.api._
[39m
[32mimport [39m[36morg.qcri.rheem.core.api._
[39m
[32mimport [39m[36morg.qcri.rheem.core.optimizer.ProbabilisticDoubleInterval
[39m
[32mimport [39m[36morg.qcri.rheem.java.Java, org.qcri.rheem.spark.Spark
[39m
[32mimport [39m[36mcom.github.sekruse.spark_summit_demo._

// Set up a Rheem context.
[39m
[36mlocalDir[39m: [32mjava[39m.[32mio[39m.[32mFile[39m = /Users/basti/Work/Repositories/spark-summit-2017/notebooks/.
[36mconfig[39m: [32morg[39m.[32mqcri[39m.[32mrheem[39m.[32mcore[39m.[32mapi[39m.[32mConfiguration[39m = Configuration[file:///Us

In [4]:
if (offline) {
    requireJs("plotly", "http://localhost:8889/js/plotly-latest.min")
} else {
    requireJs("plotly", "https://cdn.plot.ly/plotly-latest.min")
}

Now, we can do the Wordcount.

In [5]:
locally {
    // Define a class to handle word counts neatly.
    case class WC(word: String, count: Int) {
        def +(that: WC) = {
            require(this.word == that.word)
            WC(this.word, this.count + that.count)
        }
        
        override def toString: String = s"${count}x ${word}"
    }
    
    // Set up a new plan.
    val planBuilder = new PlanBuilder(rheemCtx)
        .withJobName("WordCount")
        .withUdfJarsOf(this.getClass)
    
    val wordCounts = planBuilder

        // Read the text file.
        .readTextFile(s"file://$localDir/data/iliad.txt").withName("Load file")

        // Split each line by non-word characters.
        .flatMap(_.split("\\W+")).withName("Split words")

        // Filter empty tokens.
        .filter(_.nonEmpty, selectivity = 0.99).withName("Filter empty words")

        // Attach counter to each word.
        .map(word => WC(word.toLowerCase, 1)).withName("To lower case, add counter")

        // Sum up counters for every word.
        .reduceByKey(_.word, _ + _).withName("Add counters")
        .withCardinalityEstimator((in: Long) => math.round(in * 0.01))
    
        // Mask rather small words counts.
        .map(wc => if (wc.count > 1000) wc else WC("(other)", wc.count)).withName("Mask rather small words")
        .reduceByKey(_.word, _ + _).withName("Add counters again")

        // Execute the plan and collect the results.
        .collect()
    
    plotPieChart[WC](
        name = "Words in Homer's Iliad",
        data = wordCounts,
        values = _.count.toDouble,
        labels = _.word,
        showlegend = false
    )
}