<h1>Wordcount with Rheem <div style="float:right; z-index:1"><img src="rheem.png" width="100px" /></div></h1>

This notebook demonstrates how to run Wordcount, the _Hello world!_ for data processing tools. To run this notebook, you will need the [Jupyter Scala kernel](https://github.com/alexarchambault/jupyter-scala).

At first, we obtain an input dataset.

In [None]:
locally {
    import java.io._
    import scala.io.Source
    
    val file = new File("data/iliad.txt")
    if (!file.exists) {
        file.getParentFile.mkdirs()
        val source = Source.fromURL("http://www.gutenberg.org/cache/epub/6130/pg6130.txt")
        val writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "UTF-8"))
        source.foreach(char => writer.write(char.asInstanceOf[Int]))
        writer.close()
        source.close()
    }
}

Next, we intialize Rheem.

In [None]:
// Load dependencies into the kernel.
import $ivy.`org.slf4j:slf4j-nop:1.7.12`,
    $ivy.`org.qcri.rheem:rheem-api:0.2.1-SNAPSHOT`,
    $ivy.`org.qcri.rheem:rheem-basic:0.2.1-SNAPSHOT`,
    $ivy.`org.qcri.rheem:rheem-java:0.2.1-SNAPSHOT`,
    $ivy.`org.qcri.rheem:rheem-spark:0.2.1-SNAPSHOT`

// Do the relevant imports.
import org.qcri.rheem.api._
import org.qcri.rheem.core.api._
import org.qcri.rheem.core.optimizer.ProbabilisticDoubleInterval
import org.qcri.rheem.java.Java, org.qcri.rheem.spark.Spark

// Set up a Rheem context.
val localDir = new java.io.File(".").getAbsoluteFile
val config = new Configuration(s"file://$localDir/rheem.properties")
val rheemCtx = new RheemContext(config) withPlugin Java.basicPlugin withPlugin Spark.basicPlugin

Now, we can do the Wordcount.

In [None]:
locally {
    // Define a class to handle word counts neatly.
    case class WC(word: String, count: Int) {
        def +(that: WC) = {
            require(this.word == that.word)
            WC(this.word, this.count + that.count)
        }
        
        override def toString: String = s"${count}x ${word}"
    }
    
    // Set up a new plan.
    val planBuilder = new PlanBuilder(rheemCtx)
        .withJobName("WordCount")
        .withUdfJarsOf(this.getClass)
    
    val wordCounts = planBuilder

        // Read the text file.
        .readTextFile(s"file://$localDir/data/iliad.txt").withName("Load file")

        // Split each line by non-word characters.
        .flatMap(_.split("\\W+")).withName("Split words")

        // Filter empty tokens.
        .filter(_.nonEmpty, selectivity = 0.99).withName("Filter empty words")

        // Attach counter to each word.
        .map(word => WC(word.toLowerCase, 1)).withName("To lower case, add counter")

        // Sum up counters for every word.
        .reduceByKey(_.word, _ + _).withName("Add counters")
        .withCardinalityEstimator((in: Long) => math.round(in * 0.01))
    
        // Sort out rather small word counts.
        .filter(_.count > 10)

        // Execute the plan and collect the results.
        .collect()
    
    wordCounts.toSeq.sortBy(-_.count).foreach(println)
}