# Prepare Rheem

1. Load relevant modules
2. Import relevant packages

In [1]:
// Get the right repositories for SNAPSHOT versions.
classpath.addRepository("file:///Users/basti/.m2/repository")
classpath.addRepository("https://oss.sonatype.org/content/repositories/snapshots")

val rheemVersion = "0.2.0-SNAPSHOT"

// Load Rheem's core functionality.
classpath.add("org.qcri.rheem" % "rheem-core" % rheemVersion)
classpath.add("org.qcri.rheem" % "rheem-api" % rheemVersion)

// Load Rheem's platform plugins.
classpath.add("org.qcri.rheem" % "rheem-java" % rheemVersion)
classpath.add("org.qcri.rheem" % "rheem-spark" % rheemVersion)
classpath.add("org.qcri.rheem" % "rheem-basic" % rheemVersion)
classpath.add("org.qcri.rheem" % "rheem-graphchi" % rheemVersion)
classpath.add("org.qcri.rheem" % "rheem-sqlite3" % rheemVersion)

// Load the platforms themselves.
classpath.add("org.apache.hadoop" % "hadoop-common" % "2.2.0")
classpath.add("org.apache.hadoop" % "hadoop-hdfs" % "2.2.0")
classpath.add("org.apache.spark" % "spark-core_2.11" % "1.6.1")

// Load the profiling utility used by Rheem.
classpath.add("de.hpi.isg" % "profiledb-store" % "0.1.1")
classpath.add("org.qcri.rheem" % "rheem-profiler" % "0.2.0-SNAPSHOT" )

Adding 13 artifact(s)
Adding 4 artifact(s)
Adding 0 artifact(s)
Adding 1 artifact(s)
Adding 0 artifact(s)
Adding 37 artifact(s)
Adding 2 artifact(s)
Adding 30 artifact(s)
Adding 3 artifact(s)
Adding 84 artifact(s)
Adding 0 artifact(s)
Adding 4 artifact(s)


[36mrheemVersion[0m: [32mString[0m = [32m"0.2.0-SNAPSHOT"[0m

In [2]:
// Import relevant packages.
import org.qcri.rheem.api._
import org.qcri.rheem.basic.data._
import org.qcri.rheem.core.api._
import org.qcri.rheem.core.util._
import org.qcri.rheem.core.util.fs._
import org.qcri.rheem.core.function.FunctionDescriptor._
import org.qcri.rheem.core.optimizer.ProbabilisticDoubleInterval
import org.qcri.rheem.core.plugin.Plugin
import org.qcri.rheem.basic.RheemBasics
import org.qcri.rheem.java.Java
import org.qcri.rheem.spark.Spark
import org.qcri.rheem.sqlite3.Sqlite3
import de.hpi.isg.profiledb.store.model._

import scala.collection.JavaConversions._
import scala.collection.mutable

// Create a configuration.
val cwd = "/Users/basti/Work/Notebooks/boss-2016/cost-functions"
val confUrl = s"file://$cwd/rheem.properties"
val conf = new Configuration(confUrl)
val profileDbLocation = s"$cwd/profiledb.json"

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/basti/.coursier/cache/v1/https/repo1.maven.org/maven2/org/slf4j/slf4j-simple/1.7.13/slf4j-simple-1.7.13.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/basti/.coursier/cache/v1/https/repo1.maven.org/maven2/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/basti/.m2/repository/org/slf4j/slf4j-log4j12/1.7.10/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.SimpleLoggerFactory]


[32mimport [36morg.qcri.rheem.api._[0m
[32mimport [36morg.qcri.rheem.basic.data._[0m
[32mimport [36morg.qcri.rheem.core.api._[0m
[32mimport [36morg.qcri.rheem.core.util._[0m
[32mimport [36morg.qcri.rheem.core.util.fs._[0m
[32mimport [36morg.qcri.rheem.core.function.FunctionDescriptor._[0m
[32mimport [36morg.qcri.rheem.core.optimizer.ProbabilisticDoubleInterval[0m
[32mimport [36morg.qcri.rheem.core.plugin.Plugin[0m
[32mimport [36morg.qcri.rheem.basic.RheemBasics[0m
[32mimport [36morg.qcri.rheem.java.Java[0m
[32mimport [36morg.qcri.rheem.spark.Spark[0m
[32mimport [36morg.qcri.rheem.sqlite3.Sqlite3[0m
[32mimport [36mde.hpi.isg.profiledb.store.model._[0m
[32mimport [36mscala.collection.JavaConversions._[0m
[32mimport [36mscala.collection.mutable[0m
[36mcwd[0m: [32mString[0m = [32m"/Users/basti/Work/Notebooks/boss-2016/cost-functions"[0m
[36mconfUrl[0m: [32mString[0m = [32m"file:///Users/basti/Work/Notebooks/boss-2016/cost-functions/rh

## Preparation 
Now we define an experiment runner, that executes a wordcount plan on different file sizes and stores them into a ProfileDB.

In [4]:
class WordCountExperimentRunner(profileDbLocation: String, 
                                plugins: Plugin*) {

    var nextExperimentId = 0
    
    val pluginNames = plugins.map(_.getClass.getSimpleName).sorted.mkString(",")
    
    import org.qcri.rheem.core.profiling.ProfileDBs
    val profileDB = ProfileDBs.createProfileDB
    
    def apply(configuration: Configuration, 
              tags: Seq[String], 
              inputUrls: Seq[String], 
              wordsPerLine: ProbabilisticDoubleInterval = null) {
        val experiments = inputUrls.map { case inputUrl: String =>
            // Prepare experiment.
            val inputFileSize = FileSystems.getFileSize(inputUrl).getAsLong
            val subject = new Subject("wordcount", "1.0")
                .addConfiguration("inputUrl", inputUrl)
                .addConfiguration("inputSize", inputFileSize)
                .addConfiguration("plugins", this.pluginNames)
            val experiment = new Experiment(f"exp-${nextExperimentId%03d}", subject, tags: _*)
            this.nextExperimentId += 1
            
            // Run experiment.
            val wordcounts = wordCount(configuration, inputUrl, wordsPerLine, experiment)
            
            // Handle results.
            println(s"Collected ${wordcounts.size} word counts in $inputUrl.")
            experiment
        }
        
        // Persist experiments.
        this.profileDB.append(new java.io.File(profileDbLocation), experiments: _*)
    }
    
    private def wordCount(configuration: Configuration, 
                          inputUrl: String, 
                          wordsPerLine: ProbabilisticDoubleInterval, 
                          experiment: Experiment) = {
        val rheemContext = new RheemContext(configuration)
        plugins.foreach(rheemContext.withPlugin)
        val planBuilder = new PlanBuilder(rheemContext)

        planBuilder
          // Do some set up.
          .withJobName(s"WordCount ($inputUrl)")
          .withUdfJarsOf(this.getClass)
          .withExperiment(experiment)

          // Read the text file.
          .readTextFile(inputUrl).withName("Load file")

          // Split each line by non-word characters.
          .flatMap(_.split("\\W+"), selectivity = wordsPerLine).withName("Split words")

          // Filter empty tokens.
          .filter(_.nonEmpty, selectivity = 0.99).withName("Filter empty words")

          // Attach counter to each word.
          .map(word => (word.toLowerCase, 1)).withName("To lower case, add counter")

          // Sum up counters for every word.
          .reduceByKey(_._1, (c1, c2) => (c1._1, c1._2 + c2._2)).withName("Add counters")
          .withCardinalityEstimator((in: Long) => math.round(in * 0.01))

          // Execute the plan and collect the results.
          .collect()
    }
}

defined [32mclass [36mWordCountExperimentRunner[0m

In [5]:
val inputUrls = Seq("file:///Users/basti/Work/Data/text/odyssey.txt", 
                    "file:///Users/basti/Work/Data/text/odyssey.sf3.txt", 
                    "file:///Users/basti/Work/Data/text/odyssey.sf10.txt",
                    "file:///Users/basti/Work/Data/text/odyssey.sf30.txt", 
                    "file:///Users/basti/Work/Data/text/odyssey.sf100.txt")
var wordCountOnJava = new WordCountExperimentRunner(profileDbLocation, Java.basicPlugin)

[36minputUrls[0m: [32mSeq[0m[[32mString[0m] = [33mList[0m(
  [32m"file:///Users/basti/Work/Data/text/odyssey.txt"[0m,
  [32m"file:///Users/basti/Work/Data/text/odyssey.sf3.txt"[0m,
  [32m"file:///Users/basti/Work/Data/text/odyssey.sf10.txt"[0m,
  [32m"file:///Users/basti/Work/Data/text/odyssey.sf30.txt"[0m,
  [32m"file:///Users/basti/Work/Data/text/odyssey.sf100.txt"[0m
)
[36mwordCountOnJava[0m: [32mWordCountExperimentRunner[0m = cmd3$$user$WordCountExperimentRunner@5504b80e

## Experiments

Now, we run the various experiments - the experiment results are visualized in the `wordcount-visualization` notebook.

In [5]:
wordCountOnJava(conf, Seq("attempt-1"), inputUrls)

Collected 7848 word counts in file:///Users/basti/Work/Data/text/odyssey.txt.
Collected 7848 word counts in file:///Users/basti/Work/Data/text/odyssey.sf3.txt.
Collected 7848 word counts in file:///Users/basti/Work/Data/text/odyssey.sf10.txt.
Collected 7848 word counts in file:///Users/basti/Work/Data/text/odyssey.sf30.txt.
Collected 7848 word counts in file:///Users/basti/Work/Data/text/odyssey.sf100.txt.




In [6]:
wordCountOnJava(conf, Seq("attempt-2"), inputUrls, 10)

Collected 7848 word counts in file:///Users/basti/Work/Data/text/odyssey.txt.
Collected 7848 word counts in file:///Users/basti/Work/Data/text/odyssey.sf3.txt.
Collected 7848 word counts in file:///Users/basti/Work/Data/text/odyssey.sf10.txt.
Collected 7848 word counts in file:///Users/basti/Work/Data/text/odyssey.sf30.txt.
Collected 7848 word counts in file:///Users/basti/Work/Data/text/odyssey.sf100.txt.




It turns out, that our estimates are off - however, we can learn cost functions from the so far executed experiments. The output will can be directly copied into a `.properties` file. 

In [3]:
new org.qcri.rheem.profiler.log.GeneticOptimizerApp(conf.fork).run()

Loaded 40 execution records with 6 execution operator types and 0 platform overheads.
Optimizing 15 variables on 40 partial executions (e.g., [OperatorExecution[JavaTextFileSource[0->1, id=3a332629]], OperatorExecution[JavaFlatMap[1->1, id=60808022]], OperatorExecution[JavaFilter[1->1, id=5053330b]], OperatorExecution[JavaMap[1->1, id=31a6c219]], OperatorExecution[JavaReduceBy[1->1, id=1287b86a]]]).
Fittest individual of generation 0 (0): 0.0000
Fittest individual of generation 2,000 (2,000): 0.7758
Fittest individual of generation 4,000 (4,000): 0.7761
Final fittest individual of generation 5,000 (5,000): 0.7764

=== Stats for fittest individual (fitness=0.7764)

Training data vs. measured
Actual   0:00:06.205 | Estimated:                                    (0:00:00.130 .. 0:04:37.477, p=1.07%) |   5 operators | [JavaTextFileSource, JavaFlatMapOperator, JavaFilterOperator, JavaMapOperator, JavaReduceByOperator]
Actual   0:00:05.943 | Estimated:                                    (0:00



In [6]:
conf.load(s"file://$cwd/cost-functions.properties")



In [7]:
wordCountOnJava(conf, Seq("attempt-3"), inputUrls, 10)

Collected 7848 word counts in file:///Users/basti/Work/Data/text/odyssey.txt.
Collected 7848 word counts in file:///Users/basti/Work/Data/text/odyssey.sf3.txt.
Collected 7848 word counts in file:///Users/basti/Work/Data/text/odyssey.sf10.txt.
Collected 7848 word counts in file:///Users/basti/Work/Data/text/odyssey.sf30.txt.
Collected 7848 word counts in file:///Users/basti/Work/Data/text/odyssey.sf100.txt.


