Skip to content

Latest commit

 

History

History
357 lines (250 loc) · 13.6 KB

README.md

File metadata and controls

357 lines (250 loc) · 13.6 KB

mbench

A benchmarking library for the JVM with a Gnuplot backend.

Main Features

  • Simple setup. The library executes benchmarks in isolation from each other by cloning the current VM programmatically. Multiple benchmarks can then be launched directly from a main class, like any regular application, without the need to setup an additional build or script environment.
  • Customizable results. The default benchmark definition produces the inputs passed to a test, the median execution time of the test for every input, and the coefficient of variation obtained for several runs, as columns in a table. This definition can be augmented programmatically with application specific columns (e.g. parallel speedup or throughput) that are computed in realtime using previous results, the current input of a test and/or its configuration parameters.
  • Comparison friendly. To ease comparisons, the same benchmark definition can be reused to evaluate multiple tests in different configurations. The Gnuplot backend saves these results in separate '.dat' files and features an API to generate Gnuplot scripts that aggregate multiple results on a single plot.

Examples

The examples below are taken from the mbench-benchmarks project. All the benchmarks where run on a 24-cores (2200MHz each) Opteron server configured with Linux 3.8.6 and the realtime preemption patch, running Scala 2.9.3, with the following JVM flags: -server -XX:+UseNUMA -XX:+UseCondCardMark -Xss1M -XX:MaxPermSize=128m -XX:+UseParallelGC -XX:+DoEscapeAnalysis -Xms1024m -Xmx1024m.

The '.dat' files and the gnuplot files generated by the tool for these examples can be found in the gallery directory.

The latest API documentation is available online here.

Loops

The following benchmark compares the performance of several loops for different number of iterations.

object Loops {

  def testWhile(cycles: Int): Unit = {
    var ary = new Array[Int](16)
    var i = 0
    while (i < cycles) {
      ary(i & 0xF) = i
      i += 1
    }
    println(ary(0) + "," + ary(1))
  }

  def testFor(cycles: Int): Unit = {
    var ary = new Array[Int](16)
    for (i <- 0 until cycles) {
      ary(i & 0xF) = i
    }
    println(ary(0) + "," + ary(1))
  }

  def testWhileNew(cycles: Int): Unit = {
    case class Box(i: Int)
    var ary = new Array[Box](16)
    var i = 0
    while (i < cycles) {
      ary(i & 0xF) = Box(i)
      i += 1
    }
    println(ary(0) + "," + ary(1))
  }

  import mbench.benchmark._
  import mbench.gnuplot._

  val input = (1 to 3) map (_ * 5000000)
  val cycles = Label[Int]("cycles")
  val throughput = Column.throughput(cycles)

  val benchmark = Benchmark("loops", input, cycles, warmups = 2, runs = 5).add(throughput)

  def main(args: Array[String]) = {

    val tests = Seq(
      Test.input("while", testWhile),
      Test.input("for", testFor),
      Test.input("while-new", testWhileNew)
    )

    val datFiles = tests.map(benchmark(_))

    val plots = Gnuplot(datFiles)
    Gnuplot.save(plots)

  }
}

The following plots will be automatically generated. If we look at the time plot, as expected, we see that the time increases with the number of cycles.

loops%time.plt

However, this does not tell us if the time increases linearly with the number of cycles. This is why generating a second plot of the throughput against the number of cycles is interesting. According to the figure below, we can see that, as expected for this high number of iterations, the number of cycles per second is independent from the number of cycles.

loops%throughput.plt

Parallel Loops

The benchmark below measures how well executing these loops in parallel scales with the number of threads on our 24-cores server.

object ParaLoops {

  import mbench.benchmark._
  import mbench.gnuplot._

  import java.util.concurrent.{ ExecutorService, Executors }

  /* The benchmark will be configured using an executor as runtime configuration 
   * and a number of loop cycles as static configuration. 
   */
  def runtimeConfig(executorName: String, mkExecutor: Int => ExecutorService) =
    Config.runtime[Int, ExecutorService](executorName, mkExecutor, _.shutdown())

  val threadPool = runtimeConfig("thread-pool", Executors.newFixedThreadPool)

  val cycles = Config.static(10000000)

  import mbench.Host.Hardware.cores

  val threads = ((1 to 3) ++ (4 to cores by (if (cores <= 8) 2 else 4)) ++ (Seq(2, 4) map (_ + cores))).distinct

  val ilabel = Label[Int]("threads")

  /* This column uses the cycles in the static configuration to compute 
   * the throughput in cycles per second. 
   */
  val throughput = Column.withConfig[Int, Int, Double]("throughput", "cycles".perSeconds)(
    (threads, cycles, time) => threads * (cycles / time)
  )

  val speedup = throughput.speedupHigherIsBetter

  val benchmark = Benchmark("para-loops", threads, ilabel, warmups = 5, runs = 7)
    .add(throughput).add(speedup)

  def main(args: Array[String]) = {

    /* The tests and the benchmark must use the same setup, which specifies 
     * the executor and the number of cycles to execute.
     */

    def mkTest(loop: Int => Unit)(executor: ExecutorService, cycles: Int, threads: Int) = {
      (1 to threads)
        .map(_ => executor.submit(new Runnable { def run() = loop(cycles) }))
        .foreach(_.get())
    }

    val testWhile = Test("while", mkTest(Loops.testWhile))
    val testWhileNew = Test("while-new", mkTest(Loops.testWhileNew))
    val testFor = Test("for", mkTest(Loops.testFor))

    val idealTimes = benchmark.ideal("speedup", 1 /* cycles don't matter */ , threads => if (threads <= cores) 1 else (threads.toDouble / cores))

    val tests = Seq(testWhile, testWhileNew, testFor)

    val dats = tests.map(benchmark(threadPool and cycles, _))

    val settings = Seq(Plot.xtics(1))
    Gnuplot.save(Gnuplot(dats, settings, throughput))
    Gnuplot.save(Gnuplot(dats :+ idealTimes, settings, speedup.label))

  }

}

If we loop at the throughput plot below, we see again that the while-loop wins over Scala's for-comprehension in scala 2.9.3 (contributions to migrate the build to Scala 2.10 are welcome) and that the loop that does excessive boxing is again the worst performer of the three.

para-loops%throughput.plt

But how well do they scale on multicore hardware with the number of threads? According to the speedup plot shown below, only the for-comprehension comes close to the ideal parallel speedup, and in all cases, it is not worth going beyond 24 threads. The while-loop, although it is the fastest, benefits from a parallel speedup up to at least 20 threads but it is less than ideal. And the version that does boxing creates so much garbage that it cannot be eliminated efficiently by the JVM between 8 and 12 threads (questions: What explains the different behaviors? Can it be improved by tuning the JVM?).

para-loops%speedup.plt

Maps

The benchmark below illustrates how to benchmark a scenario that performs side-effects that must be executed in a predefined sequence. In this scenario, we measure how fast we can add elements to a hash map and then remove them. At the same time, we compare the performance of immutable and mutable hash maps.

object Maps {

  def testAdd(map: Map[Int, Int], n: Int): Map[Int, Int] = {
    var i = 0
    var m = map
    while (i < n) {
      m += i -> i
      i += 1
    }
    m
  }

  def testDel(map: Map[Int, Int], n: Int): Map[Int, Int] = {
    var i = 0
    var m = map
    while (i < n) {
      m -= i
      i += 1
    }
    m
  }

  /* Hello static dispatch: test methods above would be extremely slow with mutable Map
   * because += must clone the map each time it is invoked. This is why we need these.
   */
  def testMAdd(map: mutable.Map[Int, Int], n: Int): mutable.Map[Int, Int] = {
    var i = 0
    var m = map
    while (i < n) {
      m += i -> i
      i += 1
    }
    m
  }

  def testMDel(map: mutable.Map[Int, Int], n: Int): mutable.Map[Int, Int] = {
    var i = 0
    var m = map
    while (i < n) {
      m -= i
      i += 1
    }
    m
  }

  import mbench.benchmark._
  import mbench.gnuplot._
  import mbench.benchmark.TestSeq.{ test, ignore }

  val MapTest = TestSeq.static[immutable.Map[Int, Int], Int]("", (map, n) =>
    for {
      m <- test("add", testAdd(map, n))
      m <- test("del", testDel(m, n))
      _ <- ignore(assert(m.isEmpty))
    } yield ()
  )

  val MMapTest = TestSeq.static[mutable.Map[Int, Int], Int]("", (map, n) =>
    for {
      m <- test("add", testMAdd(map, n))
      m <- test("del", testMDel(m, n))
      _ <- ignore(assert(m.isEmpty))
    } yield ()
  )

  val immutableMap = Config.static("immutable-map", Map.empty[Int, Int])
  val mutableMap = Config.static("mutable-map", mutable.Map.empty[Int, Int])
  val openMap = Config.static("open-map", mutable.OpenHashMap.empty[Int, Int])

  val input = Seq(1, 5, 10).map(_ * 50000)
  val elems = Label[Int]("elems")
  val throughput = Column.throughput(elems)

  val benchmark = Benchmark("maps", input, elems, warmups = 2, runs = 5).add(throughput)

  def main(args: Array[String]) = {
    val ires = benchmark.seq(immutableMap, MapTest)
    val mres = benchmark.seq(mutableMap, MMapTest)
    val ores = benchmark.seq(openMap, MMapTest)

    val plots = Gnuplot(ires ++ mres ++ ores, Plot.xticList(input: _*))
    Gnuplot.save(plots)
  }

}

The time plot below tells us that, as expected, the time increases with the number of elements and that the open hash map is the fastest of the three for add and remove operations.

maps%time.plt

If we look now at how the throughput scales with the number of elements, we see that the open hash map is not only the fastest of the pack but that its add operation is still optimized as the number of iterations increases, even above 300000 additions.

maps%throughput.plt

Installing

MBench and its benchmarks (for Scala 2.9.3, 2.10.1) are available in the Sonatype OSS Maven repository (which is mirrored on the central Maven repository as well):

group id: com.github.sbocq
artifact id: mbench_2.9.3 or mbench_2.10 (or mbench-benchmarks_*)
version: 0.2.4

Alternatively you can download the Jar files directly from Sonatype:

Building From Sources

Using sbt:

> git clone https://github.com/sbocq/mbench.git
> cd mbench
> sbt collect-jar

Running the Examples

Prerequisite: Your Java environment must point to a JDK VM because mbench forks benchmarks in a server VM.

The examples can be launched in several manners:

  • By importing the sources of mbench and mbench-benchmarks in your favorite IDE with the Scala plugin installed and running the examples as regular main classes.

  • From the command line with the Scala library (2.9.3):

    • Running the parallel loops in Linux:

      $JAVA_HOME/bin/java -XX:+UseNUMA -cp mbench.jar:mbench-benchmarks.jar:scala-library.jar mbench.benchmarks.ParaLoops

    • Running all the examples in Windows:

      "%JAVA_HOME%"\bin\java -XX:+UseNUMA -cp mbench.jar;mbench-benchmarks.jar;scala-library.jar mbench.benchmarks.Benchmarks

  • From a sbt prompt:

      > project mbench-benchmarks
      > run
    

Sample SBT Project File

Here is a simple build.sbt file that can be used to build a simple mbench project with sbt:

name := "mybenchmarks"

version := "0.1"

scalaVersion := "2.10.1"

resolvers ++= Seq(
    "Sonatype Nexus Releases" at "https://oss.sonatype.org/content/repositories/releases")

libraryDependencies += "com.github.sbocq" %% "mbench" % "0.2.4"

fork := true

javaOptions <++= (fullClasspath in Runtime).map(cp =>
    Seq("-cp", cp.files.mkString(System.getProperty("path.separator")),
        "-Dmbench.log.stdout=true", "-Dmbench.new.date.dir=sbtrun"))

The additional options tell mbench to log messages on stdout (default is stderr) and to always save its reports in a directory called sbtrun (default is a directory named after the date of the benchmark). Mbench supports several configuration properties that can be either passed on the command line, like this, or stored in a mbench.properties file. These properties are listed in the online documentation.

Projects Using MBench

  • Molecule: Molecule is using extensively mbench to compare the scalability on multicore hardware of different runtime options and different frameworks. Its overview reports can be found here.