This prototype implements the online adaptive stratified reservoir sampling (OASRS) algorithm using Spark 2.0.2
Building Spark-based StreamApprox is the same as building Apache Spark. See scripts for Spark/Flink building and installation instructions.
This prototype supports a sampling function reservoirStratifiedSample() for Spark RDD by implementing the OASRS algorithm. Users can use this function as a PairRDD function of Spark.
val inputStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
//Take a sample using the OASRS algorithm
val sampleItems = inputStream.map(_._2)
.map(line => (line.split(",")(0), line.split(",")(1).toDouble))
.transform(x => x.reservoirStratifiedSample(sampleSize))
.reduceByKeyAndWindow((a: Double, b: Double) => (a + b), Seconds(10), Seconds(5))
sampleItems.print()
- If you have any question please shoot me an email: do.le_quoc@tu-dresden.de
- Note that we are currently working to adapt our implementation with the new version (kafka-0-10) of Spark-Kafka connector.
Published under GNU General Public License v2.0 (GPLv2), see LICENSE