Data processing backend for ViyaDB based on Spark streaming
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Data processing backend (indexer) for ViyaDB based on Spark.

Build Status

There are two processes defined in this project:

  • Streaming process
  • Batch process

Streaming process reads events in real-time, pre-aggregates them, and dumps loadable into ViyaDB TSV files to a deep storage. Batch process creates historical view of data containing events from previous batch plus events created afterwards in the streaming process.

The process can be graphically presented like this:

                                       |                |
                                       |  Streaming Job |
                                       |                |
                                               |  writes current events
         +------------------+         +--------+---------+
         | Previous Period  |         | Current Period   |
         | Real-Time Events |--+      | Real-Time Events |
         +------------------+  |      +------------------+
         +------------------+  |      +------------------+
         | Historical       |  |      | Historical       |
         | Events           |  |      | Events           |
         +------------------+  |      +------------------+      ...
            |                  |                   ^
            |                  |                   |                Timeline
            |                  v                   |
            |              +-------------+         |
            |              |             |         |  unions previous period events
            +------------> |  Batch Job  |---------+  with all the historical events
                           |             |            that existed before


Real-time Process

Real-time process responsibility:

  • Read data from a source (for now only Kafka support is provided as part of the code, but it can be easily extended), and parse it
  • Aggregate events by configured time window
  • Generate data loadable by ViyaDB (TSV format)

Batch Process

Batch process does the following:

  • Reads events that were generated by the real-time process
  • Optionally, clean the dataset out from irrelevant events
  • Aggregate the dataset
  • Partition the data to equal parts in terms of data size (aggregated rows number), and write these partitions back to historical storage


Consul is used for storing configuration as well as for synchronizing different parts that ViyaDB cluster consists of. For running either real-time or batch processes the following configurations must present in Consul:

  • <consul prefix>/tables/<table name>/config - Table configuration
  • <consul prefix>/indexers/<indexer ID>/config - Indexer configuration


mvn package


spark-submit --class <jobClass> target/viyadb-spark_2.11-0.0.2-uberjar.jar \
    --consul-host "<consul host>" --consul-prefix "viyadb" \
    --indexer-id "<indexer ID>"

To run streaming job use com.github.viyadb.spark.streaming.Job for jobClass, to run batch job use com.github.viyadb.spark.batch.Job.

To see all available options, run:

spark-submit --class <jobClass> target/viyadb-spark_2.11-0.0.2-uberjar.jar --help