Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
src
 
 
 
 
 
 
 
 
 
 
 
 

README.md

viyadb-spark

Data processing backend (indexer) for ViyaDB based on Spark.

Build Status

Coverage

There are two processes defined in this project:

  • Streaming process
  • Batch process

Streaming process reads events in real-time, pre-aggregates them, and dumps loadable into ViyaDB TSV files to a deep storage. Batch process creates historical view of data containing events from previous batch plus events created afterwards in the streaming process.

The process can be graphically presented like this:

                                       +----------------+
                                       |                |
                                       |  Streaming Job |
                                       |                |
                                       +----------------+
                                               |
                                               |  writes current events
                                               v
         +------------------+         +--------+---------+
         | Previous Period  |         | Current Period   |
         | Real-Time Events |--+      | Real-Time Events |
         +------------------+  |      +------------------+
                               |
         +------------------+  |      +------------------+
         | Historical       |  |      | Historical       |
         | Events           |  |      | Events           |
         +------------------+  |      +------------------+      ...
            |                  |                   ^
 -----------|------------------|-------------------|----------------------------->
            |                  |                   |                Timeline
            |                  v                   |
            |              +-------------+         |
            |              |             |         |  unions previous period events
            +------------> |  Batch Job  |---------+  with all the historical events
                           |             |            that existed before
                           +-------------+

Features

Real-time Process

Real-time process responsibility:

  • Read data from a source (for now only Kafka support is provided as part of the code, but it can be easily extended), and parse it
  • Aggregate events by configured time window
  • Generate data loadable by ViyaDB (TSV format)

Batch Process

Batch process does the following:

  • Reads events that were generated by the real-time process
  • Optionally, clean the dataset out from irrelevant events
  • Aggregate the dataset
  • Partition the data to equal parts in terms of data size (aggregated rows number), and write these partitions back to historical storage

Prerequisites

Consul is used for storing configuration as well as for synchronizing different parts that ViyaDB cluster consists of. For running either real-time or batch processes the following configurations must present in Consul:

  • <consul prefix>/tables/<table name>/config - Table configuration
  • <consul prefix>/indexers/<indexer ID>/config - Indexer configuration

Building

mvn package

Running

spark-submit --class <jobClass> target/viyadb-spark_2.11-0.1.0-uberjar.jar \
    --consul-host "<consul host>" --consul-prefix "viyadb" \
    --indexer-id "<indexer ID>"

To run streaming job use com.github.viyadb.spark.streaming.Job for jobClass, to run batch job use com.github.viyadb.spark.batch.Job.

To see all available options, run:

spark-submit --class <jobClass> target/viyadb-spark_2.11-0.1.0-uberjar.jar --help

About

Data processing ang ingestion backend for ViyaDB based on Spark streaming

Topics

Resources

License

Packages

No packages published

Languages

You can’t perform that action at this time.