Skip to content
Streaming Analytics platform, built with Apache Flink and Kafka
Branch: master
Clone or download
Latest commit 4df7cc9 Nov 26, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
models updated teqnation presentation Apr 25, 2018
presentations added KNVI logo Nov 25, 2018
project small changes Apr 26, 2018
styx-app/src updated main jobs, updated documentation Apr 7, 2018
styx-appRunner/src updated main jobs, updated documentation Apr 7, 2018
styx-commons/src
styx-domain/src First code Feb 6, 2018
styx-flink-kafka/src
styx-frameworks-cassandra/src
styx-frameworks-flink/src
styx-frameworks-kafka/src upgrade to Kafka 0.11 Feb 7, 2018
styx-frameworks-openscoring/src/main/scala/com/styx/frameworks/openscoring First code Feb 6, 2018
styx-interfaces/src/main/scala/com/styx/interfaces/repository
styx-support-datagen/src/main/scala/com/styx/support/datagen First code Feb 6, 2018
styx-support-db/src/cql First code Feb 6, 2018
styx-support-deployment-validator/src/confidence
styx-support-ops First code Feb 6, 2018
.gitignore added gitignore file Feb 7, 2018
.travis.yml
README.md added event time pics Apr 26, 2018
TODO.txt added pictures for presentation Apr 23, 2018
_config.yml Set theme jekyll-theme-slate Feb 6, 2018
assembly.sbt small changes Apr 26, 2018
build.sbt small changes Apr 26, 2018
findbugs-excludefilter.xml First code Feb 6, 2018
findbugs.sbt First code Feb 6, 2018
package.sbt First code Feb 6, 2018
styx.gif
version.sbt First code Feb 6, 2018

README.md

Styx logo

Styx is the STreaming AnalytYX engine that can be used for several use cases, for example:

  • producing actions or insights in real-time data
  • predictive maintenance on systems that produce event data
  • providing recommendations based on click events

Styx is build with the following technology:

  • Apache Flink
  • Apache Kafka
  • Apache Cassandra
  • Openscoring.io

The entire application is written in Scala. For scoring the machine learning models, we load them in PMML format. The Protobuf serialization format is used to store messages on the Kafka bus.

A use case can consist of multiple stages in a streaming data pipeline. A typical use case follows a pattern of three steps:

  1. Complex Event Processing Engine: combining real-time 'raw' events with historical data to produce an interesting 'business event';
  2. Machine Learning Engine: business events are evaluated by a real-time scoring (evaluation) engine, where machine learning models or business rules are being applied on the events combined with 'static' data from a database or external APIs. The result is an 'intermediate event';
  3. Post Processor: the handling of the intermediate events are transformed into a useful action or notification event, for example an XML message that produces and email or a push notification on a mobile app.

The events are all stored on a Kafka message bus, so Styx is essentially a Kafka-in, Kafka-out framework that contains all the business logic. In a brief overview, this leads to the following architecture:

Raw Event => CEP Engine => Business Event => ML Engine => Intermediate Event => Post Processor => Action/Notification Event

Abbreviations

  • CEP = Complex Event Processor
  • ML = Machine Learning
  • PP = Post-Processor

Example use case

The code contains an example use case called shopping. In this use case, payment transactions are processed as raw events. The CEP Engine uses pattern recognition logic to detect if a customer is shopping and potentially running out of money on his or her running bank account. If that is the base (business event), in the ML Engine a model is scored to determine the best action for each individual customer. This action is stored as an intermediate event. Finally, the Post Processor transforms the notification into a real action for the customer, e.g. 'Transfer money from my savings account'.

Performance

Styx has been tested on a 10-node Flink cluster, where 10 million random events were loaded on the Kafka cluster. The example use case (shopping detection and notification) performed as follows:

  • maximum throughput of entire data pipeline: ~450k events per second
  • average throughput of entire data pipeline: ~100k events per second
  • average latency of 95-percentile of events: 42ms

Development environment

To run the Flink job locally:

sbt "project styxAppRunner" run

To run all unit and integration tests, and produce a code coverage report:

sbt clean coverage test

To package the code into a binary 'fat jar':

sbt assembly

Data Generator

To run the data generator:

java -jar customer-profile-0.4-java-1.7.jar --totalNumber=6000000
java -Xms1G -Xmx10G -jar generator-2017-01-02-V2-timestamp.jar

Project Structure

  • styx-app: the main bootstrapped Flink jobs
  • styx-appRunner: a helper standalone app which includes the Flink dependencies, that can be used to run and test locally
  • styx-commons: common independent classes for logging, exception handling, file I/O, configuration, etc.
  • styx-domain: the core domain entities of the use cases
  • styx-interfaces: the technology-independent abstract classes and traits
  • styx-frameworks-flink: the business logic of the streaming data processing engine
  • styx-frameworks-cassandra: the database connections
  • styx-frameworks-kafka: the message bus connections
  • styx-frameworks-openscoring: the model serving and scoring mechanism
  • styx-flink-core - the use case independent classes (most inner dependency, shouldn't depend on kafka/cassandra)
  • styx-flink-kafkaconnect - provides streams that read / write from/to kafka
  • styx-support-db: all database-related CQL scripts
  • styx-support-datagen: a data generator that puts messages on the Kafka bus
  • styx-support-deployment-validator: a tool that validates an use case end-to-end
  • styx-support-ops: scripts to deploy and test the application
You can’t perform that action at this time.