Slides and code from the ACM DEBS 2020 Tutorial titled "The Role of Event-Time Order in Data Streaming Analysis", given by Vincenzo Gulisano (Chalmers University of Technology), Dimitris Palyvos-Giannas (Chalmers University of Technology), Bastian Havers (Chalmers University of Technology & Volvo Cars) and Marina Papatriantafilou (Chalmers University of Technology).
The corresponding paper is available online.
The recording of this tutorial can be found at:
Download the latest version of Apache Flink here and follow the Flink DataStream start-up instructions.
The demonstration dataset used for the disorder examples from DisorderQuery.java
is already provided in the folder input
.
The larger dataset used in FullQuery is available for download here.
Place it in a location of your choice and change the variable PATH_TO_LR_SOURCE_DATA
in Config.java
to the downloaded file. Then, in the same java file, point PATH_TO_REPOSITORY
to the location to which you downloaded the repository.
To your Flink config file found at YOUR_FLINK_HOME/conf/flink-conf.yaml
, append the following lines:
metrics.reporter.grph.class: org.apache.flink.metrics.graphite.GraphiteReporter
metrics.reporter.grph.host: localhost
metrics.reporter.grph.port: 2003
metrics.reporter.grph.protocol: TCP
Then, copy YOUR_FLINK_HOME/opt/flink-metrics-graphite-1.10.0.jar
into YOUR_FLINK_HOME/lib
.
Then, install Graphite, e.g. as shown here. Flink will send metrics via port 2003 to Graphite. Finally, get Grafana from here and connect Graphite to Grafana.
Start a Flink cluster (as described here), and it will show up as a metric in the Graphite connector in Grafana, along with all the Flink jobs that you submit to the cluster.
Build the jar file with Maven:
cd PATH_TO_REPOSITORY
mvn clean package
There are two main classes provided in this repository, DisorderQuery
and FullQuery
.
Both have been tested with Apache Flink version >= 1.9.2.
DisorderQuery.java
runs without parameters, output files are created in the folderoutputs/disorder
.FullQuery.java
expects two parameters. The first is an integer >= 1 and defines the parallelism of the Join. The second parameter is eithertrue
orfalse
. If set tofalse
, the query will execute normally.true
enables sleep timers in the operators and stalls the watermark progress slightly, to make differences in the watermarks more visible and to simulate a higher load. Output files are created in the folderoutputs/full
.
An example:
./YOUR_FLINK_HOME/bin/flink start
./YOUR_FLINK_HOME/bin/flink run -c DEBS2020streamingEventTime.FullQuery PATH_TO_REPOSITORY/target/DEBS2020streamingEventTime-1.0.jar 4 false
will start the Full Query with Join parallelism 4 and no sleep timers.