satellite-kafka-spark-delta-lake-pipeline-example

This is an example/demo ("proof of concept") project. It is a producer-consumer program with randomly generated data. The premise is that there are multiple satellites producing data asynchronously. The data is serialized and published to a message broker, then processed and written to storage with an extract-tranform-load (ETL) pipeline.

demo.mov

Delta lake output:

Architecture

Sample generation code - Python >3.7 (pip install -r requirements.txt)
- sample_data.py
Serialization - FlatBuffers
Producer code - C++17
- librdkafka
- modern-cpp-kafka
Message broker - Kafka (standalone)
- Zookeeper (standalone)
Consumer code - Scala/Spark and Java
Storage: Delta Table
- Parquet w/ Snappy compression
Deployment - Docker and docker-compose
Figures - Python >3.7 (pip install -r requirements-notebook.txt)
- jupyter notebook
- Open figures.ipynb

Sample data

The sample data is a python script (./sample_data.py) that creates simulated satellite data.

A "satellite sample" is contained in one csv, so one csv per satellite. The "name" of the satellite is the file name. For example, here is the head of sat0.csv.

temp_c,battery_charge_pct,altitude,sensor1,sensor2,sensor3
17.74,54.09,786.75,0.42,0.52,0.47
27.58,44.02,1251.02,0.42,0.51,0.5
-8.67,46.33,1270.95,0.5,0.45,0.52
29.25,47.63,1818.88,0.47,0.49,0.51

There is no index in the files because as the data is pumped through the system a nanosecond time stamp is generated.

Fields

temp_c
battery_charge_pct
altitude
sensor1
sensor2
sensor3

Writes data to ./data/samples.

Data serialization

The serialization format chosen is Google Flatbuffers. Flatbuffers, are similar to Google Protocol Buffers ("Protobufs"). They have various trade-offs. Flatbuffers tend to be slightly larger (bytes) than Protobufs, but Flatbuffers offers zero-copy deserialization.

Producer

This process is written in C++17. The producer process reads all of the files asynchronously (async) using multiple threads. After the data is read and parsed from the csv file, it is serialized, then published to a Kafka topic called sat-data. Between each event the program randomly sleeps between a given range to simulate the "live" data.

Consumer

The consumer process utilizes Spark streaming in Scala. It consumes the sat-data kafka topic, accumulating the events in micro-batch fashion; collecting events during a periodic window (5s). Once the events are collected in an RDD (resilient distributed dataset), the flatbuffers are deserialized (Java flatbuffers) and the received timestamp is collected from Kafka. Using this data, a dataframe is created and appended to a Delta Table, storing the structured data as Parquet files with Snappy compression.

Writes data to ./data/delta-tables/satellite-data-raw.

Deployment

Docker in combination with docker-compose to deployment the project. It is a simple way to deploy that ensure consistent behavior across platforms.

Start: docker-compose up

Stop: docker-compose down

Rebuilding flatbuffers files (dev-only)

To rebuild the flatbuffers generated files:

Build the flatbuffers flatc binary:

./build-flatc.sh

Run the following script to build the event.fbs flatbuffers schema for C++ and Java:

./build-event-flatbuffers.sh

Notes on third-party

These git submodules are included for development purposes. They are not necessary deployment.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docker		docker
event-consumer		event-consumer
event-producer		event-producer
third-party		third-party
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
build-event-flatbuffers.sh		build-event-flatbuffers.sh
build-flatc.sh		build-flatc.sh
docker-compose.yml		docker-compose.yml
event.fbs		event.fbs
figures.ipynb		figures.ipynb
requirements-notebook.txt		requirements-notebook.txt
requirements.txt		requirements.txt
sample_data.py		sample_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

satellite-kafka-spark-delta-lake-pipeline-example

Architecture

Sample data

Data serialization

Producer

Consumer

Deployment

Rebuilding flatbuffers files (dev-only)

Notes on third-party

About

Releases

Packages

Languages

License

tayeva/satellite-kafka-spark-delta-lake-pipeline-example

Folders and files

Latest commit

History

Repository files navigation

satellite-kafka-spark-delta-lake-pipeline-example

Architecture

Sample data

Data serialization

Producer

Consumer

Deployment

Rebuilding flatbuffers files (dev-only)

Notes on third-party

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages