This is an example/demo ("proof of concept") project. It is a producer-consumer program with randomly generated data. The premise is that there are multiple satellites producing data asynchronously. The data is serialized and published to a message broker, then processed and written to storage with an extract-tranform-load (ETL) pipeline.
demo.mov
Delta lake output:
![delta-lake-output](https://private-user-images.githubusercontent.com/124453543/253350254-3ba75b69-7295-42ca-bb57-df8a388dd18b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjExNTI3OTgsIm5iZiI6MTcyMTE1MjQ5OCwicGF0aCI6Ii8xMjQ0NTM1NDMvMjUzMzUwMjU0LTNiYTc1YjY5LTcyOTUtNDJjYS1iYjU3LWRmOGEzODhkZDE4Yi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzE2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcxNlQxNzU0NThaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT05ZDc3ODJkYzE0N2VkMjY2MDJjOGRiZmJmMGRkOTA4ODMzODJlODZjNzkzYWE2YTFhM2NiNjUyNWNhNDNmZjhjJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.X0Qp9joxIyRZ5Bff6-JTfu0Jc-ju4pzW1J46rgmuwfs)
- Sample generation code - Python >3.7 (
pip install -r requirements.txt
)sample_data.py
- Serialization - FlatBuffers
- Producer code - C++17
- Message broker - Kafka (standalone)
- Zookeeper (standalone)
- Consumer code - Scala/Spark and Java
- Storage: Delta Table
- Parquet w/ Snappy compression
- Deployment - Docker and docker-compose
- Figures - Python >3.7 (
pip install -r requirements-notebook.txt
)jupyter notebook
- Open
figures.ipynb
The sample data is a python script (./sample_data.py
) that creates simulated satellite data.
A "satellite sample" is contained in one csv, so one csv per satellite. The "name" of the satellite is the file name. For example, here is the head of sat0.csv
.
temp_c,battery_charge_pct,altitude,sensor1,sensor2,sensor3
17.74,54.09,786.75,0.42,0.52,0.47
27.58,44.02,1251.02,0.42,0.51,0.5
-8.67,46.33,1270.95,0.5,0.45,0.52
29.25,47.63,1818.88,0.47,0.49,0.51
There is no index in the files because as the data is pumped through the system a nanosecond time stamp is generated.
Fields
- temp_c
- battery_charge_pct
- altitude
- sensor1
- sensor2
- sensor3
Writes data to ./data/samples
.
The serialization format chosen is Google Flatbuffers. Flatbuffers, are similar to Google Protocol Buffers ("Protobufs"). They have various trade-offs. Flatbuffers tend to be slightly larger (bytes) than Protobufs, but Flatbuffers offers zero-copy deserialization.
This process is written in C++17. The producer process reads all of the files asynchronously (async
) using multiple threads. After the data is read and parsed from the csv file, it is serialized, then published to a Kafka topic called sat-data
. Between each event the program randomly sleeps between a given range to simulate the "live" data.
The consumer process utilizes Spark streaming in Scala. It consumes the sat-data
kafka topic, accumulating the events in micro-batch fashion; collecting events during a periodic window (5s
). Once the events are collected in an RDD (resilient distributed dataset), the flatbuffers are deserialized (Java flatbuffers) and the received timestamp is collected from Kafka. Using this data, a dataframe is created and appended to a Delta Table, storing the structured data as Parquet files with Snappy compression.
Writes data to ./data/delta-tables/satellite-data-raw
.
Docker in combination with docker-compose to deployment the project. It is a simple way to deploy that ensure consistent behavior across platforms.
Start:
docker-compose up
Stop:
docker-compose down
To rebuild the flatbuffers generated files:
Build the flatbuffers flatc
binary:
./build-flatc.sh
Run the following script to build the event.fbs
flatbuffers schema for C++ and Java:
./build-event-flatbuffers.sh
These git submodules are included for development purposes. They are not necessary deployment.