SQLFlow enables SQL-based stream-processing, powered by DuckDB. SQLFlow enables writing kafka stream processing logic in pure sql.
SQLFlow supports:
- Kafka streaming - Writing a consumer that performs SQL based transformations and publishing the output to another kafka topic.
- JSON on the wire
- Writing stream transformations in pure SQL, powered by DuckDB
- Performant librdkafka python consumer
SQLFlow is currently not a good fit for:
- Stateful stream processing
- Wire protocols other than JSON
SQLFlow is a kafka consumer that embeds SQL for stream transformation:
![Screenshot 2023-11-26 at 8 16 47 PM](https://private-user-images.githubusercontent.com/151242797/285697226-419d8688-1d08-45ce-b245-1c2c886a3157.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAyMTMxNjcsIm5iZiI6MTcyMDIxMjg2NywicGF0aCI6Ii8xNTEyNDI3OTcvMjg1Njk3MjI2LTQxOWQ4Njg4LTFkMDgtNDVjZS1iMjQ1LTFjMmM4ODZhMzE1Ny5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzA1JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcwNVQyMDU0MjdaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lNWFjMGMxNTI4NGY2NDk3MTIzMGY5MGZiNjYwMzc4ZTViNmRlYmMwMmU4OWZkNGNkYmEyZmI5ZGQ2OTQyZGUyJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.4cfc4xHSLi7AjZFSMlb93OZeGm0An96KaaLvZuoAE0s)
Docker is the easiest way to get started.
- Pull the sql-flow docker image
docker pull turbolytics/sql-flow:latest
- Validate config by invoking it on test data
docker run -v $(PWD)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow turbolytics/sql-flow:latest dev invoke /tmp/conf/config/examples/basic.agg.yml /tmp/conf/fixtures/simple.json
['{"city":"New York","city_count":28672}', '{"city":"Baltimore","city_count":28672}']
- Start kafka locally using docker
docker-compose -f dev/kafka-single.yml up -d
- Publish test messages to kafka
python3 cmd/publish-test-data.py --num-messages=10000 --topic="topic-local-docker"
- Start kafka consumer from inside docker-compose container
docker exec -it kafka1 kafka-console-consumer --bootstrap-server=kafka1:9092 --topic=output-local-docker
- Start SQLFlow in docker
docker run -v $(PWD)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow turbolytics/sql-flow:latest run /tmp/conf/config/local.docker.yml
- Verify output in the kafka consumer
...
...
{"city":"San Francisco504","city_count":1}
{"city":"San Francisco735","city_count":1}
{"city":"San Francisco533","city_count":1}
{"city":"San Francisco556","city_count":1}
The dev invoke
command enables testing a sql-flow pipeline configuration on a batch of test data. This enables fast feedback local development before launching a sql-flow consumer that reads from kafka.
The heart of sql-flow is the pipeline configuration file. Each configuration file specifies:
- Kafka configuration
- Pipeline configuration
- Input configuration
- SQL transformation
- Output configuration
![Screenshot 2023-11-26 at 8 10 44 PM](https://private-user-images.githubusercontent.com/151242797/285696810-4f286fdc-ac2b-4809-acdb-1dc4d239f883.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAyMTMxNjcsIm5iZiI6MTcyMDIxMjg2NywicGF0aCI6Ii8xNTEyNDI3OTcvMjg1Njk2ODEwLTRmMjg2ZmRjLWFjMmItNDgwOS1hY2RiLTFkYzRkMjM5Zjg4My5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzA1JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcwNVQyMDU0MjdaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT02NzgxM2U2ZTk2YThlMmJmYWI4NjY3MDM4ZDFjODhmOGVhMjJlNDU4ZGE2OGI4ZDg5NTJmMzIxNWFiYmY5MmQ2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.p1MjUSL79z5H908Ovj_n74QwS360dJZh3KRef6sf4w4)
Every instance of sql-flow needs a pipeline configuration file.
Coming Soon, until then checkout:
- Install python deps
pip install -r requirements.txt
pip install -r requirements.dev.txt
C_INCLUDE_PATH=/opt/homebrew/Cellar/librdkafka/2.3.0/include LIBRARY_PATH=/opt/homebrew/Cellar/librdkafka/2.3.0/lib pip install confluent-kafka
- Run tests
pytests tests
Methodology
Each test loads 1MM records into kafka. Each test executes sql-flow consumer until each message is processed. Each test captures the maximum resident memory during the benchmark, and the average throughput of message ingestion.
System
Hardware:
Hardware Overview:
Model Name: MacBook Pro
Model Identifier: MacBookPro18,3
Model Number: Z15G001X2LL/A
Chip: Apple M1 Pro
Total Number of Cores: 10 (8 performance and 2 efficiency)
Memory: 32 GB
Activation Lock Status: Enabled
Name | Throughput | Max RSS Memory | Peak Memory Usage |
---|---|---|---|
Simple Aggregation Memory | 45,000 msgs / sec | 230 MiB | 130 MiB |
Simple Aggregation Disk | 36,000 msgs / sec | 256 MiB | 102 MiB |
Enrichment | 13,000 msgs /sec | 368 MiB | 124 MiB |
CSV Disk Join | 11,500 msgs /sec | 312 MiB | 152 MiB |
CSV Memory Join | 33,200 msgs / sec | 300 MiB | 107 MiB |
Performs a simple aggregate. Output is significantly smaller than input.
./benchmark/simple-agg-disk.sh
./benchmark/simple-agg-mem.sh
Performs an enrichment. Output is 1:1 records with input, but each output record is enhanced with additional information.
./benchmark/enrichment.sh
./benchmark/csv-disk-join.sh
./benchmark/csv-disk-join.sh
Like SQLFlow? Use SQLFlow? Feature Requests? Please let us know! danny@turbolytics.io