# <del>Kafi Streams - Complex Stream Processing on Python Without the Complexity<del>

# Stream Processing Unchained
### (and a very very early prototype of Kafi Streams)
#### Ralph Debusmann (Migros-Genossenschafts-Bund)

<hr>

# 1 The Stream Processing Crisis


## The Commercial Side

### Confluent + Flink

* Projected Confluent ARR FY 2025 $1B+
* Flink ARR FY 2025 $14M? (https://bigdata.2minutestreaming.com/p/event-streaming-is-topping-out) - that would be about 1.4% (but QoQ growth 70%)

### Lots of startups, none of them really close to the hockey stick

(in alphabetical order)

* Arroyo (acquired by Cloudflare)
* Bytewax
* Decodable
* DeltaStream
* Feldera
* Materialize
* Pathway
* Quix
* Responsive
* RisingWave
* Timeplus

### Some bigger companies, but is their focus really on stream processing?

* Databricks
* MongoDB
* Snowflake
* Ververica

## The Lakehouse Trend

### Multimodal streams

* Aiven/Apache Kafka (Iceberg Topics)
* Confluent (Tableflow)
* Redpanda (Iceberg Topics)

### With that, it seems that processing is also shifting to the lakehouse

* Confluent (Real-time Context Engine for Kafka/Iceberg)
* RisingWave (lots of Iceberg stuff, e.g. RisingWave Iceberg catalog)
* Ververica (Fluss/Paimon)


<hr>

# 2 Why Is Stream Processing Losing Its Momentum?

* aka "Why Hasn't Stream Processing Ever Really Taken Off?"

* aka "Why Do People Still Use Databases For Data Processing?"

## Few Real Real-time Use Cases

* Sure, fraud detection et al., sure in banking, online retail, but beyond?

* E.g. at Migros, <5% real-time use cases.

## Stream Processing Is Still Hard

* Just talk to someone who implements Kafka Streams/Flink pipelines...

* ...if you can even find someone who is well-versed enough to do it (and whom you can actually pay).

* And then, try to debug these pipelines if they fail.

* In the end, it is just extremely costly and is only worth it for an even smaller subset of real-time use cases.

## For the Most Part, Stream Processing Is Not Even Consistent

* The leading stream processing systems/"classical stream processing" (Kafka Streams/Flink) are *eventually* consistent.

* So if you need stronger consistency guarantees, things get even hairier. You need workarounds. Add even more complexity.

* Strongly consistent stream processing frameworks exist (but have low adoption):
  * Bytewax (Timely Dataflow)
  * Feldera (Database Stream Processor)
  * Materialize, Pathway (Differential Dataflow)

* Let's have a look at Jamie Brandon's consistency benchmark... (https://www.scattered-thoughts.net/writing/internal-consistency-in-streaming-systems/)

## So, Let's Take The Escape Route And Go Back To Batch?

* That's where it seems to go now, also with Confluent...

* ...see also "The Lakehouse Trend" above.

* But do you also think we should give up?
  * Go back to batch?
  * Go back to "just push all the data into a database/lakehouse and be happy as in the world before stream processing"?

## I Won't. I Still Believe In Stream Processing

* I believe in incremental computation (=only compute deltas, never compute anything twice).

* I believe in true shift-left architecture (as in - not shifting the technology left (lakehouse) but bringing the operational data from the lakehouse to streaming)

* I believe in simplicity and Occam's Razor (as in - not having to introduce another complexity monster (e.g. Iceberg) to the data streaming platform).


<hr>

# 3 It Is Time To Unchain Stream Processing

## Let's Begin With What Could Have Gone Wrong

* The assumptions behind "classical" stream processing
  * horizontal scaling vs. single binary
  * local operator state/clock vs. global state/clock
  * ad-hoc operator semantics vs. a mathematically sound theory 

* That is, "classical" stream processing is architected to trade off:
  * extreme scale in favor of consistency and "database-like" semantics
  * leaky abstractions in favor of a sound "database-like" abstraction (https://ralphmdebusmann.substack.com/p/why-streaming-still-isnt-mainstream)

* But, the vast majority of use cases do *not* need extreme scale and would also greatly benefit from a more sound abstraction!

* Some form of premature optimization?

* As a result, stream processing semantics remain far detached from database semantics, so complicated that it can probably never become mainstream.

* Attempts to fix this: Keep the "classical" stream processing engines as your basis, try to hide the leaky abstractions and try to bolt on a more database-like semantics:
  * ksqlDB
  * Flink SQL
  * ...

* Problematic. In software architecture, you cannot entirely hide leaky abstractions in your core engine, and you cannot entirely bolt on a completely different semantics.

* This is why ksqlDB has failed, and why Flink SQL can almost never work on its own - you still need the Datastream API (which Confluent doesn't even offer yet on Confluent Cloud) to fill the gaps.

## Challenging The Assumptions Behind "Classical" Stream Processing

* There is a diaspora of stream processing engines that do challenge these assumptions:
  * Differential Dataflow (DD; McSherry 2014) - Materialize, Pathway
  * Database Stream Processor (DBSP; Budiu et al. 2022) - Feldera

* Low adoption, yes.

* Few people actually understand the underpinnings - too theoretical/academic.

* But does this mean we should keep ignoring them?

* Who of you as ever used one of them?

* Do you even know what they are about?

## New Assumptions

* New assumptions:
  * single binary vs. horizontal scaling
  * global state/clock vs. local operator state/clock
  * a mathematically sound theory vs. ad-hoc operator semantics

* This is what you trade-off now:
  * "database-like" semantics instead of extreme scale
  * a sound "database-like" abstraction instead of leaky abstractions

* Much more appealing to non-stream processing nerds?

## DBSP

* Few people on earth have really ever fully understood DD - except its inventor, Frank McSherry ;-)

* DBSP is a simplified (but still very academic) mathematical foundation for database-like stream processing based on ideas from DD (https://arxiv.org/abs/2203.16684).

* Coming from VMware research (main researchers: Mihai Budiu, Leonid Ryzhyk)

* Vendor: Feldera

* Feldera claims:
  * They can seamlessly migrate existing Spark batch jobs to Feldera without changing the 10K lines of SQL over many dozens of tables and views with hundreds of joins, aggregates, unions, distincts, deeply nested subqueries and more - same semantics for stream processing and batch! (https://www.feldera.com/blog/how-feldera-customers-slash-cloud-spend).
  * Fully incremental: They can save a lot of compute and memory for their customers - from 50+ node Spark cluster down to 1-2 Feldera compute nodes - updating the views in milliseconds.

* Rust implementation: https://github.com/feldera/feldera

* First Python implementation (PyDBSP) by Bruno Cardona/Rucy: https://github.com/brurucy/pydbsp

## Unchaining Stream Processing

* In my view, the growth of stream processing has been held off by:
  * It all started with a focus on a very small set of real-time use cases of extreme scale.
  * The initial assumptions and architectural decisions underlying the main "classical" stream processing systems optimize for this - a very small set of real-time use cases of extreme scale.
  * The result is a niche technology that will never be able to attract many to switch from batch to stream processing.
  * Worse - even the bigger players are now essentially happily retreating back to batch!

* What if we drop these initial assumptions and architectural decisions and build a new foundation? And Feldera shows that you can still optimize that new foundation in a way that *does* allow extreme scale :)

* If we reboot stream processing and with the new assumptions behind DBSP?

* To unchain stream processing?


<hr>

# 4 Towards A Kafka Streams For Python Based On DBSP

## Kafi

* The Swiss Army Knife for Kafka. A Python library for all things Kafka (https://github.com/xdgrulez/kafi).

* Offers:
  * For all kinds of Kafka flavors...
    * Direct Kafka API (based on confluent-kafka)
    * Confluent REST Proxy
    * Emulated Kafka:
      * Local disk
      * S3
      * Azure Blob Storage
  * ...producing, consuming, metadata (watermarks, consumer groups...), Schema Registry (Avro, Protobuf, JSONSchema)
  * simple stateless stream processing (foreach, map, flatmap...)
  * add ons - e.g. message chunking, copy consumer group offsets, get topic statistics etc. pp.
  * Pandas support (e.g. read from Kafka, write to Excel sheet or the other way round).
  * Incredibly useful for scripting anything around Kafka, e.g. topic migrations.

## Kafi Streams

* Kafi extension for complex stateful stream processing.

* Take the DevX from Kafka Streams but take (Py)DBSP as the underlying stream processing engine.

* That is, start from the new assumptions:
  * single binary vs. horizontal scaling
  * global state/clock vs. local operator state/clock
  * a mathematically sound theory vs. ad-hoc operator semantics

* To get:
  * "database-like" semantics instead of extreme scale
  * a sound "database-like" abstraction instead of leaky abstractions

* Here is a very very early prototype...

## Already Implemented

* A small subset of operators (filter, map, join).
* Checkpointing to all Kafi Kafka flavors (including emulated Kafka on S3 and Azure Blob Storage) - so you can even use Kafka directly for checkpointing :)

## TODOs...

* Implement the remaining operators (starting point: https://github.com/brurucy/pydbsp/blob/master/notebooks/sql.ipynb)
* Only then optimize (no premature optimization!)
  * Garbage collection
  * At a later stage, maybe replace PyDBSP with the Rust implementation of DBSP

