# <del>Kafi Streams - Complex Stream Processing on Python Without the Complexity<del>

# Stream Processing Unchained
### (and a very very early prototype of Kafi Streams)
#### Ralph Debusmann (Migros-Genossenschafts-Bund)

<hr>

# 1 The Stream Processing Crisis

## The Commercial Side

### Confluent + Flink

* Projected Confluent ARR FY 2025 $1B+
* Flink ARR FY 2025 $14M? (https://bigdata.2minutestreaming.com/p/event-streaming-is-topping-out) - that would be about 1.4% (but QoQ growth 70%)

### Lots of startups, none of them really close to the hockey stick

(in alphabetical order)

* Arroyo (acquired by Cloudflare)
* Bytewax
* Decodable
* DeltaStream
* Feldera
* Materialize
* Pathway
* Quix
* Responsive
* RisingWave
* Timeplus

### Bigger companies, but is their focus really on stream processing?

* Databricks
* MongoDB
* Snowflake
* Ververica

## The Lakehouse Trend

### Multimodal streams

* Aiven/Apache Kafka (Iceberg Topics)
* Confluent (Tableflow)
* Redpanda (Iceberg Topics)

### With that, it seems that processing also shifts to the lakehouse

* Confluent (Real-time Context Engine for Kafka/Iceberg)
* Ververica (Fluss/Paimon)


<hr>

# 2 Why?

## Few Real Real-time Use Cases

* Sure, fraud detection et al., sure in banking, online retail, but beyond?
* E.g. at Migros, <5% real-time use cases.

## Stream Processing Is Still Hard

* Just talk to someone who implements Kafka Streams/Flink pipelines...
* ...if you can even find someone who is well-versed enough to do it (and whom you can actually pay).
* And then, try to debug these pipelines if they fail.
* In the end, it is just extremely costly and is only worth it for an even smaller subset of real-time use cases.

## For the Most Part, Stream Processing Is Not Even Consistent

* The leading stream processing systems (Kafka Streams/Flink) are *eventually* consistent.
* So if you need stronger consistency guarantees, things get even hairier. You need workarounds. Even more complexity.
* Strongly consistent stream processing frameworks exist (but have low adoption):
  * Bytewax (Timely Dataflow)
  * Feldera (Database Stream Processor)
  * Materialize, Pathway (Differential Dataflow)
* Let's have a look at Jamie Brandon's consistency benchmark... (DEMO - Flink + Materialize/Feldera)

## So, Let's Take The Escape Route And Go Back To Batch?

* That's where it seems to go now, also with Confluent...
* ...see "The Lakehouse Trend" above.
* But do you also think we should give up? Go back to batch? Go back to "just push all the data into a database/lakehouse and be happy as in the world before stream processing"?

## I Won't. I Still Believe In Stream Processing

* I believe in incremental computation (=only compute deltas, never compute anything twice).
* I believe in push and not only pull queries.
* I believe in true shift-left architecture (as in - not shifting the technology left (lakehouse) but bringing the data from the lakehouse to streaming)
* I believe in simplicity and Occam's Razor (as in - not having to introduce another complexity monster (e.g. Iceberg) to the data streaming platform).


<hr>

# 3 It Is Time To Unchain Stream Processing

## Is This Even Possible? We Can At Least Have A Go, No?

* Let's go back to Differential Dataflow (DD; McSherry 2014) and Database Stream Processor (DBSP; Budiu et al. 2022).
* Low adoption, yes, but does this mean we should keep ignoring them?
* Who of you as ever used one of them?
* Do you even know what they are about?

## This Is Not About Streaming Databases

* Yes, I've co-written a book about this ;)
* Both theories are based on database theory.
* The first commercial stream processing system based on DD was a streaming database (Materialize).
* But this is not the point. Bytewax, Feldera and Pathway are, for example, stream processors, not streaming databases.
* Materialized Views are *not* a prerequisite for strong consistency guarantees.

## But What Is the Point?

* There are three points/prerequisites for strongly consistent stream processing (as in a database):
  * You need some kind of global virtual "clock". Makes it harder to build a distributed system. But which use cases need to be ultra-scalable? Isn't there are also a trend towards less distributed systems (see e.g. DuckDB)?
  * You need more state (at least you accept that you don't forget too quickly).
  * You need a proper theory aka stream processing engine.


<hr>

# 4 Towards A Kafka Streams For Python Based On DBSP