# Kafka Streams

## What is ?
<https://kafka.apache.org/38/documentation/streams/>

- Kafka Streams is a client library for building mission-critical real-time applications and microservices, where the input and/or output data is stored in Kafka clusters.
- Kafka Streams combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka's server-side cluster technology to make these applications highly scalable, elastic, fault-tolerant, distributed, and much more.

## Core Concepts
https://kafka.apache.org/35/documentation/streams/core-concepts

**It's a library**

:::: {.columns}

::: {.fragment .column width="50%"}
Designed as a **simple and lightweight client library**, which can be easily embedded in any Java application and integrated with any existing packaging, deployment and operational tools that users have for their streaming applications.
::: 

::: {.fragment .column width="50%"}
i.e a dependency in a 
```xml
<dependency>
    <groupId>org.apache.kafka</groupId>
    <artifactId>kafka-streams</artifactId>
    <version>3.8.0</version>
</dependency>
```
:::
::::




**Standalone**

:::: {.columns}

::: {.fragment .column width="50%"}
Has **no external dependencies on systems other than Apache Kafka itself** as the internal messaging layer; notably, it uses Kafka's partitioning model to horizontally scale processing while maintaining strong ordering guarantees.
::: 

::: {.fragment .column width="50%"}
![](https://cdn.confluent.io/wp-content/uploads/2016/08/consumer-group-2.png)
<https://www.confluent.io/blog/elastic-scaling-in-kafka-streams/>
:::
::::



**fault tolerant**

:::: {.columns}

::: {.fragment .column width="50%"}
- Supports **fault-tolerant local state**, which enables very fast and efficient stateful operations like windowed joins and aggregations.
- State stores on single machines (RocksDB) combined with Kafka topics to back up and replicate the state, enabling quick recovery in case of instance failure. - -- This fault-tolerant design enables Kafka Streams applications to handle failures gracefully, ensuring uninterrupted processing and maintaining a consistent state across distributed systems.

::: 

::: {.fragment .column width="50%"}
![](https://kafka.apache.org/0102/images/streams-architecture-states.jpg)
:::
::::






**Exactly Once Processing**

:::: {.columns}

::: {.fragment .column width="50%"}
Supports **exactly-once processing** semantics to guarantee that each record will be processed once and only once even when there is a failure on either Streams clients or Kafka brokers in the middle of processing.
::: 

::: {.fragment .column width="50%"}

![](https://cdn.confluent.io/wp-content/uploads/kafka-topic.png)

https://www.confluent.io/blog/enabling-exactly-once-kafka-streams/
:::
::::




:::: {.columns}
**One Recond at a time**

::: {.fragment .column width="50%"}
Employs **one-record-at-a-time processing** to achieve millisecond processing latency, and supports **event-time based windowing operations** with out-of-order arrival of records.
::: 

::: {.fragment .column width="50%"}
![](https://images.ctfassets.net/gt6dp23g0g38/y3dPJWV6inVi0KIWNk5Ic/5952ed5d7048e099a12ed57df173a39a/late-record-1.png)

<https://developer.confluent.io/learn-kafka/kafka-streams/time-concepts/>

:::
::::




**Primitives**

Offers necessary stream processing primitives, along with a **high-level Streams DSL** and a **low-level Processor API**.

- High Level : <https://kafka.apache.org/documentation/streams/developer-guide/dsl-api.html>
- Processor API: <https://kafka.apache.org/documentation/streams/developer-guide/processor-api.html>



# Use Cases

## The New York Times (2017)

![](https://kafka.apache.org/images/powered-by/NYT.jpg){.lightbox}

The New York Times uses Apache Kafka and the Kafka Streams to store and distribute, in real-time, published content to the various applications and systems that make it available to the readers.

<https://open.nytimes.com/publishing-with-apache-kafka-at-the-new-york-times-7f0e3b7d2077>

## Real Time Analytics

![](https://dzone.com/storage/temp/12275703-kafka-use-case.png)

Story: <https://dzone.com/articles/real-time-stream-processing-with-apache-kafka-part-1>

Code: <https://github.com/hellosatish/microservice-patterns/tree/master/vehicle-tracker>

# Examples

## Wordcount Demo App
``` bash
# Build Kafka Stream (we will see the code later)
cd kafka-stream
./build.sh

# Start Kafka Server
docker run --rm -p 9092:9092 --network tap --name kafkaServer\
 -e KAFKA_NODE_ID=1 \
  -e KAFKA_PROCESS_ROLES=broker,controller \
  -e KAFKA_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093 \
  -e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://kafkaServer:9092 \
  -e KAFKA_CONTROLLER_LISTENER_NAMES=CONTROLLER \
  -e KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT \
  -e KAFKA_CONTROLLER_QUORUM_VOTERS=1@localhost:9093 \
  -e KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1 \
  -e KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR=1 \
  -e KAFKA_TRANSACTION_STATE_LOG_MIN_ISR=1 \
  -e KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS=0 \
apache/kafka:3.8.0

# Create the topic (required)
docker exec -it --workdir /opt/kafka/bin/ kafkaServer ./kafka-topics.sh --create --bootstrap-server kafkaServer:9092  --topic streams-plaintext-input
docker exec -it --workdir /opt/kafka/bin/ kafkaServer ./kafka-topics.sh --create --bootstrap-server kafkaServer:9092  --topic streams-wordcount-output

# kafkaWordCountStream
docker run -it --rm --network tap  tap:kafkastream java -cp /app/app.jar tap.WordCount

# Start a producer 
docker exec --workdir /opt/kafka/bin/ -it kafkaServer ./kafka-console-producer.sh --topic streams-plaintext-input --bootstrap-server localhost:9092

# In another tab open a consumer
docker exec --workdir /opt/kafka/bin/ -it kafkaServer ./kafka-console-consumer.sh --topic streams-wordcount-output --from-beginning --bootstrap-server localhost:9092 --formatter kafka.tools.DefaultMessageFormatter --property print.key=true --property print.value=true --property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer --property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer

# Use "All streams lead to Kafka" followed by "Hello kafka streams"
```

## Behind the scenes


:::: {.columns}

::: {.fragment .column width="50%"}
![](https://kafka.apache.org/34/images/streams-table-updates-01.png){.lightbox}
::: 

::: {.fragment .column width="50%"}
![](https://kafka.apache.org/34/images/streams-table-updates-02.png){.lightbox}
:::
::::


# Concepts

## Stream Processing Topology


:::: {.columns}

::: {.fragment .column width="50%"}
- A stream is the most important abstraction provided by Kafka Streams: it represents an unbounded, continuously updating data set. A stream is an ordered, replayable, and fault-tolerant sequence of immutable data records, where a data record is defined as a key-value pair.
- A stream processing application is any program that makes use of the Kafka Streams library. It defines its computational logic through one or more processor topologies, where a processor topology is a graph of stream processors (nodes) that are connected by streams (edges).
- A stream processor is a node in the processor topology; it represents a processing step to transform data in streams by receiving one input record at a time from its upstream processors in the topology, applying its operation to it, and may subsequently produce one or more output records to its downstream processors.
::: 

::: {.fragment .column width="50%"}
![](https://kafka.apache.org/34/images/streams-architecture-topology.jpg)
:::
::::



## Duality of Streams and Tables

Stream processing use cases in practice need both streams and also databases. 

An example use case that is very common in practice is an e-commerce application that enriches an incoming stream of customer transactions with the latest customer information from a database table.

:::: {.columns}

::: {.fragment .column width="50%"}
**Stream as Table**: A stream can be considered a changelog of a table, where each data record in the stream captures a state change of the table. A stream is thus a table in disguise, and it can be easily turned into a "real" table by replaying the changelog from beginning to end to reconstruct the table. Similarly, in a more general analogy, aggregating data records in a stream - such as computing the total number of pageviews by user from a stream of pageview events - will return a table (here with the key and the value being the user and its corresponding pageview count, respectively).
::: 

::: {.fragment .column width="50%"}
**Table as Stream**: A table can be considered a snapshot, at a point in time, of the latest value for each key in a stream (a stream's data records are key-value pairs). A table is thus a stream in disguise, and it can be easily turned into a "real" stream by iterating over each key-value entry in the table.
:::
::::


# Stream and Tables: A Primer
https://www.confluent.io/blog/kafka-streams-tables-part-1-event-streaming/

## Event Records and Streams
:::: {.columns}

::: {.fragment .column width="50%"}
**An event records the fact that “something happened” in the world**

- Event key: “Alice”
- Event value: “Has arrived in Rome”
- Event timestamp: “Dec. 3, 2019 at 9:06 a.m.”

::: 

::: {.fragment .column width="50%"}
**An event stream records the history of what has happened in the world as a sequence of events**

- This history is an ordered sequence or chain of events, so we know which event happened before another event to infer causality.

- A stream thus represents both the past and the present: as we go from today to tomorrow—or from one millisecond to the next—new events are constantly being appended to the history.
:::
::::



## Example


:::: {.columns}

::: {.fragment .column width="50%"}

_The sequence of moves in a chess match_

White moved the e2 pawn to e4, then Black moved the e7 pawn to e5

![](https://66.media.tumblr.com/tumblr_m8ok25dsch1r8gmlso1_500.gifv)
::: 

::: {.fragment .column width="50%"}


**A table represents the state of the world** at a particular point in time, typically “now.”

![](https://cdn.confluent.io/wp-content/uploads/streams-vs-tables-1.png)
:::
::::


## Stream vs Table

| Stream | Table |
| ------ | ----- |
|A stream provides immutable data. It supports only inserting (appending) new events, whereas existing events cannot be changed. Streams are persistent, durable, and fault tolerant. Events in a stream can be keyed, and you can have many events for one key, like “all of Bob’s payments.” If you squint a bit, you could consider a stream to be like a table in a relational database (RDBMS) that has no unique key constraint and that is append only.| A table provides mutable data. New events—rows—can be inserted, and existing rows can be updated and deleted. Here, an event’s key aka row key identifies which row is being mutated. Like streams, tables are persistent, durable, and fault tolerant. Today, a table behaves much like an RDBMS materialized view because it is being changed automatically as soon as any of its input streams or tables change, rather than letting you directly run insert, update, or delete operations against it.|

## Repetita iuvant 
:::: {.columns}

::: {.fragment .column width="50%"}
|                                           | Stream |  Table    |
|-------------------------------------------|--------|-----------|
| First event with key bob arrives          | Insert | Insert    |
| Another event with key bob arrives        | Insert | Update    |
| Event with key bob and value null arrives | Insert | Delete    |
| Event with key null arrives               | Insert | _ignored_ |

::: 

::: {.fragment .column width="50%"}
![](https://cdn.confluent.io/wp-content/uploads/event-stream-1.gif)
:::
::::




## Writing App
https://docs.confluent.io/platform/current/streams/developer-guide/running-app.html

```bash
mvn archetype:generate \
    -DarchetypeGroupId=org.apache.kafka \
    -DarchetypeArtifactId=streams-quickstart-java \
    -DarchetypeVersion=3.8.0 \
    -DgroupId=streams.examples \
    -DartifactId=kafka-streams.examples \
    -Dversion=0.1 \
    -Dpackage=tap
```