# Kafka

### Pub-Sub Messaging pattern

- Senders (Publishers) do not send messages directly to Receivers (Subscribers).
- Publishers publish data to a broker without knowing who the subscribers are.
- Subscribers request brokers for a specific kind of data without worrying about who the publishers are.
- Pub-Sub pattern reduces the potential connections between the Publishers and Subscribers.

### What is Kafka?

__Kafka__ - A publish-subscribe messaging system (also described as a distributed event log). 

- All the new records are immutable and appended to the end of the log.
- Messages are persistant on the disk for a specific duration (called Retention Policy).
- Like a Hybrid between a messaging system and a Database.
- _Producers_ produce messaged on different _topics_ and _Consumers_ consume those messages.
- Aims to provide a reliable and high throughput platform for handling real-time data streams and building data pipelines.
- Can be used to build modern and scalable ETL, CDC (Change Data Capture) or Big Data Ingest.

### Kafka Architecture

- Components:
    - Cluster
    - Producers (applications producing data)
    - Consumers (applications consuming data)
- _Broker_: A single Kafka cluster within a cluster
    - Responsible for
        - receiving messages from producers
        - Assigning offsets
        - Commiting messages to the disk
        - Responding to consumer's fetch requests and serving messages
    - A Kafka cluster generally consists of at least three brokers to provide enough redundancy.
- _Topics_ provide a way to categorize data
    - Further broken down into a number of _Partitions_
    - Each partition - a separate commit log
    - Order of messages guaranteed across the same partition
    - Makes scaling easy - high throughput (can be split across multiple servers)

### Messages

- Single unit of data
- Can have an Optional key to write data in a more controlled way to multiple partitions within the same topic
- Sending single messages is slow - can use batched (throughput-latency tradeoff). Batches can be compressed

### Consumers and Consumer Groups

- Consumer keeps a track of its position in the stream of data by remembering offset
    - it can stop/restart without losing its position
- Consumers always belong to a specific consumer group
- Consumers within a group work together to consume a topic
    - Each partition is consumed by exactly one consumer
    - This way, consumers can scale horizontally to consume topic with a large number of messages
    - If a consumer failes, the remaining members of the group will rebalance the partitions

### Brokers

- Designed to operate as a part of cluster
- One broker within the cluster acts as the controller
    - Responsible for administrative operations like assigning partitions to brokers, monitiring for broker failures, etc.
- Partition owned by a single broker within a cluster - called the leader of the partition
    - A partition may be assigned to multiple clusters - replication
    - Provides redundancy of messages so that another broker can take over leadership in case of failure

### Retention

- Provides durable storage of messages (until they expire)
- Retain messages for some duration of time or until a storage threshold is achieved
- Oldest messages expires and are deleted
- Individual topics can configure their own retention policy

### Reliability

- In terms of guarantees
    - Order of mess"ages within the same topic and partition
    - Messages are considered "committed"
    - Committed messages won't be lost as long as at least one replica remains alive and the retention policy holds
    - Consumers can read only committed messages
    - At-least-once message delivery semantics - doesn't prevent duplicate messages being produced
        - No exactly-once semantics (can be achieved using external systems)
- Trade-offs
    - Reliability and consistency vs availability, high throughput and low latency

### Pros and Cons of Kafka

| Pros | Cons |
| --- | --- |
| Tackles integration complexity | Requires a fair amount of time to understand |
| Great for ETL/CDC | Not the best solution for real low-latency systems |
| Big data retention |  |
| High throughput |  |
| Disk-based retention |  |
| Multiple producers/consumers |  |
| Highly scalable |  |
| Fault tolerant |  |
| Fairly low latency |  |
| Highly configurable |  |
| Provides backpressure |  |

### Kafka vs JMS (Java Messaging Service)

| Kafka | JMS |
| --- | --- |
| Consumers pull messages from the brokers | Messages pushed to the consumers directly (Hard to achieve back pressure) |
| Retention |  |
| Easy to replay messages |  |
| Guarantees ordering of messages within partitions |  |
| Easy to build scalable and fault tolerant systems |  |