# Stream Processing and Analytics

## Big Data Systems

### Reliable, Scalaable and Maintainable Applications

#### Data Intensive Applications
- deals with huge amount of data, complex and fast moving data
- built with several building blocks (databases, caches, indexes, batch and stream processing systems, etc.)
- 3 main concerns: 
    1. reliability : fault tolerance, fault recovery, monitoring, alerting, etc.
        - recover from hardware/software faults, human errors, etc.
    2. scalability : horizontal scaling, load balancing, etc.
    3. maintainability : operability, simplicity, evolvability, etc.

#### Example web analytics pipeline
- designing an application to track user visits on a website (schema : user_id, page_id, n_visits, etc.)
- portal becomes popular, data volume increases, database writes become a bottleneck
    - use intermediate queue to buffer writes (queue will hold messages, database will consume messages)
- more traffic, more data, more writes, more database load
    - use multiple databases (shard data by user_id, page_id, etc.)
- as we move down, we need to handle more and more complexity - shards, queues, more complicated application logic, etc.
    - need to handle failures, need to handle load balancing, etc. 

#### Big Data Systems
- handle huge amounts of data, fast moving data, complex data
- systems designed with distributed nature in mind, doesn't need to bother about common issues like sharding, replication, etc.
- scalability achieved by horizontal scaling - just add new machines, devs need only focus on application logic
- 3Vs of Big Data : Volume, Velocity, Variety
- examples : 
    1. data sources : web logs, social media, sensors, etc.
    2. data acquisition : Kafka, Flume, Spark Streaming, etc.
    3. storage : HDFS, Cassandra, HBase, etc.
    4. BI analysis : Spark, Hive, Pig, etc.
    5. visualization : Tableau, Zeppelin, PowerBI, etc.
- properties: fault tolerance, low latency, scalability, extensibility, maintainability, debuggability, etc.

#### Data Model of Big Data Systems
- fact based model : data is modeled as facts/events
    - graph schema captures relationships between entities in the form of nodes, edges and properties
    - nodes represent entities, edges represent relationships, properties represent attributes
    <img alt="picture 0" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/57bd7d48fad8d4e07f588599101a241bcc50e73ee1bafb1c4c97e5800aa95e1f.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">
- fact table : stores facts (events) in the form of rows
- other columns in the fact table are foreign key references to dimension tables
    <img alt="picture 1" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/e4e2363558893501a303520b0841a66bb4b755422ae4cbf17f27196ab334f389.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">

#### Big data architecture style

<img alt="picture 2" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/5e126d8f3906955d4fa2864535921f0b36e80537bc9f6c72908e9dd31a8a7683.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">

- big data solutions typically involve one or more of the following types of workload:
    - batch processing of big data sources at rest
    - real-time processing of big data in motion
    - interactive exploration of big data
    - predictive analytics and machine learning
- components are usually: data sources, data storage, batch processing, real-time message ingestion, stream processing, analytical data store, analysis and reporting, orchestration, etc
- benefits : choice in technology, performance through parallelism, scalability, interoperability with existing solutions
- challenges : complexity, lack of standards, lack of skills, etc

## Real Time Systems

### Real time systems, stream processing

| Feature                      | Real-Time Data               | Near Real-Time Data             | Streaming Data               |
| ---------------------------- | ---------------------------- | ------------------------------- | ---------------------------- |
| **Latency Measurement**       | Micro to milliseconds        | Extended to seconds             | Constant, always accessible  |
| **Use Case**                  | Immediate decision-making    | Operational intelligence, event-driven systems | Continuous acquisition and transmission of data |
| **Storage**                   | Short-term, may use in-memory storage | Short to mid-term, may be stored for a relatively longer | Continuous flow, no funneling to long-term storage |
| **Dependency on Response Time** | Critical                   | Important but with flexibility | No specific response time     |
| **Examples**                  | Financial trading systems, IoT applications, real-time analytics | Monitoring and alerting systems | Astronomical observations, climate observation systems, earth-sensing satellites |
| **Processing**                | Immediate processing upon creation or acquisition | Processing with acceptable latency for operational insights | Continuous tracking and analysis of data as it flows |

### Stream processing vs batch processing

| Feature                      | Batch Processing             | Stream Processing              |
| ---------------------------- | ---------------------------- | ------------------------------- |
| **Data Processing Model**    | Process data in fixed-size batches | Process data in real-time, small chunks or records |
| **Latency**                  | Typically higher, minutes to hours | Very low, milliseconds to seconds |
| **Use Case**                 | Suitable for scenarios where data is not time-sensitive, e.g., daily reports | Ideal for time-sensitive applications, e.g., real-time analytics, monitoring |
| **Processing Approach**      | Data processed in isolated batches | Continuous and incremental processing of data |
| **Storage**                  | Requires storage of large batches before processing | Minimal storage, as data is processed as it arrives |
| **Scalability**              | Scales well for large volumes of data but may have higher infrastructure costs | Scales horizontally with ease, cost-effective for high-velocity data |
| **Examples**                 | End-of-day financial reports, data warehousing | Real-time analytics, fraud detection, IoT data processing |
| **Complexity**               | Typically less complex as it deals with fixed batches | More complex due to real-time nature and handling of streaming data |
| **Analyses**                 | Complex analyses, e.g., machine learning, can be performed | Simple analyses, e.g., aggregations, and rolling metrics, can be performed |

Why is stream processing important?
- data is generated continuously, in huge volumes, at high velocity
    - to do batch processing, we need to wait for data to accumulate, stop and restart processing, etc.
    - in stream processing, data is processed as it arrives, so we can get real time insights gracefully and naturally
    - you can detect patterns, inspect results, look at data from multiple streams, etc.
- stream processing is a natural fit for time series and detecting patterns over time
- may work with less capable hardware, as data is processed in small chunks, also enables approximate query processing via load shedding
- stream processing is a natural fit for event driven architectures

### Applications of stream processing
- algorithmic trading, stock market analysis, fraud detection, smart patient monitoring, monitoring of IoT devices, production line monitoring, supply chain optimization, intrusion detection and surveillance, smart grids, traffic monitoring, sports analytics, contextual promotions and advertising, computer system and network monitoring, predictive maintenance, geospatial data processing
- CEP (complex event processing) : processing of multiple events to infer higher level events, e.g., detecting a fraud transaction by combining multiple events like login, purchase, etc.
- Stream analytics : processing of data in motion, e.g., aggregations, filtering, etc.
    - usually windows are used to process data in batches, e.g., tumbling window, sliding window, etc.
- Materialized views : pre-computed views of data, e.g., aggregations, etc.
    - can be used to speed up queries, e.g., in OLAP systems
    - can be used to speed up stream processing, e.g., in stream analytics

### Streaming data sources
- operational monitoring : monitoring of systems, e.g., CPU, memory, network, etc.
- web analytics : tracking of user activity on websites, e.g., page views, clicks, etc.
- online advertising : call made by modern ad exchanges to bid on ad slots to multiple advertisers in real time
- social media : tracking of user activity on social media, e.g., tweets, likes, etc.
- IoT : tracking of sensor data, e.g., temperature, humidity, etc.

## Generalized Streaming Data Architecture

### Streaming data architecture
- Often we want to deploy our models to make predictions on data 'as it arrives'
- operating on streaming data presents a number of challenges
    - rebuilding your model to reflect the changing world
    - deploying your model that can run quickly and efficiently
    - scalability and fault tolerance (systems stuff)
- They are layered systems that rely on several loosely coupled systems
    - Helps in achieving high availability, managing the system, maintaining the cost under control
    - All subsystems / components can reside on individual physical servers or can be co hosted on the single or more than one servers
    - Not all components to be present in every system
- Components: 
    1. collection : takes responsibility of collecting the data from the source
        - mostly TCP/IP over HTTP, now formats like Avro, Parquet, JSON are used
        - happens at the edge, usually application specific, new servers integrated diretly with the streaming system
    2. data flow : required intermediate layer that takes responsibility of accepting messages / events from collection layer and providing those messages / events to processing layer
        - usually a message queue, e.g., Kafka, RabbitMQ, etc.
    3. processing : takes responsibility of processing the messages / events and generating the output
        - usually a stream processing framework, e.g., Spark Streaming, Flink, etc.
        - relies on distributed processing of data, framework does the most of the heavy lifting of data partitioning, job scheduling, job managing
    4. storage : takes responsibility of storing the output of the processing layer
        - usually a NoSQL database, e.g., Cassandra, MongoDB, etc.
    5. delivery : takes responsibility of delivering the output of the processing layer to the end user
        - usually a web / mobile interface or a BI tool, e.g., Tableau, PowerBI, etc.
        - can also be a raw file like SVG, PDF, etc. 
        - monitoring / alerting use cases, feeding data to downstream applications, etc.

### Lambda architecture

<img alt="picture 5" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/72d2beeaeeaa04308b7ccafd493a6e03cdd75c47dcf1c66727826dcceef24a05.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">

<img alt="picture 6" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/1223d99cb90bcff11fa0fab78b1e50080efaf50792497cff9fcf61a3f5aa7c5c.png" width="500" style="display: block; margin-left: auto; margin-right: auto; padding-top: 20px">


- proposed by Nathan Marz (2011), combines batch and stream processing, a generic, scalable and fault tolerant data processing architecture
- should be linearly scalable, scale out by adding more machines rather than up
- critical feature : uses two separate data processing systems to handle different types of data processing workloads
    - batch processing system : processes data in large batches and stores the results in a centralized data store, e.g., data warehouse, distributed file system, etc.
    - stream processing system : processes data in real time as it arrives and stores the results in a distributed data store, e.g., message queue, NoSQL database, etc.
- 4 layers: 
    1. data ingestion layer : collects and stores raw data from various sources, e.g., log files, sensors, message queues, APIs, etc.
        - data is typically ingested in real time and fed to the batch layer and speed layer simultaneously
    2. batch layer : responsible for processing historical data in large batches and storing the results in a centralized data store, e.g., data warehouse, distributed file system, etc.
        - typically uses a batch processing framework, e.g., Hadoop, Spark, etc.
        - designed to handle large volumes of data and provide a complete view of all data
    3. speed layer : responsible for processing real time data as it arrives and storing the results in a distributed data store, e.g., message queue, NoSQL database, etc.
        - typically uses a stream processing framework, e.g., Flink, Storm, etc.
        - designed to handle high volume data streams and provide up to date views of the data
    4. serving layer : responsible for serving query results to users in real time
        - typically implemented as a layer on top of the batch and stream processing layers
        - accessed through a query layer, which allows users to query the data using a query language, e.g., SQL, HiveQL, etc.
        - designed to provide fast and reliable access to query results, regardless of whether the data is being accessed from the batch or stream processing layers
        - typically uses a distributed data store, e.g., NoSQL database, distributed cache, etc.
- advantages : 
    - scalability : designed to handle large volumes of data and scale horizontally to meet the needs of the business
    - fault tolerance : designed to be fault tolerant, with multiple layers and systems working together to ensure that data is processed and stored reliably
    - flexibility : designed to handle a wide range of data processing workloads, from historical batch processing to streaming architecture
- disadvantages :
    - complexity : designed to be complex, uses multiple layers and systems to process and store data
    - errors and data discrepancies : with doubled implementations of different workflows, you may run into a problem of different results from batch and stream processing engines
    - architecture lock-in : may be super hard to reorganize or migrate existing data stored in the Lambda architecture
- use cases :
    - handling large volumes of data and providing low-latency query results, e.g., dashboards and reporting
    - batch processing tasks, e.g., data cleansing, transformation, and aggregation
    - stream processing tasks, e.g., event processing, machine learning models, anomaly detection, and fraud detection
    - building data lakes, e.g., centralized repositories that store structured and unstructured data at rest

### Kappa architecture

<img alt="picture 3" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/0908ffe3ce19857b78b9684c8eb50acd9882451d681dfcbc346e517b2c6e14d5.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">

<img alt="picture 7" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/62326926883a15f1de431103d4df58f52eba4580a527aff23020f4096edbec30.png" width="500" style="display: block; margin-left: auto; margin-right: auto; padding-top: 20px">

- proposed by Jay Kreps (2014), an evolution of Lambda architecture, uses a single data processing system to handle both batch processing and stream processing workloads
- treats everything as streams, allows it to provide a more streamlined and simplified data processing pipeline while still providing fast and reliable access to query results
- there is only the speed layer (stream layer)
- 2 components : 
    1. ingestion component : same as Lambda architecture
    2. processing component : responsible for processing the data as it arrives and storing the results in a distributed data store, e.g., message queue, NoSQL database, etc.
        - typically implemented using a stream processing framework, e.g., Flink, Storm, etc.
        - designed to handle high volume data streams and provide fast and reliable access to query results
        - there is no seperate serving layer, instead, the stream processing layer is responsible for serving query results to users in real time
- advantages :
    - simplicity and streamlined pipeline : uses a single data processing system to handle both batch processing and stream processing workloads, which makes it simpler to set up and maintain compared to Lambda architecture
    - enables high-throughput big data processing of historical data : although it may feel that it is not designed for these set of problems, Kappa architecture can support these use cases with grace, enabling reprocessing directly from our stream processing job
    - ease of migrations and reorganizations : as there is only stream processing pipeline, you can perform migrations and reorganizations with new data streams created from the canonical data store
    - tiered storage : tiered storage is a method of storing data in different storage tiers, based on the access patterns and performance requirements of the data
        - in Kappa architecture, tiered storage is not a core concept, however, it is possible to use tiered storage in conjunction with Kappa architecture, as a way to optimize storage costs and performance
        - for example, businesses may choose to store historical data in a lower-cost fault tolerant distributed storage tier, such as object storage, while storing real-time data in a more performant storage tier, such as a distributed cache or a NoSQL database
        - tiered storage Kappa architecture makes it a cost-efficient and elastic data processing technique without the need for a traditional data lake
- disadvantages :
    - complexity : although Kappa architecture is simpler than Lambda architecture, it can still be complex to set up and maintain, especially for businesses that are not familiar with stream processing frameworks
    - costly infrastructure with scalability issues : storing big data in an event streaming platform can be costly, to make it more cost efficient, you may want to use data lake approach from your cloud provider (like AWS S3 or GCP Google Cloud Storage), another common approach for big data architecture is building a 'streaming data lake' with Apache Kafka as a streaming layer and object storage to enable long-term data storage
- use cases:
    - real-time data processing, e.g., continuous data pipelines, real-time data processing, machine learning models and real-time data analytics, IoT systems, etc.
    - building data lakes, e.g., centralized repositories that store structured and unstructured data at rest




## Service Configuration and Coordination in Distributed Systems

### Distributed systems
- collection of independent computers that appears to its users as a single coherent system
- computers coordinate their actions by passing messages over a network (nodes together form a cluster)
- nodes need to share metadata and state information to coordinate their actions (e.g., leader election, location of data, etc.)
    - this is difficult, leads to incorrect results, inconsistent state, etc.
    - needs a system-wide service that correctly and reliably implements distributed configuration and coordination
- challenges : 
    - unreliable network : network partitions, message loss, etc.
        - latency issues, bandwidth changes, lost connections, etc.
        - split brain problem : loss of connectivity between nodes, nodes may form multiple clusters
            - some amount of state is innacessible to some nodes, nodes may diverge in their views of the system
            - we should disallow writes to the state until the network partition is resolved, allow one partition to remain functional while degrading the capabilities of the other partition
    - unreliable nodes : node failures, node restarts, etc.
    - clock synchronization : clocks on different nodes may be out of sync, lead to drifts in time and disordering of events
    - consistency in application state : nodes may have different views of the application state (use Paxos, multi-Paxos, Raft algorithms to solve this)
        - these are difficult to implement

#### Data delivery semantics
Data Delivery Semantics, governing data transfer, includes three types:
1. At Most Once (AMO): The data is delivered at most once. It may not be delivered at all.
    - eg: sending a notification; no guarantee of receipt, and missed notifications are not that important
2. At Least Once (ALO): The data is delivered at least once. It may be delivered multiple times.
    - eg: email delivery; continuous sending until acknowledged, allowing duplicates
3. Exactly Once (EO): The data is delivered exactly once. Deduplication may be required.
    - eg: financial transactions; ensures no duplicates but involves more complex mechanisms

### Apache Kafka [[Vid1]](https://youtu.be/B5j3uNBH8X4), [[Vid2]](https://youtu.be/jY02MB-sz8I)
- [Kafka docs](https://kafka.apache.org/documentation/#gettingStarted)
- distributed streaming platform, used for building real time data pipelines and streaming applications
- Kafka is a distributed system, it needs to coordinate its nodes to ensure that they are all in agreement
- uses ZooKeeper to manage its cluster, store metadata, and perform leader election
    - Cluster management : ZooKeeper is used to manage the Kafka cluster, including configuration, topic metadata, broker metadata, etc.
    - Failure detection and recovery, leader election
    - Store ACLs for authorization

<img alt="picture 8" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/8c6d01dd21e6edaf3e614f69553b607fdcf751a25326b086d27dd704527c5bd9.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">

- Architecture:
    1. Producer : responsible for publishing data to Kafka cluster
        - Can be written in any language, native: Java, C/C++, Python, Go, .NET, etc
    2. Consumer : responsible for subscribing to topics and processing the data published to them
        - new inflowing messages are automatically retrieved
        - consumer offset, keeps track of the last message read, is stored in special Kafka topic
    3. Cluster : collection of nodes that together form a Kafka cluster
        - nodes are called brokers, each broker is identified by a unique integer ID
- Producers and consumers are completely decoupled:
    - slow consumers/producers don't affect each other, add more consumers/producers to scale, failures of consumers/producers don't affect system
- Topics: streams of related messages in Kafka, is a logical representation, categorizes messages into groups
    - developers can define any number of topics
    - topics are partitioned, each partition is an ordered, immutable sequence of messages
        - partitions are distributed across brokers, each partition is replicated across multiple brokers for fault tolerance
    - producers <-> topics are N:N relation, same with consumers <-> topics

<img alt="picture 9" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/8b5680ad6c9e3e97d745b8928d567ee1f36a3817c22b02207b94a700faf96dae.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">

- Kafka data record : header, key, value,  timestamp
    - header : contains metadata about the record, e.g., topic, partition, offset, etc.
    - key : optional, used for partitioning, if key is null, messages are round-robin distributed to partitions
    - value : actual data, can be any format, e.g., JSON, Avro, etc.
    - timestamp
- Broker replication:
    - each partition has one broker acting as leader and multiple brokers acting as followers
    - leader handles all read and write requests for the partition, followers replicate the leader
    - if leader fails, one of the followers is elected as the new leader
    - if a follower fails, it is removed from the ISR (in-sync replica) and replaced by a new follower
- Load Balancing and Semantic Partitioning
    - Producers use a partitioning strategy (defined by producer) to determine which partition to publish messages to
        - default strategy : hash(key) % n_partitions
            - messages with the same key are always published to the same partition
        - no key : round-robin
        - custom partitioner is also allowed
- Estimating number of partitions:
    - $N = \max(\frac{T_t}{T_p}, \frac{T_t}{T_c})$
        - $T_t$ : throughput of the system (processing speed of the slowest component)
        - $T_p$ : throughput of producer writing onto single partition
        - $T_c$ : throughput of a consumer reading from single partition

Apache Kafka Commands:
1. start kafka server : `bin/kafka-server-start.sh config/server.properties`
2. create a topic : `bin/kafka-topics.sh --create --topic <topic-name> --bootstrap-server <bootstrap-server> --partitions <num-partitions> --replication-factor <replication-factor>`
3. list topics : `bin/kafka-topics.sh --list --bootstrap-server <bootstrap-server>`
4. produce messages : `bin/kafka-console-producer.sh --topic <topic-name> --bootstrap-server <bootstrap-server>`
5. consume messages : `bin/kafka-console-consumer.sh --topic <topic-name> --bootstrap-server <bootstrap-server> --from-beginning`
6. describe consumer groups : `bin/kafka-consumer-groups.sh --describe --group <group-id> --bootstrap-server <bootstrap-server>`
7. check offsets : `bin/kafka-run-class.sh kafka.tools.GetOffsetShell --topic <topic-name> --group <group-id> --bootstrap-server <bootstrap-server>`        

### Apache Zookeeper [[Vid]](https://youtu.be/gZj16chk0Ss)
- [ZooKeeper overview](https://zookeeper.apache.org/doc/r3.9.1/zookeeperOver.html)
- distributed coordination service, used for building distributed systems, used with Hadoop, HBase, Kafka, etc
- services:
    - naming : provides a hierarchical namespace for nodes in cluster
    - configuration management : stores and manages configuration information for systems
    - cluster management : manages the membership of nodes in a cluster
    - leader election : elects a leader among distributed nodes
    - locking and synchronization : provides primitives for synchronization and coordination between distributed processes
- benefits : synchronization, serialization (application runs consistently), atomicity, reliability, etc.
- architecture:
    - ensemble : a group of ZooKeeper servers that collectively manage the service (3 minimum)
        - leaders are elected among the ensemble, followers replicate the leader
    - quorum : majority of servers must agree on an operation for it to be committed
    - client : connects to any server in the ensemble, sends heartbeat to server, if server doesn't respond, client connects to another server
    - znode : the basic data structure in ZooKeeper, analogous to a file system node
        - znodes are organized in a hierarchical tree-like structure
        - each znode contains data and metadata (version number, ACL, etc.)
        - znodes can be ephemeral (deleted when client disconnects) or persistent (deleted when explicitly deleted) or sequential (name of znode is appended with a monotonically increasing counter)

<img alt="picture 10" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/2574be9ee8aa76e507988eb8754c5d9ec658d83ada1c13e2050ff13d69f3821e.png" width="400" style="display: block; margin-left: auto; margin-right: auto;">

- operations : create, read, update, delete, watch
    - watches : allows clients to be notified of changes to a znode
- ACLs : access control lists, used to control access to znodes
    - each znode has an ACL that specifies the permissions for that znode
    - permissions : read, write, create, delete, admin
    - ACLs are stored in ZooKeeper and managed by ZooKeeper itself
- deployment:
    - ensemble size : typically odd number (3, 5, 7) for fault tolerance
    - network configuration : low-latency, reliable network required between ZooKeeper servers
    - client deployment : clients connect to any server in the ensemble
- split brain problem : loss of connectivity between nodes, nodes may form multiple clusters
    - some amount of state is innacessible to some nodes, nodes may diverge in their views of the system
    - we should disallow writes to the state until the network partition is resolved, allow one partition to remain functional while degrading the capabilities of the other partition

Apache ZooKeeper Commands:
1. start zookeeper : `bin/zookeeper-server-start.sh config/zookeeper.properties`
2. connect to zookeeper cli : `bin/zookeeper-shell.sh <zookeeper-host>:<zookeeper-port>`
3. create a znode : `create /<path> <data>`
4. list znode contents : `ls /<path>`
5. get znode data : `get /<path>`
6. set znode data : `set /<path> <new-data>`
7. delete znode : `delete /<path>`

### Pub-sub messaging workflow
- producers send messages to a topic at regular intervals
- Kafka broker stores all messages in the partitions configured for that particular topic
    - ensures the messages are equally shared between partitions
- consumer subscribes to a specific topic, Kafka will provide the current offset of the topic to the consumer and also saves the offset in the Zookeeper ensemble
- consumer will request Kafka at regular intervals for new messages
- once Kafka receives messages from producers, it forwards these messages to the consumers
- consumer will receive the message and process it
- once the messages are processed, consumer will send an acknowledgement to the Kafka broker
- once Kafka receives an acknowledgement, it changes offset to new value and updates it in Zookeeper
    - since offsets are maintained in Zookeeper, consumer can read next message correctly even during server outrages
- this above flow will repeat until consumer stops the request
- consumer has option to rewind/skip to desired offset of a topic at any time and read all subsequent messages

### Message queues vs Pub-sub systems
- message queues : messages are stored in a queue, each message is delivered to exactly one consumer
    - consumers can be grouped into consumer groups, each message is delivered to one consumer in each consumer group
    - consumers can acknowledge messages, once a message is acknowledged, it is deleted from the queue
    - if a consumer fails to acknowledge a message before a timeout, the message is redelivered to another consumer
    - examples : RabbitMQ, ActiveMQ, etc
- pub-sub systems : messages are published to a topic, each message is delivered to all consumers subscribed to that topic
    - message will only be deleted if it's consumed by all subscribers to the category
    - examples : Kafka, Google Cloud Pub/Sub, etc have retention policy that ensures messages stay in the queue for a specified amount of time, even after they are consumed by all subscribers

### Streaming Architectures

Random streaming architectures I found

[Link1](https://www.simform.com/blog/stream-processing/)
<img alt="picture 11" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/132f86c07b23b9b8e108558f9e27bf6159a9c86ca033cfa15341b247cb2ebe62.png" width="700" style="display: block; margin-left: auto; margin-right: auto;">

[Link2](https://docs.aws.amazon.com/whitepapers/latest/build-modern-data-streaming-analytics-architectures/what-is-a-modern-streaming-data-architecture.html)
<img alt="picture 12" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/4318bba6aca69a8f6dce99565fb3d697ee78c24378ddc9c8ff3d8b1dba700aa4.png" width="700" style="display: block; margin-left: auto; margin-right: auto;">

[Link3](https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/data/stream-processing-stream-analytics)
<img alt="picture 13" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/cd7def52aa38a303f47ef8c8f15a484d61f92de42f90e3e4b8a83f7ce2049750.png" width="700" style="display: block; margin-left: auto; margin-right: auto;">
