# Stream Processing and Analytics

## Big Data Systems

### Reliable, Scalaable and Maintainable Applications

#### Data Intensive Applications
- deals with huge amount of data, complex and fast moving data
- built with several building blocks (databases, caches, indexes, batch and stream processing systems, etc.)
- 3 main concerns: 
    1. reliability : fault tolerance, fault recovery, monitoring, alerting, etc.
        - recover from hardware/software faults, human errors, etc.
    2. scalability : horizontal scaling, load balancing, etc.
    3. maintainability : operability, simplicity, evolvability, etc.

#### Example web analytics pipeline
- designing an application to track user visits on a website (schema : user_id, page_id, n_visits, etc.)
- portal becomes popular, data volume increases, database writes become a bottleneck
    - use intermediate queue to buffer writes (queue will hold messages, database will consume messages)
- more traffic, more data, more writes, more database load
    - use multiple databases (shard data by user_id, page_id, etc.)
- as we move down, we need to handle more and more complexity - shards, queues, more complicated application logic, etc.
    - need to handle failures, need to handle load balancing, etc. 

#### Big Data Systems
- handle huge amounts of data, fast moving data, complex data
- systems designed with distributed nature in mind, doesn't need to bother about common issues like sharding, replication, etc.
- scalability achieved by horizontal scaling - just add new machines, devs need only focus on application logic
- 3Vs of Big Data : Volume, Velocity, Variety
- examples : 
    1. data sources : web logs, social media, sensors, etc.
    2. data acquisition : Kafka, Flume, Spark Streaming, etc.
    3. storage : HDFS, Cassandra, HBase, etc.
    4. BI analysis : Spark, Hive, Pig, etc.
    5. visualization : Tableau, Zeppelin, PowerBI, etc.
- properties: fault tolerance, low latency, scalability, extensibility, maintainability, debuggability, etc.

#### Data Model of Big Data Systems
- fact based model : data is modeled as facts/events
    - graph schema captures relationships between entities in the form of nodes, edges and properties
    - nodes represent entities, edges represent relationships, properties represent attributes
    <img alt="picture 0" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/57bd7d48fad8d4e07f588599101a241bcc50e73ee1bafb1c4c97e5800aa95e1f.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">
- fact table : stores facts (events) in the form of rows
- other columns in the fact table are foreign key references to dimension tables
    <img alt="picture 1" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/e4e2363558893501a303520b0841a66bb4b755422ae4cbf17f27196ab334f389.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">

#### Big data architecture style

<img alt="picture 2" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/5e126d8f3906955d4fa2864535921f0b36e80537bc9f6c72908e9dd31a8a7683.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">

- big data solutions typically involve one or more of the following types of workload:
    - batch processing of big data sources at rest
    - real-time processing of big data in motion
    - interactive exploration of big data
    - predictive analytics and machine learning
- components are usually: data sources, data storage, batch processing, real-time message ingestion, stream processing, analytical data store, analysis and reporting, orchestration, etc
- benefits : choice in technology, performance through parallelism, scalability, interoperability with existing solutions
- challenges : complexity, lack of standards, lack of skills, etc

## Real Time Systems

### Real time systems, stream processing

| Feature                      | Real-Time Data               | Near Real-Time Data             | Streaming Data               |
| ---------------------------- | ---------------------------- | ------------------------------- | ---------------------------- |
| **Latency Measurement**       | Micro to milliseconds        | Extended to seconds             | Constant, always accessible  |
| **Use Case**                  | Immediate decision-making    | Operational intelligence, event-driven systems | Continuous acquisition and transmission of data |
| **Storage**                   | Short-term, may use in-memory storage | Short to mid-term, may be stored for a relatively longer | Continuous flow, no funneling to long-term storage |
| **Dependency on Response Time** | Critical                   | Important but with flexibility | No specific response time     |
| **Examples**                  | Financial trading systems, IoT applications, real-time analytics | Monitoring and alerting systems | Astronomical observations, climate observation systems, earth-sensing satellites |
| **Processing**                | Immediate processing upon creation or acquisition | Processing with acceptable latency for operational insights | Continuous tracking and analysis of data as it flows |

### Stream processing vs batch processing

| Feature                      | Batch Processing             | Stream Processing              |
| ---------------------------- | ---------------------------- | ------------------------------- |
| **Data Processing Model**    | Process data in fixed-size batches | Process data in real-time, small chunks or records |
| **Latency**                  | Typically higher, minutes to hours | Very low, milliseconds to seconds |
| **Use Case**                 | Suitable for scenarios where data is not time-sensitive, e.g., daily reports | Ideal for time-sensitive applications, e.g., real-time analytics, monitoring |
| **Processing Approach**      | Data processed in isolated batches | Continuous and incremental processing of data |
| **Storage**                  | Requires storage of large batches before processing | Minimal storage, as data is processed as it arrives |
| **Scalability**              | Scales well for large volumes of data but may have higher infrastructure costs | Scales horizontally with ease, cost-effective for high-velocity data |
| **Examples**                 | End-of-day financial reports, data warehousing | Real-time analytics, fraud detection, IoT data processing |
| **Complexity**               | Typically less complex as it deals with fixed batches | More complex due to real-time nature and handling of streaming data |
| **Analyses**                 | Complex analyses, e.g., machine learning, can be performed | Simple analyses, e.g., aggregations, and rolling metrics, can be performed |

Why is stream processing important?
- data is generated continuously, in huge volumes, at high velocity
    - to do batch processing, we need to wait for data to accumulate, stop and restart processing, etc.
    - in stream processing, data is processed as it arrives, so we can get real time insights gracefully and naturally
    - you can detect patterns, inspect results, look at data from multiple streams, etc.
- stream processing is a natural fit for time series and detecting patterns over time
- may work with less capable hardware, as data is processed in small chunks, also enables approximate query processing via load shedding
- stream processing is a natural fit for event driven architectures

### Applications of stream processing
- algorithmic trading, stock market analysis, fraud detection, smart patient monitoring, monitoring of IoT devices, production line monitoring, supply chain optimization, intrusion detection and surveillance, smart grids, traffic monitoring, sports analytics, contextual promotions and advertising, computer system and network monitoring, predictive maintenance, geospatial data processing
- CEP (complex event processing) : processing of multiple events to infer higher level events, e.g., detecting a fraud transaction by combining multiple events like login, purchase, etc.
- Stream analytics : processing of data in motion, e.g., aggregations, filtering, etc.
    - usually windows are used to process data in batches, e.g., tumbling window, sliding window, etc.
- Materialized views : pre-computed views of data, e.g., aggregations, etc.
    - can be used to speed up queries, e.g., in OLAP systems
    - can be used to speed up stream processing, e.g., in stream analytics

### Streaming data sources
- operational monitoring : monitoring of systems, e.g., CPU, memory, network, etc.
- web analytics : tracking of user activity on websites, e.g., page views, clicks, etc.
- online advertising : call made by modern ad exchanges to bid on ad slots to multiple advertisers in real time
- social media : tracking of user activity on social media, e.g., tweets, likes, etc.
- IoT : tracking of sensor data, e.g., temperature, humidity, etc.

## Generalized Streaming Data Architecture

### Streaming data architecture
- Often we want to deploy our models to make predictions on data 'as it arrives'
- operating on streaming data presents a number of challenges
    - rebuilding your model to reflect the changing world
    - deploying your model that can run quickly and efficiently
    - scalability and fault tolerance (systems stuff)
- They are layered systems that rely on several loosely coupled systems
    - Helps in achieving high availability, managing the system, maintaining the cost under control
    - All subsystems / components can reside on individual physical servers or can be co hosted on the single or more than one servers
    - Not all components to be present in every system
- Components: 
    1. collection : takes responsibility of collecting the data from the source
        - mostly TCP/IP over HTTP, now formats like Avro, Parquet, JSON are used
        - happens at the edge, usually application specific, new servers integrated diretly with the streaming system
    2. data flow : required intermediate layer that takes responsibility of accepting messages / events from collection layer and providing those messages / events to processing layer
        - usually a message queue, e.g., Kafka, RabbitMQ, etc.
    3. processing : takes responsibility of processing the messages / events and generating the output
        - usually a stream processing framework, e.g., Spark Streaming, Flink, etc.
        - relies on distributed processing of data, framework does the most of the heavy lifting of data partitioning, job scheduling, job managing
    4. storage : takes responsibility of storing the output of the processing layer
        - usually a NoSQL database, e.g., Cassandra, MongoDB, etc.
    5. delivery : takes responsibility of delivering the output of the processing layer to the end user
        - usually a web / mobile interface or a BI tool, e.g., Tableau, PowerBI, etc.
        - can also be a raw file like SVG, PDF, etc. 
        - monitoring / alerting use cases, feeding data to downstream applications, etc.

### Lambda architecture

<img alt="picture 5" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/72d2beeaeeaa04308b7ccafd493a6e03cdd75c47dcf1c66727826dcceef24a05.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">

<img alt="picture 6" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/1223d99cb90bcff11fa0fab78b1e50080efaf50792497cff9fcf61a3f5aa7c5c.png" width="500" style="display: block; margin-left: auto; margin-right: auto; padding-top: 20px">


- proposed by Nathan Marz (2011), combines batch and stream processing, a generic, scalable and fault tolerant data processing architecture
- should be linearly scalable, scale out by adding more machines rather than up
- critical feature : uses two separate data processing systems to handle different types of data processing workloads
    - batch processing system : processes data in large batches and stores the results in a centralized data store, e.g., data warehouse, distributed file system, etc.
    - stream processing system : processes data in real time as it arrives and stores the results in a distributed data store, e.g., message queue, NoSQL database, etc.
- 4 layers: 
    1. data ingestion layer : collects and stores raw data from various sources, e.g., log files, sensors, message queues, APIs, etc.
        - data is typically ingested in real time and fed to the batch layer and speed layer simultaneously
    2. batch layer : responsible for processing historical data in large batches and storing the results in a centralized data store, e.g., data warehouse, distributed file system, etc.
        - typically uses a batch processing framework, e.g., Hadoop, Spark, etc.
        - designed to handle large volumes of data and provide a complete view of all data
    3. speed layer : responsible for processing real time data as it arrives and storing the results in a distributed data store, e.g., message queue, NoSQL database, etc.
        - typically uses a stream processing framework, e.g., Flink, Storm, etc.
        - designed to handle high volume data streams and provide up to date views of the data
    4. serving layer : responsible for serving query results to users in real time
        - typically implemented as a layer on top of the batch and stream processing layers
        - accessed through a query layer, which allows users to query the data using a query language, e.g., SQL, HiveQL, etc.
        - designed to provide fast and reliable access to query results, regardless of whether the data is being accessed from the batch or stream processing layers
        - typically uses a distributed data store, e.g., NoSQL database, distributed cache, etc.
- advantages : 
    - scalability : designed to handle large volumes of data and scale horizontally to meet the needs of the business
    - fault tolerance : designed to be fault tolerant, with multiple layers and systems working together to ensure that data is processed and stored reliably
    - flexibility : designed to handle a wide range of data processing workloads, from historical batch processing to streaming architecture
- disadvantages :
    - complexity : designed to be complex, uses multiple layers and systems to process and store data
    - errors and data discrepancies : with doubled implementations of different workflows, you may run into a problem of different results from batch and stream processing engines
    - architecture lock-in : may be super hard to reorganize or migrate existing data stored in the Lambda architecture
- use cases :
    - handling large volumes of data and providing low-latency query results, e.g., dashboards and reporting
    - batch processing tasks, e.g., data cleansing, transformation, and aggregation
    - stream processing tasks, e.g., event processing, machine learning models, anomaly detection, and fraud detection
    - building data lakes, e.g., centralized repositories that store structured and unstructured data at rest

### Kappa architecture

<img alt="picture 3" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/0908ffe3ce19857b78b9684c8eb50acd9882451d681dfcbc346e517b2c6e14d5.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">

<img alt="picture 7" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/62326926883a15f1de431103d4df58f52eba4580a527aff23020f4096edbec30.png" width="500" style="display: block; margin-left: auto; margin-right: auto; padding-top: 20px">

- proposed by Jay Kreps (2014), an evolution of Lambda architecture, uses a single data processing system to handle both batch processing and stream processing workloads
- treats everything as streams, allows it to provide a more streamlined and simplified data processing pipeline while still providing fast and reliable access to query results
- there is only the speed layer (stream layer)
- 2 components : 
    1. ingestion component : same as Lambda architecture
    2. processing component : responsible for processing the data as it arrives and storing the results in a distributed data store, e.g., message queue, NoSQL database, etc.
        - typically implemented using a stream processing framework, e.g., Flink, Storm, etc.
        - designed to handle high volume data streams and provide fast and reliable access to query results
        - there is no seperate serving layer, instead, the stream processing layer is responsible for serving query results to users in real time
- advantages :
    - simplicity and streamlined pipeline : uses a single data processing system to handle both batch processing and stream processing workloads, which makes it simpler to set up and maintain compared to Lambda architecture
    - enables high-throughput big data processing of historical data : although it may feel that it is not designed for these set of problems, Kappa architecture can support these use cases with grace, enabling reprocessing directly from our stream processing job
    - ease of migrations and reorganizations : as there is only stream processing pipeline, you can perform migrations and reorganizations with new data streams created from the canonical data store
    - tiered storage : tiered storage is a method of storing data in different storage tiers, based on the access patterns and performance requirements of the data
        - in Kappa architecture, tiered storage is not a core concept, however, it is possible to use tiered storage in conjunction with Kappa architecture, as a way to optimize storage costs and performance
        - for example, businesses may choose to store historical data in a lower-cost fault tolerant distributed storage tier, such as object storage, while storing real-time data in a more performant storage tier, such as a distributed cache or a NoSQL database
        - tiered storage Kappa architecture makes it a cost-efficient and elastic data processing technique without the need for a traditional data lake
- disadvantages :
    - complexity : although Kappa architecture is simpler than Lambda architecture, it can still be complex to set up and maintain, especially for businesses that are not familiar with stream processing frameworks
    - costly infrastructure with scalability issues : storing big data in an event streaming platform can be costly, to make it more cost efficient, you may want to use data lake approach from your cloud provider (like AWS S3 or GCP Google Cloud Storage), another common approach for big data architecture is building a 'streaming data lake' with Apache Kafka as a streaming layer and object storage to enable long-term data storage
- use cases:
    - real-time data processing, e.g., continuous data pipelines, real-time data processing, machine learning models and real-time data analytics, IoT systems, etc.
    - building data lakes, e.g., centralized repositories that store structured and unstructured data at rest




## Service Configuration and Coordination in Distributed Systems

### Distributed systems
- collection of independent computers that appears to its users as a single coherent system
- computers coordinate their actions by passing messages over a network (nodes together form a cluster)
- nodes need to share metadata and state information to coordinate their actions (e.g., leader election, location of data, etc.)
    - this is difficult, leads to incorrect results, inconsistent state, etc.
    - needs a system-wide service that correctly and reliably implements distributed configuration and coordination
- challenges : 
    - unreliable network : network partitions, message loss, etc.
        - latency issues, bandwidth changes, lost connections, etc.
        - split brain problem : loss of connectivity between nodes, nodes may form multiple clusters
            - some amount of state is innacessible to some nodes, nodes may diverge in their views of the system
            - we should disallow writes to the state until the network partition is resolved, allow one partition to remain functional while degrading the capabilities of the other partition
    - unreliable nodes : node failures, node restarts, etc.
    - clock synchronization : clocks on different nodes may be out of sync, lead to drifts in time and disordering of events
    - consistency in application state : nodes may have different views of the application state (use Paxos, multi-Paxos, Raft algorithms to solve this)
        - these are difficult to implement

#### Data delivery semantics
Data Delivery Semantics, governing data transfer, includes three types:
1. At Most Once (AMO): The data is delivered at most once. It may not be delivered at all.
    - eg: sending a notification; no guarantee of receipt, and missed notifications are not that important
2. At Least Once (ALO): The data is delivered at least once. It may be delivered multiple times.
    - eg: email delivery; continuous sending until acknowledged, allowing duplicates
3. Exactly Once (EO): The data is delivered exactly once. Deduplication may be required.
    - eg: financial transactions; ensures no duplicates but involves more complex mechanisms

### Apache Kafka [[Vid1]](https://youtu.be/B5j3uNBH8X4), [[Vid2]](https://youtu.be/jY02MB-sz8I)
- [Kafka docs](https://kafka.apache.org/documentation/#gettingStarted)
- distributed streaming platform, used for building real time data pipelines and streaming applications
- Kafka is a distributed system, it needs to coordinate its nodes to ensure that they are all in agreement
- uses ZooKeeper to manage its cluster, store metadata, and perform leader election
    - Cluster management : ZooKeeper is used to manage the Kafka cluster, including configuration, topic metadata, broker metadata, etc.
    - Failure detection and recovery, leader election
    - Store ACLs for authorization

<img alt="picture 8" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/8c6d01dd21e6edaf3e614f69553b607fdcf751a25326b086d27dd704527c5bd9.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">

- Architecture:
    1. Producer : responsible for publishing data to Kafka cluster
        - Can be written in any language, native: Java, C/C++, Python, Go, .NET, etc
    2. Consumer : responsible for subscribing to topics and processing the data published to them
        - new inflowing messages are automatically retrieved
        - consumer offset, keeps track of the last message read, is stored in special Kafka topic
    3. Cluster : collection of nodes that together form a Kafka cluster
        - nodes are called brokers, each broker is identified by a unique integer ID
- Producers and consumers are completely decoupled:
    - slow consumers/producers don't affect each other, add more consumers/producers to scale, failures of consumers/producers don't affect system
- Topics: streams of related messages in Kafka, is a logical representation, categorizes messages into groups
    - developers can define any number of topics
    - topics are partitioned, each partition is an ordered, immutable sequence of messages
        - partitions are distributed across brokers, each partition is replicated across multiple brokers for fault tolerance
    - producers <-> topics are N:N relation, same with consumers <-> topics

<img alt="picture 9" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/8b5680ad6c9e3e97d745b8928d567ee1f36a3817c22b02207b94a700faf96dae.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">

- Kafka data record : header, key, value,  timestamp
    - header : contains metadata about the record, e.g., topic, partition, offset, etc.
    - key : optional, used for partitioning, if key is null, messages are round-robin distributed to partitions
    - value : actual data, can be any format, e.g., JSON, Avro, etc.
    - timestamp
- Broker replication:
    - each partition has one broker acting as leader and multiple brokers acting as followers
    - leader handles all read and write requests for the partition, followers replicate the leader
    - if leader fails, one of the followers is elected as the new leader
    - if a follower fails, it is removed from the ISR (in-sync replica) and replaced by a new follower
- Load Balancing and Semantic Partitioning
    - Producers use a partitioning strategy (defined by producer) to determine which partition to publish messages to
        - default strategy : hash(key) % n_partitions
            - messages with the same key are always published to the same partition
        - no key : round-robin
        - custom partitioner is also allowed
- Estimating number of partitions:
    - $N = \max(\frac{T_t}{T_p}, \frac{T_t}{T_c})$
        - $T_t$ : throughput of the system (processing speed of the slowest component)
        - $T_p$ : throughput of producer writing onto single partition
        - $T_c$ : throughput of a consumer reading from single partition

Apache Kafka Commands:
1. start kafka server : `bin/kafka-server-start.sh config/server.properties`
2. create a topic : `bin/kafka-topics.sh --create --topic <topic-name> --bootstrap-server <bootstrap-server> --partitions <num-partitions> --replication-factor <replication-factor>`
3. list topics : `bin/kafka-topics.sh --list --bootstrap-server <bootstrap-server>`
4. produce messages : `bin/kafka-console-producer.sh --topic <topic-name> --bootstrap-server <bootstrap-server>`
5. consume messages : `bin/kafka-console-consumer.sh --topic <topic-name> --bootstrap-server <bootstrap-server> --from-beginning`
6. describe consumer groups : `bin/kafka-consumer-groups.sh --describe --group <group-id> --bootstrap-server <bootstrap-server>`
7. check offsets : `bin/kafka-run-class.sh kafka.tools.GetOffsetShell --topic <topic-name> --group <group-id> --bootstrap-server <bootstrap-server>`        

Apache Kafka APIs:
1. Producer API : used to publish messages to Kafka topics
2. Consumer API : used to subscribe to Kafka topics and process messages
3. Streams API : used to process streams of data and produce output streams, effectively transforming input streams into output streams
    - highly scalable and fault-tolerant, can be used to build real-time applications, stateful and stateless processing, etc.
    - event time processing, windowing, joins, aggregations, etc.
    - does not take much configuration, can be run on a single machine or a cluster
    - employs one record at a time processing, to achieve low latency and high throughput
4. Connector API : used to connect Kafka topics to external systems, e.g., databases, message queues, etc.

Stream processing topologies:
- A stream is an unbounded, continuously updating data set, ordered, replayable, fault-tolerant sequence of immutable data records
- A stream processing application defines its computational logic through one or more processor topologies, a graph of stream processors connected by streams
- A stream processor is a node in the processor topology, represents a processing step to transform data in streams
    - receives one input record at a time from its upstream processors
    - applies its operation to it
    - may subsequently produce one or more output records to its downstream processors
- Kafka stream processors:
    - Source Processor : does not have any upstream processors, produces an input stream to its topology from one or multiple Kafka topics
    - Sink Processor : does not have downstream processors, sends any received records from its upstream processors to a specified Kafka topic
- Processing in Kafka Streams:
    - High-level Kafka Streams DSL : provides ready to use methods with functional style
        - has already implemented methods ready to use
        - composed of two main abstractions: KStream and KTable or GlobalKTable
            - KStream : abstraction of record stream, provides many functional ways to manipulate stream data
                - map, mapValue, flatMap, flatMapValues, filter, are some of the methods
            - KTable or GlobalKTable : abstraction of a changelog stream, every data record is considered an Insert or Update
                - existing row with the same key will be overwritten
    - Low-level Processor API : provides flexibility to implement processing logic according to need
        - extends AbstractProcessor, overrides process method, called once for every key-value pair
        - provides client to access stream data, perform business logic, send result as downstream data
        - trade-off is lines of code needed for specific scenarios

### Apache Zookeeper [[Vid]](https://youtu.be/gZj16chk0Ss)
- [ZooKeeper overview](https://zookeeper.apache.org/doc/r3.9.1/zookeeperOver.html)
- distributed coordination service, used for building distributed systems, used with Hadoop, HBase, Kafka, etc
- services:
    - naming : provides a hierarchical namespace for nodes in cluster
    - configuration management : stores and manages configuration information for systems
    - cluster management : manages the membership of nodes in a cluster
    - leader election : elects a leader among distributed nodes
    - locking and synchronization : provides primitives for synchronization and coordination between distributed processes
- benefits : synchronization, serialization (application runs consistently), atomicity, reliability, etc.
- architecture:
    - ensemble : a group of ZooKeeper servers that collectively manage the service (3 minimum)
        - leaders are elected among the ensemble, followers replicate the leader
    - quorum : majority of servers must agree on an operation for it to be committed
    - client : connects to any server in the ensemble, sends heartbeat to server, if server doesn't respond, client connects to another server
    - znode : the basic data structure in ZooKeeper, analogous to a file system node
        - znodes are organized in a hierarchical tree-like structure
        - each znode contains data and metadata (version number, ACL, etc.)
        - znodes can be ephemeral (deleted when client disconnects) or persistent (deleted when explicitly deleted) or sequential (name of znode is appended with a monotonically increasing counter)

<img alt="picture 10" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/2574be9ee8aa76e507988eb8754c5d9ec658d83ada1c13e2050ff13d69f3821e.png" width="400" style="display: block; margin-left: auto; margin-right: auto;">

- operations : create, read, update, delete, watch
    - watches : allows clients to be notified of changes to a znode
- ACLs : access control lists, used to control access to znodes
    - each znode has an ACL that specifies the permissions for that znode
    - permissions : read, write, create, delete, admin
    - ACLs are stored in ZooKeeper and managed by ZooKeeper itself
- deployment:
    - ensemble size : typically odd number (3, 5, 7) for fault tolerance
    - network configuration : low-latency, reliable network required between ZooKeeper servers
    - client deployment : clients connect to any server in the ensemble
- split brain problem : loss of connectivity between nodes, nodes may form multiple clusters
    - some amount of state is innacessible to some nodes, nodes may diverge in their views of the system
    - we should disallow writes to the state until the network partition is resolved, allow one partition to remain functional while degrading the capabilities of the other partition

Apache ZooKeeper Commands:
1. start zookeeper : `bin/zookeeper-server-start.sh config/zookeeper.properties`
2. connect to zookeeper cli : `bin/zookeeper-shell.sh <zookeeper-host>:<zookeeper-port>`
3. create a znode : `create /<path> <data>`
4. list znode contents : `ls /<path>`
5. get znode data : `get /<path>`
6. set znode data : `set /<path> <new-data>`
7. delete znode : `delete /<path>`

### Pub-sub messaging workflow
- producers send messages to a topic at regular intervals
- Kafka broker stores all messages in the partitions configured for that particular topic
    - ensures the messages are equally shared between partitions
- consumer subscribes to a specific topic, Kafka will provide the current offset of the topic to the consumer and also saves the offset in the Zookeeper ensemble
- consumer will request Kafka at regular intervals for new messages
- once Kafka receives messages from producers, it forwards these messages to the consumers
- consumer will receive the message and process it
- once the messages are processed, consumer will send an acknowledgement to the Kafka broker
- once Kafka receives an acknowledgement, it changes offset to new value and updates it in Zookeeper
    - since offsets are maintained in Zookeeper, consumer can read next message correctly even during server outrages
- this above flow will repeat until consumer stops the request
- consumer has option to rewind/skip to desired offset of a topic at any time and read all subsequent messages

### Message queues vs Pub-sub systems
- message queues : messages are stored in a queue, each message is delivered to exactly one consumer
    - consumers can be grouped into consumer groups, each message is delivered to one consumer in each consumer group
    - consumers can acknowledge messages, once a message is acknowledged, it is deleted from the queue
    - if a consumer fails to acknowledge a message before a timeout, the message is redelivered to another consumer
    - examples : RabbitMQ, ActiveMQ, etc
- pub-sub systems : messages are published to a topic, each message is delivered to all consumers subscribed to that topic
    - message will only be deleted if it's consumed by all subscribers to the category
    - examples : Kafka, Google Cloud Pub/Sub, etc have retention policy that ensures messages stay in the queue for a specified amount of time, even after they are consumed by all subscribers

### Streaming Architectures

Random streaming architectures I found

[Link1](https://www.simform.com/blog/stream-processing/)
<img alt="picture 11" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/132f86c07b23b9b8e108558f9e27bf6159a9c86ca033cfa15341b247cb2ebe62.png" width="700" style="display: block; margin-left: auto; margin-right: auto;">

[Link2](https://docs.aws.amazon.com/whitepapers/latest/build-modern-data-streaming-analytics-architectures/what-is-a-modern-streaming-data-architecture.html)
<img alt="picture 12" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/4318bba6aca69a8f6dce99565fb3d697ee78c24378ddc9c8ff3d8b1dba700aa4.png" width="700" style="display: block; margin-left: auto; margin-right: auto;">

[Link3](https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/data/stream-processing-stream-analytics)
<img alt="picture 13" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/cd7def52aa38a303f47ef8c8f15a484d61f92de42f90e3e4b8a83f7ce2049750.png" width="700" style="display: block; margin-left: auto; margin-right: auto;">


## Stream Processing Framework Features

State Management
- where to maintain state in stream processing systems?
    - in-memory : for example, sum(sales) for last one hour
        - data will be flushed out once one hour window expires
        - potential of losing all the data if the stream processor node fails
    - persistent storage : for example, needs to join data from two different streams and data being produced at different rate

Fault Tolerance
- point of failures in stream processing data flow
    - incoming stream data, network carrying data, stream processor, connection to output sink, output destination, streaming manager, application driver
- in case of data loss, different approaches can be used
    - replication and coordination : replicate the state of computation on multiple nodes, in case of failures, streaming manager interacts with replicas
        - can predefine number of simultaneous failures, K-fault tolerant where k is number of simultaneous failures
    - state machine approach : streaming job is replicated on multiple nodes, replicas are coordinated by sending same input in same order to all
        - allows for quick failover, resulting into little disruption
    - rollback recovery approach : stream processor periodically packages the state of computation into checkpoint, copies checkpoint to different nodes
        - in case of failure, stream manager fetches the computation from last saved checkpoint and reschedules it
    
### Timing Concepts
- Event Time vs Stream Time
    - event time is time at which the event occurs
    - stream time is time at which event enters the streaming system
    - stream time will lag a bit due to several reasons
    - for example, if we are monitoring the traffic on the road and a person jumps the signal, then
        - event time is the time at which the person jumped the signal
        - stream time is the time at which the event information reached to streaming platform for processing like snapping the fine on the user and then sending him the alert on the mobile
- Time Skew
    - stream time is often lagging behind the event time
    - the difference between these two times is "time skew"
    - variance can be significant sometimes due to various issues
    - applications needs to take into consideration whether they depend on the event time or stream time

### Windowing
- Windowing is a technique used in stream processing to divide the stream of data into finite segments
- policies:
    - trigger policy : rules for determining when the code should be executed
    - eviction policy : rules for determining when the data should be evicted from the window
- types:
    - sliding window : window slides over the data stream
        - window length specifies the eviction policy (time duration for which data is available for processing, eg. if 2 seconds, data older than 2 seconds will be evicted)
        - sliding interval specifies the trigger policy (time duration after which code will be triggered eg. if 1 second, code will be executed every second)
    - tumbling window : window is fixed and data is processed in chunks
        - trigger and eviction policies are executed when window is full
        - no time constraints for window to be filled
        - types:
            - count based tumbling : trigger and eviction policy set to 2 seconds, when 2 events are accumulated in window, trigger will be fired, code will be executed, window will be drained
            - temporal based tumbling : based on time, if the eviction and trigger policy set to 2 seconds, when 2 seconds are over, trigger will be fired, code will be executed, window will be drained

### Stream Joins
- stream-stream join : event streams from two or more sources are joined in real time
    - sources can be same or different
    - for example, location coordinates of mobile devices of users is received in real time and can be joined with the traffic updates available in real time around those locations to suggest an alternative route in case of congestion
    - at least one stream must be bounded (have a window attached) to perform join
        ```sql
        Select x, y 
        From S1 as s1 join S2 as s2 
        on s1.id=s2.id 
        insert into JoinedStream
        ```
    - one window join : one stream is bounded by a window, events retained in the window are matched against events coming in the second stream
        - only events in the second stream will trigger an output
    - two window join : a two-window query will retain events coming from both the streams in the window
        - a new event coming in will either trigger a match and an output
- stream-table join : streaming event and data stored in persistent storage like database is joined together
    - usually information from the database is fetched to add more value to the streaming data
    - for example, users location is tracked near a mall in real time along with identity of user and sourced as stream, profile data of user is fetched from database, his/her interests are observed, based on the interests and outlets located in the malls, discount coupons can be sent on the users devices
    - lookup in database can be time consuming in case of too many records of users are available, local copy of users data can be loaded in stream processor

### Approaches to Stream Processing
Two approaches:
1. Microbatching (Bulk Synchronous Processing)
    - gist of Bulk Synchronous Processing (BSP)
        - split distribution of asynchronous work
        - each executor receives the chunk(s) of work and works separately until the second element comes in
        - a particular resource is tasked with keeping track of progress of computation
        - between these scheduled steps, all executors on the cluster are doing the same thing
    - synchronous barrier, coming in at fixed intervals
        - frequency at which further rounds of processing are scheduled id dictated by time period
        - implemented at small, fixed intervals that better approximate the real time motion of data processing
        - function-passing style – asynchronously pass safe functions to data
        - functions are passed around the scheduling process that describe processing to be done on data
        - data is already on various executors, delivered directly to resources
2. One-Record-at-a-Time Processing
    - works based on pipelining
    - analyses the whole computation as described by user-specified functions and deploys it as a pipeline using the resources of the cluster
    - flow the data through various resources, following prescribed pipeline
    - each step of computation is materialized at some place in cluster at any given point
    - Apache Filnk, Storm and IBM Streams follows this style
    - latency
        - for microbatching system is batch interval + processing time
        - for one-record-at-a-time system is only processing time as it react as it meets the event of interest

Bringing Microbatch and One-record-at-time Together
- marriage is implemented in systems like Flink and Naiad
- Spark Structured Streaming
    - backed up by microbatching but does not reveal the batch interval at API level
    - allows for processing that is independent of fixed batch interval
    - execution model mixes microbatching with a dynamic batch interval
    - trigger the execution of next batch as early as possible
    - new batch should be started as soon as the previous one has been processed

Trade-Offs
- despite high latency, microbathcing offers significant advantages
    - able to adapt at the synchronization barrier boundaries
        - might represent the task of recovering from failure
        - give opportunity to add or remove executor nodes, possibility to grow or shrink resources depending upon cluster load
    - have easier time providing strong consistency
        - batch determinations – beginning and end of batch of data – are deterministic and recorded
        - any kind of computation can be redone and produce same results the second time
    - perform efficient optimizations
        - data available as a set can provide ideas on the way to compute on data
        - allows an efficient way of specifying programming both batch processing and streaming data
        - even for mere instances, looks like data at rest

### Streaming in Apache Spark
- 2 APIs : Spark Streaming and Structured Streaming
- Spark Streaming
    - first streaming engine based on distributed capabilities of Spark
    - based on simple but powerful premise
        - apply Spark’s distributed computing capabilities to stream processing by transforming a continuous stream of data into discrete data collections
        - micro batching
    - uses same functional programming paradigm as Spark core but introduces a new abstraction
    - discretized streams – exposes a programming model to operate on data in stream
    - Spark RDD abstraction permits creation of programs that treat distributed data as a collection
        - allows applying data processing logic in form of transformation of distributed dataset
    - main task is to take data from stream, package it down into small batches and provide them to Spark for further processing
    - DSream abstraction
        - Spark streaming relies on the much more fundamental Spark abstraction of RDD
        - introduces a new concept : The Discretized Stream or DStream
        - DStream represents a stream in terms of discrete block of data that in turn are represented as RDDs over time
        - primarily an execution model that when combined with functional programming model, provides a complete framework to develop and execute streaming applications
        - DStreams as a Programming model
            - provides functional programming APIs consistent with RDD APIs and augmented with stream specific functions to deal with
                - aggregations
                - time based operations
                - stateful computations
            - in Spark Streaming, we
                - consume a stream by creating Dstream form one of the native implementations or the connectors available
                    - from SocketInputStream or Kafka / Twitter / Kinesis connector
                - implement application logic using functions provided by Dstream API
                    - such as counting the elements or grouping the elements
                - uses transformations – no execution happens unless output operation is not called
                - execute output operations to yield out the results
            - DStream programming model consists of functional composition of transformations over the stream payload, materialized by one or more output operations and recurrently executed by Spark engine
        - DStreams as an Execution model
            - in programming model, its descried how data is transformed from original form to intended result as a series of lazy evaluations
            - Spark streaming engine is responsible for taking that chain of transformations and turning it into an actual execution plan
                - happens by receiving data from input stream(s), collecting that into batches and feeding to spark in timely manner
            - the measure of time to wait for data in batch interval
                - central unit of time
                - short amount of time, ranging from 200ms to 1 min, depending app’s latency requirement
            - at each batch interval, the data corresponding to the previous interval is sent to Spark for processing while new data is received
                - process is repeated as long as Streaming job is active and healthy
            - DStream model dictates that a continuous stream of data is discretized into micro batches using a regular time interval
- Structured Streaming
    - stream processor built on top of Spark SQL abstractions
    - extends Dataset and DataFrame APIs with streaming capabilities
    - adopts schema oriented transformation model – structured part in name
    - inherits optimizations implemented in Spark SQL
    - introduced in 2017 with Spark 2.2 release
    - still evolving fast with each new version of Spark

### Sketching Algorithms
- Sketching algorithms are used to summarize data streams in a small amount of space 
- Sketching algorithms are used to estimate various statistics of the data stream, such as count, sum, average, variance, etc.
- Examples of sketching algorithms include Count-Min Sketch, HyperLogLog, and Bloom Filter

#### Bloom Filter
- Bloom filter is a probabilistic data structure used to test whether an element is a member of a set Link : https://llimllib.github.io/bloomfilter-tutorial/
- Designed to tell you, rapidly and memory-efficiently, whether an element is present in a set - in other words, it tells you either "possibly in set" or "definitely not in set". Not - to say "definitely in set".
- Can't remove elements
- hash functions used in a Bloom filter should be independent and uniformly distributed
    - like murmur, xxhash, fnv, HashMix, etc.
- In a Bloom filter with k hashes, m bits in the filter, and n elements that have been inserted:
    - The probability of a false positive is $$p = \left(1 - \left[1 - \frac{1}{m}\right]^{kn}\right)^k \approx (1 - e^{-kn/m})^k$$
- So, to choose the size of a bloom filter, we:
    - Choose a ballpark value for $n$
    - Choose a value for $m$
    - Calculate the optimal value of $k$
    - Calculate the error rate for our chosen values of $m$ and $k$, and if it's too high, increase $m$ and recalculate $k$.
- Both insertion and lookup are $O(k)$

#### HyperLogLog
- HyperLogLog is a probabilistic data structure used to estimate the cardinality of a set
- HyperLogLog is used to estimate the number of distinct elements in a multiset
- HyperLogLog is a memory-efficient way to estimate the number of distinct elements in a set
- HyperLogLog uses a hash function to map elements to a fixed-size array of bits
- HyperLogLog uses a technique called "log-log" counting to estimate the number of distinct elements in a set
- Algorithm:
    - Identify a unique ID for the data value
    - Pass the ID through a hash function, it will result into a hashed value
    - Hashed value is converted into a binary representation
    - Need to determine the place which needs to be updated and with what value.
        - Take the least 6 significant digits of binary number and convert it to a decimal value. That gives you the position where value needs to be updated
        - Count the number of leading zeros in the binary number, add one to it and use that number as a value to be stored in the position identified earlier
    - Repeat the process for all the data values
    - To estimate the cardinality, use the harmonic mean of the values stored in the array

#### Count-Min Sketch
- is a probabilistic data structure used to estimate the frequency of elements in a data stream
- uses a hash function to map elements to a fixed-size array of counters
- create a table:
    - number of rows = number of different values given by hash function
    - number of columns = number of hash functions (each representing a different counter)
- uses multiple hash functions to map elements to multiple counters
    - loop through each element in the stream, hash it using multiple hash functions, increment the counter at the hashed index
- uses the minimum value of the counters to estimate the frequency of that element in the stream

#### Reservoir Sampling
- Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of k items from a list S containing n items, where n is either a very large or unknown number
- Algorithm:
    - Initialize an array of size k to store the sample
    - Fill the array with the first k elements from the stream
    - For each subsequent element in the stream, generate a random number between 0 and the number of elements seen so far
    - If the random number is less than k:
        - Insert the element into the sample
        - Remove a random element from the sample

#### Hot list detection
- Hot list detection is a technique used to identify the most frequently occurring items in a data stream
- Algorithm:
    - input $S$ : a sequence of examples
    - for each example $x$ in $S$:
        - if $x$ is in the hot list:
            - increment the count of $x$
        - else 
            - if there is a element $y$ in the hot list with count 0:
                - replace $y$ with $x$
                - increment the count of $x$
            - else
                - decrement the count of each element in the hot list

#### Decaying Window
- Decaying window is a technique used to give more weight to recent data in a data stream
- Here, the weight of the data decreases exponentially as it gets older
- Idea:
    - Attach weights to elements in sliding window
    - Recent elements receive higher weights, with the older elements receive decaying weights
    - For a new element:
        - First reduce the weight of all the existing elements by a constant factor
        - Assign the new element with a specific weight
        - The aggregate sum of the decaying exponential weights can be calculated using the following formula
    - Let stream is $x_1, x_2, x_3, \ldots$ and taking sum of stream as follows:
        - $S_t = \sum_{i=0}^{t-1} \lambda^{i} \cdot x_i$
        - where $\lambda$ is the decay factor