```{contents}
```


## Batch Ingestion vs. Streaming Ingestion (Tutor Explanation)

When building Machine Learning or Generative AI systems—especially Retrieval-Augmented Generation (RAG) pipelines—we need to **move, process, and prepare data** so that models and vector stores stay updated. Two main approaches are used for this: **Batch Ingestion** and **Streaming Ingestion**. They serve different needs depending on whether the system requires **periodic updates** or **real-time responsiveness**.

---

### Batch Ingestion (Offline / Periodic Processing)

Batch ingestion means that data is **collected and processed in large chunks** at scheduled intervals, such as **hourly, daily, or weekly**.

#### **How it Works**

* Data accumulates over time.
* At a scheduled time, a job processes this data in bulk.
* The output (features, embeddings, predictions) is **stored** for later use.

#### **Learning and Model Behavior**

* Models in batch mode are usually **trained on the full dataset at once**.
* When new data arrives, the model is typically **retrained from scratch** or retrained on a large historical dataset.
* This makes batch systems **simple but slow to adapt** to new information.

#### **Batch Predictions (Offline Prediction)**

* Predictions are computed **ahead of time**, not when the user requests them.
* These predictions are stored (e.g., in a DB or cache) and **reused** later.
* This improves efficiency and **reduces compute costs**.

#### **Common Use Cases**

| Example                             | Why Batch Works                                   |
| ----------------------------------- | ------------------------------------------------- |
| Nightly recommender system updates  | User preference doesn’t need updates every second |
| Analytics dashboards and BI reports | Data is summarized at regular intervals           |
| Knowledge base re-indexing for RAG  | Documents are not changing constantly             |

---

### Streaming Ingestion (Real-Time Processing)

Streaming ingestion means data is processed **immediately** as it arrives. The system reacts to events **in real time**.

#### **How it Works**

* Data flows continuously through systems like **Kafka** or **Kinesis**.
* The pipeline processes and updates downstream systems instantly.
* Results are available with **very low latency**.

#### **Learning and Model Behavior**

* Supports **online or incremental learning**:

  * The model updates itself using small bits of new data (mini-batches).
* Allows the system to **adapt quickly** to new patterns or sudden changes.

#### **Online Predictions (Real-Time Inference)**

* Predictions are generated **when the user requests them**.
* The response must be **fast** — usually under a few milliseconds.

#### **Common Use Cases**

| Example                                | Why Streaming is Needed          |
| -------------------------------------- | -------------------------------- |
| Fraud detection (e.g., PayPal, Stripe) | Decisions must be made instantly |
| Real-time chatbots and LLM assistants  | Users expect immediate replies   |
| TikTok-style content recommendation    | User preferences change quickly  |

---

#### Comparison Summary

| Aspect            | Batch Ingestion                    | Streaming Ingestion                 |
| ----------------- | ---------------------------------- | ----------------------------------- |
| Processing Timing | Scheduled (periodic)               | Continuous (real-time)              |
| Data Freshness    | Data may be slightly old           | Always up-to-date                   |
| Model Updating    | Re-training in large chunks        | Small, incremental updates          |
| Prediction Mode   | Precomputed and stored             | Generated on demand                 |
| System Complexity | Simpler to implement               | More complex and resource-intensive |
| Ideal Use Cases   | Offline analytics, nightly updates | Real-time apps, dynamic behavior    |

---

#### How This Relates to RAG (Retrieval-Augmented Generation)

| RAG Component        | Batch Pipeline                       | Streaming Pipeline                     |
| -------------------- | ------------------------------------ | -------------------------------------- |
| Embedding Documents  | Periodically re-embed entire dataset | Embed new data instantly as it appears |
| Vector Store Updates | Scheduled refreshes                  | Continuous updates                     |
| Use Case Fit         | Corporate knowledge bases            | Live customer feedback or chat logs    |

---

**Key Insight**

Many companies **start with batch pipelines because they are easier**.
As the system grows and the need for freshness increases, they **add or transition to streaming**.


### **Demonstration**

Below is a **simple, practical demonstration** of both **Batch** and **Streaming** ingestion using **plain Python** so you can see the difference clearly.
(No extra infrastructure required to understand the concept.)

---

#### Batch Ingestion (Periodic Bulk Processing)

Assume we have a folder where logs accumulate throughout the day.
At **midnight**, we run a batch job to process all logs at once.

##### **Example Folder Structure**

```
/logs/
   log_01.csv
   log_02.csv
   log_03.csv
```

#### **Batch Ingestion Demo (Python)**

```python
import pandas as pd
import glob

def batch_ingest_logs():
    files = glob.glob("logs/*.csv")  # get all CSV files
    batch_data = []

    for file in files:
        df = pd.read_csv(file)
        batch_data.append(df)

    combined = pd.concat(batch_data)
    combined.to_csv("processed/batch_output.csv", index=False)
    print("Batch ingestion finished. Combined data saved.")

# This is typically scheduled using Airflow / Cron
batch_ingest_logs()
```

**What happened:**

* We waited until enough data accumulated.
* Processed everything together.
* Stored results for later use.

**This is Batch.**

---

### Streaming Ingestion (Real-Time Processing)

Here, data arrives **continuously**, so we process each record **immediately**.

We’ll simulate streaming by reading one event at a time.

#### Streaming Data Example

```
incoming_stream = [
  {"user": "A", "action": "login"},
  {"user": "B", "action": "purchase"},
  {"user": "C", "action": "logout"}
]
```

#### Streaming Ingestion Demo (Python)

```python
import time

def process_event(event):
    print(f"Processing event: {event}")

def streaming_ingest(events):
    for event in events:
        process_event(event)
        time.sleep(1)  # simulate real-time arrival

incoming_stream = [
  {"user": "A", "action": "login"},
  {"user": "B", "action": "purchase"},
  {"user": "C", "action": "logout"}
]

streaming_ingest(incoming_stream)
```

**What happens:**

* Each event is processed as soon as it arrives.
* No waiting for a full batch.
* Output is immediate.

**This is Streaming.**

---

**Key Difference Shown Clearly**

| Aspect                 | Batch Example                            | Streaming Example                               |
| ---------------------- | ---------------------------------------- | ----------------------------------------------- |
| When Data is Processed | After accumulation (e.g., nightly job)   | Immediately as events occur                     |
| Code Pattern           | Read many files → process → write output | Read one event → process → repeat               |
| Latency                | Higher                                   | Very low                                        |
| Use Case               | ETL, offline reporting                   | Chatbots, fraud detection, real-time dashboards |

---

**How This Maps to RAG (Vector Database Ingestion)**

| Step              | Batch RAG                              | Streaming RAG                                |
| ----------------- | -------------------------------------- | -------------------------------------------- |
| Embedding Docs    | Run nightly to embed all new documents | Embed each document as soon as it is created |
| Vector DB updates | Periodic bulk indexing                 | Continuous incremental updates               |
| Usage             | Internal knowledge bases               | Live chat, customer support feeds            |



### **Batch Pipeline (Airflow + Spark)**

```
Data Source (CSV / DB / S3)
        ↓
Airflow Scheduler (runs daily)
        ↓
Spark Batch Job (ETL / Embeddings / Aggregations)
        ↓
Data Lake / Warehouse / Vector DB
```

### **Streaming Pipeline (Kafka + Spark Structured Streaming)**

```
Producers (apps / microservices / sensors)
        ↓
Kafka Topic (real-time event buffer)
        ↓
Spark Structured Streaming Job (transform & write)
        ↓
Target System (DB / Dashboard / Vector Store)
```

---

#### Batch Ingestion Demo using Airflow + Spark

##### **Airflow DAG (scheduled daily batch job)

Save as: `dags/batch_ingest_spark.py`

```python
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

default_args = {
    "start_date": datetime(2025, 1, 1),
    "retries": 1
}

with DAG(
    dag_id="batch_ingestion_spark",
    schedule_interval="@daily",
    default_args=default_args,
    catchup=False
):
    run_spark_job = BashOperator(
        task_id="run_spark_batch_job",
        bash_command="spark-submit /opt/jobs/batch_etl.py"
    )

    run_spark_job
```

##### **Spark Batch ETL Job**

Save as: `/opt/jobs/batch_etl.py`

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("BatchETL").getOrCreate()

df = spark.read.csv("/data/raw/*.csv", header=True, inferSchema=True)

df_clean = df.dropna()

df_clean.write.mode("overwrite").parquet("/data/processed/batch_output/")

spark.stop()
```

**What happens:**

* Airflow **triggers Spark job everyday**
* Spark reads **all accumulated files**
* Cleans and writes output once per schedule
  → **This is Batch**

---

#### Streaming Ingestion Demo using Kafka + Spark Structured Streaming

##### Kafka Setup

Create Kafka topic:

```bash
kafka-topics.sh --create --topic user_events --bootstrap-server localhost:9092
```

##### Streaming Producer (simulating real-time events)

```python
from kafka import KafkaProducer
import json, time

producer = KafkaProducer(
    bootstrap_servers="localhost:9092",
    value_serializer=lambda v: json.dumps(v).encode("utf-8")
)

events = [
    {"user": "A", "action": "login"},
    {"user": "B", "action": "purchase"},
    {"user": "C", "action": "logout"}
]

for event in events:
    producer.send("user_events", event)
    print("Sent:", event)
    time.sleep(1)  # simulate real-time arrival
```

##### Spark Streaming Consumer

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("StreamingETL").getOrCreate()

df = (spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("subscribe", "user_events")
    .load())

json_df = df.selectExpr("CAST(value AS STRING)")

query = (json_df
    .writeStream
    .format("console")
    .outputMode("append")
    .start())

query.awaitTermination()
```

What happens:

* Events are **pushed to Kafka one-by-one**
* Spark Structured Streaming **reads and processes instantly**
  → **This is Streaming**

---

**Comparison in This Demo**

| Feature      | Batch (Airflow + Spark) | Streaming (Kafka + Spark Streaming) |
| ------------ | ----------------------- | ----------------------------------- |
| Data Arrival | Accumulated             | Continuous                          |
| Trigger      | Airflow schedule        | Event-driven                        |
| Spark Mode   | Batch job               | Structured Streaming                |
| Output       | Periodic data refresh   | Real-time updates                   |

---

**Optional Upgrade: RAG Integration**

| Step           | Batch RAG                                             | Streaming RAG                             |
| -------------- | ----------------------------------------------------- | ----------------------------------------- |
| Embedding Docs | Trigger `spark-submit generate_embeddings.py` nightly | Embed documents as soon as they hit Kafka |
| Vector DB      | Bulk upsert into FAISS / Pinecone                     | Incremental upsert                        |
