Build a small, end-to-end demo: logs → Kafka → PySpark clean → embeddings → Chroma → RAG with Ollama. Minimal, local, reproducible.

# 0) Folder layout

```
genai-incidents/
├─ docker-compose.yml
├─ producer/producer.py
├─ spark/stream_clean_from_kafka.py
├─ ingest/embed_to_chroma.py
├─ rag/query_rag.py
└─ airflow/dags/reindex_daily.py
```

# 1) Infra: Kafka (Zookeeper mode, simplest)

**docker-compose.yml**

```yaml
version: "3.8"
services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.5.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
    ports: ["2181:2181"]

  kafka:
    image: confluentinc/cp-kafka:7.5.0
    depends_on: [zookeeper]
    ports: ["9092:9092"]
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
```

Run:

```
docker compose up -d
```

# 2) Stream in sample syslog lines → Kafka

**producer/producer.py**

```python
from kafka import KafkaProducer
import json, random, time, datetime

producer = KafkaProducer(
    bootstrap_servers="localhost:9092",
    value_serializer=lambda v: json.dumps(v).encode("utf-8")
)
services = ["nginx","auth","payment","orders","redis","postgres"]
levels = ["INFO","WARN","ERROR"]

def line():
    svc = random.choice(services)
    lvl = random.choices(levels, weights=[0.7,0.2,0.1])[0]
    msg = {
        "INFO":  "service healthy",
        "WARN":  "slow response detected",
        "ERROR": "connection timeout to dependency"
    }[lvl]
    return {
        "ts": datetime.datetime.utcnow().isoformat(),
        "service": svc,
        "level": lvl,
        "message": msg
    }

while True:
    rec = line()
    producer.send("logs", value=rec)
    print("→", rec)
    time.sleep(0.5)
```

Install and run:

```
pip install kafka-python
python producer/producer.py
```

Create topic if needed:

```
docker exec -it $(docker ps -qf name=kafka) \
  kafka-topics --create --topic logs --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1
```

# 3) PySpark Structured Streaming: clean + land to parquet

**spark/stream_clean_from_kafka.py**

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, lower, regexp_replace, trim
from pyspark.sql.types import StructType, StructField, StringType

spark = SparkSession.builder.appName("clean-logs").getOrCreate()
spark.sparkContext.setLogLevel("WARN")

schema = StructType([
    StructField("ts", StringType()),
    StructField("service", StringType()),
    StructField("level", StringType()),
    StructField("message", StringType()),
])

raw = (spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers","localhost:9092")
    .option("subscribe","logs")
    .option("startingOffsets","latest")
    .load())

json = raw.select(from_json(col("value").cast("string"), schema).alias("j")).select("j.*")

clean = (json
    .withColumn("service", lower(trim(col("service"))))
    .withColumn("level", lower(trim(col("level"))))
    .withColumn("message", lower(regexp_replace(col("message"), r"[^a-z0-9 _\-:]", " ")))
    .filter("message is not null and length(message) > 3"))

# Write cleaned stream to small parquet files for downstream embedding
query = (clean.writeStream
    .format("parquet")
    .option("path","landing/clean_parquet")
    .option("checkpointLocation","landing/_chk_clean")
    .outputMode("append")
    .start())

query.awaitTermination()
```

Run:

```
pip install pyspark
python spark/stream_clean_from_kafka.py
```

This continuously writes cleaned records into `landing/clean_parquet/`.

# 4) Embed new records and upsert into Chroma

**ingest/embed_to_chroma.py**

```python
import time, uuid, os
from glob import glob
from sentence_transformers import SentenceTransformer
import chromadb
from pyspark.sql import SparkSession

# Reuse Spark just to read Parquet easily (local mode)
spark = SparkSession.builder.appName("read-clean").getOrCreate()

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
client = chromadb.Client()
col = client.get_or_create_collection("incidents")

# Simple file cursor to avoid reprocessing
STATE = ".embed_state"
last_count = int(open(STATE).read().strip()) if os.path.exists(STATE) else 0

def count_rows(path):
    df = spark.read.parquet(path)
    return df.count(), df

while True:
    if not os.path.exists("landing/clean_parquet"):
        time.sleep(2); continue

    total, df = count_rows("landing/clean_parquet")
    if total <= last_count:
        time.sleep(2); continue

    new_df = df.limit(total - last_count)  # simplistic; good enough for demo
    rows = [r.asDict() for r in new_df.collect()]

    texts = [f"[{r['ts']}] {r['service']} {r['level']} :: {r['message']}" for r in rows]
    embs  = model.encode(texts, show_progress_bar=False).tolist()
    ids   = [str(uuid.uuid4()) for _ in texts]

    col.add(ids=ids, embeddings=embs, metadatas=rows, documents=texts)
    last_count = total
    with open(STATE,"w") as f: f.write(str(last_count))
    print(f"Indexed {len(texts)} new records. Total seen: {last_count}")
    time.sleep(2)
```

Install and run:

```
pip install sentence-transformers chromadb
python ingest/embed_to_chroma.py
```

# 5) Query via RAG with Ollama or HF

First run an LLM locally. Example with **Ollama**:

```
ollama pull llama3
ollama serve
```

**rag/query_rag.py**

```python
from chromadb import Client
from sentence_transformers import SentenceTransformer
import requests, json

db = Client().get_collection("incidents")
embed = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def ask_ollama(prompt):
    r = requests.post("http://localhost:11434/api/generate",
                      json={"model":"llama3","prompt":prompt,"stream":False})
    print(json.loads(r.text)["response"])

def ask(question, k=5):
    qv = embed.encode(question).tolist()
    res = db.query(query_embeddings=[qv], n_results=k)
    ctx = "\n".join(res["documents"][0])
    prompt = f"""You are an incident analyst. Use only the context.

Context:
{ctx}

Question: {question}
Answer with a concise root cause or hypothesis and a fix."""
    ask_ollama(prompt)

if __name__ == "__main__":
    ask("why are we seeing payment timeouts and how to fix?")
```

Run:

```
pip install requests sentence-transformers chromadb
python rag/query_rag.py
```

# 6) Airflow: nightly maintenance (batch reindex or vacuum)

Minimal DAG that rebuilds the Chroma collection from parquet once per night.

**airflow/dags/reindex_daily.py**

```python
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

default_args = {"owner":"you","retries":0}
with DAG(
    "reindex_chroma",
    default_args=default_args,
    schedule_interval="0 2 * * *",  # 02:00 daily
    start_date=datetime(2025,1,1),
    catchup=False,
) as dag:
    rebuild = BashOperator(
        task_id="rebuild",
        bash_command="python /opt/flows/rebuild_index.py"
    )
```

Your `/opt/flows/rebuild_index.py` can read all parquet, dedupe, and recreate the Chroma collection. For demo, you can point it to `ingest/embed_to_chroma.py` with a “full rebuild” flag.

# How this proves each technology

* **Kafka**: streaming ingestion of events.
* **PySpark**: structured streaming clean + land to parquet.
* **Embeddings + Vector DB (Chroma)**: semantic index.
* **RAG**: retrieve top-k, prompt LLM with context.
* **Ollama / HF**: local LLM inference.
* **Airflow**: scheduled maintenance or rebuild.

# Quick run order

1. `docker compose up -d`
2. Terminal A: `python spark/stream_clean_from_kafka.py`
3. Terminal B: `python producer/producer.py`
4. Terminal C: `python ingest/embed_to_chroma.py`
5. Terminal D: start LLM → `ollama pull llama3 && ollama serve`
6. Terminal D: `python rag/query_rag.py`

You now have a working, local, real-time Incident Intelligence Assistant demonstrating Kafka + PySpark + Airflow + embeddings + Chroma + RAG + Ollama.
