<a href="https://colab.research.google.com/github/smduarte/ps2024/blob/main/lab4/ps2024_lab4_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Processamento de Streams 2024
## Lab 4.2 - Flume + Kafka Streaming 

Flume + Kafka + Spark Structured Streaming
---
### Colab Setup

In [None]:
#@title Install Flume
%%bash

rm -rf *flume*
wget -q -O - /tmp/ https://dlcdn.apache.org/flume/1.11.0/apache-flume-1.11.0-bin.tar.gz | tar xfz -
mkdir -p conf

In [None]:
#@title Define Flume agent topology
%%writefile conf/flume.conf

# Name the components on this agent
agent.sinks = k1
agent.sources = r1
agent.channels = c1

# Describe/configure the source
agent.sources.r1.type = seq
agent.sources.r1.channels = c1

# Use a channel which buffers events in memory
agent.channels.c1.type = memory
agent.channels.c1.capacity = 1000
agent.channels.c1.transactionCapacity = 100

agent.sinks.k1.channel = c1
agent.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.k1.kafka.topic = flume
agent.sinks.k1.kafka.bootstrap.servers = localhost:9092
agent.sinks.k1.kafka.flumeBatchSize = 20
agent.sinks.k1.kafka.producer.acks = 1
agent.sinks.k1.kafka.producer.linger.ms = 1
agent.sinks.k1.kafka.producer.compression.type = snappy



In [None]:
#@title Install PySpark
!pip install pyspark findspark --quiet
import findspark
findspark.init()
findspark.find()

In [None]:
#@title Install & Launch Kafka
%%bash
KAFKA_VERSION=3.4.0
KAFKA=kafka_2.13-$KAFKA_VERSION
wget -q -O /tmp/$KAFKA.tgz https://dlcdn.apache.org/kafka/$KAFKA_VERSION/$KAFKA.tgz
tar xfz /tmp/$KAFKA.tgz
wget -q -O $KAFKA/config/server1.properties - https://github.com/smduarte/ps2024/raw/main/colab/server1.properties

UUID=`$KAFKA/bin/kafka-storage.sh random-uuid`
$KAFKA/bin/kafka-storage.sh format -t $UUID -c $KAFKA/config/server1.properties
$KAFKA/bin/kafka-server-start.sh -daemon $KAFKA/config/server1.properties

In [None]:
#@title Start Flume agent
%%bash
FLUME=apache-flume-1.11.0-bin

nohup $FLUME/bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=ALL,console -n agent 2>/dev/null > /dev/null &

In [None]:
#@title Process stream with Spark streaming
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

def dumpBatchDF(df, epoch_id):
    df.show(20, False)

spark = SparkSession \
    .builder \
    .appName('Kafka Spark Structured Streaming Example') \
    .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.2') \
    .getOrCreate()

lines = spark \
  .readStream \
  .format('kafka') \
  .option('kafka.bootstrap.servers', 'localhost:9092') \
  .option('subscribe', 'flume') \
  .option('startingOffsets', 'earliest') \
  .load() \
  .selectExpr('CAST(value AS STRING)')

query = lines \
    .writeStream \
    .outputMode('append') \
    .foreachBatch(dumpBatchDF) \
    .start()

query.awaitTermination(600)
query.stop()
spark.stop()

### Exercise 1

Use [Flume](https://flume.apache.org/index.html) to injest the weblog daset into Kafka, according the following pipeline.

 `weblog tcp:7777` -> ` flume` -> `Kafka` -> `Spark Streaming`

1. Run the `weblog` stream from Lab 2, using client.py, instead of server.py

2. Check the Flume [user guide](https://flume.apache.org/FlumeUserGuide.html) to find the documentation on the available sources and how to use them. 

3. Define the new Flume agent topology. Use the `netcat` source to consume lines from the weblog server. The sink should be `kafka`.

4. Consume the stream data using Spark Streaming.

### Exercise 2

Create a new Flume agent, based on the solution of the previous exercise, to aggregate events from two weblog servers, one serving to the `IPv4` dataset and the other the `IPv6` dataset. 