<a href="https://colab.research.google.com/github/smduarte/ps2021/blob/master/proj/colab/ps2021_spark247%2Bkafka_setup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Processamento de Streams 2021 
### Google Colab Project Development Environment 

This notebook can be used to setup Spark, as well as the Kafka taxi data source for the first project.

Compared to using Docker, the full taxi data can be kept
remotely in Google Drive.

Additionally, the Google Colab instances feature 12+GB of RAM and lots of disk space. 

The main drawback is the execution environment is not persistent. The notebook itself is saved in Google Drive, but the Spark and Kafka instalation needs to be repeated everytime you reopen the notebook. Fortunately, it only
the procedure only takes a couple of minutes and it is fully automated.

In [1]:
#@title Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Setup Spark and Kafka

The command below will download and execute a script to setup spark and kafka as a local single machine cluster.

In [1]:
#@title Install Spark and Kafka
!wget -q -O - https://raw.githubusercontent.com/smduarte/ps2021/master/proj/colab/setup-spark+kafka.sh  | bash 

Downloading Spark...
Unpacking Spark...
Downloading spark-sql-kafka-0-10_2.11-2.4.7.jar...
Downloading spark-streaming-kafka-0-10-assembly_2.11-2.4.7.jar...
Downloading spark-streaming-kafka-0-10_2.11-2.4.7.jar...
Downloading spark-streaming-kafka-0-8-assembly_2.11-2.4.7.jar...
Downloading spark-streaming-kafka-0-8_2.11-2.4.7.jar...
Done
Downloading Kafka...
Unpacking Kafka...
Done


### Kafka Source

Execute the cell below to start the Kafka source.

Expose the source to customize, where:

`SPEEDUP` controls how fast the events are replayed relative to realtime. A value of 60 means that 1 second
of realtime corresponds to 1 minute worth of events in the stream.

`TAXI_DATA` should point to the compressed taxi data. You can use the sample dataset, or the full dataset. The
full dataset can be stored in Google Drive and used
from there.

In [2]:
#@title Start Kafka Taxi Publisher
%%bash

SPEEDUP=60
TAXI_DATA=/content/drive/MyDrive/PS2021/proj/sample.csv.gz

wget -q -O - https://raw.githubusercontent.com/smduarte/ps2021/master/proj/colab/start-kafka.sh \
  | bash /dev/stdin $SPEEDUP $TAXI_DATA

60 /content/drive/MyDrive/PS2021/proj/sample.csv.gz
No zookeeper server to stop
No kafka server to stop
Starting Zookeeper...
Starting Kafka...
Waiting for 10 secs until kafka and zookeeper services are up and running


### Structured Spark Streaming

The code below shows how to access the kafka stream in structured mode. Note that with this setup, we need to connect to `localhost`.

In [3]:
#@title Enable Spark SQL Kafka source
!(cp jars/*kafka*10* spark*/jars/ && rm -f spark*/jars/*kafka*8*)

In [4]:
import os
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

import findspark
findspark.init()
findspark.find()

jars = ",".join(["jars/spark-sql-kafka-0-10_2.11-2.4.7.jar", \
                 "jars/spark-streaming-kafka-0-10_2.11-2.4.7.jar", \
                 "jars/spark-streaming-kafka-0-10-assembly_2.11-2.4.7.jar"])

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

def dumpBatchDF(df, epoch_id):
    df.show(20, False)


spark = SparkSession \
    .builder \
    .config("spark.jars", jars) \
    .appName("Kafka Spark Structured Streaming Example") \
    .getOrCreate()

print(spark)
lines = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("subscribe", "debs") \
  .load() \
  .selectExpr("CAST(value AS STRING)")

split_lines = split(lines['value'], ',')

rides = lines.withColumn('medallion', split_lines.getItem(0).cast("string")) \
        .withColumn('pickup_datetime', split_lines.getItem(2).cast("timestamp")) \
        .drop('value')

query = rides \
    .writeStream \
    .outputMode("append") \
    .trigger(processingTime='5 seconds') \
    .foreachBatch(dumpBatchDF) \
    .start()

query.awaitTermination( 20)
query.stop()
spark.stop()

<pyspark.sql.session.SparkSession object at 0x7f540a022a10>
+---------+---------------+
|medallion|pickup_datetime|
+---------+---------------+
+---------+---------------+

+--------------------------------+-------------------+
|medallion                       |pickup_datetime    |
+--------------------------------+-------------------+
|019319FCEB17B9B9E45763F9573DE640|2013-01-01 00:00:19|
|02B5097C51EBA10438C43AC0F4BC5867|2013-01-01 00:00:03|
|02C49A409C2DC66B1DBD44A6EF1B75B5|2013-01-01 00:00:22|
|02CDF4B814C18DE3898FDA761EA8A5B4|2013-01-01 00:00:26|
|050E71DA009681B177455E36A1FDB041|2013-01-01 00:00:24|
|067F347AA3A87A1F35E8EDC20083F43A|2013-01-01 00:00:19|
|06F02AE93C005820592CC14E6E6E9B51|2013-01-01 00:00:19|
|079E2EE80E3CFA17EB826FDF4DF66C36|2013-01-01 00:00:22|
|0812169BAD07C1760F17DA2104920BF1|2013-01-01 00:00:17|
|0BC78FA149D8D784815895BD33F17EE1|2013-01-01 00:00:17|
|0C52944B57BE5E0962C6561A45C67A4A|2013-01-01 00:00:16|
|0CE8A32DD9ED7C681471A968C71F31D9|2013-01-01 00:00:17|
|0

## Spark Streaming

The code below shows how to access the kafka stream in the original spark streaming mode. Note that with this setup, we need to connect to `localhost`.

In [1]:
#@title Enable Spark Streaming Kafka source
!(cp jars/*kafka*8* spark*/jars/ && rm -f spark*/jars/*kafka*10*)

In [2]:
import os
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

import findspark
findspark.init()
findspark.find()

from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

conf = SparkConf() \
        .set("spark.io.compression.codec", "snappy") 

sc = SparkContext(conf=conf, master="local[*]", appName="Kafka Spark Streaming Example")

ssc = StreamingContext(sc, 1)
lines = KafkaUtils.createDirectStream(ssc, ["debs"], \
            {"metadata.broker.list": "localhost:9092"}) \
        .map( lambda e : e[1] ) \
        .filter( lambda line: len(line) > 0) \
        .window(5, 1)


lines.pprint()
    
ssc.start()
ssc.awaitTermination(20)
ssc.stop()
sc.stop()

-------------------------------------------
Time: 2021-04-29 09:22:42
-------------------------------------------
003EEA559FA61800874D4F6805C4A084,CD92B72D14A3A88FE135641B51BD9F57,2013-01-01 00:01:49,2013-01-01 00:03:59,720,3.33,-73.948006,40.804722,-73.955948,40.768448,CRD,12.5,0.5,0.5,0.0,0.0,13.5
00FF7906D2968161B83E04AF033CF02E,704561F06342BEC6B24E4B065B02BF9D,2013-01-01 00:01:37,2013-01-01 00:03:59,1440,10.13,-73.919235,40.774883,-73.980835,40.733616,CRD,31.5,0.5,0.5,6.0,0.0,38.5
03055B956C21B1F915DCDB118AA79F21,E0C7A67293DE535DB04F8AFD8BF28F73,2013-01-01 00:01:30,2013-01-01 00:03:59,1860,4.35,-74.002296,40.746929,-73.990334,40.76989,CSH,21.5,0.5,0.5,0.0,0.0,22.5
034BF986CBF876AE48C633CB96653F5F,D6C469E0E2352E6F6A79E909D6BEAC8B,2013-01-01 00:01:48,2013-01-01 00:03:59,780,2.84,-73.958061,40.677341,-73.991951,40.660488,CSH,12.0,0.5,0.5,0.0,0.0,13.0
07A5B8AA9A594BAA47BC512D621F2074,EE2F8D5FD41D6456D8D474FC69B017B4,2013-01-01 00:01:48,2013-01-01 00:03:59,780,2.86,-73.923454,40.743996,

KeyboardInterrupt: ignored