# Reddit Consumer - Structured Streaming

This notebook shows 
 * how to receive the reddit json content coming through a socket, and save this raw data locally to disk (using spark dstreams).
 * how to use spark structured streaming to load data from disk and process it.
 * how to converts dstream to spark dataframe, 
 * how to make dstreams and structured streaming dataframes available in SQL interfaces.
 * how to join streaming data with static data.

The process:
 * Execute the first cell to save the streaming data to disk. The code will keep running undefinitely until the notebook kernel is terminated. If you update the output format (new fields, etc.), you need to increment the `base_path` variables before rerunning the cell, to have a consistent dataset.
 * Execute the second cell to run the processing on the data saved to disk. The code will run on all previously saved files and then will be triggered each time a new file is added (from the code in the first cell). The code will keep running undefinitely until you rerun that cell. This allows updating the cell code and trying it, while always running on all previous files. Before each retry, you need to increment `output_path` and `checkpt_path` variables.
 * The 3rd cell is provided for testing purposes.

This notebook is meant to work in parallel to the content producer (`reddit_producer.ipynb`).

In [1]:
# This cell shows 
#  * how to receive the reddit json content coming through a socket, and save this raw data locally to disk (using spark dstreams).
#  * how to converts dstream to spark dataframe, 
#  * how to make dstreams available in SQL interfaces.

from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.sql.types import StructType, StructField, StringType
import json
import time

# Create a SparkSession and StreamingContext
spark_conf = SparkConf().setAppName("reddit")
ss1 = SparkSession.builder.config(conf=spark_conf).getOrCreate()
ssc = StreamingContext(ss1.sparkContext, 5)

# Create a DStream
lines = ssc.socketTextStream("localhost", 9998)
comments = lines.map(lambda json_data: json.loads(json_data))
comments.pprint()

# Define the schema for the DataFrame
schema = StructType([
    StructField("comment",StringType(),True),
    StructField("prev_comment",StringType(),True),
    StructField("post",StringType(),True),
    StructField("author",StringType(),True),
    StructField("created_utc",StringType(),True),
    ])

# ---- UPDATE BELOW IF YOU RERUN THIS CELL WITH A NEW OUTPUT FORMAT -----
base_path = "./data/raw/reddit_v5"

# Convert each RDD in the DStream to a DataFrame
def process_rdd(time, rdd):
    if not rdd.isEmpty():
        df = ss1.createDataFrame(rdd, schema)
        df.registerTempTable("comments")
        df.persist()
        output_path = f"{base_path}/{time.strftime('%Y%m%d%H%M%S')}"
        df.write.json(output_path)
        df.show()

comments.foreachRDD(process_rdd)

# Start the streaming context
ssc.start()  
# no ssc.awaitTermination() added here to make the cell non blocking and to use other cell in parallel.


23/06/20 17:19:24 WARN Utils: Your hostname, MacBook-Pro-van-Ruben-2.local resolves to a loopback address: 127.0.0.1; using 192.168.0.101 instead (on interface en0)
23/06/20 17:19:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/06/20 17:19:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/06/20 17:19:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/06/20 17:19:26 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [None]:
# This notebook shows 
#  * how to use spark structured streaming to load data from disk and process it.
#  * how to make structured streaming dataframes available in SQL interfaces.
#  * how to join streaming data with static datasets.

from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, DoubleType, IntegerType, LongType, FloatType
from pyspark.sql import functions as F

host = "localhost"
port = 9998

# Create a SparkSession and StreamingContext
spark_conf = SparkConf().setAppName("reddit")
ss1 = SparkSession.builder.config(conf=spark_conf).getOrCreate()

# ---- UPDATE BELOW ACCORDING TO THE LOCATION IN THE FIRST CELL -----
input_path = "./data/raw/reddit_v5/*/*.json"

schema = StructType([
    StructField("comment",StringType(),True),
    StructField("prev_comment",StringType(),True),
    StructField("post",StringType(),True),
    StructField("author",StringType(),True),
    StructField("created_utc",StringType(),True),
    ])

streaming_df = ss1.readStream \
    .format("json") \
    .schema(schema) \
    .option("path", input_path) \
    .load()

transformed_df = streaming_df \
    .withColumn('created_utc2', F.col('created_utc').cast("float")) \
    .withColumn('created_utc3', F.col('created_utc2').cast("int")) \
    .withColumn('created_utc_ts', F.from_unixtime(F.col('created_utc3')).cast(TimestampType())) \
    .withWatermark("created_utc_ts", "5 seconds") \
    .groupBy(F.col("author"), F.window(F.col("created_utc_ts"), windowDuration="60 seconds")) \
    .agg({"created_utc_ts":'max', "comment":'count'})


#transformed_df.createOrReplaceTempView("stream")
#transformed_df = ss1.sql("""
#    SELECT *
#    FROM stream as ct
#    """)
# streaming df can be joined with static tables.


# Save output to disk
# ---- UPDATE BELOW EACH TIME YOU RERUN THIS CELL -----
output_path = "./data/processed/reddit_v1"
checkpt_path = "./metadata/processed/reddit_v1"

transformed_df.writeStream \
    .format("json") \
    .option("checkpointLocation", checkpt_path) \
    .option("path", output_path) \
    .outputMode("append") \
    .start()

# For testing, output to console
#query = transformed_df.writeStream \
#    .format("console") \
#    .start()


In [None]:
# To mix static table (with histoical content)
# and dynamic table (with content from current window)

# Loading static table
base_path = "./data/raw/reddit_v5/*/*.json"
historical = ss1.read.json(base_path)
historical.createOrReplaceTempView('historical')
historical.show()

# Mixing static table with histoical content and dynamic table with content from current window
ss1.sql("""
    SELECT * 
    FROM comments as ct
    LEFT JOIN historical as ht on ht.author=ct.author
    """).show()

In [None]:
ssc.stop()
sc.stop()