- **Name:** 20.1_structured_streaming
- **Author:** Shamas Imran
- **Desciption:** Basics of Structured Streaming in Spark
- **Date:** 19-Aug-2025
<!--
REVISION HISTORY
Version          Date        Author           Desciption
01           19-Aug-2025   Shamas Imran       Created streaming DataFrame from files  
                                              Defined query with writeStream  
                                              Used checkpointing for fault tolerance  
-->

In [0]:
# Root path of your Unity Catalog volume
rootPath = "/Volumes/datapurcatalog/default/datapurvolume/"

# Master folder for streaming project
masterPath = rootPath + "spark-streaming/"

# Delete recursively
# dbutils.fs.rm(masterPath, recurse=True)

# Define subfolders inside master
inputPath = masterPath + "csv_input"
checkpointPath = masterPath + "checkpoints/csv_query"
outputPath = masterPath + "csv_output"

# Create directories
dbutils.fs.mkdirs(masterPath)
dbutils.fs.mkdirs(inputPath)
dbutils.fs.mkdirs(checkpointPath)
dbutils.fs.mkdirs(outputPath)

print("Master folder:", masterPath)
print("Input folder:", inputPath)
print("Checkpoint folder:", checkpointPath)
print("Output folder:", outputPath)


In [0]:
# 0) Paths (as provided in output of previous cell)
masterPath     = "/Volumes/datapurcatalog/default/datapurvolume/spark-streaming/"
inputPath      = "/Volumes/datapurcatalog/default/datapurvolume/spark-streaming/csv_input"
checkpointPath = "/Volumes/datapurcatalog/default/datapurvolume/spark-streaming/checkpoints/csv_query"
outputPath     = "/Volumes/datapurcatalog/default/datapurvolume/spark-streaming/csv_output"

In [0]:
for stream in spark.streams.active:
    print(stream.id, stream.name, stream.status)

for q in spark.streams.active:
    q.stop()

for stream in spark.streams.active:
    print(stream.id, stream.name, stream.status)


In [0]:
# 2) Define schema for the incoming CSVs
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, TimestampType

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("score", IntegerType(), True),
    StructField("event_time", TimestampType(), True)  # keep for future (watermark/windows)
])

In [0]:
# 3) Create the streaming DataFrame (CSV source)
df_stream = (
    spark.readStream
         .option("header", "true")   # CSV has a header row
         .schema(schema)             # schema is required for streaming
         .csv(inputPath)             # watches this folder for new files
)

In [0]:
# 4) Start a basic console sink (no windows, no watermark)
query = (
    df_stream.writeStream
             .format("csv")
             .option("path", outputPath)
             .option("checkpointLocation", checkpointPath)  # enables recovery/fault tolerance
             .outputMode("append")                          # append since no aggregations
             .trigger(once=True) # .trigger(processingTime="1 seconds")
             .start()
)

query.awaitTermination()

# trigger() ===>
# Default → real-time continuous streaming.
# Processing time → when you want controlled batch frequency.
# Once → for testing or one-off processing.
# AvailableNow → catch-up to current state, then stop.

# outputMode("append") ======>
# Use append if: new rows keep arriving, and you don’t need to rewrite or update old results.
# Use update if: you’re aggregating and only want changes since the last batch.
# Use complete if: you’re aggregating and need the full table every batch.

In [0]:
# 5) (Optional) Stop the stream when you’re done
# for q in spark.streams.active:
#     q.stop()

for stream in spark.streams.active:
    print(stream.id, stream.name, stream.status)


In [0]:
batch01_students.csv
id,name,score,event_time
1,Ali,85,2025-08-18 12:00:05
2,Sara,90,2025-08-18 12:00:08
3,Imran,70,2025-08-18 12:00:10

batch02_students.csv
id,name,score,event_time
4,Ayesha,95,2025-08-18 12:01:02
5,Hassan,88,2025-08-18 12:01:30