⊕ [structured-streaming-scala.html - Databricks](https://docs.databricks.com/_static/notebooks/structured-streaming-scala.html)


In [1]:
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
// For implicit conversions from RDDs to DataFrames
import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql._

spark = org.apache.spark.sql.SparkSession@23a2ef80


org.apache.spark.sql.SparkSession@23a2ef80

The usual first step in attempting to process the data is to interactively query the data. Let's define a static DataFrame on the files, and give it a table name.



In [4]:
import org.apache.spark.sql.types._

val inputPath = "./events/"

// Since we know the data format already, let's define the schema to speed up processing (no need for Spark to infer schema)
val jsonSchema = new StructType().add("time", TimestampType).add("action", StringType)

val staticInputDF = 
  spark
    .read
    .schema(jsonSchema)
    .json(inputPath)

// display(staticInputDF)
staticInputDF.show()

+-------------------+------+
|               time|action|
+-------------------+------+
|2016-07-26 11:15:32|  Open|
|2016-07-26 11:15:41|  Open|
|2016-07-26 11:15:41|  Open|
|2016-07-26 11:15:42|  Open|
|2016-07-26 11:15:45| Close|
|2016-07-26 11:15:46| Close|
|2016-07-26 11:15:47|  Open|
|2016-07-26 11:15:54| Close|
|2016-07-26 11:15:59|  Open|
|2016-07-26 11:16:04| Close|
|2016-07-26 11:16:04|  Open|
|2016-07-26 11:16:09| Close|
|2016-07-26 11:16:12| Close|
|2016-07-26 11:16:13| Close|
|2016-07-26 11:16:13|  Open|
|2016-07-26 11:16:14|  Open|
|2016-07-26 11:16:15| Close|
|2016-07-26 11:16:24|  Open|
|2016-07-26 11:16:25|  Open|
|2016-07-26 11:16:26| Close|
+-------------------+------+
only showing top 20 rows



inputPath = ./events/
jsonSchema = StructType(StructField(time,TimestampType,true), StructField(action,StringType,true))
staticInputDF = [time: timestamp, action: string]


lastException: Throwable = null


[time: timestamp, action: string]

Now we can compute the number of "open" and "close" actions with one hour windows. To do this, we will group by the action column and 1 hour windows over the time column.



In [6]:
import org.apache.spark.sql.functions._

val staticCountsDF = 
  staticInputDF
    .groupBy($"action", window($"time", "1 hour"))
    .count()   

// Register the DataFrame as table 'static_counts'
staticCountsDF.createOrReplaceTempView("static_counts")

staticCountsDF = [action: string, window: struct<start: timestamp, end: timestamp> ... 1 more field]


[action: string, window: struct<start: timestamp, end: timestamp> ... 1 more field]

Now we can directly use SQL to query the table. For example, here are the total counts across all the hours.



In [10]:
%%SQL
select action, sum(count) as total_count from static_counts group by action

+------+-----------+
|action|total_count|
+------+-----------+
|  Open|        367|
| Close|         70|
+------+-----------+



How about a timeline of windowed counts?



In [15]:
var df=spark.sql("select action, date_format(window.end, \"MMM-dd HH:mm\") as time, count from static_counts order by time, action")
df.show()

+------+------------+-----+
|action|        time|count|
+------+------------+-----+
| Close|Jul-26 11:00|   11|
|  Open|Jul-26 11:00|  179|
| Close|Jul-26 12:00|   59|
|  Open|Jul-26 12:00|  188|
+------+------------+-----+



df = [action: string, time: string ... 1 more field]


[action: string, time: string ... 1 more field]

## Stream Processing
Now that we have analyzed the data interactively, let's convert this to a streaming query that continuously updates as data comes. Since we just have a static set of files, we are going to emulate a stream from them by reading one file at a time, in the chronological order they were created. The query we have to write is pretty much the same as the interactive query above.

In [16]:
import org.apache.spark.sql.functions._

// Similar to definition of staticInputDF above, just using `readStream` instead of `read`
val streamingInputDF = 
  spark
    .readStream                       // `readStream` instead of `read` for creating streaming DataFrame
    .schema(jsonSchema)               // Set the schema of the JSON data
    .option("maxFilesPerTrigger", 1)  // Treat a sequence of files as a stream by picking one file at a time
    .json(inputPath)

// Same query as staticInputDF
val streamingCountsDF = 
  streamingInputDF
    .groupBy($"action", window($"time", "1 hour"))
    .count()

// Is this DF actually a streaming DF?
streamingCountsDF.isStreaming

streamingInputDF = [time: timestamp, action: string]
streamingCountsDF = [action: string, window: struct<start: timestamp, end: timestamp> ... 1 more field]


true

As you can see, streamingCountsDF is a streaming Dataframe (streamingCountsDF.isStreaming was true). You can start streaming computation, by defining the sink and starting it. In our case, we want to interactively query the counts (same queries as above), so we will set the complete set of 1 hour counts to be a in a in-memory table (note that this for testing purpose only in Spark 2.0).



In [17]:
spark.conf.set("spark.sql.shuffle.partitions", "1")  // keep the size of shuffles small

val query =
  streamingCountsDF
    .writeStream
    .format("memory")        // memory = store in-memory table (for testing only in Spark 2.0)
    .queryName("counts")     // counts = name of the in-memory table
    .outputMode("complete")  // complete = all the counts should be in the table
    .start()

query = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@5baa605a


org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@5baa605a

query is a handle to the streaming query that is running in the background. This query is continuously picking up files and updating the windowed counts.

Note the status of query in the above cell. The progress bar shows that the query is active. Furthermore, if you expand the > counts above, you will find the number of files they have already processed.

Let's wait a bit for a few files to be processed and then query the in-memory counts table.

In [18]:
Thread.sleep(5000) // wait a bit for computation to start


In [21]:
var sqlstr="select action, date_format(window.end, \"MMM-dd HH:mm\") as time, count from counts order by time, action"
var df=spark.sql(sqlstr)
df.show()

+------+------------+-----+
|action|        time|count|
+------+------------+-----+
| Close|Jul-26 11:00|   11|
|  Open|Jul-26 11:00|  179|
| Close|Jul-26 12:00|   59|
|  Open|Jul-26 12:00|  188|
+------+------------+-----+



sqlstr = select action, date_format(window.end, "MMM-dd HH:mm") as time, count from counts order by time, action
df = [action: string, time: string ... 1 more field]


[action: string, time: string ... 1 more field]

We see the timeline of windowed counts (similar to the static one earlier) building up. If we keep running this interactive query repeatedly, we will see the latest updated counts which the streaming query is updating in the background.

Also, let's see the total number of "opens" and "closes".





In [22]:
var sqlstr="select action, sum(count) as total_count from counts group by action order by action"
var df=spark.sql(sqlstr)
df.show()

+------+-----------+
|action|total_count|
+------+-----------+
| Close|         70|
|  Open|        367|
+------+-----------+



sqlstr = select action, sum(count) as total_count from counts group by action order by action
df = [action: string, total_count: bigint]


[action: string, total_count: bigint]

If you keep running the above query repeatedly, you will always find that the number of "opens" is more than the number of "closes", as expected in a data stream where a "close" always appear after corresponding "open". This shows that Structured Streaming ensures prefix integrity. Read the blog posts linked below if you want to know more.

Note that there are only a few files, so consuming all of them there will be no updates to the counts. Rerun the query if you want to interact with the streaming query again.

Finally, you can stop the query running in the background, either by clicking on the 'Cancel' link in the cell of the query, or by executing query.stop(). Either way, when the query is stopped, the status of the corresponding cell above will automatically update to TERMINATED.

In [24]:
query.stop()