### grp

# Spark: The Definitive Guide

## PART 5: Streaming

## dataPaths

In [1]:
activityDataSample = '/Users/grp/sparkTheDefinitiveGuide/data/activity-data-sample/'
checkpointPath = '/Users/grp/sparkTheDefinitiveGuide/checkpoint/'

## _Chapter #20 - Stream Processing Fundamentals_

-  **Streaming Processing**:
    -  continuously incorporating new data to compute results
    -  input data in unbounded (no beginning or end)
    -  Use Cases (alerts, real-time transaction reporting, incremental ETL, online ML)
    -  Challenges (input records might arrive out of order [timestamps], handling machine failures, joining with other data) sources   
    <br>
    -  **Event-Time vs Processing Time**:
        -  ET (process data based on timestamp records within source data)
        -  PT (process data based on when streaming application receives records)   
        <br>
    -  **Continuous Processing vs Micro-Batch Execution**:
        -  CP (read records one by one from input source for low latency benefits)
        -  MBE (read records in small batches from input source for higher throughput benefits but higher latency)

## _Chapter #21 - Structured Streaming Basics_

-  SS engine computes a switch for running queries incrementally and continuously as new data arrives into the system
-  SS ensures data processed once as well as fault-tolerance through checkpointing and write-ahead logs
-  SS treats a stream as a "table" that is continuously appended and periodically checks for new data flowing through active streams to update new results
-  SS supports Spark Transformations and Actions (starting a stream)

-  **Structured Streaming Features**:   
<br>
    -  **Input Sources (streams source entry point)**:
         -  Kafka
         -  distributed file system (HDFS, S3)
         -  socket for testing purposes  
         <br>
    -  **Sinks (streams target result destination)**:
         -  Kafka
         -  many file formats
         -  foreach sink for running arbitary computation on the output records
         -  console sink for testing
         -  memory sink for debugging   
         <br>
    -  **Output Modes (how to write data to sink)**:
         -  Append (only add new records to output sink based on trigger; does not support aggregations because of logic)
         -  Update (update changed records in place [rows different from previous write are written out to sink])
         -  Complete (rewrite full output to result table [useful for data where all rows are expected to change])   
         <br>
    -  **Triggers**:
        - frequency when data is outputed to sink
        -  ex: trigger duration of 1 min will fire at 12:00, 12:01, 12:02, etc.
        -  Spark will wait until next trigger if trigger time is missed
        -  "Once Trigger" can be used to run streaming job manually to import new data occasionally   
        <br>
    -  **Event-Time Processing**:
        -  processes data based on timestamp column in source DF   
        <br>
    -  **Watermarks**:
        -  feature specifying how late (delays) stream expects to see data in event time
        -  limit how long stream needs to remember old data
        -  used frequently with event time windows (waiting until the watermark for window has passed)

### Sources and Sinks:
-  _File Source & Sink_   
<br>
__Disclaimer: if new files are manually added into a streaming job's input directory Spark will process partially written files before the files have finished writing hence the new files should be written to an external directory then moved to the stream's source directory when the stream in inactive__  
<br>
-  _Kafka Source & Sink_  
    -  acts as a distributed buffer
    -  stores streams of records in categories called _topics_
    -  each record consists of a key, a value, and a timestamp
    -  reading data is called subscribing to a topic and writing data is called publishing to a topic   
    <br>
-  _Socket Source_
    -  ability to send data to streams via TCP sockets

In [2]:
'''
// reading from Kafka

# subscribe to 1 topic
df1 = spark.readStream.format("kafka")\
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")\
.option("subscribe", "topic1")\
.load()

# subscribe to multiple topics
df1 = spark.readStream.format("kafka")\
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")\
.option("subscribe", "topic1,topic2")\
.load()

# subscribe to a pattern
df1 = spark.readStream.format("kafka")\
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")\
.option("subscribePattern", "topic.*")\
.load()

// writing to Kafka

df1.selectExpr("topic", "CAST(key as STRING)", "CAST(value as STRING)")\
.writeStream\
.format("kafka")\
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")\
.option("checkpointLocation", "...")\
.start()

df1.selectExpr("CAST(key as STRING)", "CAST(value as STRING)")\
.writeStream\
.format("kafka")\
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")\
.option("checkpointLocation", "...")\
.option("topic", "topic1")\
.start()
'''

'\n// reading from Kafka\n\n# subscribe to 1 topic\ndf1 = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "host1:port1,host2:port2").option("subscribe", "topic1").load()\n\n# subscribe to multiple topics\ndf1 = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "host1:port1,host2:port2").option("subscribe", "topic1,topic2").load()\n\n# subscribe to a pattern\ndf1 = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "host1:port1,host2:port2").option("subscribePattern", "topic.*").load()\n\n// writing to Kafka\n\ndf1.selectExpr("topic", "CAST(key as STRING)", "CAST(value as STRING)").writeStream.format("kafka").option("kafka.bootstrap.servers", "host1:port1,host2:port2").option("checkpointLocation", "...").start()\n\ndf1.selectExpr("CAST(key as STRING)", "CAST(value as STRING)").writeStream.format("kafka").option("kafka.bootstrap.servers", "host1:port1,host2:port2").option("checkpointLocation", "...").option("topic", "topic1").start()\n'

In [3]:
'''
// reading from socket

socketDF = spark.readStream.format("socket")\
.option("host", "localhost").option("port", 9999).load()

// connect to socket

nc -lk 9999

// console sink testing

df.format("console").write()

// memory sink testing

df.writeStream.format("memory").queryName("test_table")

'''

'\n// reading from socket\n\nsocketDF = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()\n\n// connect to socket\n\nnc -lk 9999\n\n// console sink testing\n\ndf.format("console").write()\n\n// memory sink testing\n\ndf.writeStream.format("memory").queryName("test_table")\n\n'

### _Chapter #21 Exercises (Structured Streaming)_

In [4]:
static = spark.read.json(activityDataSample)

In [5]:
dataSchema = static.schema

In [6]:
static.printSchema()

root
 |-- Arrival_Time: long (nullable = true)
 |-- Creation_Time: long (nullable = true)
 |-- Device: string (nullable = true)
 |-- Index: long (nullable = true)
 |-- Model: string (nullable = true)
 |-- User: string (nullable = true)
 |-- gt: string (nullable = true)
 |-- x: double (nullable = true)
 |-- y: double (nullable = true)
 |-- z: double (nullable = true)



In [7]:
static.show(3)

+-------------+-------------------+--------+-----+------+----+-----+------------+-------------+-------------+
| Arrival_Time|      Creation_Time|  Device|Index| Model|User|   gt|           x|            y|            z|
+-------------+-------------------+--------+-----+------+----+-----+------------+-------------+-------------+
|1424686734968|1424688581023530396|nexus4_2|    2|nexus4|   g|stand| 6.866455E-4|  0.033355713|  0.030136108|
|1424686735183|1424686733186371836|nexus4_1|   37|nexus4|   g|stand|0.0014038086|-0.0027008057|-0.0124053955|
|1424686735388|1424688581441712769|nexus4_2|   85|nexus4|   g|stand| 6.866455E-4|  0.011993408|-0.0029754639|
+-------------+-------------------+--------+-----+------+----+-----+------------+-------------+-------------+
only showing top 3 rows



In [8]:
streaming = spark.readStream.schema(dataSchema).option("maxFilesPerTrigger", 1)\
.json(activityDataSample)

In [9]:
activityCounts = streaming.groupBy("gt").count()

In [10]:
spark.conf.set("spark.sql.shuffle.partitions", 5)

In [11]:
activityQuery = activityCounts.writeStream.queryName("activity_counts")\
.format("memory").outputMode("complete")\
.start()
#activityQuery.awaitTermination()

In [12]:
#activityQuery.stop()

In [13]:
spark.streams.active

[<pyspark.sql.streaming.StreamingQuery at 0x10dfb8550>]

In [14]:
from time import sleep
for x in range(3):
    spark.sql("select * from activity_counts").show(3)
    sleep(30)

+---+-----+
| gt|count|
+---+-----+
+---+-----+

+----------+-----+
|        gt|count|
+----------+-----+
|       sit|49238|
|     stand|45539|
|stairsdown|37459|
+----------+-----+
only showing top 3 rows

+----------+-----+
|        gt|count|
+----------+-----+
|       sit|49238|
|     stand|45539|
|stairsdown|37459|
+----------+-----+
only showing top 3 rows



In [15]:
from pyspark.sql.functions import expr

### _Select & Filter Example_

In [16]:
simpleTransform = streaming.withColumn("stairs", expr("gt like '%stairs%'"))\
.where("stairs")\
.where("gt is not null")\
.select("gt", "model", "arrival_time", "creation_time")\
.writeStream\
.queryName("simple_transform")\
.format("memory")\
.outputMode("append")\
.start()

In [17]:
#simpleTransform.stop()

In [18]:
from time import sleep
for x in range(3):
    spark.sql("select count(*) from simple_transform").show()
    sleep(30)

+--------+
|count(1)|
+--------+
|       0|
+--------+

+--------+
|count(1)|
+--------+
|   79268|
+--------+

+--------+
|count(1)|
+--------+
|   79268|
+--------+



### _Aggregation Example_

In [19]:
deviceModelStats = streaming.cube("gt", "model").avg()\
.drop("avg(Arrival_time)")\
.drop("avg(Creation_time)")\
.drop("avg(Index)")\
.writeStream.queryName("device_counts").format("memory")\
.outputMode("complete")\
.start()

In [20]:
#deviceModelStats.stop()

In [21]:
from time import sleep
sleep(10)

In [22]:
spark.sql("select * from device_counts").show(10)

+-----+------+--------------------+--------------------+--------------------+
|   gt| model|              avg(x)|              avg(y)|              avg(z)|
+-----+------+--------------------+--------------------+--------------------+
|  sit|  null|-5.27535491537025...|3.678924478593777...|-1.48893910179131...|
|stand|  null|-4.10214448224599...|3.879979054370976E-4|1.904237520345197...|
|  sit|nexus4|-5.27535491537025...|3.678924478593777...|-1.48893910179131...|
|stand|nexus4|-4.10214448224599...|3.879979054370976E-4|1.904237520345197...|
| null|  null|-0.00684720545658...|3.304535732167207...|0.006290737756605493|
| null|  null|7.216251708678295E-4|-0.00607285249718343|-0.00815160685941575|
| walk|  null|-0.00357654462748...|0.004378921096641145|0.002121673306851616|
| null|nexus4|-0.00684720545658...|3.304535732167207...|0.006290737756605493|
| null|nexus4|7.216251708678295E-4|-0.00607285249718343|-0.00815160685941575|
| bike|  null|0.022771987227411025| -0.0093323141311089|-0.08289

### _Joins Example_

In [23]:
historicalAgg = static.groupBy("gt", "model").avg()
deviceModelStats = streaming.drop("Arrival_Time", "Creation_Time", "Index")\
.cube("gt", "model").avg()\
.join(historicalAgg, ["gt", "model"])\
.writeStream.queryName("device_join").format("memory")\
.outputMode("complete")\
.start()

In [24]:
#deviceModelStats.stop()

In [25]:
for i in spark.sql("select * from device_counts").take(10): print(i)

Row(gt='sit', model=None, avg(x)=-0.0005275354915370251, avg(y)=0.00036789244785937776, avg(z)=-0.00014889391017913116)
Row(gt='stand', model=None, avg(x)=-0.0004102144482245993, avg(y)=0.0003879979054370976, avg(z)=0.00019042375203451976)
Row(gt='sit', model='nexus4', avg(x)=-0.0005275354915370251, avg(y)=0.00036789244785937776, avg(z)=-0.00014889391017913116)
Row(gt='stand', model='nexus4', avg(x)=-0.0004102144482245993, avg(y)=0.0003879979054370976, avg(z)=0.00019042375203451976)
Row(gt='null', model=None, avg(x)=-0.006847205456583953, avg(y)=0.00033045357321672076, avg(z)=0.006290737756605493)
Row(gt=None, model=None, avg(x)=0.0007216251708678295, avg(y)=-0.00607285249718343, avg(z)=-0.00815160685941575)
Row(gt='walk', model=None, avg(x)=-0.003576544627480005, avg(y)=0.004378921096641145, avg(z)=0.002121673306851616)
Row(gt='null', model='nexus4', avg(x)=-0.006847205456583953, avg(y)=0.00033045357321672076, avg(z)=0.006290737756605493)
Row(gt=None, model='nexus4', avg(x)=0.00072162

### _Trigger Example_

In [26]:
trigger = activityCounts.writeStream.trigger(processingTime = '5 seconds')\
.queryName("t1").format("console").outputMode("complete").start()

In [27]:
triggerOnce = activityCounts.writeStream.trigger(once=True)\
.queryName("t2").format("console").outputMode("complete").start()

In [28]:
trigger.stop()
triggerOnce.stop()

## _Chapter #22 - Event-Time and Stateful Processing_

-  **Event-Time** (analyze information with respect to the time that it was created, NOT the time it was processed):   
<br>
__Summary: the order of the series of events in the processing system does not guarantee an ordering in event time hence it is imperative in certain use cases to operate on event time (actual real timestamp on data) rather than the timestamp when the data arrives and is processed in the system__  
<br>    
-  **Stateful Processing**:   
    - used to capture intermediate information in a "state store"
    - gets implemented in a fault-tolerant in-memory state store for intermediate state to the checkpoint directory   
    <br>
-  **Event-Time Watermarks**:
    - amount of time following a given event or set of events after which we do not expect to see any more data
    - used to age-out data in a stream to avoid overwhelming system over a long period of time
    - ex: 10 min watermark means that any event that occurs more than 10 "event-time" min past previous event should be ignored
    - if watermark is not specified Spark will maintain that data in memory forever
    - allows Spark to free pending objects from memory

### _Chapter #22 Exercises (Structured Streaming)_

### _Event-Time Example_

In [29]:
static = spark.read.json(activityDataSample)
timestampLatency = static\
.selectExpr("*", "cast(cast(Creation_Time as double)/1000000000 as timestamp) as event_time")\
.orderBy("event_time")
streaming = spark\
.readStream\
.schema(static.schema)\
.option("maxFilesPerTrigger", 10)\
.json(activityDataSample)
streaming.printSchema()

root
 |-- Arrival_Time: long (nullable = true)
 |-- Creation_Time: long (nullable = true)
 |-- Device: string (nullable = true)
 |-- Index: long (nullable = true)
 |-- Model: string (nullable = true)
 |-- User: string (nullable = true)
 |-- gt: string (nullable = true)
 |-- x: double (nullable = true)
 |-- y: double (nullable = true)
 |-- z: double (nullable = true)



In [30]:
timestampLatency.select("event_time").show(10, truncate=False)

+--------------------------+
|event_time                |
+--------------------------+
|2015-02-21 18:41:40.120513|
|2015-02-23 04:18:53.176179|
|2015-02-23 04:18:53.181336|
|2015-02-23 04:18:53.186371|
|2015-02-23 04:18:53.382813|
|2015-02-23 04:18:53.579072|
|2015-02-23 04:18:53.584107|
|2015-02-23 04:18:53.589417|
|2015-02-23 04:18:53.785493|
|2015-02-23 04:18:53.986909|
+--------------------------+
only showing top 10 rows



### _Window Event-Time Example_

In [31]:
from pyspark.sql.functions import window, col

In [32]:
withEventTime = streaming\
.selectExpr("*", "cast(cast(Creation_Time as double)/1000000000 as timestamp) as event_time")
withEventTime = withEventTime.groupBy(window(col("event_time"), "10 minutes")).count()\
.writeStream\
.queryName("pyevents_per_window")\
.format("memory")\
.outputMode("complete")\
.start()

In [33]:
#withEventTime.stop()

In [34]:
spark.sql("select * from pyevents_per_window").printSchema()

root
 |-- window: struct (nullable = true)
 |    |-- start: timestamp (nullable = true)
 |    |-- end: timestamp (nullable = true)
 |-- count: long (nullable = false)



In [35]:
from time import sleep
sleep(30)

In [36]:
spark.sql("select * from pyevents_per_window").show(10, truncate=False)

+---------------------------------------------+-----+
|window                                       |count|
+---------------------------------------------+-----+
|[2015-02-23 04:40:00.0,2015-02-23 04:50:00.0]|4409 |
|[2015-02-24 05:50:00.0,2015-02-24 06:00:00.0]|7519 |
|[2015-02-24 07:00:00.0,2015-02-24 07:10:00.0]|6650 |
|[2015-02-23 07:20:00.0,2015-02-23 07:30:00.0]|5343 |
|[2015-02-21 18:40:00.0,2015-02-21 18:50:00.0]|1    |
|[2015-02-23 06:30:00.0,2015-02-23 06:40:00.0]|5059 |
|[2015-02-24 05:20:00.0,2015-02-24 05:30:00.0]|5697 |
|[2015-02-23 04:20:00.0,2015-02-23 04:30:00.0]|4966 |
|[2015-02-24 06:20:00.0,2015-02-24 06:30:00.0]|6688 |
|[2015-02-24 08:00:00.0,2015-02-24 08:10:00.0]|7469 |
+---------------------------------------------+-----+
only showing top 10 rows



In [37]:
withEventTime = streaming\
.selectExpr("*", "cast(cast(Creation_Time as double)/1000000000 as timestamp) as event_time")
withEventTime2 = withEventTime.groupBy(window(col("event_time"), "10 minutes"), "User").count()\
.writeStream\
.queryName("pyevents_per_window2")\
.format("memory")\
.outputMode("complete")\
.start()

In [38]:
#withEventTime2.stop()

In [39]:
from time import sleep
sleep(30)

In [40]:
spark.sql("select * from pyevents_per_window2").show(10, truncate=False)

+---------------------------------------------+----+-----+
|window                                       |User|count|
+---------------------------------------------+----+-----+
|[2015-02-23 04:20:00.0,2015-02-23 04:30:00.0]|g   |4966 |
|[2015-02-24 07:30:00.0,2015-02-24 07:40:00.0]|b   |4087 |
|[2015-02-24 06:20:00.0,2015-02-24 06:30:00.0]|f   |6688 |
|[2015-02-24 09:00:00.0,2015-02-24 09:10:00.0]|e   |4998 |
|[2015-02-24 07:00:00.0,2015-02-24 07:10:00.0]|f   |1663 |
|[2015-02-23 07:40:00.0,2015-02-23 07:50:00.0]|a   |5537 |
|[2015-02-24 08:50:00.0,2015-02-24 09:00:00.0]|e   |6384 |
|[2015-02-23 08:30:00.0,2015-02-23 08:40:00.0]|h   |4723 |
|[2015-02-24 08:10:00.0,2015-02-24 08:20:00.0]|e   |3384 |
|[2015-02-24 07:00:00.0,2015-02-24 07:10:00.0]|d   |4987 |
+---------------------------------------------+----+-----+
only showing top 10 rows



### _Sliding Window Event-Time Example_

In [41]:
# 10 min window starting every 5 min
withEventTime = streaming\
.selectExpr("*", "cast(cast(Creation_Time as double)/1000000000 as timestamp) as event_time")
withEventTime3 = withEventTime.groupBy(window(col("event_time"), "10 minutes", "5 minutes")).count()\
.writeStream\
.queryName("pyevents_per_window3")\
.format("memory")\
.outputMode("complete")\
.start()

In [42]:
#withEventTime3.stop()

In [43]:
from time import sleep
sleep(30)

In [44]:
spark.sql("select * from pyevents_per_window3").show(10, truncate=False)

+---------------------------------------------+-----+
|window                                       |count|
+---------------------------------------------+-----+
|[2015-02-23 08:15:00.0,2015-02-23 08:25:00.0]|5403 |
|[2015-02-24 05:50:00.0,2015-02-24 06:00:00.0]|7519 |
|[2015-02-24 07:00:00.0,2015-02-24 07:10:00.0]|6650 |
|[2015-02-21 18:35:00.0,2015-02-21 18:45:00.0]|1    |
|[2015-02-23 06:30:00.0,2015-02-23 06:40:00.0]|5059 |
|[2015-02-23 04:20:00.0,2015-02-23 04:30:00.0]|4966 |
|[2015-02-23 07:25:00.0,2015-02-23 07:35:00.0]|4603 |
|[2015-02-24 06:30:00.0,2015-02-24 06:40:00.0]|6289 |
|[2015-02-24 07:10:00.0,2015-02-24 07:20:00.0]|5234 |
|[2015-02-24 08:25:00.0,2015-02-24 08:35:00.0]|10126|
+---------------------------------------------+-----+
only showing top 10 rows



### _Event-Time Watermark Example_

In [45]:
# 30 min watermark (SS will wait 30 min after final timestamp before finalizing) on 10 min rolling window every 5 min
withEventTime = streaming\
.selectExpr("*", "cast(cast(Creation_Time as double)/1000000000 as timestamp) as event_time")
watermark = withEventTime\
.withWatermark("event_time", "30 minutes")\
.groupBy(window(col("event_time"), "10 minutes", "5 minutes")).count()\
.writeStream\
.queryName("pyevents_watermark")\
.format("memory")\
.outputMode("complete")\
.start()

In [46]:
#watermark.stop()

In [47]:
from time import sleep
sleep(30)

In [48]:
spark.sql("select * from pyevents_watermark").show(10, truncate=False)

+---------------------------------------------+-----+
|window                                       |count|
+---------------------------------------------+-----+
|[2015-02-23 08:15:00.0,2015-02-23 08:25:00.0]|5403 |
|[2015-02-24 05:50:00.0,2015-02-24 06:00:00.0]|7519 |
|[2015-02-24 07:00:00.0,2015-02-24 07:10:00.0]|6650 |
|[2015-02-21 18:35:00.0,2015-02-21 18:45:00.0]|1    |
|[2015-02-23 06:30:00.0,2015-02-23 06:40:00.0]|5059 |
|[2015-02-23 04:20:00.0,2015-02-23 04:30:00.0]|4966 |
|[2015-02-23 07:25:00.0,2015-02-23 07:35:00.0]|4603 |
|[2015-02-24 06:30:00.0,2015-02-24 06:40:00.0]|6289 |
|[2015-02-24 07:10:00.0,2015-02-24 07:20:00.0]|5234 |
|[2015-02-24 08:25:00.0,2015-02-24 08:35:00.0]|10126|
+---------------------------------------------+-----+
only showing top 10 rows



### _Dropping Duplicates Example_

In [49]:
from pyspark.sql.functions import expr

In [50]:
dropDuplicates = withEventTime\
.withWatermark("event_time", "5 minutes")\
.dropDuplicates(["User", "event_time"])\
.groupBy("User")\
.count()\
.writeStream\
.queryName("pydeduplicated")\
.format("memory")\
.outputMode("complete")\
.start()

In [51]:
#dropDuplicates.stop()

In [52]:
from time import sleep
sleep(30)

In [53]:
spark.sql("select * from pydeduplicated").show(10, truncate=False)

+----+-----+
|User|count|
+----+-----+
|a   |32340|
|b   |36492|
|c   |30860|
|g   |36671|
|h   |30932|
|e   |38412|
|f   |36824|
|d   |32496|
|i   |37020|
+----+-----+



## _Chapter #23 - Structured Streaming in Production_

-  **Fault Tolerance & Checkpointing**:
    - ability to recover SS applications via checkpointing and write-ahead logs in HDFS / S3 directory
    - simply restart SS application (make sure checkpoint dir exists) to recover its state to start processing data where it crashed
    - checkpoint dir stores the streams processed information   
    <br>
-  **Sizing & Rescaling**:
    - will need to scale up cluster or application is input rate (data flowing) > processing rate "aka" stream is falling behind and cannot handle the load   
    <br>
-  **Other Production Elements**:
    - Alerting
    - Advanced Monitoring

### _Chapter #23 Exercises (Structured Streaming)_

### _Checkpointing Example_

In [54]:
static = spark.read.json(activityDataSample)
streaming = spark\
.readStream\
.schema(static.schema)\
.option("maxFilesPerTrigger", 1)\
.json(activityDataSample)\
.groupBy("gt")\
.count()
query = streaming\
.writeStream\
.outputMode("complete")\
.option("checkpointLocation", checkpointPath)\
.queryName("test_python_stream")\
.format("memory")\
.start()

In [55]:
#query.stop()

In [56]:
from time import sleep
sleep(30)

In [57]:
query.status

{'isDataAvailable': False,
 'isTriggerActive': False,
 'message': 'Waiting for data to arrive'}

In [58]:
query.recentProgress

[{'durationMs': {'addBatch': 175,
   'getBatch': 4,
   'getOffset': 34,
   'queryPlanning': 3,
   'triggerExecution': 248,
   'walCommit': 32},
  'id': '5427238c-b8b2-41c9-b563-f6b99908543e',
  'name': 'test_python_stream',
  'numInputRows': 78012,
  'processedRowsPerSecond': 314564.51612903224,
  'runId': '26faf9f6-350d-4de9-ab4d-784ecb57300d',
  'sink': {'description': 'MemorySink'},
  'sources': [{'description': 'FileStreamSource[file:/Users/grp/sparkTheDefinitiveGuide/data/activity-data-sample]',
    'endOffset': {'logOffset': 0},
    'numInputRows': 78012,
    'processedRowsPerSecond': 314564.51612903224,
    'startOffset': None}],
  'stateOperators': [{'numRowsTotal': 7, 'numRowsUpdated': 7}],
  'timestamp': '2018-09-27T22:49:03.869Z'},
 {'durationMs': {'addBatch': 178,
   'getBatch': 4,
   'getOffset': 34,
   'queryPlanning': 2,
   'triggerExecution': 253,
   'walCommit': 34},
  'id': '5427238c-b8b2-41c9-b563-f6b99908543e',
  'inputRowsPerSecond': 277619.2170818505,
  'name': 't

### grp