# Summarizing Data in a Stream

__Requirement__: Count the number of occurrences of "HELLOWORLD" (case sensitive) in a text stream by applying stateful updateStateByKey() on a DStream

__Learning Objective:__

* Summarizing Data in entire Stream using updateStateByKey of DStream
* See usage of a stateful updateStateByKey operation of DStream

__Types of Stream Transformations__ = 
* Stateless - Transformations which are applied on single RDD like map(), filter(), reduceByKey()
* Stateful - Transformations which accumulate across multiple RDDs across a longer time interval (i.e. entire stream or window)

### Prerequisites

Run Netcat (a small utility found in most Unix-like systems) as a data server by using: `nc -lc 9999`

For windows: [download netcat](https://joncraton.org/blog/46/netcat-for-windows) and run: `nc -l -p 9999`

### Create a StreamingContext and a DStream

<font color="red">TODO: Configure checkpoint path</red>

In [1]:
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 5)

# ssc.checkpoint("file:///tmp/spark")
ssc.checkpoint("D:\\x-temp\\temp-spark\\06")

lines = ssc.socketTextStream("localhost", 9999)

### Perform Stateful Operation (e.g. updateStateByKey) on DStream and Print

In [2]:
# function to do sum of new value and last global sum
def countWords(newValues, lastSum):
    if lastSum is None:
        lastSum = 0
    return sum(newValues, lastSum)  

# Find count of "HELLOWORLD" word in the entire Stream (i.e. multiple RDDs)
counts = lines.flatMap(lambda line: line.split(" "))\
              .filter(lambda word:"HELLOWORLD" in word)\
              .map(lambda word: (word, 1))\
              .updateStateByKey(countWords)

In [3]:
# Print a counts of "HELLOWORLD" words in the entire Stream
counts.pprint()

### Start Listening for Streaming Data

Run "netcat" on localhost:9999 and start entering few lines/words including few occurences of "HELLOWORLD" word

In [None]:
ssc.start() # Start the computation

ssc.awaitTermination() # Wait for the computation to terminate

-------------------------------------------
Time: 2018-07-18 18:31:20
-------------------------------------------

-------------------------------------------
Time: 2018-07-18 18:31:25
-------------------------------------------
('HELLOWORLD', 2)

-------------------------------------------
Time: 2018-07-18 18:31:30
-------------------------------------------
('HELLOWORLD', 3)

-------------------------------------------
Time: 2018-07-18 18:31:35
-------------------------------------------
('HELLOWORLD', 6)

-------------------------------------------
Time: 2018-07-18 18:31:40
-------------------------------------------
('HELLOWORLD', 7)

-------------------------------------------
Time: 2018-07-18 18:31:45
-------------------------------------------
('HELLOWORLD', 7)

-------------------------------------------
Time: 2018-07-18 18:31:50
-------------------------------------------
('HELLOWORLD', 7)

-------------------------------------------
Time: 2018-07-18 18:31:55
-----------------