# Summarizing Data over a Window - The reduceByKeyAndWindow Transformation

__Requirement__: Count the occurrences of "HelloWorld" messages received from text stream in window interval by applying the reduceByKeyAndWindow() on a DStream

### Prerequisites

Run Netcat (a small utility found in most Unix-like systems) as a data server by using: `nc -lc 9999`

For windows: [download netcat](https://joncraton.org/blog/46/netcat-for-windows) and run: `nc -l -p 9999`

### Create a StreamingContext and a DStream

<font color="red">TODO: Configure checkpoint path</red>

In [1]:
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 2)

# ssc.checkpoint("file:///tmp/spark")
ssc.checkpoint("D:\\x-temp\\temp-spark\\07\\03")

lines = ssc.socketTextStream("localhost", 9999)

### Perform reduceByKeyAndWindow Transformation on DStream and Print

In [2]:
# Count the total number of "HELLOWORLD" messages over a window            
counts = lines.flatMap(lambda line: line.split(" "))\
              .filter(lambda word:"HELLOWORLD" in word)\
              .map(lambda word: (word, 1))\
              .reduceByKeyAndWindow(lambda a, b: a + b,    # Acts on (word, 1) tuple and sums all count associated with the same word in the window
                                    lambda a, b: a - b,    # The inverse function to substract word counts when the tuples move out of the window
                                    12, 4)                 # 12 window size, 4 sliding interval

In [3]:
# Print count of 'HelloWorld' over a window interval
counts.pprint()

### Start Listening for Streaming Data

Run "netcat" on localhost:9999 and start entering few messages

In [None]:
ssc.start() # Start the computation

ssc.awaitTermination() # Wait for the computation to terminate

-------------------------------------------
Time: 2018-07-20 18:52:14
-------------------------------------------

-------------------------------------------
Time: 2018-07-20 18:52:18
-------------------------------------------
('HELLOWORLD', 2)

-------------------------------------------
Time: 2018-07-20 18:52:22
-------------------------------------------
('HELLOWORLD', 5)

-------------------------------------------
Time: 2018-07-20 18:52:26
-------------------------------------------
('HELLOWORLD', 6)

-------------------------------------------
Time: 2018-07-20 18:52:30
-------------------------------------------
('HELLOWORLD', 5)

-------------------------------------------
Time: 2018-07-20 18:52:34
-------------------------------------------
('HELLOWORLD', 2)

-------------------------------------------
Time: 2018-07-20 18:52:38
-------------------------------------------
('HELLOWORLD', 1)

-------------------------------------------
Time: 2018-07-20 18:52:42
-----------------