# Summarizing Data over a Window - The countByWindow Transformation

__Requirement__: Count the number of messages in window interval by applying the countByWindow() on a DStream

__Theory__
* Spark Streaming = a microbatch based streaming library
* The streaming data is divided into batches based on time slice called batch interval
* Every batch gets converted into an RDD and this continous stream of RDDs is represented as DStream
* What if, we need to know what happened in last n seconds every m seconds in streaming data? For this, Spark streaming offers to apply transformations over a sliding window of data. 
* For example, let's say batch interval is 10 seconds and we need to know what happened in last 60 seconds every 30 seconds. Here, 60 seconds is called window length and 30 second slide interval. Let's say first 6 batches are A,B,C,D,E,F which are part of first window. After 30 seconds second window will reformed which will have D,E,F,G,H,I. That means, 3 batches (i.e. D,E,F) are common between first and second window.

In summary,
* Batch interval = the short duration of time (e.g. 10 seconds) to group messages of that duration into RDD
* Window size = the duration of the window (e.g. 60 seconds) and all RDDs within a window are treated as a single combined RDD 
* Sliding interval = the interval (e.g. 30 seconds) at which the window operation is performed

### Prerequisites

Run Netcat (a small utility found in most Unix-like systems) as a data server by using: `nc -lc 9999`

For windows: [download netcat](https://joncraton.org/blog/46/netcat-for-windows) and run: `nc -l -p 9999`

### Create a StreamingContext and a DStream

<font color="red">TODO: Configure checkpoint path</red>

In [1]:
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 2)

# ssc.checkpoint("file:///tmp/spark")
ssc.checkpoint("D:\\x-temp\\temp-spark\\07\\01")

lines = ssc.socketTextStream("localhost", 9999)

### Perform countByWindow Transformation on DStream and Print

In [2]:
# Find count of messages over a window (12 window size, 4 sliding interval)
counts = lines.countByWindow(12, 4)

In [3]:
# Print a count messages over a window
counts.pprint()

### Start Listening for Streaming Data

Run "netcat" on localhost:9999 and start entering few messages

In [None]:
ssc.start() # Start the computation

ssc.awaitTermination() # Wait for the computation to terminate

-------------------------------------------
Time: 2018-07-19 14:17:24
-------------------------------------------

-------------------------------------------
Time: 2018-07-19 14:17:28
-------------------------------------------
3

-------------------------------------------
Time: 2018-07-19 14:17:32
-------------------------------------------
8

-------------------------------------------
Time: 2018-07-19 14:17:36
-------------------------------------------
8

-------------------------------------------
Time: 2018-07-19 14:17:40
-------------------------------------------
5

-------------------------------------------
Time: 2018-07-19 14:17:44
-------------------------------------------

-------------------------------------------
Time: 2018-07-19 14:17:48
-------------------------------------------

-------------------------------------------
Time: 2018-07-19 14:17:52
-------------------------------------------

