# Spark Streaming Simple Example

__Requirement__: Count of "HELLOWORLD" (case sensitive) in each RDD (i.e. every batch interval) of Streaming Data from Server

__Learning Objective:__

* Create DStream using socketTextStream and start listening for streaming data

### Prerequisites

Run Netcat (a small utility found in most Unix-like systems) as a data server by using: `nc -lc 9999`

For windows: [download netcat](https://joncraton.org/blog/46/netcat-for-windows) and run: `nc -l -p 9999`

### Create a StreamingContext

In [1]:
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 10)

# 10 = The batchInterval property of DStream created by this StreamingContext

##### Setup Checkpoint

For fault-tolerance of the created streams. For back-up of data at a specified frequency.

<font color="red">TODO: Configure checkpoint path</red>

In [2]:
# ssc.checkpoint("file:///tmp/spark")
ssc.checkpoint("D:\\x-temp\\temp-spark\\05")

### Create a DStream to connect to hostname:port (e.g. localhost:9999)

In [3]:
lines = ssc.socketTextStream("localhost", 9999)

This lines DStream represents the stream of data that will be received from the data server. Each record in this stream is a line of text. 

### Perform Operations on DStream

In [4]:
# Find count of "HELLOWORLD" word in every batch interval (each batch interval is one RDD)

counts = lines.flatMap(lambda line: line.split(" "))\
                  .filter(lambda word:"HELLOWORLD" in word)\
                  .map(lambda word: (word, 1))\
                  .reduceByKey(lambda a, b: a+b)

In [5]:
# Print a counts of "HELLOWORLD" words generated in every batch interval
counts.pprint()

### Start Listening for Streaming Data

Run "netcat" on localhost:9999 and start entering few lines/words including few occurences of "HELLOWORLD" word

In [None]:
ssc.start() # Start the computation

ssc.awaitTermination() # Wait for the computation to terminate

-------------------------------------------
Time: 2018-07-18 18:29:50
-------------------------------------------
('HELLOWORLD', 1)

-------------------------------------------
Time: 2018-07-18 18:30:00
-------------------------------------------
('HELLOWORLD', 2)

-------------------------------------------
Time: 2018-07-18 18:30:10
-------------------------------------------

-------------------------------------------
Time: 2018-07-18 18:30:20
-------------------------------------------

