# PySpark - Structure Streaming with PySpark

## Introduction

This project aims to explore the workings of Stuructured Streaming with PySpark. As there are an abundance of machine generated data from things like IoT, sensors, devices and beacons. Gaining an insight from these data is becoming more important and requires a quicker response. Streaming such analytics can therefore be a huge differentiator and can be an advantage in business. 

This project will be combining batch and real time processing to develop continuous applications. The data can be analysed by utilising Spark SQL in batch or in real time, machine learning models will be trained (with MLlib) and followed by scoring these models using Spark Streaming. 

Apache Spark has been widely adopted due to its ability to unify the disparate data processing paradigms such as Machine learning, SQL and streaming. Companies that uses this are Netflix, Uber, Pinterest etc.

The key abstraction in Structure Streaming with PySpark is a discretised stream (DStream) where it represents a stream of data that is divided up tino smaller batches. As these are built on Spark's RDDs, it allows for Spark Streaming to integrate into any other of Spark's components seamlessly such as MLlib or SQL. This unification is one of the key reason of its rapid adoptation in business. It allows developers to use a single framework to perform all processing needs. In short, developers and system administratores can just focus more of their energy on developing smarter solutions/applications.

More information:
- https://www.datanami.com/2015/11/30/spark-streaming-what-is-it-and-whos-using-it/
- https://www.datanami.com/2015/10/05/how-uber-uses-spark-and-hadoop-to-optimize-customer-experience/
- https://databricks.com/session/spark-and-spark-streaming-at-netflix

## Running Spark Streaming:

Majority of this project will be ran within the System Terminals. This notebook will be used to show the screenshots of the events and its description, like how things are working.

## Breakdown of this Notebook

- Develope an understanding on DStreams
- Develope an understanding on Global Aggregations
- Continuous Aggregations with Structured Streaming

## 1 PySpark Machine Configuration:

Here it only uses four processing cores from the CPU, and it set up by the following code.

In [None]:
%%configures
{
    "executorCores" : 4
}

In [None]:
from pyspark.sql.types import *

## 2 Setup the Correct Directory:

In [None]:
import os

# Change the Path:
path = '++++your working directory here++++/Datasets/'
os.chdir(path)
folder_pathway = os.getcwd()

# print(folder_pathway)

## 3 What is Spark Streaming?:

To allow for real time processing, Spark's structured streaming is built on DataFrames, this also meaans that it allows for processes like streaming, machine learning and SQL. These are optimised with Spark SQL Engine Catalyst Optimiser (which also receives regular updates). 

To better understand Spark Streaming, the fundamentals of its predecessor should be explored. 

### The diagram below shows the data flow of a Spark driver, workers, streaming sources and streaming targets (storage) in a Spark Streaming application:

In [None]:
%%local

# Import the required library and set to use ggplot:
import os
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline

folder_pathway = os.getcwd()
image_path = folder_pathway + "/Description Images/"

# plot the image
fig, ax1 = plt.subplots(figsize=(16,10))
image = mpimg.imread(image_path + 'Spark Streaming application Data Flow.png')
plt.imshow(image);

# print('Image source -> ')

### Description of what is happening:

1. This step starts the Spark Streaming Context (ssc.start() ) where the driver will execute the running taasks on the executors also known as Spark workers.
2. With the code already defined within the driver from the Spark Streaming Context, the Receiver on executor 1 will receive a data stream from the Streaming Sources (such as HDFS, Twitter etc., or it is also possible to create a custom receiver). The receiver will also divide up the incoming data stream into blocks and retains it as blocks in the memory.
3. As with working with Spark, these data blocks are replicated onto another executor such as executor 2, for the purposes of high availability.
4. The block manager in the master node (driver) will have this block ID inforamtion being transmitted onto it. This ensures that each block of the data in the memory is correctly tracked and accounted for.
5. The Spark Streaming Context will have every batch interval configured (such as every 1 second) where the driver will launch Spark tasks to process each of these blocks. Lastly, these blocks are persisted to target data stores and these can be cloud storage (S3/WASB), relational data stores (like MySQL, PostgreSQL etc) or No SQL stores.

## 4 DStreams:

Discretised Streams (DStreams) is the fundamental streaming building block and is built on top of RDDs. It represents a stream of data that is divided into smaller chunks.

This section will cover DStreams and performing global aggregations by stateful calculations on it. This is followed by simplifying the streaming application utilising structured streaming and simultaneously gaining performance optimisations. 

### The diagram below shows that these data chunks are in micro-batches of milliseconds to seconds:

This example shows how the lines of DStream is broken down into micro-batches of seconds. Each of the square here represents a micro-batch of each events that has occured within the individual 1 second window.

In [None]:
%%local

# plot the image
fig, ax1 = plt.subplots(figsize=(16, 10))
image = mpimg.imread(image_path + 'data chunks in micro batches_1.png')
plt.imshow(image);

print('Image source -> ')

### Description of what is happening:

1. First interval, at 1 second: there are 5 occurrences of "blue" event and 3 occurrences of "green" event.
2. Second interval, at the 2nd second: there is a single occurrence of "gohawks".
3. Fourth interval, at the 4th second: there are 2 occurrences of "green" event.

## 4.1 Executing in Bash terminal:

The above events will be created and executed in a console application (Bash terminal). This section will require two terminals to be opened.

Terminal 1: To transmit an event. \
Terminal 2: to receive these events.

## 4.1.1 For Terminal 1 -> Netcat Window

Use Netcat (nc) to send the events of blue, green and gohawks. To begin, the following commands are used. This will direct the events to port 9999 and that the Spark Streaming job will be detected.

In [None]:
%%local

# plot the image
fig, ax1 = plt.subplots(figsize=(16, 10))
image = mpimg.imread(image_path + 'data chunks in micro batches_2.png')
plt.imshow(image);

print('Image source -> ')

Next, will be to type in the events taht are desired, where it can be seen in the following diagram.

In [None]:
%%local

# plot the image
fig, ax1 = plt.subplots(figsize=(16, 10))
image = mpimg.imread(image_path + 'data chunks in micro batches_3.png')
plt.imshow(image);

print('Image source -> ')

## 4.1.2 For Terminal 2 -> Spark Streaming Window

Create a PySpark Streaming application that counts the number of words (events above). The code is constructed in a " .py " file and can be executed in the terminal with the following:

- Create it as a file in PyCharm or other IDE and call it: streaming_word_count.py
- INPUT -> cd /to directory/
- INPUT -> ./bin/spark-submit streaming_word_count.py localhost 9999

In [None]:
%%local

# plot the image
fig, ax1 = plt.subplots(figsize=(16, 10))
image = mpimg.imread(image_path + 'streaming_word_count_pyFile.png')
plt.imshow(image);

print('Image source -> ')

## 4.1.3 Putting it all together:

