# Spark Streaming

## Overview

Spark Streaming
https://spark.apache.org/docs/latest/streaming-programming-guide.html

PySpark
https://spark.apache.org/docs/latest/api/python/getting_started/index.html

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. 

Data can be ingested from many sources like `Kafka, Kinesis, or TCP sockets`, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. 

Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.

![Spark Streaming Architecture](https://raw.githubusercontent.com/tnurbek/ds702/main/Lab7/streaming-arch.png)



Spark Streaming provides a high-level abstraction called discretized stream or `DStream`, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.

## Setup

In [None]:
!pip install pyspark

## CSV

In [1]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext

from pyspark.sql.functions import * 
from pyspark.sql.types import * 

In [2]:
spark = SparkSession.builder.appName('Stream').master('local[*]').getOrCreate()

22/03/16 02:57:52 WARN Utils: Your hostname, umar resolves to a loopback address: 127.0.1.1; using 10.127.80.168 instead (on interface wlp2s0)
22/03/16 02:57:52 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/16 02:57:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Now, we will work with structured streaming. 

In [3]:
schema = StructType(
    [
        StructField('id', IntegerType(), True), 
        StructField('name', StringType(), True), 
        StructField('age', IntegerType(), True), 
        StructField('profession', StringType(), True), 
        StructField('city', StringType(), True), 
        StructField('salary', DoubleType(), True)
    ]
)

We will load our data into a streaming DataFrame by using the `readStream`.


We can also check status of our streaming with the `isStreaming` method.

In [4]:
data = spark.readStream.format('csv').schema(schema).option('header', True).option('maxFilesPerTrigger', 1).load("./Lab7_Extra/stream/")

In [5]:
data.isStreaming

True

Now, we created streaming DataFrame. Next, we will simulate streaming. Instead of streaming data as it comes in, we will copy each of our csv files one at a time to our path that we specified in “readStream” above in the code. That’s why we are also setting “maxFilesPerTrigger” option to 1, which tells us only a single csv file will be streamed at a time.

In [6]:
data.printSchema()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- profession: string (nullable = true)
 |-- city: string (nullable = true)
 |-- salary: double (nullable = true)



Next we will apply some transformations which will show us the number of people from each profession and also average salaries of professions with descending order in a DataFrame that will be updated with every new file.

In [7]:
average_salaries = data.groupBy('profession').agg(
    (avg('salary').alias('average_salary')), 
    (count('profession').alias('count'))
).sort(desc('average_salary'))

Now we are ready for Streaming except one last point; we need to specify a `format()` for streaming to a destination 

and `outputMode()` for the determination of the data to be written into a streaming sink.

Most used formats are console, kafka, parquet and memory. We will use the console option as format so we can follow our streaming results from terminal.

We have three options for outputMode():

- `append`: Only new rows will be written to the sink.
- `complete`: All rows will be written to the sink, every time there are updates.
- `update`: Only the updated rows will be written to the sink, every time there are updates.

In [8]:
query = average_salaries.writeStream.format('console').outputMode('complete').start()

22/03/16 02:58:23 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-09e9336c-dc64-422f-8c19-056deb712782. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
22/03/16 02:58:23 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


## Data Generation

In [None]:
!pip install faker

In [9]:
from faker import Faker
import numpy as np
import pandas as pd

fake = Faker() 

In [10]:
customers_data = [] 
id = 100
professions = ['doctor', 'teacher', 'police', 'lawyer', 'musician', 'ml engineer', 'designer', 'software engineer', 'journalist', 'fireman', 'singer']

for i in range (6): 
    stream_data = []
    for j in range(20):
        arr = []

        # id 
        arr.append(id)
        id += 1

        # name 
        arr.append(fake.name())

        # age 
        arr.append(np.random.randint(18, 72))

        # profession 
        arr.append(np.random.choice(professions))

        # city 
        arr.append(fake.city())

        # salary 
        arr.append(np.random.randint(1000, 12000))

        stream_data.append(arr)
    customers_data.append(stream_data)


In [11]:
len(customers_data)

6

In [12]:
customers_data[:2]

[[[100, 'Julie Collins', 24, 'journalist', 'North Betty', 9887],
  [101, 'Stephanie Burton', 33, 'fireman', 'Ashleyfurt', 6431],
  [102, 'Tracie Hardin', 28, 'software engineer', 'North Nancyton', 5785],
  [103, 'Theresa Myers', 49, 'doctor', 'North Melissahaven', 4876],
  [104, 'Connie Petty', 69, 'ml engineer', 'New Alyssa', 6995],
  [105, 'Brian Garza', 26, 'police', 'New Rodney', 5231],
  [106, 'Carrie Blanchard', 34, 'designer', 'North Heatherside', 4686],
  [107, 'Morgan Harris', 35, 'ml engineer', 'Longstad', 5740],
  [108, 'Ronald Bradford', 52, 'teacher', 'Christophershire', 3410],
  [109, 'Kenneth Reed', 41, 'fireman', 'Hoborough', 1191],
  [110, 'Joshua Smith', 43, 'singer', 'Port James', 11051],
  [111, 'Emily Roberts', 39, 'journalist', 'West Claireburgh', 9515],
  [112, 'Paul Sanchez', 55, 'musician', 'New Juliestad', 9177],
  [113, 'Amy Burns', 69, 'software engineer', 'Lake Karenbury', 1275],
  [114, 'Katherine Howard', 45, 'singer', 'Hansonmouth', 8229],
  [115, 'Brook

Let's put first three files into `stream/` folder. 

In [13]:
for i, elem in enumerate(customers_data): 
    if i < 3: 
        df = pd.DataFrame(elem, columns="id name age profession city salary".split(" ")) 
        filename = f'./Lab7_Extra/stream/file{i}.csv' 
        df.to_csv(filename, index=False) 

                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+-----------------+-----------------+-----+
|       profession|   average_salary|count|
+-----------------+-----------------+-----+
|       journalist|           9701.0|    2|
|           singer|           9640.0|    2|
|         musician|           9177.0|    1|
|      ml engineer|           6367.5|    2|
|software engineer|           5831.6|    5|
|           police|           5231.0|    1|
|           doctor|           4876.0|    1|
|          fireman|4862.333333333333|    3|
|         designer|           4686.0|    1|
|           lawyer|           4098.0|    1|
|          teacher|           3410.0|    1|
+-----------------+-----------------+-----+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+-----------------+-----------------+-----+
|       profession|   average_salary|count|
+-----------------+-----------------+-----+
|         musician|           9177.0|    1|
|       journalist|9107.333333333334|    3|
|           lawyer|8994.333333333334|    3|
|           police|           6884.0|    3|
|           singer|           6079.5|    4|
|      ml engineer|           5722.0|    3|
|software engineer|         5625.125|    8|
|          fireman|           4758.5|    4|
|           doctor|           4529.5|    6|
|         designer|          4418.25|    4|
|          teacher|           3410.0|    1|
+-----------------+-----------------+-----+



                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+-----------------+-----------------+-----+
|       profession|   average_salary|count|
+-----------------+-----------------+-----+
|         musician|           8908.0|    3|
|           lawyer|           8099.6|    5|
|           police|           7868.0|    4|
|       journalist|           7366.6|    5|
|      ml engineer|          6581.75|    4|
|           singer|           6227.4|   10|
|         designer|6159.833333333333|    6|
|software engineer|         5625.125|    8|
|          teacher|           5294.5|    2|
|           doctor|5187.111111111111|    9|
|          fireman|           4758.5|    4|
+-----------------+-----------------+-----+



As you noticed, streaming takes one file at a time and processes it. Let's put another file into `stream/` folder and see aggregation result. 

In [14]:
pd.read_csv('./Lab7_Extra/stream/file2.csv').head()

Unnamed: 0,id,name,age,profession,city,salary
0,140,Michele Gibbs,55,teacher,Lake Jenniferfort,7179
1,141,Michele Long,39,police,Port Joy,10820
2,142,Jordan Moore,58,doctor,Stevenchester,8115
3,143,Valerie Davis,50,musician,Lake Derrick,10770
4,144,Jessica Chambers,45,journalist,Robinmouth,2819


In [15]:
df = pd.DataFrame(customers_data[3], columns="id name age profession city salary".split(" "))
filename = f'./Lab7_Extra/stream/file3.csv'
df.to_csv(filename, index=False)

                                                                                

-------------------------------------------
Batch: 3
-------------------------------------------
+-----------------+-----------------+-----+
|       profession|   average_salary|count|
+-----------------+-----------------+-----+
|           police|           7868.0|    4|
|         musician|           7636.2|    5|
|           lawyer|7487.142857142857|    7|
|      ml engineer|           7405.0|    5|
|         designer|6735.777777777777|    9|
|          teacher|           6673.2|    5|
|       journalist|           6450.0|    6|
|software engineer|           5636.1|   10|
|          fireman|           5606.2|    5|
|           singer|5535.076923076923|   13|
|           doctor|5358.727272727273|   11|
+-----------------+-----------------+-----+



To stop the streaming: 

In [16]:
query.stop()