# Processamento de Streams 2022
## Lab 3 - Structured Spark Streaming 
---
### Colab Setup



In [None]:
#@title Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#@title Install PySpark
!pip install pyspark findspark --quiet
import findspark
findspark.init()
findspark.find()

---
### Weblog Sender
The stream server is a small python TCP server, listening
on port 7777 (localhost). 

The stream will consist of a set of text lines, obtained from the output log of a webserver.



In [None]:
!wget -q -O - https://github.com/smduarte/ps2022/raw/main/colab/logsender.tgz | tar xfz - 2> /dev/null

!nohup python logsender/server.py logsender/web.log 7777 > /dev/null 2> /dev/null &

The python code below shows the basics needed to process data from socket source using PySpark.

Spark Streaming python documentation is found [here](https://spark.apache.org/docs/latest/api/python/reference/pyspark.streaming.html)

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, count

spark = SparkSession \
    .builder \
    .appName("StructuredWebLogExample") \
    .getOrCreate()

# Create DataFrame representing the stream of input
# lines from connection to logsender 7777

lines = spark.readStream.format("socket") \
    .option("host", "localhost") \
    .option("port", 7777) \
    .load()

query = lines \
    .writeStream \
    .outputMode("append") \
    .trigger(processingTime='1 seconds') \
    .foreachBatch(lambda df, epoch: df.show(10, False)) \
    .start()

query.awaitTermination(60)
query.stop()


+-----------------------------------------------------------------------------------+
|value                                                                              |
+-----------------------------------------------------------------------------------+
|2016-12-06T08:59:09.049+0000 120.52.73.97 404 GET /codemove/AP6FI7P9J2HP 58.589    |
|2016-12-06T08:59:09.050+0000 185.28.193.95 404 GET /codemove/0CYDRWMI9BOW 75.605   |
|2016-12-06T08:59:09.051+0000 125.31.1.27 404 GET /codemove/VORA983ZWBAV 75.614     |
|2016-12-06T08:59:09.052+0000 120.52.73.98 404 GET /codemove/S7YPK8KVCBUZ 0.732     |
|2016-12-06T08:59:09.052+0000 185.28.193.95 404 GET /codemove/68MT6YARSIYO 75.695   |
|2016-12-06T08:59:09.053+0000 185.28.193.95 404 GET /codemove/KR8MW5Y1MRJ3 75.685   |
|2016-12-06T08:59:09.056+0000 202.47.236.252 404 GET /codemove/4CN9JWUGP4BI 59.182  |
|2016-12-06T08:59:09.059+0000 120.52.73.97 404 GET /codemove/LMXHMZ87KGIC 58.560    |
|2016-12-06T08:59:09.060+0000 185.28.193.95 404 GET /c

---
# Exercises

## Exercise 1

In a denial-of-service event it is important to identify the IP sources that might be attacking the system, by issuing a large number of requests.

Write a program to find the IP sources that have done more than 50 requests in the last 10 seconds -- dump this information every 5 seconds. 


## Exercise 2

#### a)
Write a program to dump the number of requests, minimum processing time, maximum processing time for request in the last 10 seconds, **for all** source IPs that performed more than 100 requests -- dump this information every 5 second.  

#### b)

Write a program to dump the number of requests, minimum processing time, maximum processing time for request in the last 10 seconds, **only if at least one** source IP has performed more than 100 requests -- dump this information every 5 second.

## Exercise 3
Write a program to dump the IP sources that deviate most from the average in terms of the number of requests made in the last 30 seconds - dump this information every 5 seconds.

## Exercise 4

Run additional logsender servers for subsets of the logs (IPv4 and IPv6 logs), using the following commands.

```
!nohup python logsender/server.py logsender/webipv4.log 7778 > /dev/null 2> /dev/null &
!nohup python logsender/server.py logsender/webipv6.log 7779 > /dev/null 2> /dev/null &
```

Write a program that combines the two streams, dumping the number of requests made in the last 15 seconds - dump this information every 5 seconds.

## Exercise 5

Write a program that combines the two streams from the previous exercise and dumps the proportion of IPv4 vs IPv6 requests in the last 20 seconds - dump this information every 5 seconds.
