<a href="https://colab.research.google.com/github/smduarte/ps2024/blob/main/lab2/ps2024_lab2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Processamento de Streams 2024
## Lab 2 - (Unstructured) Spark Streaming
---
### Colab Setup



In [None]:
#@title Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#@title Install PySpark
!pip install pyspark findspark --quiet
import findspark
findspark.init()
findspark.find()

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

---
### Weblog Sender
The stream server is a small python TCP server, listening
on port 7777 (localhost).

The stream will consist of a set of text lines, obtained from the output log of a webserver.



In [None]:
!wget -q -O - https://github.com/smduarte/ps2024/raw/main/colab/logsender.tgz | tar xfz - 2> /dev/null

!nohup python logsender/server.py logsender/web.log 7777 > /dev/null 2> /dev/null &

The python code below shows the basics needed to process data from socket source using PySpark.

Spark Streaming python documentation is found [here](https://spark.apache.org/docs/latest/api/python/reference/pyspark.streaming.html)

In [None]:

from pyspark.sql import *
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

spark = SparkSession.builder.master('local[*]').appName('WebLogExample').getOrCreate()
sc = spark.sparkContext

try :
  ssc = StreamingContext(sc, 1)
  lines = ssc.socketTextStream("localhost", 7777)

  lines.pprint()

  ssc.start()
  ssc.awaitTermination(10)
except Exception as err:
  ssc.stop()

---
# Exercises

## Exercise 1

In a denial-of-service event it is important to identify the IP sources that might be attacking the system, by issuing a large number of requests.

Write a program to find the IP sources that have done more than 50 requests in the last 10 seconds -- dump this information every 5 seconds.


## Exercise 2



#### 2a)
Write a program to dump the *number of requests*, *minimum processing time* and *maximum processing time*, for requests made in the last 10 seconds, **for all** source IPs that performed more than 100 requests -- dump this information every 5 seconds.  

#### 2b)
Write a program to dump the *number of requests*, *minimum processing time*, *maximum processing time*, for requests made in the last 10 seconds, **but only if at least one** source IP has performed more than 100 requests -- dump this information every 5 seconds.

## Exercise 3
Write a program to dump the top 3 IP sources that deviate the most from the average in terms of the number of requests made in the last 30 seconds - dump this information every 5 seconds.

## Exercise 4

Run additional logsender servers for subsets of the logs (IPv4 and IPv6 logs), using the following commands.

```
!nohup python logsender/server.py logsender/webipv4.log 7778 > /dev/null 2> /dev/null &
!nohup python logsender/server.py logsender/webipv6.log 7779 > /dev/null 2> /dev/null &
```

Write a program that combines the two streams, dumping the number of requests made in the last 15 seconds - dump this information every 5 seconds.

## Exercise 5

Write a program that combines the two streams from the previous exercise and dumps the proportion of IPv4 vs IPv6 requests in the last 20 seconds - dump this information every 5 seconds.
