# PS 2021 - Lab 2

The following example and exercises assume that the stream server is running on a machine/docker with name `logsender` on port 7777, launched with the following command:

`docker run --network psnet --name logsender smduarte/ps2021-logsender`

The messages propagated by the server have the following format, using a blank space as the separator: 

`date` `IP_source` `return_value` `operation` `URL` `time`

For example:

`2016-12-06T08:58:35.318+0000 37.139.9.11 404 GET /codemove/TTCENCUFMH3C 0.026`

## Example

Count the number of requests from each source IP, over a window of 15 seconds, updated every 5 seconds, sorted in decreasing order.

In [None]:
import socket
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext("local[*]", "WebLogExample")
ssc = StreamingContext(sc, 5)

lines = ssc.socketTextStream("logsender", 7777)

lines = lines.window(15,5)
counts = lines.filter( lambda line : len(line) > 0 ) \
            .map(lambda line: (line.split(' ')[1],1)) \
            .reduceByKey(lambda a, b: a+b) \
            .transform( \
                lambda rdd: rdd.sortBy(lambda x: x[1], ascending=False))
counts.pprint()

ssc.start()
ssc.awaitTermination(30)
ssc.stop()

# Exercises (Lab1)

Do the follwing exercises:

Every 5 seconds, 
1. Dump the number of requests in the last 10 seconds;
2. Dump the number of requests in the last 10 seconds, only if they total more than 100;
3. Dump the number of requests in the last 10 seconds, if there is an IP address with more than 
100 requests;
4. Dump the proportion of IPv4 vs IPv6 requests in the last 20 seconds.



# Lab 2 Exercises

## Exercise 2a

In a denial-of-service attack, it is important to identify the IP sources that might be attacking the system, i.e., those that are issuing a large number of requests.

Write a program to find the IP sources that have done more than 50 requests in the last 10 seconds, dump this information every 5 seconds. 

## Exercise 2b

Write a program to dump the number of requests, minimum processing time and maximum processing time per request in the last 10 seconds, for each source IP that has performed more than 100 requests. 

Dump this information every 5 seconds. 

## Exercise 2c
Write a program to dump the IP sources that deviate the most from the average, in terms of the number of requests made in the last 30 seconds - dump this information every 5 seconds.

Note: modify your solution for using a weigthed average, where the most recent value have an higher weight.

## Exercise 2d

Run additional logsender servers for subsets of the logs (IPv4 and IPv6 logs), using the following commands.

`docker exec -d logsender python ./server.py /data/webipv4.log 7778`

`docker exec -d logsender python ./server.py /data/webipv6.log 7779`

Write a program that joins the two streams, dumping every 5 seconds, the number of requests made in the last 15 seconds
