# Spark Streaming example

This is a simple example of how to run Spark Streaming jobs in Jupyter.

To run this example, you should first **start the stream sender**, hosted in a separate
docker container, using the following command:

`docker run --network psnet -h logsender smduarte/ps2021-logsender`

The stream server is a small python TCP server, listening
on port 7777. The stream will consist of a set of text lines,
obtained from the output log of a webserver.

Spark Streaming python documentation is found [here](https://spark.apache.org/docs/latest/api/python/reference/pyspark.streaming.html)

In [None]:
from pyspark import SparkContext

sc = SparkContext("local[2]", "WebLogExample")


In [None]:
import socket
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream("logsender", 7777)

lines = lines.window(15,1)

counts = lines.filter( lambda line : len(line) > 0 ) \
            .map(lambda line: (line.split(' ')[1],1)) \
            .reduceByKey(lambda a, b: a+b) \
            .transform( lambda rdd: rdd.sortBy(lambda x: x[1], ascending=False))
counts.pprint()

ssc.start()
ssc.awaitTermination(120)
ssc.stop()

# Exercises

Do the follwing exercises:

Every 5 seconds, 
1. Dump the number of requests in the last 10 seconds;
2. Dump the number of requests in the last 10 seconds, only if they total more than 100;
3. Dump the number of requests in the last 10 seconds, if there is an IP address with more than 
100 requests;
4. Dump the proportion of IPv4 vs IPv6 requests in the last 20 seconds.

