<a href="https://colab.research.google.com/github/vinagoros/dv_streamprocessing/blob/main/lab1/ps2022_lab1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Processamento de Streams 2022
## Lab 1
---
### Colab Setup Basics

This notebook shows how to run Spark in Google Colab.

The main drawback is the Google Colab execution environment is not persistent. The notebook itself is saved in Google Drive, but Spark and any other software installation needs to be repeated everytime you reopen the notebook. 

Fortunately, it only the procedure only takes a couple of minutes and it is fully automated.



In [1]:
#@title Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

MessageError: ignored

In [2]:
#@title Install PySpark
!pip install pyspark findspark --quiet
import findspark
findspark.init()
findspark.find()

[K     |████████████████████████████████| 281.4 MB 34 kB/s 
[K     |████████████████████████████████| 198 kB 46.7 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


'/usr/local/lib/python3.7/dist-packages/pyspark'

---
#Example
### Weblog Sender
The stream server is a small python TCP server, listening
on port 7777 (localhost). 

The stream will consist of a set of text lines, obtained from the output log of a webserver.



In [3]:
!wget -q -O - https://github.com/smduarte/ps2022/raw/main/colab/logsender.tgz | tar xfz - 2> /dev/null

!nohup python logsender/server.py logsender/web.log 7777 > /dev/null 2> /dev/null &

The python code below shows the basics needed to process data from socket source using PySpark.

Spark Streaming python documentation is found [here](https://spark.apache.org/docs/latest/api/python/reference/pyspark.streaming.html)

In [4]:
from pyspark import SparkContext

sc = SparkContext("local[2]", "WebLogExample")
import socket
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream("localhost", 7777)

lines.pprint()

ssc.start()
ssc.awaitTermination(10)
ssc.stop()

-------------------------------------------
Time: 2022-04-18 15:13:03
-------------------------------------------

-------------------------------------------
Time: 2022-04-18 15:13:04
-------------------------------------------
2022-04-18T15:13:03.122+0000 37.139.9.11 404 GET /codemove/TTCENCUFMH3C 0.026  
2022-04-18T15:13:03.198+0000 178.22.148.122 404 GET /codemove/PSO83TYKET12 0.088  
2022-04-18T15:13:03.198+0000 178.22.148.122 404 GET /codemove/PSO83TYKET12 0.088  
2022-04-18T15:13:03.199+0000 37.139.9.11 404 GET /codemove/TTCENCUFMH3C 0.057  
2022-04-18T15:13:03.472+0000 37.139.9.11 404 GET /codemove/TTCENCUFMH3C 0.015  
2022-04-18T15:13:03.524+0000 185.28.193.95 404 GET /codemove/1U6HCG3V2S9D 0.056  
2022-04-18T15:13:03.524+0000 185.28.193.95 404 GET /codemove/1U6HCG3V2S9D 0.052  
2022-04-18T15:13:03.524+0000 185.28.193.95 404 GET /codemove/1U6HCG3V2S9D 0.055  
2022-04-18T15:13:03.525+0000 185.28.193.95 404 GET /codemove/1U6HCG3V2S9D 0.013  
2022-04-18T15:13:03.714+0000 37.139.9

---
# Exercises

Do the follwing exercises:

**Every 3 seconds**, 
1. Dump the number of requests in the last 10 seconds;
2. Dump the number of requests in the last 10 seconds, only if they total more than 100;
3. Dump the number of requests in the last 10 seconds, if there is an IP address with more than 100 requests;
4. Dump the proportion of IPv4 vs IPv6 requests in the last 20 seconds.



In [5]:
#@title 1. Dump the number of requests in the last 10 seconds;

from pyspark import SparkContext

sc = SparkContext("local[2]", "WebLogExample")
import socket
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 1)
line = ssc.socketTextStream("localhost", 7777)
lines = line.window(10,10)
counted_lines = lines.count()
counted_lines.pprint()

ssc.start()
ssc.awaitTermination(10)
ssc.stop()




-------------------------------------------
Time: 2022-04-17 17:34:09
-------------------------------------------
2022-04-17T17:33:59.741+0000 37.139.9.11 404 GET /codemove/TTCENCUFMH3C 0.026  
2022-04-17T17:33:59.817+0000 178.22.148.122 404 GET /codemove/PSO83TYKET12 0.088  
2022-04-17T17:33:59.818+0000 178.22.148.122 404 GET /codemove/PSO83TYKET12 0.088  
2022-04-17T17:33:59.818+0000 37.139.9.11 404 GET /codemove/TTCENCUFMH3C 0.057  
2022-04-17T17:34:00.091+0000 37.139.9.11 404 GET /codemove/TTCENCUFMH3C 0.015  
2022-04-17T17:34:00.143+0000 185.28.193.95 404 GET /codemove/1U6HCG3V2S9D 0.056  
2022-04-17T17:34:00.143+0000 185.28.193.95 404 GET /codemove/1U6HCG3V2S9D 0.052  
2022-04-17T17:34:00.144+0000 185.28.193.95 404 GET /codemove/1U6HCG3V2S9D 0.055  
2022-04-17T17:34:00.144+0000 185.28.193.95 404 GET /codemove/1U6HCG3V2S9D 0.013  
2022-04-17T17:34:00.333+0000 37.139.9.11 404 GET /codemove/TTCENCUFMH3C 0.039  
...



In [4]:
#@title 2. Dump the number of requests in the last 10 seconds, only if they total more than 100;

from pyspark import SparkContext

sc = SparkContext("local[2]", "WebLogExample")
import socket
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream("localhost", 7777)
counted_lines = lines.count().filter(lambda count : count > 100)

counted_lines.pprint()

ssc.start()
ssc.awaitTermination(10)
ssc.stop()

-------------------------------------------
Time: 2022-04-18 15:20:23
-------------------------------------------

-------------------------------------------
Time: 2022-04-18 15:20:24
-------------------------------------------
138

-------------------------------------------
Time: 2022-04-18 15:20:25
-------------------------------------------

-------------------------------------------
Time: 2022-04-18 15:20:26
-------------------------------------------
239

-------------------------------------------
Time: 2022-04-18 15:20:27
-------------------------------------------

-------------------------------------------
Time: 2022-04-18 15:20:28
-------------------------------------------
207

-------------------------------------------
Time: 2022-04-18 15:20:29
-------------------------------------------

-------------------------------------------
Time: 2022-04-18 15:20:30
-------------------------------------------
292

-------------------------------------------
Time: 2022-04-18 15:

In [2]:
#@title 3. Dump the number of requests in the last 10 seconds, if there is an IP address with more than 100 requests;

from pyspark import SparkContext

sc = SparkContext("local[2]", "WebLogExample")
import socket
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 1)
line = ssc.socketTextStream("localhost", 7777)
lines = line.window(10,10)
mapped_lines = lines.map(lambda line: (line.split(" ")[1],1)).reduceByKey(lambda a,b : a+b).filter(lambda key_value_pair: key_value_pair[1]> 100)

mapped_lines.pprint()

ssc.start()
ssc.awaitTermination(10)
ssc.stop()


-------------------------------------------
Time: 2022-04-18 15:39:45
-------------------------------------------
('192.241.151.220', 115)
('97.77.104.22', 124)
('120.52.73.98', 187)
('178.22.148.122', 156)
('120.52.73.97', 255)



In [2]:
#@title 4. Dump the proportion of IPv4 vs IPv6 requests in the last 20 seconds;

from pyspark import SparkContext

sc = SparkContext("local[2]", "WebLogExample")
import socket
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 1)
line = ssc.socketTextStream("localhost", 7777)
lines = line.window(10,10).map(lambda line: line.split(" ")[1])
ipv4_lines = lines.filter(lambda ip: len(ip)<= 15).count()
ipv6_lines = lines.filter(lambda ip : len(ip) > 15).count()
ip_counts = ipv4_lines.union(ipv6_lines)
final = ip_counts.reduce(lambda a,b : a/b if b > 0 else a)

final.pprint()

ssc.start()
ssc.awaitTermination(20)
ssc.stop()

-------------------------------------------
Time: 2022-04-18 16:20:50
-------------------------------------------
30.358974358974358

-------------------------------------------
Time: 2022-04-18 16:21:00
-------------------------------------------
56.690140845070424

