<a href="https://colab.research.google.com/github/smduarte/spbd-2223/blob/main/lab5/SPBD_Labs_spark2_exercise_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python Spark Dataframes Exercises


In [9]:
#@title Install Pyspark
!pip install --quiet pyspark

In [10]:
#@title Download "Os Maias"
!wget -q -O os_maias.txt https://www.dropbox.com/s/n24v0z7y79np319/os_maias.txt?dl=0

##1. Sorted Word Frequency

1.1) Create a [Spark Dataframes](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/index.html) program that counts the number of occurrences of each word in "Os Maias" novel, sorting them by frequency (the words with higher occurrence first). 


In [11]:
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]').appName('words').getOrCreate()
sc = spark.sparkContext

try :
  lines = sc.textFile('os_maias.txt') \
  .filter( lambda line : len(line) > 1 )

  structured_lines = lines.map( lambda line : Row( line = line, listOfWords = line.split(' ') ) )
  
  wordsOfLine = spark.createDataFrame( structured_lines )
  
  x = wordsOfLine.select(explode("listOfWords").alias('words')) \
      .groupBy('words').count() \
      .orderBy('count', ascending=False)

  
  x.show(5)
except Exception as err:
  print(err)
  sc.stop()

+-----+-----+
|words|count|
+-----+-----+
|   de| 8308|
|    a| 6720|
|    o| 6602|
|  que| 4846|
|    e| 4441|
+-----+-----+
only showing top 5 rows



1.2) Create a Spark Dataframes program that computes the top 10 most used words in "Os Maias" novel.

In [12]:
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]').appName('words').getOrCreate()
sc = spark.sparkContext

try :
  lines = sc.textFile('os_maias.txt') \
  .filter( lambda line : len(line) > 1 )

  structured_lines = lines.map( lambda line : Row( line = line, listOfWords = line.split(' ') ) )
  
  wordsOfLine = spark.createDataFrame( structured_lines )
  
  x = wordsOfLine.select(explode("listOfWords").alias('words')) \
      .groupBy('words').count() \
      .orderBy('count', ascending=False) \
      .limit(10)

  
  x.show()
except Exception as err:
  print(err)
  sc.stop()

+-----+-----+
|words|count|
+-----+-----+
|   de| 8308|
|    a| 6720|
|    o| 6602|
|  que| 4846|
|    e| 4441|
|    -| 3535|
|   um| 3004|
|  com| 2792|
|   do| 2564|
|   da| 2200|
+-----+-----+



##2. Weblog Analysis

Consider a set of log files captured during a DDOS (*Distributed Denial of Service*) attack, containing information for the web accesses performed during the attack to the server.

The log files contain text lines as shown below, with TAB as the separator:

date |IP_source | status_code | operation | URL | execution time |
-|-|-|-|-|-
timestamp  | string | int | string | string| float |
2016-12-06T08:58:35.318+0000|37.139.9.11|404|GET|/codemove/TTCENCUFMH3C|0.026

In [13]:
#@title Download the dataset
!wget -q -O web.log https://www.dropbox.com/s/0r8902uj9yum7dg/web.log?dl=0
!head -1 web.log

2016-12-06T08:58:35.318+0000 37.139.9.11 404 GET /codemove/TTCENCUFMH3C 0.026  


2.1. Count the number of unique IP addresses involved in the DDOS attack.


In [14]:
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]') \
						.appName('weblog').getOrCreate()
sc = spark.sparkContext
try :
    lines = sc.textFile('web.log')
    logRows = lines.filter( lambda line : len(line) > 0 ) \
                   .map( lambda line : line.split(' ') ) \
                   .map( lambda l : Row( date = l[0], \
				    		            ipSource = l[1], retValue = l[2], \
                            op = l[3], url = l[4], time = float(l[5])))
                   
    logRowsDF = spark.createDataFrame( logRows )
    countIps = logRowsDF.select('ipSource').distinct().count()
    
    print(countIps)
    sc.stop()
except Exception as err:
    print(err)
    sc.stop()

167


2.2. For each interval of 10 seconds, provide the following information: [number of requests, average execution time, maximum time, minimum time]

In [15]:
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]') \
						.appName('weblog').getOrCreate()
sc = spark.sparkContext
try :
    lines = sc.textFile('web.log')
    logRows = lines.filter( lambda line : len(line) > 0 ) \
                   .map( lambda line : line.split(' ') ) \
                   .map( lambda l : Row( date = l[0], \
				    		            ipSource = l[1], retValue = l[2], \
                            op = l[3], url = l[4], time = float(l[5])))
                   
                   
    interval = udf(lambda timestamp: timestamp[0:18], StringType())


    logRowsDF = spark.createDataFrame( logRows )
    intervals = logRowsDF.select(interval('date').alias("interval"), "time")
    x = intervals.groupBy('interval').agg( count('*').alias('count'), avg('time'), min('time'), max('time')) \
        .orderBy('interval')

    x.show(10)
    sc.stop()
except Exception as err:
    print(err)
    sc.stop()

+------------------+-----+------------------+---------+---------+
|          interval|count|         avg(time)|min(time)|max(time)|
+------------------+-----+------------------+---------+---------+
|2016-12-06T08:58:3|  483|7.5934244306418215|    0.013|   46.849|
|2016-12-06T08:58:4| 2611|30.159845653006503|    0.014|   69.654|
|2016-12-06T08:58:5| 5500| 38.52511163636371|    0.017|   80.846|
|2016-12-06T08:59:0| 6914| 38.53438212322824|    0.018|   81.659|
|2016-12-06T08:59:1| 6271| 32.96384978472328|    0.017|   83.993|
|2016-12-06T08:59:2| 5434| 17.29333143172616|    0.051|   77.967|
|2016-12-06T08:59:3| 8015|11.210152214597631|    0.056|   67.441|
|2016-12-06T08:59:4| 7947| 7.761815779539431|    0.914|   65.706|
|2016-12-06T08:59:5| 5983| 3.821664382416849|    0.678|    54.29|
|2016-12-06T09:00:0| 6882| 8.649971519907023|    0.017|   45.314|
+------------------+-----+------------------+---------+---------+
only showing top 10 rows



2.3. Create an inverted index that, for each interval of 10 seconds, has a list of (unique) IPs executing accesses (to each URL).

In [24]:
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]') \
						.appName('weblog').getOrCreate()
sc = spark.sparkContext
try :
    lines = sc.textFile('web.log')
    logRows = lines.filter( lambda line : len(line) > 0 ) \
                   .map( lambda line : line.split(' ') ) \
                   .map( lambda l : Row( date = l[0], \
				    		            ipSource = l[1], retValue = l[2], \
                            op = l[3], url = l[4], time = float(l[5])))
                   
                   
    interval = udf(lambda timestamp: timestamp[0:18], StringType())

    countIps = udf( lambda l : len(l))

    logRowsDF = spark.createDataFrame( logRows )
    intervals = logRowsDF.select(interval('date').alias("interval"), 'ipSource', "url")
    
    stats = intervals.groupBy('interval', 'url').agg( collect_set('ipSource').alias('ips')) \
    .orderBy('interval', 'url', ascending=False) 

    #stats = intervals.groupBy('interval', 'url').agg( collect_set('ipSource').alias('ips')) \
    #.select( "*", countIps('ips').alias('#ips')).orderBy('interval', 'url', '#ips', ascending=False)

    stats.show(10)

    sc.stop()
except Exception as err:
    print(err)
    sc.stop()

+------------------+--------------------+--------------------+
|          interval|                 url|                 ips|
+------------------+--------------------+--------------------+
|2016-12-06T10:03:2|/codemove/9JVQI8T...|[106.37.189.69, 1...|
|2016-12-06T10:03:2|/codemove/79SC2H8...|[106.37.189.69, 1...|
|2016-12-06T10:03:1|/codemove/NCZX4FB...|[106.37.189.69, 2...|
|2016-12-06T10:03:1|/codemove/9JVQI8T...|[106.37.189.69, 2...|
|2016-12-06T10:03:1|/codemove/79SC2H8...|[106.37.189.69, 2...|
|2016-12-06T10:03:0|/codemove/NCZX4FB...|[106.37.189.69, 1...|
|2016-12-06T10:03:0|/codemove/9JVQI8T...|[106.37.189.69, 1...|
|2016-12-06T10:03:0|/codemove/79SC2H8...|[106.37.189.69, 1...|
|2016-12-06T10:02:5|/codemove/9JVQI8T...|[106.37.189.69, 1...|
|2016-12-06T10:02:5|/codemove/79SC2H8...|[106.37.189.69, 2...|
+------------------+--------------------+--------------------+
only showing top 10 rows



2.3. Create an inverted index that, for each interval of 15 seconds, has a list of (unique) IPs executing accesses (to each URL).

In [28]:
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

from dateutil.parser import parse

origin = parse('2016-12-06T08:58:35.318+0000').timestamp()

spark = SparkSession.builder.master('local[*]') \
						.appName('weblog').getOrCreate()
sc = spark.sparkContext
try :
    lines = sc.textFile('web.log')
    logRows = lines.filter( lambda line : len(line) > 0 ) \
                   .map( lambda line : line.split(' ') ) \
                   .map( lambda l : Row( date = l[0], \
				    		            ipSource = l[1], retValue = l[2], \
                            op = l[3], url = l[4], time = float(l[5])))
                  
    logRowsDF = spark.createDataFrame( logRows )

    # use window() to define the interval
    intervals = logRowsDF.select(col('date'), 'ipSource', "url") \
        .select('*', window("date", "15 seconds").alias('interval'))
    
    stats = intervals.groupBy('interval', 'url').agg( collect_set('ipSource').alias('ips')) \
    .select( "*", countIps('ips').alias('#ips')).orderBy('#ips', ascending=False)

    stats.show(10)

    sc.stop()
except Exception as err:
    print(err)
    sc.stop()

+--------------------+--------------------+--------------------+----+
|            interval|                 url|                 ips|#ips|
+--------------------+--------------------+--------------------+----+
|{2016-12-06 09:01...|/codemove/JUANR8S...|[95.128.43.164, 1...|   8|
|{2016-12-06 09:02...|/codemove/JUANR8S...|[185.69.154.59, 5...|   5|
|{2016-12-06 09:02...|/codemove/C11AWNK...|[185.69.154.59, 2...|   5|
|{2016-12-06 09:01...|/codemove/JUANR8S...|[163.172.67.180, ...|   4|
|{2016-12-06 10:02...|/codemove/79SC2H8...|[106.37.189.69, 2...|   3|
|{2016-12-06 10:01...|/codemove/2J6KRPS...|[106.37.189.69, 2...|   3|
|{2016-12-06 10:01...|/codemove/2J6KRPS...|[106.37.189.69, 2...|   3|
|{2016-12-06 10:02...|/codemove/79SC2H8...|[106.37.189.69, 2...|   3|
|{2016-12-06 10:02...|/codemove/LHWWNO3...|[106.37.189.69, 2...|   3|
|{2016-12-06 10:01...|/codemove/2J6KRPS...|[106.37.189.69, 2...|   3|
+--------------------+--------------------+--------------------+----+
only showing top 10 