# Python Spark SQL Exercises

For this set of exercises, you should use SQL statements, as
much as possible!

Check this online resource for some help with [SQL queries](https://www.codecademy.com/learn/learn-sql/modules/learn-sql-queries/cheatsheet)

In [2]:
#@title Install Pyspark
!pip install --quiet pyspark

In [3]:
#@title Download "Os Maias"
!wget -q -O os_maias.txt https://www.dropbox.com/s/n24v0z7y79np319/os_maias.txt?dl=0
!wc os_maias.txt

   5877  216896 1292368 os_maias.txt


##1. Sorted Word Frequency

1.1) Create a [Spark SQL](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/index.html) program that counts the number of occurrences of each word in "Os Maias" novel, sorting them by frequency (the words with higher occurrence first).


In [6]:
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]').appName('words').getOrCreate()
sc = spark.sparkContext

try :
  lines = sc.textFile('os_maias.txt') \
  .filter( lambda line : len(line) > 1 ) \
  .map( lambda line : Row( line = line ) )

  linesDF = spark.createDataFrame( lines )
  linesDF.createOrReplaceTempView("OSMAIAS")

  x = spark.sql("SELECT word, count(*) as freq FROM \
                  (SELECT explode(split(line, ' ')) as word FROM OSMAIAS) \
                  GROUP BY word ORDER BY freq DESC")

  x.show(20)
except Exception as err:
  print(err)

+----+----+
|word|freq|
+----+----+
|  de|8308|
|   a|6720|
|   o|6602|
| que|4846|
|   e|4441|
|   -|3535|
|  um|3004|
| com|2792|
|  do|2564|
|  da|2200|
| uma|2154|
|  os|1762|
|para|1733|
|   E|1602|
| não|1586|
|  em|1505|
|  no|1439|
|  se|1427|
|  as|1401|
|  ao|1391|
+----+----+
only showing top 20 rows



1.2) Create a Spark Dataframes program that computes the top 10 most used words in "Os Maias" novel.

In [7]:
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]').appName('words').getOrCreate()
sc = spark.sparkContext

try :
  lines = sc.textFile('os_maias.txt') \
  .filter( lambda line : len(line) > 1 ) \
  .map( lambda line : Row( line = line ) )

  linesDF = spark.createDataFrame( lines )
  linesDF.createOrReplaceTempView("OSMAIAS")

  x = spark.sql("SELECT word, count(*) as freq FROM \
                  (SELECT explode(split(line, ' ')) as word FROM OSMAIAS) \
                  GROUP BY word ORDER BY freq DESC LIMIT 10")

  x.show(20)
except Exception as err:
  print(err)

+----+----+
|word|freq|
+----+----+
|  de|8308|
|   a|6720|
|   o|6602|
| que|4846|
|   e|4441|
|   -|3535|
|  um|3004|
| com|2792|
|  do|2564|
|  da|2200|
+----+----+



##2. Weblog Analysis

Consider a set of log files captured during a DDOS (*Distributed Denial of Service*) attack, containing information for the web accesses performed during the attack to the server.

The log files contain text lines as shown below, with TAB as the separator:

date |IP_source | status_code | operation | URL | execution time |
-|-|-|-|-|-
timestamp  | string | int | string | string| float |
2016-12-06T08:58:35.318+0000|37.139.9.11|404|GET|/codemove/TTCENCUFMH3C|0.026

In [8]:
#@title Download the dataset
!wget -q -O web.log https://www.dropbox.com/s/0r8902uj9yum7dg/web.log?dl=0
!head -1 web.log

!echo "date ipSource retValue op url time" > weblog_with_header.log
!cat web.log >> weblog_with_header.log
!head -2 weblog_with_header.log

2016-12-06T08:58:35.318+0000 37.139.9.11 404 GET /codemove/TTCENCUFMH3C 0.026  
date ipSource retValue op url time
2016-12-06T08:58:35.318+0000 37.139.9.11 404 GET /codemove/TTCENCUFMH3C 0.026  


2.1. Count the number of unique IP addresses involved in the DDOS attack.


In [10]:
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]') \
						.appName('weblog').getOrCreate()
sc = spark.sparkContext
try :
    logRows = spark.read.csv('weblog_with_header.log',
                             sep =' ', header=True, inferSchema=True)

    logRows.createOrReplaceTempView("WEBLOG")

#    x = spark.sql("SELECT count(*) FROM \
#                    (SELECT DISTINCT ipSource FROM WEBLOG)")

    x = spark.sql("SELECT count(DISTINCT ipSource) FROM WEBLOG")

    x.show()
except Exception as err:
    print(err)

+------------------------+
|count(DISTINCT ipSource)|
+------------------------+
|                     167|
+------------------------+



2.2. For each interval of 10 seconds, provide the following information: [number of requests, average execution time, maximum time, minimum time]

In [35]:
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]') \
						.appName('weblog').getOrCreate()
sc = spark.sparkContext
try :
    logRows = spark.read.csv('weblog_with_header.log',
                             sep =' ', header=True, inferSchema=True)

    logRows.createOrReplaceTempView("WEBLOG")


    x = spark.sql("SELECT from_unixtime((unix_timestamp(date) div 10) * 10) as intervalo, count(*) as requests, \
                      min(time), max(time), mean(time) FROM WEBLOG \
                      GROUP BY intervalo \
                      ORDER BY intervalo")

    x.show()
except Exception as err:
    print(err)

+-------------------+--------+---------+---------+-------------------+
|          intervalo|requests|min(time)|max(time)|         mean(time)|
+-------------------+--------+---------+---------+-------------------+
|2016-12-06 08:58:30|     483|    0.013|   46.849| 7.5934244306418215|
|2016-12-06 08:58:40|    2611|    0.014|   69.654| 30.159845653006503|
|2016-12-06 08:58:50|    5500|    0.017|   80.846|  38.52511163636371|
|2016-12-06 08:59:00|    6914|    0.018|   81.659|  38.53438212322824|
|2016-12-06 08:59:10|    6271|    0.017|   83.993|  32.96384978472328|
|2016-12-06 08:59:20|    5434|    0.051|   77.967|  17.29333143172616|
|2016-12-06 08:59:30|    8015|    0.056|   67.441| 11.210152214597631|
|2016-12-06 08:59:40|    7947|    0.914|   65.706|  7.761815779539431|
|2016-12-06 08:59:50|    5983|    0.678|    54.29|  3.821664382416849|
|2016-12-06 09:00:00|    6882|    0.017|   45.314|  8.649971519907023|
|2016-12-06 09:00:10|    9719|    0.225|   34.406|  7.857372672085602|
|2016-

2.3. Create an inverted index that, for each interval of 10 seconds, has a list of (unique) IPs executing accesses (to each URL).

In [40]:
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]') \
						.appName('weblog').getOrCreate()
sc = spark.sparkContext
try :
    logRows = spark.read.csv('weblog_with_header.log',
                             sep =' ', header=True, inferSchema=True)
    logRows.createOrReplaceTempView("WEBLOG")

    spark.udf.register("toInterval", lambda x : x[0:18])

    x = spark.sql("SELECT from_unixtime((unix_timestamp(date) div 10) * 10) as intervalo, collect_set( ipSource) as Ips FROM WEBLOG \
                          GROUP BY intervalo \
                          ORDER BY intervalo DESC")

    x.show(truncate=False)
except Exception as err:
    print(err)

+-------------------+-----------------------------------------------+
|intervalo          |Ips                                            |
+-------------------+-----------------------------------------------+
|2016-12-06 10:03:20|[106.37.189.69, 123.127.217.155]               |
|2016-12-06 10:03:10|[106.37.189.69, 222.35.13.232, 123.127.217.155]|
|2016-12-06 10:03:00|[106.37.189.69, 123.127.217.155]               |
|2016-12-06 10:02:50|[106.37.189.69, 222.35.13.232, 123.127.217.155]|
|2016-12-06 10:02:40|[106.37.189.69, 222.35.13.232, 123.127.217.155]|
|2016-12-06 10:02:30|[106.37.189.69, 222.35.13.232, 123.127.217.155]|
|2016-12-06 10:02:20|[106.37.189.69, 222.35.13.232, 123.127.217.155]|
|2016-12-06 10:02:10|[106.37.189.69, 222.35.13.232, 123.127.217.155]|
|2016-12-06 10:02:00|[106.37.189.69, 123.127.217.155]               |
|2016-12-06 10:01:50|[106.37.189.69, 123.127.217.155]               |
|2016-12-06 10:01:40|[106.37.189.69, 123.127.217.155]               |
|2016-12-06 10:01:30

2.4. Create an inverted index that, for each interval of 15 seconds, has a list of (unique) IPs executing accesses (to each URL).

In [41]:
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]') \
						.appName('weblog').getOrCreate()
sc = spark.sparkContext
try :
    logRows = spark.read.csv('weblog_with_header.log',
                             sep =' ', header=True, inferSchema=True)
    logRows.createOrReplaceTempView("WEBLOG")

    x = spark.sql("SELECT from_unixtime((unix_timestamp(date) div 15) * 15) as intervalo, collect_set(ipSource) as Ips \
                      FROM WEBLOG GROUP BY intervalo ORDER BY intervalo")

    x.show(truncate = False)
except Exception as err:
    print(err)


+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------