## Getting Spark ready

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark

In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

In [3]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark.conf.set("spark.sql.repl.eagerEval.enabled", True) # Property used to format output tables better
spark

## Preprocessing

#### Read log file

all log files are under one column by the name of 'value'! 

Let's define a column for each peice of info

In [9]:
df = spark.read.text('Log')
df.show(5, truncate=False)

+-----------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                  |
+-----------------------------------------------------------------------------------------------------------------------+
|199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245                                 |
|unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985                      |
|199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085   |
|burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0               |
|199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179|
+-----------------------

#### Split columns

Splitting is done everytime there is a space (' ') and the useful information are stored in their corresponding columns. at the end, the 'value' column is dropped.

Schema shows that all columns are strings! we will fix that later

In [42]:
from pyspark.sql.functions import col, split, substring, length, to_date, to_timestamp

split_col = split(col('value'), ' ')

splitDF = df.withColumn("host",split_col.getItem(0)) \
    .withColumn("datetime",split_col.getItem(3)) \
    .withColumn("requesttype",split_col.getItem(5)) \
    .withColumn("requestURL",split_col.getItem(6)) \
    .withColumn("version",split_col.getItem(7)) \
    .withColumn("response1",split_col.getItem(8)) \
    .withColumn("response2",split_col.getItem(9)) \
    .drop("value")   

splitDF.printSchema()
splitDF.show(5, truncate = False)

root
 |-- host: string (nullable = true)
 |-- datetime: string (nullable = true)
 |-- requesttype: string (nullable = true)
 |-- requestURL: string (nullable = true)
 |-- version: string (nullable = true)
 |-- response1: string (nullable = true)
 |-- response2: string (nullable = true)

+--------------------+---------------------+-----------+-----------------------------------------------+---------+---------+---------+
|host                |datetime             |requesttype|requestURL                                     |version  |response1|response2|
+--------------------+---------------------+-----------+-----------------------------------------------+---------+---------+---------+
|199.72.81.55        |[01/Jul/1995:00:00:01|"GET       |/history/apollo/                               |HTTP/1.0"|200      |6245     |
|unicomp6.unicomp.net|[01/Jul/1995:00:00:06|"GET       |/shuttle/countdown/                            |HTTP/1.0"|200      |3985     |
|199.120.110.21      |[01/Jul/1995:00

#### Remove extra junk at beginning or end of column values

Even after splitting by space, some columns have bad characters at first or last of them like '[' or '"'

Let's fix that by keeping the correct substrings of those columns

Now our table looks beautiful :)

In [43]:
splitDF = splitDF.withColumn("datetime", substring("datetime", 2, 20)) \
                 .withColumn("requesttype", substring("requesttype", 2, 4)) \
                 .withColumn("version", substring("version", 1, 8))
splitDF.show(5, truncate = False)

+--------------------+--------------------+-----------+-----------------------------------------------+--------+---------+---------+
|host                |datetime            |requesttype|requestURL                                     |version |response1|response2|
+--------------------+--------------------+-----------+-----------------------------------------------+--------+---------+---------+
|199.72.81.55        |01/Jul/1995:00:00:01|GET        |/history/apollo/                               |HTTP/1.0|200      |6245     |
|unicomp6.unicomp.net|01/Jul/1995:00:00:06|GET        |/shuttle/countdown/                            |HTTP/1.0|200      |3985     |
|199.120.110.21      |01/Jul/1995:00:00:09|GET        |/shuttle/missions/sts-73/mission-sts-73.html   |HTTP/1.0|200      |4085     |
|burger.letters.com  |01/Jul/1995:00:00:11|GET        |/shuttle/countdown/liftoff.html                |HTTP/1.0|304      |0        |
|199.120.110.21      |01/Jul/1995:00:00:11|GET        |/shuttle/missi

#### Cast to appropriate types

datetime and response columns are casted for easier calculations



In [52]:
splitDF = splitDF.withColumn("datetime", to_date('datetime', 'dd/MMM/yyyy:HH:mm:ss')) \
                .withColumn("response1", col('response1').cast('int')) \
                .withColumn("response2", col('response2').cast('int')) 

splitDF.printSchema()

root
 |-- host: string (nullable = true)
 |-- datetime: date (nullable = true)
 |-- requesttype: string (nullable = true)
 |-- requestURL: string (nullable = true)
 |-- version: string (nullable = true)
 |-- response1: integer (nullable = true)
 |-- response2: integer (nullable = true)



## Part 1

hosts are selected, only distinct values are kept, and they are counted

we have over 12000 distinct hosts! nice!

In [54]:
splitDF.select('host').distinct().count()

12133

# Part 2

First line calculates the number of requests each host receives each day.

We want daily average of requests for each hosts, so we average the number of requests for each host across all days.

the output shows the most busy hosts and their daily number of requests

In [66]:
splitDF.groupBy('host', 'datetime').count() \
       .groupby('host').avg('count') \
       .orderBy('avg(count)', ascending=False) \
       .show(10, truncate=False)

+--------------------+------------------+
|host                |avg(count)        |
+--------------------+------------------+
|piweba3y.prodigy.com|674.6666666666666 |
|alyssa.prodigy.com  |404.0             |
|134.83.184.18       |362.0             |
|burger.letters.com  |350.0             |
|piweba1y.prodigy.com|323.0             |
|piweba4y.prodigy.com|310.0             |
|disarray.demon.co.uk|301.3333333333333 |
|www-b6.proxy.aol.com|297.6666666666667 |
|mica.saglac.qc.ca   |220.0             |
|www-d4.proxy.aol.com|211.33333333333334|
+--------------------+------------------+
only showing top 10 rows



# Part 3

By using substring, we can find, filter, and count '.gif' URLs.

There are over 81000 of them!

In [81]:
splitDF.filter(substring('requestURL', -4,4) == '.gif').count()

81832

# Part 4


#### Popular domains

First, any host that does not contain any english words ( a-z and A-Z ) is removed.

Then the hosts counts is calculated, The ones having count of 3 or less are discared, and the remaining are ordered and shown in the output

In [107]:
# This regex removes any host that does not contain any english characters a-z and A-Z
splitDF.filter(col('host').rlike("^[^a-zA-Z]*$") == False) \
       .groupBy('host').count() \
       .filter(col('count') > 3) \
       .orderBy('count', ascending=True) \
       .show(20, truncate=False)

+------------------------------+-----+
|host                          |count|
+------------------------------+-----+
|pm2_2.digital.net             |4    |
|ad02-001.compuserve.com       |4    |
|inlnet3.ftech.co.uk           |4    |
|elvira.thegap.com             |4    |
|unix.neont.com                |4    |
|race-server.race.u-tokyo.ac.jp|4    |
|dec5ki.cs.uregina.ca          |4    |
|cs1-08.sun.ptd.net            |4    |
|nb-dyna129.interaccess.com    |4    |
|kenwong.magna.com.au          |4    |
|ppp31.texoma.com              |4    |
|line105.nwm.mindlink.net      |4    |
|leo.racsa.co.cr               |4    |
|ip080.phx.primenet.com        |4    |
|ppp9.sbbs.se                  |4    |
|ix-sj12-17.ix.netcom.com      |4    |
|lcy-ip8.halcyon.com           |4    |
|blv-pm1-ip9.halcyon.com       |4    |
|mac-41-4.cern.ch              |4    |
|dram.cmu.susx.ac.uk           |4    |
+------------------------------+-----+
only showing top 20 rows



#### daily top domains

First, like the previous part, only domains are kept and IP's are discarded.

This time, in 'daily_requests', the grouping is done on datetime and host when calculating count, because we want to find daily top domains.

in 'daily_max', The maximum counts are found by a simple grouping by datetime and max() aggregation.

However, 'daily_max' does not contain the name of the hosts, So we join it with 'daily_request' on datetime and count to get the name of daily top domains

In [123]:
# daily count of requests for each host
daily_requests = splitDF.filter(col('host').rlike("^[^a-zA-Z]*$") == False) \
                        .groupBy('datetime', 'host').count()

# hosts with maximum daily request (no host column, only count)
daily_max = daily_requests.groupBy('datetime').max('count') \
                          .withColumnRenamed("max(count)", "count")

# join the 2 tables on count and datetime match (so we can see which host it belongs to)
data_joined = daily_max.join(daily_requests, ['datetime', 'count']).show()

+----------+-----+--------------------+
|  datetime|count|                host|
+----------+-----+--------------------+
|1995-07-01|  623|piweba3y.prodigy.com|
|1995-07-02|  960|piweba3y.prodigy.com|
|1995-07-03|  441|piweba3y.prodigy.com|
+----------+-----+--------------------+



# Step 5

First, rows with response code = 200 are discarded.

Because we want a columnwise output, we should have as many columns as HTTP errors and only one row containing the count of each HTTP error.

We group by a literal value so all our data is in one group and pivot on response code.

In [145]:
from pyspark.sql.functions import lit

daily_requests = splitDF.filter(col('response1') != 200) \
                        # Want all data to be in 1 row
                        .groupBy(lit('count').alias("HTTP error")) \
                        # Want columnwise table
                        .pivot("response1") \
                        .count() \
                        .show() 

+----------+---+---+---+----+----+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------+------+------+------+------+------+------+------+------+-------+-------+
|HTTP error|  0| 68|234| 302| 304|363|403|404|500|651|669|786|1204|1289|1380|1391|1713|1879|2537|3047|3092|3214|3635|3985|4179|4209|4538|5544|5866|7008|7074|7124|7634|8763|10099|11175|11326|11473|11644|11853|11934|12040|12054|12169|12290|12859|13116|14897|16102|17083|17314|18128|19084|19092|20271|25439|25814|30232|30995|34029|34546|35540|40310|40960|42732|44153|45966|45970|46888|49152|52491|59752|63942|64379|64427|64910|64939|65536|67310|77646|78183