## Web Server Logs Analysis

Realizamos los imports necesarios y creamos la sesión

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.appName("WebServerLogsAnalysis").getOrCreate()

Creamos el Dataframe logsDf a partir de los ficheros

In [0]:
files = "/FileStore/shared_uploads/giovanni.rodriguez@bosonit.com/*.gz"
logsDf = spark.read.text(files)

Creamos el pattern con regex que necesitamos para tratar los datos y parseamos los datos según nos interese, las columnas "status" y "size" las casteamos a formato int

In [0]:
pattern = """(\S+) (\S+) (\S+) \[(\d{2}\/[A-Za-z]{3}\/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})\] "(GET|POST|HEAD|PUT|DELETE|CONNECT|OPTIONS|TRACE|PATCH) (\S+) (\S+)" (\d{3}) (\S+)"""

parseDf = logsDf.select(regexp_extract(col("value"), pattern, 1).alias("host"),
                            regexp_extract(col("value"), pattern, 4).alias("date"),
                            regexp_extract(col("value"), pattern, 5).alias("method"),
                            regexp_extract(col("value"), pattern, 6).alias("resource"),
                            regexp_extract(col("value"), pattern, 7).alias("protocol"),
                            regexp_extract(col("value"), pattern, 8).cast("Integer").alias("status"),
                            regexp_extract(col("value"), pattern, 9).cast("Integer").alias("size"))

También parseamos que si el size es **null** se cambie a 0

In [0]:
clearDf = parseDf.na.fill(0)
clearDf.cache()

Out[21]: DataFrame[host: string, date: string, method: string, resource: string, protocol: string, status: int, size: int]

Creamos el mapa de meses para castear el mes a formato numérico, también definimos una función que convertirá el string a formato date

In [0]:
months = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6, 'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}

def parseDatetime(date):
    return datetime.datetime(
        int(date[7:11]),
        months[date[3:6]],
        int(date[0:2]),
        int(date[12:14]),
        int(date[15:17]),
        int(date[18:20]))

toTimeStamp = udf(lambda date: parseDatetime(date))

logsDf = clearDf.withColumn("date", to_timestamp(toTimeStamp("date")))

Cambiamos los valores a formato timestamp

In [0]:
logsDf = clearDf.withColumn("date", to_timestamp("date", "dd/MMM/yyyy:HH:mm:ss"))

Cacheamos el DataFrame

In [0]:
logsDf.write.format("parquet").mode("overwrite").save("/FileStore/shared_uploads/giovanni.rodriguez@bosonit.com/nasaParquet/")

In [0]:
parquetFile = "/FileStore/shared_uploads/giovanni.rodriguez@bosonit.com/nasaParquet/*"
parquetDf = spark.read.format("parquet").load(parquetFile)
logsDf.cache()

Out[27]: DataFrame[host: string, date: timestamp, method: string, resource: string, protocol: string, status: int, size: int]

¿Cuáles son los distintos protocolos web utilizados? Agrúpalos.

In [0]:
logsDf.select("protocol").distinct().show()

+-------------+
|     protocol|
+-------------+
|       HTTP/*|
|             |
|    HTTP/V1.0|
|     HTTP/1.0|
|STS-69</a><p>|
|            a|
+-------------+



¿Cuáles son los códigos de estado más comunes en la web? Agrúpalos y ordénalos para ver cuál es el más común.

In [0]:
(parquetDf.select("status")
          .groupBy("status")
          .agg(count("status").alias("times"))
          .orderBy(desc("times"))
).show()

+------+-------+
|status|  times|
+------+-------+
|   200|3092620|
|   304| 266764|
|   302|  72963|
|   404|  20621|
|     0|   8314|
|   403|    225|
|   500|     65|
|   501|     41|
+------+-------+



¿Y los métodos de petición (verbos) más utilizados?

In [0]:
(parquetDf.select("method")
          .groupBy("method")
          .agg(count("method").alias("times"))
          .orderBy(desc("times"))
).show()

+------+-------+
|method|  times|
+------+-------+
|   GET|3445162|
|      |   8314|
|  HEAD|   7915|
|  POST|    222|
+------+-------+



¿Qué recurso tuvo la mayor transferencia de bytes de la página web?

In [0]:
(parquetDf.select("host", "resource", "size")
          .orderBy(desc("size"))
).show(1, False)

+-----+---------------------------------------+-------+
|host |resource                               |size   |
+-----+---------------------------------------+-------+
|derec|/shuttle/countdown/video/livevideo.jpeg|6823936|
+-----+---------------------------------------+-------+
only showing top 1 row



Además, queremos saber que recurso de nuestra web es el que más tráfico recibe. Es decir, el recurso con más registros en nuestro log.

In [0]:
(parquetDf.select("resource")
          .groupBy("resource")
          .agg(count("resource").alias("times"))
          .orderBy(desc("times"))
).show(1, False)

+--------------------------+------+
|resource                  |times |
+--------------------------+------+
|/images/NASA-logosmall.gif|208353|
+--------------------------+------+
only showing top 1 row



¿Qué días la web recibió más tráfico?

In [0]:
(parquetDf.select(date_trunc("day", "date").alias("date"))
          .groupBy("date")
          .agg(count("date").alias("times"))
          .orderBy(desc("times"))
).show(truncate = False)

+-------------------+------+
|date               |times |
+-------------------+------+
|1995-07-13 00:00:00|133841|
|1995-07-06 00:00:00|100773|
|1995-07-05 00:00:00|94387 |
|1995-07-12 00:00:00|92046 |
|1995-08-31 00:00:00|89720 |
|1995-07-03 00:00:00|89411 |
|1995-07-07 00:00:00|87081 |
|1995-07-14 00:00:00|83909 |
|1995-07-11 00:00:00|80212 |
|1995-08-30 00:00:00|80173 |
|1995-07-17 00:00:00|74818 |
|1995-07-10 00:00:00|72655 |
|1995-07-19 00:00:00|72553 |
|1995-07-04 00:00:00|70338 |
|1995-08-29 00:00:00|67834 |
|1995-07-20 00:00:00|66467 |
|1995-07-01 00:00:00|64523 |
|1995-07-21 00:00:00|64467 |
|1995-07-24 00:00:00|64099 |
|1995-07-18 00:00:00|64050 |
+-------------------+------+
only showing top 20 rows



¿Cuáles son los hosts son los más frecuentes?

In [0]:
(parquetDf.select("host")
          .groupBy("host")
          .agg(count("host").alias("times"))
          .orderBy(desc("times"))
).show(truncate = False)

+--------------------+-----+
|host                |times|
+--------------------+-----+
|piweba3y.prodigy.com|21988|
|piweba4y.prodigy.com|16437|
|piweba1y.prodigy.com|12825|
|edams.ksc.nasa.gov  |11944|
|163.206.89.4        |9697 |
|                    |8314 |
|news.ti.com         |8161 |
|www-d1.proxy.aol.com|8047 |
|alyssa.prodigy.com  |8037 |
|siltb10.orl.mmc.com |7573 |
|www-a2.proxy.aol.com|7516 |
|www-b2.proxy.aol.com|7266 |
|piweba2y.prodigy.com|7246 |
|www-b3.proxy.aol.com|7218 |
|www-d4.proxy.aol.com|7211 |
|www-b5.proxy.aol.com|7080 |
|www-d2.proxy.aol.com|6984 |
|www-b4.proxy.aol.com|6972 |
|www-d3.proxy.aol.com|6895 |
|webgate1.mot.com    |6749 |
+--------------------+-----+
only showing top 20 rows



¿A qué horas se produce el mayor número de tráfico en la web?

In [0]:
(parquetDf.select(hour("date").alias("hour"))
          .groupBy("hour")
          .agg(count("hour").alias("times"))
          .orderBy(desc("times"))
).show(truncate = False)

+----+------+
|hour|times |
+----+------+
|15  |230187|
|12  |226811|
|13  |224959|
|14  |223282|
|16  |216993|
|11  |210609|
|10  |193439|
|9   |178321|
|17  |178080|
|8   |148969|
|18  |145749|
|22  |131100|
|19  |130745|
|21  |129518|
|20  |129410|
|23  |123615|
|0   |109672|
|7   |101206|
|1   |91196 |
|2   |77684 |
+----+------+
only showing top 20 rows



¿Cuál es el número de errores 404 que ha habido cada día?

In [0]:
(parquetDf.where(col("status") == 404)
          .select(date_trunc("day", "date").alias("date"), "status")
          .groupBy("date", "status")
          .agg(count("date").alias("times"))
          .orderBy(desc("times"))
).show(truncate = False)

+-------------------+------+-----+
|date               |status|times|
+-------------------+------+-----+
|1995-07-19 00:00:00|404   |636  |
|1995-07-06 00:00:00|404   |630  |
|1995-08-30 00:00:00|404   |564  |
|1995-07-07 00:00:00|404   |563  |
|1995-08-31 00:00:00|404   |525  |
|1995-07-13 00:00:00|404   |524  |
|1995-08-07 00:00:00|404   |523  |
|1995-07-05 00:00:00|404   |491  |
|1995-07-03 00:00:00|404   |473  |
|1995-07-11 00:00:00|404   |469  |
|1995-07-18 00:00:00|404   |463  |
|1995-07-12 00:00:00|404   |459  |
|1995-07-25 00:00:00|404   |458  |
|1995-07-20 00:00:00|404   |427  |
|1995-08-24 00:00:00|404   |419  |
|1995-08-25 00:00:00|404   |411  |
|1995-08-29 00:00:00|404   |411  |
|1995-07-14 00:00:00|404   |407  |
|1995-08-28 00:00:00|404   |405  |
|1995-07-17 00:00:00|404   |403  |
+-------------------+------+-----+
only showing top 20 rows

