![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png)

# Distributed computation
## ESIPE — INFO 3 — Option Logiciel

# Lab 2 : Web Server Log Analysis with Apache Spark

This lab will demonstrate how easy it is to perform web server log analysis with Apache Spark.

 
Server log analysis is an ideal use case for Spark.  It's a very large, common data source and contains a rich set of information.  Spark allows you to store your logs in files on disk cheaply, while still providing a quick and simple way to perform data analysis on them.  This homework will show you how to use Apache Spark on real-world text-based production logs and fully harness the power of that data.  Log data comes from many sources, such as web, file, and compute servers, application logs, user-generated content,  and can be used for monitoring servers, improving business and customer intelligence, building recommendation systems, fraud detection, and much more.


## How to complete this lab

 This assignment is broken up into sections with bite-sized examples for demonstrating Spark functionality for log processing. For each problem, you should start by thinking about the algorithm that you will use to *efficiently* process the log in a parallel, distributed manner. This means using the various [RDD](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD) operations along with [`lambda` functions](https://docs.scala-lang.org/overviews/scala-book/anonymous-functions.html#inner-main) that are applied at each worker.

 
This assignment consists of 4 parts:

- Part 1 : Apache Web Server Log file format
- Part 2 : Sample Analyses on the Web Server Log File with Spark Core
- Part 3 : Analyzing Web Server Log File with Spark SQL
- Part 4 : Exploring 404 Response Codes



##  Prerequisites : Spark Context configuration 

In [None]:
import $ivy.`org.apache.spark::spark-sql:3.3.1`

import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)

In [None]:
import org.apache.spark.sql._

val spark = NotebookSparkSession.builder
    .appName("lab2_apache_log_text")
    .master("local[*]")
    .getOrCreate()
val sc = spark.sparkContext

# Part 1 : Apache Web Server Log file format

If you're familiar with web servers at all, you'll recognize that this is in [Common Log Format](https://www.w3.org/Daemon/User/Config/Logging.html#common-logfile-format). The fields are:

_remotehost rfc931 authuser [date] "request" status bytes_

| field         | meaning                                                                |
| ------------- | ---------------------------------------------------------------------- |
| _remotehost_  | Remote hostname (or IP number if DNS hostname is not available).       |
| _rfc931_      | The remote logname of the user. We don't really care about this field. |
| _authuser_    | The username of the remote user, as authenticated by the HTTP server.  |
| _[date]_      | The date and time of the request.                                      |
| _"request"_   | The request, exactly as it came from the browser or client.            |
| _status_      | The HTTP status code the server sent back to the client.               |
| _bytes_       | The number of bytes (`Content-Length`) transferred to the client.      |



The log file entries produced in CLF will look something like this:

`127.0.0.1 - - [01/Aug/1995:00:00:01 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1839`
 
Each part of this log entry is described below.

* `127.0.0.1`
This is the IP address (or host name, if available) of the client (remote host) which made the request to the server.

 
* `-`
The "hyphen" in the output indicates that the requested piece of information (user identity from remote machine) is not available.

 
* `-`
The "hyphen" in the output indicates that the requested piece of information (user identity from local logon) is not available.

 
* `[01/Aug/1995:00:00:01 -0400]`
The time that the server finished processing the request. The format is:

`[day/month/year:hour:minute:second timezone]`
  * day = 2 digits
  * month = 3 letters
  * year = 4 digits
  * hour = 2 digits
  * minute = 2 digits
  * second = 2 digits
  * zone = (\+ | \-) 4 digits
 
* `"GET /images/launch-logo.gif HTTP/1.0"`
This is the first line of the request string from the client. It consists of a three components: the request method (e.g., `GET`, `POST`, etc.), the endpoint (a [Uniform Resource Identifier](http://en.wikipedia.org/wiki/Uniform_resource_identifier)), and the client protocol version.

 
* `200`
This is the status code that the server sends back to the client. This information is very valuable, because it reveals whether the request resulted in a successful response (codes beginning in 2), a redirection (codes beginning in 3), an error caused by the client (codes beginning in 4), or an error in the server (codes beginning in 5). The full list of possible status codes can be found in the HTTP specification ([RFC 2616](https://www.ietf.org/rfc/rfc2616.txt) section 10).

 
* `1839`
The last entry indicates the size of the object returned to the client, not including the response headers. If no content was returned to the client, this value will be "-" (or sometimes 0).

 
Note that log files contain information supplied directly by the client, without escaping. Therefore, it is possible for malicious clients to insert control-characters in the log files, *so care must be taken in dealing with raw logs.*

 
## NASA-HTTP Web Server Log
For this assignment, we will use a data set from NASA Kennedy Space Center WWW server in Florida. The full data set is freely available (http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html) and contains two month's of all HTTP requests. We are using a subset that only contains several days worth of requests.



## 1. Parsing Each Log Line
Using the CLF as defined above, we create a regular expression pattern to extract the nine fields of the log. The function returns a pair consisting of a Row object and 1. If the log line fails to match the regular expression, the function returns a pair consisting of the log line string and 0. A '-' value in the content size field is cleaned up by substituting it with 0. The function converts the log line's date string into a `java.sql.Timestamp` object using the given `parse_apache_time` function.


We first define the case class `Log` which will be the structure of our log.

In [None]:
import java.sql.Timestamp
case class Log(
                host          : String,
                client_identd : String,
                user_id       : String,
                date_time     : Timestamp,
                method        : String,
                endpoint      : String,
                protocol      : String,
                response_code : Int,
                content_size  : Int
)

In [None]:
import scala.util.matching.Regex
//import datetime

//import org.apache.spark.sql.Row

val month_map = Map("Jan"-> 1, "Feb"-> 2, "Mar"->3, "Apr"->4, "May"->5, "Jun"->6, "Jul"->7,
  "Aug"->8,  "Sep"-> 9, "Oct"->10, "Nov"-> 11, "Dec"-> 12)

def parse_apache_time(s: String) = {
    /* Convert Apache time format into a Timestamp object
    Args:
        s (str): date and time in Apache time format
    Returns:
        datetime: Timestamp object (ignore timezone for now)
    */
    val year = s.substring(7,11).toInt
    val month = month_map(s.substring(3,6))
    val day = s.substring(0,2).toInt
    val hour = s.substring(12,14).toInt
    val min = s.substring(15,17).toInt
    val sec = s.substring(18,20).toInt
    Timestamp.valueOf(s"${year}-${month}-${day} ${hour}:${min}:${sec}")
}
                             
//A regular expression pattern to extract fields from the log line
val APACHE_ACCESS_LOG_PATTERN = new Regex("""^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+)\s*(\S*)\s*" (\d{3}) (\S+)""")

def parseApacheLogLine = (logline : String) => {
    /* Parse a line in the Apache Common Log format
    Args:
        logline (str): a line of text in the Apache Common Log format
    Returns:
        tuple: either a dictionary containing the parts of the Apache Access Log and 1,
               or the original invalid log line and 0
    */
    logline match {
        case APACHE_ACCESS_LOG_PATTERN(host, client_identd, user_id, date_time, method, endpoint, protocol, response_code, content_size) => {
            (Log(
                host,
                client_identd,
                user_id,
                parse_apache_time(date_time),
                method,
                endpoint,
                protocol,
                response_code.toInt,
                if (content_size == "-") 0 else content_size.toInt
            ),1)
        }
        case _ => (logline, 0)

    }
}

## 2. Configuration and Initial RDD Creation
We are ready to specify the input log file and create an RDD containing the parsed log file data. The log file has already been downloaded for you.

 
To create the primary RDD that we'll use in the rest of this assignment, we first load the text file using [`sc.textfile(logFile)`](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext@textFile(path:String,minPartitions:Int):org.apache.spark.rdd.RDD[String]) to convert each line of the file into an element in an RDD.

Next, we use [`map(parseApacheLogLine)`](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@map[U](f:T=%3EU)(implicitevidence$3:scala.reflect.ClassTag[U]):org.apache.spark.rdd.RDD[U]) to apply the parse function to each element (that is, a line from the log file) in the RDD and turn each line into a pair with the [`Log` case class].

Finally, we cache the RDD in memory since we'll use it throughout this notebook.



In [None]:
import org.apache.spark.rdd.RDD
val logFile = "../data/apache.access.log.PROJECT.gz"

def parseLogs() = {
    // Read and parse log file 
    val parsed_logs  = sc
                       .textFile(logFile)
                       .map(parseApacheLogLine)
                       .cache()

    val access_logs : RDD[Log] = parsed_logs
                       .filter(s => s._2 == 1)
                       .map(s => s._1.asInstanceOf[Log])
                       .cache()

    val failed_logs : RDD[String]  = (parsed_logs
                       .filter(s => s._2 == 0)
                       .map(s => s._1.asInstanceOf[String]))
    
    val failed_logs_count = failed_logs.count()
    
    if (failed_logs_count > 0) {
        println(s"Number of invalid logline: ${failed_logs.count()}")
        failed_logs.take(20).foreach(line => println(s"Invalid logline: ${line}"))
    }
        
    println(s"Read ${parsed_logs.count()} lines, successfully parsed ${access_logs.count()} lines, failed to parse ${failed_logs.count()} lines")
    (parsed_logs, access_logs, failed_logs)
}
    

val (parsed_logs, access_logs, failed_logs) = parseLogs()

# Part 2 : Sample Analyses on the Web Server Log File with Spark Core
 
Now that we have an RDD containing the log file as a set of Row objects, we can perform various analyses.

 
## 1. Content Size Statistics

 
Let's compute some statistics about the sizes of content being returned by the web server. In particular, we'd like to know what are the average, minimum, and maximum content sizes.

 
We can compute the statistics by applying a `map` to the `access_logs` RDD. The `lambda` function we want for the map is to extract the `content_size` field from the RDD. The map produces a new RDD containing only the `content_sizes` (one element for each Row object in the `access_logs` RDD). To compute the minimum and maximum statistics, we can use [`min()`](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@min()(implicitord:Ordering[T]):T) and [`max()`](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@max()(implicitord:Ordering[T]):T) functions on the new RDD. We can compute the average statistic by using the [`reduce`](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@reduce(f:(T,T)=%3ET):T) function with a `lambda` function that sums the two inputs, which represent two elements from the new RDD that are being reduced together. The result of the `reduce()` is the total content size from the log and it is to be divided by the number of requests as determined using the [`count()`](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@count():Long) function on the new RDD.



In [None]:
// TODO: Replace ??? with appropriate code
// Calculate statistics based on the content size
// HINT : RDD has been indexed in function parseApacheLogLine(). 
// You can now access content sizes using attribute .content_size
val content_sizes = ???
val content_sizes_mean = ???
val content_sizes_min = ???
val content_sizes_max = ???
println(s"Content Size Avg: ${content_sizes_mean.round}, Min: ${content_sizes_min}, Max: ${content_sizes_max}")

Creating the function for the tests

In [None]:
def assertEquals[A](expected : A, answer : A, error : String) = {
    if (expected equals answer) println("1 test passed")
    else error
}

In [None]:
// TEST Content size statistics
assertEquals((content_sizes_mean.round, content_sizes_min, content_sizes_max), (17532, 0, 3421948), "incorrect expected values")

## 2. Response Code Analysis

Next, lets make a count of the response codes that appear in the logs. As with the content size analysis, first we create a new RDD by using a `lambda` function to extract the `response_code` field from the `access_logs` RDD. The difference here is that we will use a [pair tuple](https://docs.scala-lang.org/tour/tuples.html) instead of just the field itself. Using a pair tuple consisting of the response code and 1 will let us count how many records have a particular response code. Using the new RDD, we perform a [`reduceByKey`](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions@reduceByKey(func:(V,V)=%3EV):org.apache.spark.rdd.RDD[(K,V)]) function. `reduceByKey` performs a reduce on a per-key basis by applying the `lambda` function to each element, pairwise with the same key. We use the simple `lambda` function of adding the two values. Then, we cache the resulting RDD and create a list by using the [`take`](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@take(num:Int):Array[T]) function.

_Note_ : The expected method is similar of the word count approach developed in last week TP.

In [None]:
// TODO: Replace <FILL IN> with appropriate code
// Make a Response Code repartition count.
// HINT : you can access to the log response_code using attribute ".response_code".

val responseCodeToCount = ???
val responseCodeToCountList = ???
println(s"Found ${responseCodeToCountList.length} response codes")
println(s"Response Code Counts: ${responseCodeToCountList.mkString}")

In [None]:
// TEST : Response Code Analysis
assertEquals(responseCodeToCountList.toList.sorted, List((200, 940847), (302, 16244), (304, 79824), (403, 58), (404, 6185), (500, 2), (501, 17)), "Incorrect response code analysis")

## 3. Top 10 transferred bytes hosts

Now, let's answer the following question. Who are the top 10 hosts in terms of transferred bytes (content size) ? 

To perform this task, use RDD transformations `map`, `reduceByKey` and [`takeOrdered`](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@takeOrdered(num:Int)(implicitord:Ordering[T]):Array[T]).

In [None]:
// TODO: Replace ??? with appropriate code
// Assign the top 10 hosts considering transfered bytes to the variable "top_10_hosts".
???

In [None]:
// TEST : Top 10 tranfered bytes hosts
assertEquals(top_10_hosts.length, 10, "top_10_hosts had to be length 10")
assertEquals(top_10_hosts.toList, List("news.ti.com", "www-relay.pa-x.dec.com", "piweba5y.prodigy.com", "e659229.boeing.com", "piweba3y.prodigy.com", "www-c2.proxy.aol.com", "163.206.89.4", "www-b3.proxy.aol.com", "webgate1.mot.com", "gatekeeper.cca.rockwell.com"), "Incorrect top 10 hosts")

# Part 3 : Analyzing Web Server Log File with Spark SQL
 
Now it is time to perform advanced analytics on web server log files using Spark SQL.



## 1. Transform Spark RDD to Spark SQL dataframe

In order to use Spark SQL functionalities, you need to transform your logs data spark RDD to a Spark SQL dataframe. To perform this task, you can refer to the Spark lectures provided in classroom or spark documentation. After dataframe load, make sure that dataframe column casting is correct using method `printSchema`.

In [None]:
import spark.implicits._

val logs_df = ???
logs_df.cache()
logs_df.printSchema()

In [None]:
// TEST : Transform Spark RDD to Spark SQL dataframe
assertEquals(logs_df.getClass.toString, "class org.apache.spark.sql.Dataset", "logs_df is not a Spark Dataset.")
assertEquals(logs_df.dtypes.toList.sortBy(_._1) , List(("client_identd", "StringType"), ("content_size", "IntegerType"), ("date_time", "TimestampType"), ("endpoint", "StringType"), ("host", "StringType"), ("method", "StringType"), ("protocol", "StringType"), ("response_code", "IntegerType"), ("user_id", "StringType")), "Dataframe casting is not correct.")

## 2. Top 10 error endpoints

What are the top twenty paths which did not have return code 200? Create a sorted list containing the paths and the number of times that they were accessed with a non-200 return code and show the top ten.

Think about the steps that you need to perform to determine which paths did not have a 200 return code, how you will uniquely count those paths and sort the list.

In [None]:
// TODO: Replace ??? with appropriate code
// Assign logs with status different than 200 to variable "not200_df".
// Assign the counts per endpoint dataframe to variable "logs_sums_df". 
// Descending sort the logs_sum_df and assign the result to variable "sorted_logs_sum_df".
// Collect the top 10 endpoints and assign it to variable "top_ten_err_urls".
// Hint : You will need to use methods .groupBy() and .sort() to achieve this task.
// Note : You are welcome to structure your solution in a different way, so long as
// you ensure the variables used in the next Test section are defined (ie. logs_sum_df, top_ten_err_urls).

import org.apache.spark.sql.functions._

val not200_df = logs_df.???
val logs_sum_df = not200_df.???
val sorted_logs_sum_df = logs_sum_df.???
val top_ten_err_urls = ???

// Display
println("Top Ten failed URLs:")
sorted_logs_sum_df.show(10, false)
println(top_ten_err_urls.mkString(" "))

In [None]:
// TEST Top ten error endpoints
import org.apache.spark.sql._
assertEquals(logs_sum_df.count(), 7689, "incorrect count for endpointSum")
assertEquals(top_ten_err_urls.toList, List(Row(s"/images/NASA-logosmall.gif", 8761), Row(s"/images/KSC-logosmall.gif", 7236), Row(s"/images/MOSAIC-logosmall.gif", 5197), Row(s"/images/USA-logosmall.gif", 5157), Row(s"/images/WORLD-logosmall.gif", 5020), Row(s"/images/ksclogo-medium.gif", 4728), Row(s"/history/apollo/images/apollo-logo1.gif", 2907), Row(s"/images/launch-logo.gif", 2811), Row(s"/", 2199), Row(s"/images/ksclogosmall.gif", 1622)), "incorrect Top Ten failed URLs (topTenErrURLs)")

## 3. Number of Unique Hosts

How many unique hosts are there in the entire log?

 
Think about the steps that you need to perform to count the number of different hosts in the log.



In [None]:
// TODO: Replace ??? with appropriate code
// Note : There are several ways to achieve this task.
val unique_host_count = logs_df.???
println(s"Unique hosts: ${unique_host_count}")

In [None]:
// TEST Number of unique hosts
assertEquals(unique_host_count, 54507, "incorrect unique_host_count")

## 4. Extract the date day using an UDF

In the next questions, we will compute the number of unique Daily hosts given the day of the month. To perform this following task, we need to create a new column which contains the day of the month (from 01 to 31). Fortunately, the day of the month is contained in the column `date_time` pattern.

_Note_ : There are several ways to achieve this task including using method `dayofmonth` from spark.sql.functions. In this exercise, we will use a UDF based solution in order to make you practice with the UDF concept. It is strongly recommended for you to look at the UDF slides provided during lecture 3.

_Note_ : Since the log only covers a single month, you can ignore the month.

In [None]:
// TODO: Replace <FILL IN> with appropriate code
// Complete the function 'date_to_day' in order to return the day of the month.
// Wrap the function into and UDF object. Assign the result to variable "my_udf".
// Apply the UDF on logs_df using method .WithColumn(). The resulting column has to be named "day". Assign the resulting
// dataframe to variable "logs_df_with_day".
import org.apache.spark.sql.functions.udf

val date_to_day = (date : Timestamp) => {
    /*
    Extracts the day of month in variable date.
    date : timestamp with pattern "yyyy-mm-dd hh:mm:ss"
    returns : day of the month (pattern 'dd')
    */
    val str_date = date.toString
    ???
}
    

// testing the function
assertEquals(date_to_day(Timestamp.valueOf("1995-08-01 00:00:07")), "01", "function date_to_day is not correct.")
assertEquals(date_to_day(Timestamp.valueOf("2017-09-17 10:00:07")), "17", "function date_to_day is not correct.")

// Wrap in a UDF and apply on logs_df
val my_udf = ???
val logs_df_with_day = logs_df.???
logs_df_with_day.show(5)

In [None]:
// TEST Extract the month day using an UDF
val distinct_days = logs_df_with_day.???
assertEquals(distinct_days.length, 21, "it seems that UDF is misapplied")

## 5. Number of Unique Daily Hosts

For an advanced exercise, let's determine the number of unique hosts in the entire log on a day-by-day basis. This computation will give us counts of the number of unique daily hosts. We'd like a DataFrame sorted by increasing day of the month which includes the day of the month and the associated number of unique hosts for that day. Make sure you cache the resulting DataFrame `daily_hosts_df` so that we can reuse it in the next exercise.

Think about the steps that you need to perform to count the number of different hosts that make requests *each* day.
*Since the log only covers a single month, you can ignore the month.*  You may want to use the `day` column you have computed in the previous task.

In [None]:
// TODO: Replace ??? with appropriate code
// Select columns "day" and "host" from "logs_df_with_day" dataframe and assign the result to "day_to_host_pair_df".
// Remove duplicates <"day", "host"> duplicates using method .distinct(). Assign the result to day_group_hosts_df.
// Group by day and count the distinct hosts. Assign the result dataframe to "daily_hosts_df".
// Sort per day and assign the result to "daily_hosts_df_sorted".
// Collect the result to driver. Assign the result to variable "daily_hosts_list".

val day_to_host_pair_df = logs_df_with_day.???
val day_group_hosts_df = ???
val daily_hosts_df = ???
val daily_hosts_df_sorted = ???
val daily_hosts_list = ???

println("Unique hosts per day:")
daily_hosts_df_sorted.show(30, false)
daily_hosts_df.cache()

In [None]:
// TEST Number of unique daily hosts (3c)
assertEquals(daily_hosts_df.count(), 21, "incorrect dailyHosts.count()")
assertEquals(daily_hosts_list.toList, List(Row("01", 2582), Row("03", 3222), Row("04", 4190), Row("05", 2502), Row("06", 2537), Row("07", 4106), Row("08", 4406), Row("09", 4317), Row("10", 4523), Row("11", 4346), Row("12", 2864), Row("13", 2650), Row("14", 4454), Row("15", 4214), Row("16", 4340), Row("17", 4385), Row("18", 4168), Row("19", 2550), Row("20", 2560), Row("21", 4134), Row("22", 4456)), "incorrect dailyHostsList")
assertEquals(daily_hosts_df.storageLevel.useMemory, true, "incorrect dailyHosts.is_cached")

## 6. Mean Transfered Bytes per status code categories

For an advanced exercise, let's determine the mean transfered bytes per status code category. Remember that :
- Code beginning with a `2` means a request resulted in a successful response.
- Code beginning with a `3` means a request resulted in a redirection.
- Code beginning with a `4` means a request resulted in a client error.
- Code beginning with a `5` means a request resulted in a server error.

For every of these four category, compute the mean transfered bytes (content size).

_Hint_ : There are many ways to compute the code category column including the use of an UDF. You can choose the method you prefer. Feel free to refer to spark documentation and StackOverflow posts in order to find functions or informations you are searching for.

In [None]:
// TODO: Replace ??? with appropriate code
// Add a new colummn to logs_df contaning the code category. Assign the result to 'logs_df_with_code_category'.
// Group the logs per 'code category' computed in first subtask. Assign the result to 'logs_df_groupby_code_category'.
// Compute the content size mean per category and return the result to the driver. Assign the result 
// to 'content_size_per_code_category'.

val logs_df_with_code_category = ???
val logs_df_groupby_code_category = ???
val logs_df_agg_content_size = ???
val content_size_per_code_category = ???

In [None]:
val round_result = content_size_per_code_category.map({case Row(x: Int,y: Double) => (x,(y*10).round/10.0)})
assertEquals(content_size_per_code_category.length, 4, "error : length has to be 4.")
assertEquals(round_result.toList, List((3, 14.4), (5, 10.4), (4, 0.0), (2, 19436.9)), "incorrect mean size content per category.")

# Part 4 : Exploring 404 Response Codes
 
Let's drill down and explore the error 404 response code records. 404 errors are returned when an endpoint is not found by the server (i.e., a missing page or object). During this part, you are free to complete the tasks using `Spark Core` and `access_logs` RDD or using `Spark SQL` and `logs-df` dataframe.

_Note_ : Do not forget to cache your RDD / Dataframe in memory in order to reduce computing time.

## 1. Counting 404 Response Codes
 
How many 404 records are there in the logs? Assign the result to variable `badRecords`.

In [None]:
// TODO: Replace ??? with appropriate code

badRecords = ???
badRecords_count = ???

println(s"Found ${badRecordsSQL_count} 404 URLs")

In [None]:
// TEST Counting 404 (4a)
assertEquals(badRecords_count, 6185, "incorrect badRecords_count.count()")

## 2. Listing The top-15 404 Response Code endpoints

Get the top 15 endpoints that return 404 errors. Assign the result to variable `top_15_404`.

_Note_ : variable `top_15_404` has to be a list only containing top 404 error endpoints.




In [None]:
// TODO: Replace ??? with appropriate code

???

println(s"404 Top 15 URLS: ${top_15_404.mkString(" ")}")

In [None]:
// TEST Listing 404 records (4.2)
assertEquals(top_15_404.length, 15, "top_15_404 length has to be 15.")
assertEquals(top_15_404.toList, List("/pub/winvn/readme.txt", "/pub/winvn/release.txt", "/shuttle/missions/STS-69/mission-STS-69.html", "/images/nasa-logo.gif", "/elv/DELTA/uncons.htm", "/shuttle/missions/sts-68/ksc-upclose.gif", "/history/apollo/sa-1/sa-1-patch-small.gif", "/images/crawlerway-logo.gif", "/://spacelink.msfc.nasa.gov", "/history/apollo/pad-abort-test-1/pad-abort-test-1-patch-small.gif", "/history/apollo/a-001/a-001-patch-small.gif", "/images/Nasa-logo.gif", "/shuttle/resources/orbiters/atlantis.gif", "/history/apollo/images/little-joe.jpg", "/images/lf-logo.gif"), "top_15_404 not correct")


## 3. Listing the Top Twenty-five 404 Response Code Hosts

Instead of looking at the endpoints that generated 404 errors, let's look at the hosts that encountered 404 errors. Using the RDD / Dataframe containing only log records with a 404 response code that you cached in part (4.1), print out a list of the top twenty-five hosts that generate the most 404 errors. Assign the result to variable `errHostsTop25` and return your results to the driver. 



In [None]:
// TODO: Replace ??? with appropriate code

???

println(s"Top 25 hosts that generated errors: ${errHostsTop25.mkString(" ")}")

In [None]:
// TEST Top twenty-five 404 response code hosts (4.3)
assertEquals(errHostsTop25.length, 25, "length of errHostsTop25 is not 25")
assertEquals(errHostsTop25.toSet ,Set(("maz3.maz.net", 39), ("piweba3y.prodigy.com", 39), ("gate.barr.com", 38), ("m38-370-9.mit.edu", 37), ("ts8-1.westwood.ts.ucla.edu", 37), ("nexus.mlckew.edu.au", 37), ("204.62.245.32", 33), ("163.206.104.34", 27), ("spica.sci.isas.ac.jp", 27), ("www-d4.proxy.aol.com", 26), ("www-c4.proxy.aol.com", 25), ("203.13.168.24", 25), ("203.13.168.17", 25), ("internet-gw.watson.ibm.com", 24), ("scooter.pa-x.dec.com", 23), ("crl5.crl.com", 23), ("piweba5y.prodigy.com", 23), ("onramp2-9.onr.com", 22), ("slip145-189.ut.nl.ibm.net", 22), ("198.40.25.102.sap2.artic.edu", 21), ("gn2.getnet.com", 20), ("msp1-16.nas.mr.net", 20), ("isou24.vilspa.esa.es", 19), ("dial055.mbnet.mb.ca", 19), ("tigger.nashscene.com", 19)), "incorrect errHostsTop25")


## 4. Listing 404 Response Codes per Day

Let's explore the 404 records temporally. Break down the 404 requests by day (`cache()` the RDD `errDateSorted`) and get the daily counts sorted by day as a list. Assign the result to variable `errByDate` and return the result to driver.

*Since the log only covers a single month, you can ignore the month in your checks.*



In [None]:
// TODO: Replace <FILL IN> with appropriate code

???

println(s"404 Errors by day: ${errByDate.mkString(" ")}")

In [None]:
// TEST 404 response codes per day (4.4)
assertEquals(errByDate.toList, List((1, 243), (3, 303), (4, 346), (5, 234), (6, 372), (7, 532), (8, 381), (9, 279), (10, 314), (11, 263), (12, 195), (13, 216), (14, 287), (15, 326), (16, 258), (17, 269), (18, 255), (19, 207), (20, 312), (21, 305), (22, 288)), "incorrect errByDate")

## 5. Hourly 404 Response Codes

Using the RDD / Dataframe `badRecords` you cached in the part (4.1) and by hour of the day and in increasing order, create an RDD / Dataframe containing how many requests had a 404 return code for each hour of the day (midnight starts at 0). Cache the resulting RDD hourRecordsSorted. Assign the result to variable `errHourList`and don't forget to return the result to the driver.

_Hint_ : Concerning the Spark SQL approach, multiple methods exist including the use of an UDF similar to exercise (3.4).

In [None]:
// TODO: Replace <FILL IN> with appropriate code

???

println(s"Top hours for 404 requests: ${errHourList.mkString(" ")}")

In [None]:
// TEST Hourly 404 response codes (4h)
assertEquals(errHourList.toList, List((0, 175), (1, 171), (2, 422), (3, 272), (4, 102), (5, 95), (6, 93), (7, 122), (8, 199), (9, 185), (10, 329), (11, 263), (12, 438), (13, 397), (14, 318), (15, 347), (16, 373), (17, 330), (18, 268), (19, 269), (20, 270), (21, 241), (22, 234), (23, 272)), "incorrect errHourListCore")