# Web Server Log Analysis with Apache Spark
 
This lab will demonstrate how easy it is to perform web server log analysis with Apache Spark.

Server log analysis is an ideal use case for Spark.  It's a very large, common data source and contains a rich set of information. Spark allows you to store your logs in files on disk cheaply, while still providing a quick and simple way to perform data analysis on them. Log data comes from many sources, such as web, file, and compute servers, application logs, user-generated content, and can be used for monitoring servers, improving business and customer intelligence, building recommendation systems, fraud detection, and much more.

This lab is broken up into sections with bite-sized examples for demonstrating Spark functionality for log processing. For each problem, you should start by thinking about the algorithm that you will use to *efficiently* process the log in a parallel, distributed manner. This means using the various built in [pyspark functions](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions) along with some [user defined functions](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.udf).
 
This lab consists of 4 parts:

1. Apache Web Server Log file format
2. Sample Analyses on the Web Server Log File
3. Analyzing Web Server Log File
4. Exploring 404 Response Codes

## (1) Apache Web Server Log file format

The log files that we use for this lab are in the [Apache Common Log Format (CLF)](http://httpd.apache.org/docs/1.3/logs.html#common). The log file entries produced in CLF will look something like this:
`127.0.0.1 - - [01/Aug/1995:00:00:01 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1839`
 
Each part of this log entry is described below.

* `127.0.0.1` - This is the IP address (or host name, if available) of the client (remote host) which made the request to the server.
 
* `-` - The "hyphen" in the output indicates that the requested piece of information (user identity from remote machine) is not available.
 
* `-` - The "hyphen" in the output indicates that the requested piece of information (user identity from local logon) is not available.

* `[01/Aug/1995:00:00:01 -0400]` - The time that the server finished processing the request. The format is:
`[day/month/year:hour:minute:second timezone]`

  * day = 2 digits
  * month = 3 letters
  * year = 4 digits
  * hour = 2 digits
  * minute = 2 digits
  * second = 2 digits
  * zone = (+|-) 4 digits
 
* `"GET /images/launch-logo.gif HTTP/1.0"` - This is the first line of the request string from the client. It consists of a three components:

  * the request method (e.g., `GET`, `POST`, etc.)
  * the endpoint (a [Uniform Resource Identifier](http://en.wikipedia.org/wiki/Uniform_resource_identifier))
  * the client protocol and version

* `200` - This is the status code that the server sends back to the client. This information is very valuable, because it reveals whether the request resulted in a successful response (codes beginning in 2), a redirection (codes beginning in 3), an error caused by the client (codes beginning in 4), or an error in the server (codes beginning in 5). The full list of possible status codes can be found in the HTTP specification ([RFC 2616](https://www.ietf.org/rfc/rfc2616.txt) section 10).
 
* `1839` - The last entry indicates the size of the object returned to the client, not including the response headers. If no content was returned to the client, this value will be "-" (or sometimes 0).
 
Note that log files contain information supplied directly by the client, without escaping. Therefore, it is possible for malicious clients to insert control-characters in the log files, *so care must be taken in dealing with raw logs.*

For this lab, we will use a data set from NASA Kennedy Space Center WWW server in Florida. The full data set is freely available (http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html) and contains two months of HTTP requests. We are using a subset that only contains several days worth of requests.

## (1a) Parse a Log Timestamp

We must first write a function that can parse an Apache Logs format timestamp. For this you should use `datetime.strptime()`. Passing a pattern string formatted according to [the documentation](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior), this can be used to build a `datetime` object from a string.

For example:

In [None]:
from datetime import datetime

pattern = '%d/%m/%y %H:%M'
timestamp = '17/02/17 14:05'

datetime.strptime(timestamp, pattern)

Below we use this to define a function that parses an Apache log format timestamp to a `datetime` object (or `None` in the case of an invalid timestamp). Complete the time pattern to correctly parse Apache timestamps.

In [None]:
# TODO: Replace <FILL IN> with appropriate code

APACHE_TIME_PATTERN = <FILL IN>

def parse_apache_time(string):
    """Parse an Apache log formatted date string.
    
    Parameters
    ----------
    string : str
        The string to parse
        
    Returns
    -------
    datetime or None
    """
    try:
        return datetime.strptime(string, APACHE_TIME_PATTERN)
    except ValueError:
        return None

parsed_time = parse_apache_time('01/Aug/1995:00:00:08 -0400')
print(parsed_time)

In [None]:
from test_helper import Test
from datetime import timedelta, timezone
utc_minus_4 = timezone(timedelta(hours=-4))
Test.assertEquals(parsed_time,
                  datetime(1995, 8, 1, 0, 0, 8, tzinfo=utc_minus_4),
                 'incorrect parsed_time')

## (1b) Extract components from log line

Using the CLF as defined above, we write a regular expression pattern to extract the nine fields of the log line. This regular expression should extract 9 'groups' - one for each of:

* Client host (hostname or IP address)
* Remote user identity
* Local user identity
* Timestamp
* Request method
* Endpoint
* Client protocol and version
* Returned HTTP status code
* Returned content size

The function `parse_apache_log_line` applies the regular expression `APACHE_ACCESS_LOG_PATTERN` using the python [`re.search`](https://docs.python.org/3/library/re.html#re.search) method. Execute the cell to test the regular expression on a sample row.

In [None]:
import re
from pprint import pprint

APACHE_ACCESS_LOG_PATTERN = '^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+)\s*(\S*)" (\d{3}) (\S+)'

APACHE_LOG_FIELD_NAMES = [
    'host', 'client_identd', 'user_id', 'date_time', 'method', 'endpoint',
    'protocol', 'response_code', 'content_size'
]
APACHE_ACCESS_LOG_SAMPLE_LINE = 'uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 304 0'


def parse_apache_log_line(logline):
    """Parse a line in the Apache Common Log format.
    
    Parameters
    ----------
    logline : str
        A line of text in the Apache Common Log format
        
    Returns
    -------
    dict or None
        The parsed line
    """
    
    match = re.search(APACHE_ACCESS_LOG_PATTERN, logline)
    
    if match is None:
        return
    
    fields = dict(zip(APACHE_LOG_FIELD_NAMES, match.groups()))

    return fields


pprint(parse_apache_log_line(APACHE_ACCESS_LOG_SAMPLE_LINE))

In [None]:
Test.assertEquals(parse_apache_log_line(APACHE_ACCESS_LOG_SAMPLE_LINE),
                  {'client_identd': '-', 'content_size': '0', 'date_time': '01/Aug/1995:00:00:08 -0400',
                   'endpoint': '/images/ksclogo-medium.gif', 'host': 'uplherc.upl.com', 'method': 'GET',
                   'protocol': 'HTTP/1.0', 'response_code': '304', 'user_id': '-'},
                  'log line incorrectly parsed')

## (1c) Configuration and Initial RDD Creation

We are ready to specify the input log file and load it into a DataFrame.

Let's start by fetching the data:

In [None]:
!rm -rf apache.access.log.*
!wget https://s3-eu-west-1.amazonaws.com/asi-training-data/spark/apache.access.log.PROJECT

To create the primary RDD that we'll use in the rest of this lab, we first load the text file using [`spark.read.text()`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.text).

Next, we use `.select()` on the DataFrame along with some [pyspark functions](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions) and [user defined functions](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.udf) to generate a new DataFrame with the fields parsed into the desired formats.

Finally, we cache the DataFrame in memory since we'll use it throughout this notebook.

In [None]:
APACHE_LOG_FILE = "apache.access.log.PROJECT"

import pyspark.sql.functions as func
from pyspark.sql import types


# A helper to generate a `.select()` entry that applies the access log
# regular expression to the 'line' column and extracts a particular group as a new column
extract_field = lambda group: func.regexp_extract('line', APACHE_ACCESS_LOG_PATTERN, group)

# A helper function to apply parse_apache_time to a column
parse_apache_time_udf = func.udf(parse_apache_time, types.TimestampType())

# A helper function to cast a column to integer type
cast_int = lambda col: col.cast('int')

# A helper function to cast a column to integer type, interpreting '-' as 0
safe_cast_int = lambda col: func.when(col == '-', 0).otherwise(col.cast('int'))
    

def parse_logs():
    """Load and parse the logs.
    
    Returns
    -------
    pyspark.sql.DataFrame
        All the logs
    pyspark.sql.DataFrame
        The correctly parsed logs
    pyspark.sql.DataFrame
        The logs that could not be parsed
    """
    
    # Load the dataset and rename the column to 'line'
    logs = (spark.read.text(APACHE_LOG_FILE)
            .select(func.col('value').alias('line')))
    
    # Parse the fields
    parsed_logs = (logs
                   .select(
                       'line',  # Include the original line column
                       extract_field(1).alias('host'),
                       extract_field(2).alias('client_identd'),
                       extract_field(3).alias('user_id'),
                       parse_apache_time_udf(
                           extract_field(4)
                       ).alias('date_time'),
                       extract_field(5).alias('method'),
                       extract_field(6).alias('endpoint'),
                       extract_field(7).alias('protocol'),
                       cast_int(
                           extract_field(8)
                       ).alias('response_code'),
                       safe_cast_int(
                           extract_field(9)
                       ).alias('content_size')
                   )
                   .cache())
    
    # Filter for correctly parsed logs
    access_logs = parsed_logs.filter(func.length('host') > 0)
    failed_logs = parsed_logs.filter(func.length('host') == 0)
    
    return parsed_logs, access_logs, failed_logs


parsed_logs, access_logs, failed_logs = parse_logs()

print('Total logs:          {}'.format(parsed_logs.count()))
print('Successfully parsed: {}'.format(access_logs.count()))
print('Failed to parse:     {}'.format(failed_logs.count()))

## (1d) Data Cleaning

Notice that there are a number of log lines that failed to parse. Since we included the original line column in the parsed DataFrames, we can inspect the lines we failed to pass with:

In [None]:
print('First 20 failed log lines:')
for row in failed_logs.head(20):
    print(repr(row['line']))

Examine the sample of invalid lines and compare them to the correctly parsed line. Based on your observations, alter the `APACHE_ACCESS_LOG_PATTERN` regular expression below so that the failed lines will correctly parse, and execute the cell below to rerun `parse_logs()`.
 
If you not familar with Python regular expression [`search` function](https://docs.python.org/3/library/re.html#regular-expression-objects), now would be a good time to check up on the [documentation](https://developers.google.com/edu/python/regular-expressions). One tip that might be useful is to use an online tester like http://pythex.org or http://www.pythonregex.com. To use it, copy and paste the regular expression string below (located between the single quotes ') and test it against one of the 'Invalid logline' above.

In [None]:
# TODO: Replace <FILL IN> with appropriate code

# This was originally '^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+)\s*(\S*)" (\d{3}) (\S+)'
APACHE_ACCESS_LOG_PATTERN = '^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+)\s*(\S*)\s*" (\d{3}) (\S+)'

parsed_logs, access_logs, failed_logs = parse_logs()

print('Total logs:          {}'.format(parsed_logs.count()))
print('Successfully parsed: {}'.format(access_logs.count()))
print('Failed to parse:     {}'.format(failed_logs.count()))

In [None]:
Test.assertEquals(failed_logs.count(), 0, 'incorrect failed_logs.count()')
Test.assertEquals(parsed_logs.count(), 1043177 , 'incorrect parsed_logs.count()')
Test.assertEquals(access_logs.count(), parsed_logs.count(), 'incorrect access_logs.count()')

## (2) Sample Analyses on the Web Server Log File
 
Now that we have a DataFrame containing the components of the log file as columns, we can perform various analyses.
 
### (2a) Example: Content Size Statistics

Let's compute some statistics about the sizes of content being returned by the web server. In particular, we'd like to know what are the average, minimum, and maximum content sizes.
 
We can compute the statistics by calling `.select()` with an aggregating [pyspark function](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions) such as [`min()`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.min), [`max()`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.max) and [`avg()`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.avg):

In [None]:
min_size = access_logs.select(func.min('content_size')).first()[0]
max_size = access_logs.select(func.max('content_size')).first()[0]
mean_size = access_logs.select(func.avg('content_size')).first()[0]

print('Content size:')
print('  min:  {}'.format(min_size))
print('  max:  {}'.format(max_size))
print('  mean: {}'.format(mean_size))

### (2b) Example: Response Code Analysis

Next, lets look at the response codes that appear in the log. We'd like to count the number of times each response code occurs in the logs.

To do this, first group the logs by response code, then count the size of each group:

In [None]:
response_code_counts = (access_logs
                        .groupBy('response_code')
                        .count())
response_code_counts.toPandas()

### (2c) Example: Response Code Graphing with `matplotlib`

Now, lets visualize the results from the last example. We can visualize the results from the last example using [`matplotlib`](http://matplotlib.org/).

In [None]:
pandas_df = (response_code_counts
             .orderBy('response_code')
             .toPandas())

import numpy as np
from matplotlib import pyplot
%matplotlib inline

fig, ax = pyplot.subplots(figsize=(12, 6))

ax.bar(np.arange(7), pandas_df['count'])

ax.set_xticks(np.arange(7))
ax.set_xticklabels(pandas_df['response_code'])

ax.set_xlabel('Response code')
ax.set_ylabel('Hits')

pass

### (2d) Example: Frequent Hosts

Let's look at hosts that have accessed the server multiple times (e.g., more than ten times). As with the response code analysis in (2b), first we group by host, then we count the size of each group. Finally, we can apply a `.filter()` to return only hosts accessed the correct number of times.

In [None]:
host_more_than_10 = (access_logs
                     .groupBy('host')
                     .count()
                     .filter(func.col('count') > 10))
host_more_than_10.show(truncate=False)

In [None]:
Test.assertEquals(host_more_than_10.count(),
                  23656,
                  'incorrect size of host_more_than_10')

### (2e) Example: Visualizing Endpoints

Now, lets visualize the number of hits to endpoints (URIs) in the log. To perform this task, we once again group and count to get the number of times each endpoint occurs in the logs, then order by descending number of hits. This data can then be plotted with matplotlib:

In [None]:
endpoints = (access_logs
             .groupBy('endpoint')
             .count()
             .orderBy('count', ascending=False))

fig, ax = pyplot.subplots(figsize=(12, 6))
ax.plot(endpoints.toPandas()['count'])

ax.set_xlabel('Endpoints')
ax.set_ylabel('Number of Hits')
ax.grid()

pass

### (2f) Example: Top Endpoints

For the final example, we'll look at the top endpoints (URIs) in the log. Since the DataFrame is already ordered by number of hits, we can just print the first ten lines:

In [None]:
print('Top ten endpoints:')
for row in endpoints.head(10):
    print('{} {}'.format(row['endpoint'], row['count']))

In [None]:
Test.assertEquals([row['endpoint'] for row in endpoints.head(10)],
                  ['/images/NASA-logosmall.gif', 
                   '/images/KSC-logosmall.gif',
                   '/images/MOSAIC-logosmall.gif',
                   '/images/USA-logosmall.gif',
                   '/images/WORLD-logosmall.gif',
                   '/images/ksclogo-medium.gif',
                   '/ksc.html',
                   '/history/apollo/images/apollo-logo1.gif',
                   '/images/launch-logo.gif',
                   '/'],
                  'incorrect top_ten_endpoints')

## (3) Analyzing Web Server Log File
 
Now it is your turn to perform analyses on web server log files.

### (3a) Exercise: Top Ten Error Endpoints

What are the top ten endpoints which did not have return code 200? Create a sorted list containing top ten endpoints and the number of times that they were accessed with non-200 return code.
 
Think about the steps that you need to perform to determine which endpoints did not have a 200 return code, how you will uniquely count those endpoints, and sort the list.
 
You might want to refer back to the previous Lab (Lab 1 Word Count) for insights.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
# HINT: Each of these <FILL IN> below could be completed with a single transformation or action.
# You are welcome to structure your solution in a different way, so long as
# you ensure the variables used in the next Test section are defined (ie. not_200_counts, top_ten_endpoints_not_200).

not_200 = access_logs.filter(func.col('response_code') != 200)

not_200_counts = <FILL IN>

sorted_not_200 = <FILL IN>

top_ten_endpoints_not_200 = [row['endpoint'] for row in sorted_not_200.head(10)]

print('Top ten non-200 status endpoints:')
print('\n'.join(top_ten_endpoints_not_200))

In [None]:
# TEST Top ten error endpoints (3a)
Test.assertEquals(not_200_counts.count(), 7689, 'incorrect count for not_200_counts')
Test.assertEquals(top_ten_endpoints_not_200,
                  ['/images/NASA-logosmall.gif',
                   '/images/KSC-logosmall.gif',
                   '/images/MOSAIC-logosmall.gif',
                   '/images/USA-logosmall.gif',
                   '/images/WORLD-logosmall.gif',
                   '/images/ksclogo-medium.gif',
                   '/history/apollo/images/apollo-logo1.gif',
                   '/images/launch-logo.gif',
                   '/',
                   '/images/ksclogosmall.gif'],
                  'incorrect top_ten_endpoints_not_200')

### (3b) Exercise: Number of Unique Hosts

How many unique hosts are there in the entire log?
 
Think about the steps that you need to perform to count the number of different hosts in the log.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
# HINT: Do you recall the tips from (3a)? Each of these <FILL IN> could be an transformation or action.

num_hosts = <FILL IN>

print('Number of unique hosts: {}'.format(num_hosts))

In [None]:
# TEST Number of unique hosts (3b)
Test.assertEquals(num_hosts, 54507, 'incorrect num_hosts')

### (3c) Exercise: Number of Unique Daily Hosts

For an advanced exercise, let's determine the number of unique hosts in the entire log on a day-by-day basis. This computation will give us counts of the number of unique daily hosts. We'd like a list sorted by increasing day of the month which includes the day of the month and the associated number of unique hosts for that day.
 
Think about the steps that you need to perform to count the number of different hosts that make requests *each* day. You may find the [dayofmonth scala function](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.dayofmonth) useful.

*Since the log only covers a single month, you can ignore the month.*

In [None]:
# TODO: Replace <FILL IN> with appropriate code

grouped_by_day = <FILL IN>

daily_hosts = <FILL IN>

daily_hosts.toPandas()

In [None]:
# TEST Number of unique daily hosts (3c)
Test.assertEquals(daily_hosts.count(), 22, 'incorrect unique_hosts_by_day.count()')
Test.assertEquals(list(tuple(row) for _, row in daily_hosts.toPandas().iterrows()),
                  [(1, 2582), (3, 2591), (4, 4262), (5, 2573), (6, 2469), (7, 4067),
                   (8, 4259), (9, 4440), (10, 4432), (11, 4507), (12, 2865), (13, 2667),
                   (14, 4363), (15, 4334), (16, 4253), (17, 4412), (18, 4231), (19, 2620),
                   (20, 2482), (21, 4125), (22, 4416), (23, 696)],
                  'incorrect daily_hosts')

### (3d) Exercise: Visualizing the Number of Unique Daily Hosts

Using the results from the previous exercise, use matplotlib to plot a line graph of the unique hosts requests by day. `.toPandas()` is a useful method for converting a Spark DataFrame to a Pandas DataFrame.

In [None]:
fig, ax = pyplot.subplots(figsize=(12, 6))

<FILL IN>

ax.set_xlabel('Day')
ax.set_ylabel('Hosts')
ax.grid()

pass

### (3e) Exercise: Average Number of Daily Requests per Hosts

Next, let's determine the average number of requests on a day-by-day basis. We'd like a list by increasing day of the month and the associated average number of requests per host for that day. Make sure you cache the resulting RDD `avgDailyReqPerHost` so that we can reuse it in the next exercise.

To compute the average number of requests per host, get the total number of request across all hosts and divide that by the number of unique hosts.

*Since the log only covers a single month, you can skip checking for the month. Also, to keep it simple, when calculating the approximate average use the integer value - you do not need to upcast to float*

In [None]:
# TODO: Replace <FILL IN> with appropriate code

daily_host_requests = <FILL IN>

average_daily_requests_per_host = <FILL IN>

average_daily_requests_per_host.toPandas()

In [None]:
# TEST Average number of daily requests per hosts (3e)
Test.assertEquals(list((row[0], round(row[1], 2)) for _, row in average_daily_requests_per_host.toPandas().iterrows()),
                  [(1.0, 13.17), (3.0, 12.74), (4.0, 14.30), (5.0, 12.59), (6.0, 13.08), (7.0, 13.87),
                   (8.0, 13.58), (9.0, 13.91), (10.0, 13.82), (11.0, 13.84), (12.0, 12.88), (13.0, 13.9),
                   (14.0, 13.59), (15.0, 13.89), (16.0, 13.09), (17.0, 13.42), (18.0, 13.56), (19.0, 12.5),
                   (20.0, 13.03), (21.0, 13.4), (22.0, 12.97), (23.0, 10.89)],
                  'incorrect average_daily_requests_per_host')

## (4) Exploring 404 Response Codes
 
Let's drill down and explore the error 404 response code records. 404 errors are returned when an endpoint is not found by the server (i.e., a missing page or object).

### (4a) Exercise: Counting 404 Response Codes

Create a DataFrame containing only log records with a 404 response code. Make sure you `cache()` the DataFrame `bad_records` as we will use it in the rest of this exercise.
 
How many 404 records are in the log?

In [None]:
# TODO: Replace <FILL IN> with appropriate code

bad_records = <FILL IN>

print('Found {} 404 URLs'.format(bad_records.count()))

### (4b) Exercise: Listing 404 Response Code Records

Using the DataFrame containing only log records with a 404 response code that you cached in part (4a), print out a list up to 40 **distinct** endpoints that generate 404 errors -  *no endpoint should appear more than once in your list.*

In [None]:
# TODO: Replace <FILL IN> with appropriate code

bad_unique_endpoints = <FILL IN>

bad_unique_endpoints.show(40, truncate=False)

### (4c) Exercise: Listing the Top Twenty 404 Response Code Endpoints

Using the DataFrame containing only log records with a 404 response code that you cached in part (4a), print out a list of the top twenty endpoints that generate the most 404 errors.

*Remember, top endpoints should be in sorted order*

In [None]:
bad_endpoints_ranked = <FILL IN>

bad_endpoints_ranked.show(20, truncate=False)

### (4d) Exercise: Listing the Top Twenty-five 404 Response Code Hosts

Instead of looking at the endpoints that generated 404 errors, let's look at the hosts that encountered 404 errors. Using the DataFrame containing only log records with a 404 response code that you cached in part (4a), print out a list of the top twenty-five hosts that generate the most 404 errors.

In [None]:
bad_hosts_ranked = <FILL IN>

bad_hosts_ranked.show(25, truncate=False)

### (4e) Exercise: Listing 404 Response Codes per Day

Let's explore the 404 records temporally. Break down the 404 requests by day and get the daily counts sorted by day as a list.

*Since the log only covers a single month, you can ignore the month in your checks.*

In [None]:
bad_records_per_day = <FILL IN>

bad_records_per_day.show(30)

### (4f) Exercise: Visualizing the 404 Response Codes by Day

Using the results from the previous exercise, use `matplotlib` to plot a "Line" or "Bar" graph of the 404 response codes by day.

In [None]:
fig, ax = pyplot.subplots(figsize=(12, 8))

<FILL IN>

ax.set_xlabel('Day')
ax.set_ylabel('404 Errors')
ax.grid()
pass

### (4g) Exercise: Hourly 404 Response Codes

Using the DataFrame `bad_records` you cached in the part (4a) and by hour of the day and in increasing order, create a DataFrame containing how many requests had a 404 return code for each hour of the day (midnight starts at 0).

In [None]:
bad_records_by_hour = <FILL IN>

bad_records_by_hour.show(24)

### (4h) Exercise: Visualizing the 404 Response Codes by Hour

Using the results from the previous exercise, use `matplotlib` to plot a "Line" or "Bar" graph of the 404 response codes by hour.

In [None]:
fig, ax = pyplot.subplots(figsize=(12, 8))

<FILL IN>

ax.set_xlabel('Hour')
ax.set_ylabel('404 Errors')
ax.grid()
pass