# Exam 1:  NASA weblog analysis

This exam will continue our weblog analysis that we started in the Week 4 homework.  Problem 1 will "just" be about parsing and cleaning the data (this is usually the most time-consuming part of any analysis).

All of the remaining problems rely on you getting Problem 1 correct, so I will walk you through it just like a homework.  All unit tests in Problem 1 are visible so that you can be sure.

Starting in Problem 2, however, the tests will be hidden (after all, it is an Exam).

## Problem 1:  loading, parsing, and cleaning the data

In [None]:
from pyspark import SparkContext
sc = SparkContext('local', 'exam1')

In [None]:
# make sure you copy (or move) the logs from the week4 analysis into the same directory as
# this notebook
logs_rdd = sc.textFile('NASA_access_log_Aug95.gz,NASA_access_log_Jul95.gz')

First, we need to subsample (**without replacement**).  You CAN perform this analysis on the entire log, but it's never a bad idea to subsample first while you develop your algorithms (so that feedback is faster and costs less $$).

Later, if necessary, you can come back and remove the subsampling step (to process the entire dataset).

In [None]:
sample_fraction = 0.1
seed = 7  # this is completely random - I just chose this

# Apply the `.sample()` transformation to `logs_rdd` and store the result in a new RDD called
# `sample_rdd`

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Since we don't want to repeat the sampling step (which is expensive) every time 
# we perform an action, let's cache `sample_rdd`
sample_rdd.persist()

In [None]:
assert sample_rdd.take(2) == \
    ['in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839',
     'uplherc.upl.com - - [01/Aug/1995:00:00:08 -0400] "GET /images/USA-logosmall.gif HTTP/1.0" 304 0']

Borrowing the regex that you created in week 4 homework, create a function that parses a log line into its individual fields:

- requesting_host
- user_identity
- user_local_identity
- timestamp
- requested_resource
- return_code
- bytes_transferred

If the line is parsed successfully then the function should return the tuple
```
(True, requesting_host, user_identity, user_local_identity, timestamp, requested_resource, return_code, bytes_transferred)
```

Sometimes the parse fails.  In that case return the tuple
```
(False, original_line)
```


In [None]:
import re

# Store your pattern string in a variable named `logpattern`
# YOUR CODE HERE
raise NotImplementedError()

# Precompile the regex pattern so that we don't have to compile every
# time the function is called (that can get expensive)
logregex = re.compile(logpattern)

def parse_log_line(line):
    # NOTICE that `logregex` is visible from here, so we can use it.
    # BTW, "Capturing" a variable from outside the function is called 
    # a "closure" in programming
    
    # YOUR CODE HERE
    raise NotImplementedError()

Apply your parsing function to `sample_rdd` to get a new RDD where each line has been parsed into the tuple format described above.  Let's name the new RDD `parsed_rdd`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

The first element in the tuple is either `True` or `False`.  This is meant to indicate whether or not the parse was successful (bad data is a fact of life).

Filter `parsed_rdd` to create a new RDD, `bad_rdd`, that only contains the elements that didn't parse successfully:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# since we're going to run a couple of actions on `bad_logs`, let's persist it
# (so we don't have to recompute the entire DAG for each action)
bad_rdd.persist()

It turns out that there is actually 1 legitimately bad (unparseable) line in our sample.

If you find more than this (i.e. if the assertion in the next cell fails) then you should be suspicious of your regex.  Does it handle all cases correctly?  Note that my regex (week 4 solution) is *almost* right, but it doesn't handle one little detail.

Always do a `bad_rdd.take(5)` and see what some of the "bad" data looks like.  Is it really bad?  Or just formatted slightly differently?  Modify your regex if necessary to handle variations in the formatting that you are seeing.

In [None]:
assert bad_rdd.count() == 1

In [None]:
assert bad_rdd.take(1) == [(False, 'alyssa.p')]

In [None]:
# We are done with the bad data.  Let Spark forget it
bad_rdd.unpersist()

Create a new RDD, `good_rdd`, that filters out the bad data.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Let's persist it since we will perform a couple of actions on it
good_rdd.persist()

In [None]:
assert good_rdd.count() == 345380

In [None]:
assert good_rdd.take(2) == \
    [(True,
      'in24.inetnebr.com',
      '-',
      '-',
      '01/Aug/1995:00:00:01 -0400',
      'GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0',
      '200',
      '1839'),
     (True,
      'uplherc.upl.com',
      '-',
      '-',
      '01/Aug/1995:00:00:08 -0400',
      'GET /images/USA-logosmall.gif HTTP/1.0',
      '304',
      '0')]

It is time to reach into week4 homework and pull out the `.strptime()` code that you wrote to parse the timestamp.  Make sure it was working there!

We want to create a function that parses a timestamp string and returns a `datetime` object.  However, this can sometimes fail (e.g. if the string is corrupted).

When this error happens, `.strptime()` will "raise an exception".  An exception is Python's way of saying "I give up.  Get me OUTTA HERE!".

There are many types of exceptions in Python, but `.strptime()` will raise a `ValueError`.

Exceptions "bubble up" the call stack and need to be handled (some people say "caught").  If an exception bubbles to the top level and has not been handled then the program will bomb out.

To handle an exception, use Python's `try` and `except` statements.  Since this is your first time I will write the code for you.  Just put in your `.strptime()` snippet where it says "YOUR CODE HERE" and try to figure out what is going on with the rest of the function.

In [None]:
from datetime import datetime, timezone, timedelta

# Write a function that parses a timestamp string and returns a datetime object
# If an exception is raised by `.strptime()` then just return None
def parse_timestamp(timestamp):
    try:
        # YOUR CODE HERE
        raise NotImplementedError()
    except ValueError:
        return None

We will also want to further parse the `requested_resource` field because there is more information that we can extract.

An example `requested_resource` field looks like the following:
```
GET /a/cool/resource.html HTTP/1.0
```
It has the format
```
method resource protocol
```
Write a regex pattern (name the variable `requested_resource_pattern`) that parses into 3 fields.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Again, for efficiency, we will precompile the pattern
requested_resource_regex = re.compile(requested_resource_pattern)

In [None]:
assert requested_resource_regex.match('GET /a/cool/resource.html HTTP/1.0').groups() ==\
    ('GET', '/a/cool/resource.html', 'HTTP/1.0')

Putting everything together will be challenging.  Write a function named `cleanup_logentry` that takes a parsed log entry, i.e. a tuple of the form:
```
(True,
 'in24.inetnebr.com',
 '-',
 '-',
 '01/Aug/1995:00:00:01 -0400',
 'GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0',
 '200',
 '1839')
```
and does the following to it:

- removes the two `-` fields because those are almost always empty anyway
- parses `'01/Aug/1995:00:00:01 -0400'` to turn it into a datetime object.
- parses `GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0` (using the regex that you just wrote) to "explode" its 3 fields.
- converts the last two fields (`response_code` and `bytes_transferred`) to integers.  Watch OUT!  `bytes_transferred` can be `-` (which means `0`).

The return value should look like this:
```
(True,   <---- set this to False if any of the parsing fails
 'in24.inetnebr.com',
 datetime(1995, 8, 1, 0, 0, 1, tzinfo=datetime.timezone(timedelta(-1, 72000))),
 'GET',
 '/shuttle/missions/sts-68/news/sts-68-mcc-05.txt',
 'HTTP/1.0',
 200,
 1839)
```
If the timestamp parsing fails then just leave it as the original timestamp string.

If the requested_resource parsing fails then just leave its first field as the original string and use `None` for the other two fields.

In other words, here is what the return value should look like in case of failure (in this example *both* parsings failed):
```
(False,
 'in24.inetnebr.com',
 '01/Aug/1995:00:00:01 -0400',
 'GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0',
 None,
 None,
 200,
 1839)
```

In [None]:
def cleanup_logentry(logentry):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
test_tuple1 = (True,
              'in24.inetnebr.com',
              '-',
              '-',
              '01/Aug/1995:00:00:01 -0400',
              'GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0',
              '200',
              '1839')
assert cleanup_logentry(test_tuple1) == \
    (True,
     'in24.inetnebr.com',
     datetime(1995, 8, 1, 0, 0, 1, tzinfo=timezone(timedelta(-1, 72000))),
     'GET',
     '/shuttle/missions/sts-68/news/sts-68-mcc-05.txt',
     'HTTP/1.0',
     200,
     1839)

In [None]:
test_tuple2 = (True,
              'in24.inetnebr.com',
              '-',
              '-',
              'should_fail_to_parse',
              'GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0',
              '200',
              '1839')
assert cleanup_logentry(test_tuple2) == \
    (False,
     'in24.inetnebr.com',
     'should_fail_to_parse',
     'GET',
     '/shuttle/missions/sts-68/news/sts-68-mcc-05.txt',
     'HTTP/1.0',
     200,
     1839)

In [None]:
test_tuple3 = (True,
              'in24.inetnebr.com',
              '-',
              '-',
              '01/Aug/1995:00:00:01 -0400',
              'should_fail_to_parse',
              '200',
              '1839')
assert cleanup_logentry(test_tuple3) == \
    (False,
     'in24.inetnebr.com',
     datetime(1995, 8, 1, 0, 0, 1, tzinfo=timezone(timedelta(-1, 72000))),
     'should_fail_to_parse',
     None,
     None,
     200,
     1839)

In [None]:
test_tuple4 = (True,
              'in24.inetnebr.com',
              '-',
              '-',
              '01/Aug/1995:00:00:01 -0400',
              'GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0',
              '200',
              '-')
assert cleanup_logentry(test_tuple4) == \
    (True,
     'in24.inetnebr.com',
     datetime(1995, 8, 1, 0, 0, 1, tzinfo=timezone(timedelta(-1, 72000))),
     'GET',
     '/shuttle/missions/sts-68/news/sts-68-mcc-05.txt',
     'HTTP/1.0',
     200,
     0)

WHEW!  That was tough.

Apply your shiny new function `cleanup_logentry` to `good_rdd` to produce a new RDD named `cleaned_rdd`:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Since we've done more parsing, we will find MORE "bad" data.  Using a filter, create a new RDD named `bad_rdd` (I reused the variable name from above.  This is not the same RDD) that contains all the bad data:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Let's persist it for a bit because we will perform several actions on it
bad_rdd.persist()

In [None]:
assert bad_rdd.count() == 489

Let's look at some of this "bad" data.  Can we resurrect some of it?

In [None]:
bad_rdd.take(5)

In [None]:
# Let's forget `bad_rdd` because we're going to fix this up and recompute below
bad_rdd.unpersist()

Notice that many of these seem to be failing to parse because the requested_resource field is incomplete.  For example, `'GET /images/NASA-logosmall.gif'` doesn't have an `HTTP/1.0` on the end.

But do we care?  Let's change our regex to allow for the case where this is missing:

In [None]:
# Let's fix the regex to allow for cases where PROTOCOL is missing

# Here's a website that contains a fun example of groups that are optional
# https://howchoo.com/g/ymfhmtrhyjg/python-regexes-match-objects

# YOUR CODE HERE
raise NotImplementedError()

requested_resource_regex = re.compile(requested_resource_pattern)

In [None]:
m = requested_resource_regex.match('GET /images/NASA-logosmall.gif')
assert m.groups() == ('GET', '/images/NASA-logosmall.gif', None)

m = requested_resource_regex.match('GET /images/NASA-logosmall.gif HTTP/1.0')
assert m.groups() == ('GET', '/images/NASA-logosmall.gif', 'HTTP/1.0')

Let's recompute `cleaned_rdd` and `bad_rdd` using your improved regex:

In [None]:
cleaned_rdd = good_rdd.map(cleanup_logentry)
bad_rdd = cleaned_rdd.filter(lambda x: not x[0])

In [None]:
bad_rdd.persist()

In [None]:
assert bad_rdd.count() == 1

It looks like we still have some bad data.  Let's look at it to see if we can refine our parsing further.

(NOTE: You should see some junk output in the cell below, which indicates that this is INDEED bad data)

In [None]:
bad_rdd.take(1)

In [None]:
# Release the memory in Spark
bad_rdd.unpersist()

OK, we've identified all of the truly bad data, and parsed the rest.  Let's filter out the bad data and put the result into an RDD named `good_rdd` (we are reusing the variable name `good_rdd` from above, but this RDD is different): 

In [None]:
good_rdd = cleaned_rdd.filter(lambda x: x[0])

In [None]:
# Going to be using this in quite a few actions below, so persist
good_rdd.persist()

Make sure we're all on the same page:

In [None]:
assert good_rdd.take(2) == \
    [(True,
      'in24.inetnebr.com',
      datetime(1995, 8, 1, 0, 0, 1, tzinfo=timezone(timedelta(-1, 72000))),
      'GET',
      '/shuttle/missions/sts-68/news/sts-68-mcc-05.txt',
      'HTTP/1.0',
      200,
      1839),
     (True,
      'uplherc.upl.com',
      datetime(1995, 8, 1, 0, 0, 8, tzinfo=timezone(timedelta(-1, 72000))),
      'GET',
      '/images/USA-logosmall.gif',
      'HTTP/1.0',
      304,
      0)]

In [None]:
# This variable will be useful below
num_logs = good_rdd.count()

In [None]:
assert num_logs == 345379

For all problems below use `good_rdd` as your starting point.

## Problem 2:  Total bytes transferred in July and August 1995

Figure out the total bytes transferred in July and August 1995.  Note that the unit test is hidden (because this is an exam, after all).

REMEMBER:  you are using a sample, so account for this in your answer (I want an estimate for the TOTAL bytes transferred, not just for the sample).  Make sure that your answer is an integer.

In [None]:
tot_bytes_transferred = None  # please use THIS variable to store your result

# YOUR CODE HERE
raise NotImplementedError()

This number is huge and unreadable.  If we wanted to report in terms of kilobytes, we would divide by 1024.  If we wanted to report in terms of megabytes we would divide by 1024 AGAIN.  If we wanted to report in terms of gigabytes we would divide by 1024 again.

How many gigabytes (GB) were transferred total in July and August 1995 (rounded to the nearest gigabyte)?

Something to think about:  why do we divide by 1024 instead of 1000?

In [None]:
tot_gigabytes_transferred = None

# YOUR CODE HERE
raise NotImplementedError()

## Problem 3:  What fraction of requests are successful?

A return code of 200 means "success" for HTTP requests.  Since these are weblogs, they should all be HTTP requests (we could check that, but let's leave that aside for now).

Estimate the *fraction* of requests that were successful.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Problem 4: Weekends vs weekdays

One common question is: are weekends more popular than weekdays?  This can help IT in their capacity planning.

For this problem let's define the "weekend" to be Friday, Saturday, and Sunday.  All other days are "weekdays".  Don't worry about timezones for this problem (we don't know the timezones of the clients anyway - we only know the timezone of the server at NASA).

First estimate the *number of requests* during July and August 1995 that came in on the **weekend** (remember to correct for the fact that you are using a sample):

In [None]:
num_weekend_requests = None  # Please store your result in this variable

# YOUR CODE HERE
raise NotImplementedError()

Now estimate the number of requests in July and August 1995 that came in on the weekdays:

In [None]:
num_weekday_requests = None  # Please store your result in this variable

# YOUR CODE HERE
raise NotImplementedError()

Enrichment:  Think about these results.  Is the NASA website more popular during weekdays, or weekends?

## Problem 5: most popular day of the week

Estimate the total number of requests in July and August 1995 *grouped by day of week*.  Remember that you are using a 10% sample (which is why this is only an estimate), so be sure to adjust for that (multiply by 10!).  Return your result in a list named `hits_by_day_of_week`.  It should look something like this:
```
[(0, ??????),
 (1, ??????),
 ...
 (6, ??????)]
```
(where ?????? will be numbers).

Please make sure that your output is *sorted* by day of week (hint: `.sortByKey()`).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()