# Week 4 HW:  Analysis of NASA weblogs from 1995

## Tuples

Before we begin:  another built-in Python datatype that you will encounter is the *tuple*.  A tuple looks a LOT like a list, except that it is denoted with parentheses instead of brackets.  Elements are indexed in exactly the same way:

In [26]:
mytuple = ("zero", "one", "two")
mytuple[0]

'zero'

The only difference is that tuples have a FIXED LENGTH (once created, we cannot add or remove elements), and they are IMMUTABLE (once created, the elements cannot be changed - with some caveats that I want to ignore right now).

They are especially well-suited for situations where you want to bundle a small (e.g. 3) group of elements together.  In these situations they are much more efficient than lists.

Tuples with only a single element need an extra comma "," after the element.  Can you figure out why?

In [28]:
not_a_tuple = (9)
type(not_a_tuple)

int

In [29]:
a_tuple = (9,)
type(a_tuple)

tuple

## Parsing log entries using regular expressions

*Parsing* is the act of extracting structured information from strings.  In this homework we will use *regular expressions* to parse each log entry.

Recall that in week 2 we used regular expressions to clean up our tweets (see Python video tutorials).  There we only did simple substitutions (finding patterns and replacing with `' '`).

A nice tutorial is here: https://www.machinelearningplus.com/python/python-regex-tutorial-examples/ 

Full documentation here:  https://docs.python.org/3/howto/regex.html

In [33]:
import re

# Here is an example entry from the log
# let's practice extracting the following fields from it (already described):
# requesting_host
# user_identity
# user_local_identity
# timestamp
# requested_resource
# return_code
# bytes_transferred

logentry = 'maynard.isi.uconn.edu - - [28/Jul/1995:13:32:22 -0400] "GET /images/shuttle-patch-logo.gif HTTP/1.0" 200 891'

In [34]:
# You need to write a regular expression that extracts the fields into variables
# YOUR CODE GOES HERE
logpattern = '(\S+)\s+(\S+)\s+(\S+)\s+\[(.+)]\s\"(.+)\"\s+(\d+)\s+(\d+)'
logregex = re.compile(logpattern)
matches = logregex.match(logentry)
requesting_host, user_identity, user_local_identity, timestamp, requested_resource, return_code, bytes_transferred = matches.groups()

In [35]:
assert requesting_host == 'maynard.isi.uconn.edu'
assert user_identity == '-'
assert user_local_identity == '-'
assert timestamp == '28/Jul/1995:13:32:22 -0400'
assert requested_resource == 'GET /images/shuttle-patch-logo.gif HTTP/1.0'
assert return_code == '200'
assert bytes_transferred == '891'
# TODO test what happens when bytes_transferred is a '-'

## Parsing Timestamps

The timestamp itself has some further structure that we want to extract.  Let's try using another regex to split up the timestamp string.  It is formatted in the following way: `Day/Month/Year:Hour:Minute:Second Timezone`.

Write a regular expression that parses the timestamp string (from the example above):

In [36]:
tspattern = '(\d+)/(\w+)/(\d+):(\d+):(\d+):(\d+)\s+(\S+)'
tsregex = re.compile(tspattern)
matches = tsregex.match(timestamp)
day, month, year, hour, minute, second, timezone = matches.groups()

In [37]:
assert day == '28'
assert month == 'Jul'
assert year == '1995'
assert hour == '13'
assert minute == '32'
assert second == '22'
assert timezone == '-0400'

It turns out that parsing this timestamp using a regex is just the beginning!  In order to get something useful (i.e. dates and times that you can do ARITHMETIC on) you would have to translate the month string from `'Jul'` to the number `7`.  But what if somebody changes the log format to write out `'July'` instead of `'Jul'`?  Are you going to handle that case as well?  What if somebody changes the log system to spew month strings that are in French?

What if somebody starts spelling out timezones, e.g. `'US Mountain'`?  

Your head should start spinning now.  All of this CAN be done with regex, but with a lot of extra logic on top to check various cases.

Worse, you need to start understanding the intricacies of the calendar if you want to answer questions like:  how many days are in between `December 7, 1941` and `January 1, 2017`.  Start thinking about leap days!  Did you know there are leap seconds as well?

Fortunately, Python has a `datetime` module that is meant to simplify life (note: `datetime` does NOT handle leap seconds it turns out, but ignore that little nasty).  Let it do all of the hard date and time arithmetic for you!  Here is a tutorial to get you started: https://www.guru99.com/date-time-and-datetime-classes-in-python.html

Let's abandon the regex approach to timestamps.  Instead, `datetime` comes with a smart way to parse timestamp strings called `strptime`.  Use that instead! (described in the tutorial).

One thing to keep in mind:  all sane systems measure time using UTC (Coordinated Universal Time).  Roughly, this is just Greenwich Mean Time (with some subtleties).  The timezone will come formatted like `-0400` (4 hours behind UTC), or `+0800` (8 hours ahead of UTC).

Your task:  create a `datetime` object that holds the date and time that you extracted above:

In [48]:
from datetime import datetime, timezone, timedelta

# YOUR CODE GOES HERE
dt = datetime.strptime(timestamp, "%d/%b/%Y:%H:%M:%S %z")

In [49]:
assert dt == datetime(1995, 7, 28, 13, 32, 22, tzinfo=timezone(-timedelta(hours=4)))

## Using Spark to load weblogs into RDD

In [1]:
from pyspark import SparkContext

sc = SparkContext('local', 'NASA_weblog_analysis') 

In [2]:
rdd = sc.parallelize([1,2,3,4])
rdd.collect()

[1, 2, 3, 4]