# Get Set Up

## Import Libraries

In [0]:
# Pandas provides an extremely useful data structure
import pandas as pd

# RE provides regular expression pattern matching
import re

# datetime provides a datetime object class and conversion utilities
from datetime import datetime

# Web file access
from urllib.request import urlopen

# Google tools
from google.colab import drive

# Math provides additional math functions
import math


## Define Some Functions

In [0]:
def log_ReadFile(logfile):
  with open(logfile) as fh:
    loglines = fh.readlines()
  loglines = [line.strip() for line in loglines]
  return loglines


def log_ReadURL(logfile):
  loglines = urlopen(logfile).readlines()
  loglines = [line.decode().strip() for line in loglines]
  return loglines


def log_Parser(log_list, regx_obj, col_list):
  # initialize empty lists for results
  logs_parsed = []
  parse_fails = []
  
  # parse logs using supplied regex and column list
  for line in log_list:
    match = regx_obj.match(line)
    if match:
      logs_parsed.append([match.group(col) for col in col_list]) 
    else:
      parse_fails.append(line)
      
  # return parsed data and list of lines that were not parsed correctly
  return logs_parsed, parse_fails

## Load Data

In [0]:
# Define vars
error_url = "https://raw.githubusercontent.com/flarmy/ds101/master/error.log.2019-03-22"

# Read log files into lists
error_logs = log_ReadURL(error_url)

In [0]:
# display first 5 lines in list
#display(error_logs[:5])

## Parse Data

In [0]:
# define a regex pattern to parse lines into fields
# sample line:
# '[Fri Mar 22 02:13:49 2019] [error] [client 54.36.149.5] File does not exist: /home/sooshie/secrepo.com/self.logs/access.log.2015-09-07.gz']
web_error_pattern = re.compile('\[(?P<date>[^\]]+)\]'
                                '\s+\[(?P<msg_type>[^\]]+)\]'
                                '\s+\[client\s+(?P<client_ip>[^\]]+)\]'
                                '\s+(?P<error_type>[^:]+):'
                                '\s+(?P<error_message>.*)')

# define list of columns to use
error_column_list = ['date', 'msg_type', 'client_ip', 'error_type', 'error_message']


# call parser
error_logs_parsed, error_logs_parsefail = log_Parser(error_logs, web_error_pattern, error_column_list)


In [0]:
# test parsing
#error_logs_parsed[:5]

In [0]:
# did any lines fail to parse?
#error_logs_parsefail[:5]

In [41]:
# convert to Pandas dataframe and display it
error_logs_df = pd.DataFrame.from_records(error_logs_parsed, columns=error_column_list)
display(error_logs_df.head())

Unnamed: 0,date,msg_type,client_ip,error_type,error_message
0,Fri Mar 22 01:58:55 2019,error,54.36.148.18,File does not exist,/home/sooshie/secrepo.com/self.logs/error.log....
1,Fri Mar 22 02:04:26 2019,error,54.36.148.62,File does not exist,/home/sooshie/secrepo.com/self.logs/access.log...
2,Fri Mar 22 02:08:27 2019,error,27.255.4.117,ModSecurity,Access denied with code 418 (phase 1). Pattern...
3,Fri Mar 22 02:08:28 2019,error,27.255.4.117,ModSecurity,Access denied with code 418 (phase 1). Pattern...
4,Fri Mar 22 02:13:49 2019,error,54.36.149.5,File does not exist,/home/sooshie/secrepo.com/self.logs/access.log...


# Instructions

In this lab, you will combine techniques from the previous labs to explore an additional data set.  Namely, you will look at the web error logs from the same server as the access logs we looked at in Lab #1.

* To save time, the logs have been parsed already.  They are stored in a Pandas dataframe called "error_logs_df".

* Conduct exploratory data analysis to gain an understanding into these logs, and answer the questions defined below.  We recommend inserting text blocks to organize your work, if helpful.  Several collapsible "header" sections have been provided to guide your analysis.

**To begin, choose "Runtime -> Run All" from the menu options.**

# Exploratory Analysis: Web Error Logs

To get started, here are two easy steps provided for you.  Take a look at a sample of the data.  Then run a simple command to generate some basic summary statistics for the entire dataframe (you should still examine inidividual columns more carefully).

## Examine the Data

In [42]:
error_logs_df.head()

Unnamed: 0,date,msg_type,client_ip,error_type,error_message
0,Fri Mar 22 01:58:55 2019,error,54.36.148.18,File does not exist,/home/sooshie/secrepo.com/self.logs/error.log....
1,Fri Mar 22 02:04:26 2019,error,54.36.148.62,File does not exist,/home/sooshie/secrepo.com/self.logs/access.log...
2,Fri Mar 22 02:08:27 2019,error,27.255.4.117,ModSecurity,Access denied with code 418 (phase 1). Pattern...
3,Fri Mar 22 02:08:28 2019,error,27.255.4.117,ModSecurity,Access denied with code 418 (phase 1). Pattern...
4,Fri Mar 22 02:13:49 2019,error,54.36.149.5,File does not exist,/home/sooshie/secrepo.com/self.logs/access.log...


## Summary Statistics

In [44]:
error_logs_df.describe()

Unnamed: 0,date,msg_type,client_ip,error_type,error_message
count,276,276,276,276,276
unique,208,1,133,2,169
top,Fri Mar 22 08:52:49 2019,error,222.186.160.61,File does not exist,/home/sooshie/secrepo.com/Datasets
freq,4,276,17,239,17
