# Recognizing Directory Brute-force from apache2's access.log file

Hello, there. Today we will be taking a look at Directory Brute-force in apache2.

In directory brute-force, the most common method is to directly try a dictionary of common directory and see the server response. If the server returns 404, it means that it doesn't exist. Otherwise it exists but on several condition, such as unauthorized.

I used my previous competition, KKST Qualifiers, log analysis file as the log example.

In [30]:
import re
import pandas as pd
from datetime import datetime
import pytz

In [31]:
def parse_str(x):
    """
    Returns the string delimited by two characters.

    Example:
        `>>> parse_str('[my string]')`
        `'my string'`
    """
    return x[1:-1]

def parse_datetime(x):
    '''
    Parses datetime with timezone formatted as:
        `[day/month/year:hour:minute:second zone]`

    Example:
        `>>> parse_datetime('13/Nov/2015:11:45:42 +0000')`
        `datetime.datetime(2015, 11, 3, 11, 45, 4, tzinfo=<UTC>)`

    Due to problems parsing the timezone (`%z`) with `datetime.strptime`, the
    timezone will be obtained using the `pytz` library.
    '''
    dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
    dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
    return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))

The function above is used for cleaning unusual string format such as time and string with closed brackets.

In [32]:
data = pd.read_csv(
    'access.log',
    sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
    engine='python',
    na_values='-',
    header=None,
    usecols=[0, 3, 4, 5, 6, 7, 8],
    names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
    converters={'time': parse_datetime,
                'request': parse_str,
                'status': int,
                'size': int,
                'referer': parse_str,
                'user_agent': parse_str})

In [33]:
data.head()

Unnamed: 0,ip,time,request,status,size,referer,user_agent
0,192.168.77.87,2020-11-09 04:07:00+00:00,GET / HTTP/1.1,200,3477,,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...
1,192.168.77.87,2020-11-09 04:07:00+00:00,GET /robots.txt HTTP/1.1,404,492,,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...
2,192.168.77.87,2020-11-09 04:07:00+00:00,GET /icons/ubuntu-logo.png HTTP/1.1,200,3623,http://192.168.77.38/,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...
3,192.168.77.87,2020-11-09 04:07:00+00:00,GET /favicon.ico HTTP/1.1,404,491,http://192.168.77.38/,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...
4,192.168.77.87,2020-11-09 04:59:01+00:00,GET / HTTP/1.1,200,2361,,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...


Now that we have the data ready to digest, we can proceed to check whether the log shows an indication of directory brute force or not. We will use `pandas` to digest and count the requests.

In [34]:
print("Total digested requests:", data.shape[0])
for status in data.status.unique():
    statusobj = data.apply(lambda x: True if x['status'] == status else False, axis=1)
    numofstatus = len(statusobj[statusobj == True].index)
    print("Requests with", str(status), "status:", str(numofstatus), "("+"{:.4f}".format(numofstatus/data.shape[0]*100), "%)")

Total digested requests: 19060
Requests with 200 status: 101 (0.5299 %)
Requests with 404 status: 18830 (98.7933 %)
Requests with 304 status: 1 (0.0052 %)
Requests with 500 status: 14 (0.0735 %)
Requests with 403 status: 108 (0.5666 %)
Requests with 301 status: 6 (0.0315 %)


In this example, you can see that in a directory brute-force attack, you will expect a lot of 404 response from the server and noticed the sudden high load of the server. Common users won't get 404 responses that often. If this was detected, it is recommended to check the logs for confirmation of the directory brute-force. Sometimes, this can be a false positive but it's rare unless there are something done in the server that resulted in many 404s e.g. committing untested code to production server.

Usually directory brute-forcing is done alphabetically because of dictionary, so if there is a sudden spike of 404s and when checked it's from the same IP and the requests are alphabetically ordered, you can flag this as an incident. 