<img src="advenica-logo.svg" width="150" align="right">

# Intrusion detection Part 2 - Wordpress (i.e. Apache) logs

_(Continued from Part 1)_

3. Locating the WordPress Site:
    - A search for the WordPress site leads to the `robots.txt` file, which points to the path `/blogblog/`.
    - Using a fuzzing tool called `ffuf`, some common WordPress URIs are found and through those the login page.
4. Exploiting WordPress Credentials:
    - The previously cracked passwords includes access to the WordPress admin account.
    - Finally, with admin rights to the entire WordPress site, there are multiple options for further exploration and exploitation.



# Investigating the incident

The WordPress site is using an apache webserver and we continue the investigation by looking for further evidence of the intrusion attempt there.

In [None]:
# Import some needful things
import sys
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install seaborn
!{sys.executable} -m pip install matplotlib
!{sys.executable} -m pip install scikit-learn
!{sys.executable} -m pip install logparser3
!{sys.executable} -m pip install drain3
!{sys.executable} -m pip install numpy

In [None]:
# Some pandas settings
import pandas as pd
pd.set_option('display.max_columns', None)          # Show all columns
pd.set_option('display.max_colwidth', 1000)         # Set a large max column width
pd.set_option('display.colheader_justify', 'left')  # Justify column headers to the left
pd.set_option('display.expand_frame_repr', False)   # Prevent line breaks

# Preprocessing

Preprocessing of the log file to make it more palatable for Drain.

- We rearrange the log entry a bit, putting the timestamp first, followed by client ip (which is probably an interesting feature.)
- Apache timestamps look a bit iffy, so we transform them into something pandas understand natively.
- Log entries are expected look like this: 
- After preprocessing, the log will only contain single line entries on the form
  - `<timestamp> <clientip> <clientid> <clientuser> <statuscode> <size> <refererip> <refererpage> "<useragent>" "<request>"`
  - `2024-07-08T11:12:56+0100 192.168.56.1 - - 200 3640 - - "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0" "/blogblog/wp-login.php"`
  - *Note*: `clientid` and `clientuser` are almnost always `-` so can probably be ignored.

In [None]:
import re
from datetime import datetime
import gzip
import pandas as pd
log_line_re = re.compile(r'^(?P<clientip>\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}) (?P<clientid>.*) (?P<clientuser>.*) \[(?P<timestamp>.+)\] \"(?P<method>[A-Za-z]+) (?P<request>.+) HTTP/\d.\d\" (?P<statuscode>\d{3}) (?P<size>\d+) \"(?P<refererip>https?://\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}:\d*)?(?P<refererpage>.+)\" \"(?P<useragent>.+)\"$')

def convert_datetime(dt):
    # Convert apache's weird date format to something more useful
    apache_format = "%d/%b/%Y:%H:%M:%S %z"
    # Define the output format
    pandas_format = "%Y-%m-%dT%H:%M:%S%z"
    # Parse the datetime string and reformat it
    dt = datetime.strptime(dt, apache_format)
    return dt.strftime(pandas_format)

def parse_raw_log(filename):
    log_entries = []
    num_lines = 0
    failed_lines = 0
    parsed_lines = 0
    with gzip.open(filename, "rt", encoding="utf8") as fp:
        for line in fp:
            num_lines += 1
            if not log_line_re.match(line):
                failed_lines += 1
                print(f"failed to parse: {repr(line)}")
            else:
                parsed_lines += 1
                entry = { k: v if v is not None else '-' for k,v in log_line_re.match(line).groupdict().items()}
                entry['timestamp'] = convert_datetime(entry['timestamp'])
                log_entries.append(entry)
    print(f"lines: {num_lines}, parsed: {parsed_lines}, failed: {failed_lines}")
    return log_entries

log_entries = parse_raw_log('access.log.gz')

In [None]:
import os
if not os.path.exists("preproc_data"):
    os.mkdir("preproc_data")
with open("preproc_data/preproc-access.log", "wt", encoding="utf8") as fp:
    for entry in log_entries:
        fp.write("{timestamp} {clientip} {clientid} {clientuser} {statuscode} {size} {refererip} {refererpage} \"{useragent}\" \"{request}\"\n".format(**entry))

In [None]:
df = pd.DataFrame(log_entries)
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['statuscode'] = pd.to_numeric(df['statuscode'], downcast="unsigned")
df['size'] = pd.to_numeric(df['size'], downcast="unsigned")

In [None]:
import seaborn
import matplotlib.pyplot as plt

# Visualizations
seaborn.countplot(data=df, x='method')
plt.title('Distribution of HTTP Methods')
plt.show()

In [None]:
# Visualizations
fig = seaborn.countplot(data=df, x='statuscode')
fig.set_xticklabels(fig.get_xticklabels(), rotation=90)
plt.title('Distribution of status codes')
plt.show()

In [None]:
df.set_index('timestamp', inplace=True)
df['hour'] = df.index.hour
hourly_patterns = df.groupby(['method', 'hour']).size().unstack().T
hourly_patterns.plot(kind='bar', stacked=True, figsize=(12, 6))
plt.title('Methods by Hour of Day')
plt.show()

# Drain & log parsing

After preprocessing, the logs have the following format:

```log
2024-07-08 11:16:26+0100 192.168.56.3 - - 200 4353 https://192.168.56.13:12380 /blogblog/ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" "/blogblog/?p=21"
```
With the following fields:

```
<Timestamp> <Clientip> - - <Statuscode> <Size> <Refererip> <Refererpage> "<Useragent>" "<Request>"
```
(Where request will be the event, i.e. Content)

In [None]:
from logparser.Drain import LogParser

input_dir = 'preproc_data/' # The input directory of log file
output_dir = 'result/'  # The output directory of parsing results
log_file = 'preproc-access.log'
# time id command argument
log_format = '<Timestamp> <Clientip> - - <Statuscode> <Size> <Refererip> <Refererpage> "<Useragent>" "<Content>"' # Define log format to split message fields
# Regular expression list for optional preprocessing (default: [])
regex = [
    r'([\da-fA-F]{8,})', # HEX numbers
    r'(\d{5,})', # 'large' integers
    r'(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})' # IP numbers
]
st = 0.5  # Similarity threshold
depth = 4  # Depth of all leaf nodes

parser = LogParser(log_format, indir=input_dir, outdir=output_dir,  depth=depth, st=st, rex=regex)
parser.parse(log_file)

# Parsed log data

In [None]:
# Load structured data
df= pd.read_csv('result/preproc-access.log_structured.csv')
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
templates = pd.read_csv('result/preproc-access.log_templates.csv')

In [None]:
df.dtypes

In [None]:
df

# Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['EncodedEventId'] = label_encoder.fit_transform(df['EventId'])
df['EncodedClientip'] = label_encoder.fit_transform(df['Clientip'])
df['EncodedUseragent'] = label_encoder.fit_transform(df['Useragent'])

# Supervised vs Unsupervised

Since we have the labels of the data, we can add those classifiers to the structured data.

The classes are:
  - `normal` for lines that are not part of an attack.
  - `fuzzing` for lines that try to locate important wordpress pages.
  - `login` for the line where the hacker downloads the login page

In [None]:

# Attacker fetching robots.txt and index page of wordpress blog
recon = list(range(6139, 6140))
# Attacker running fuzzing tool for more recon details
fuzzing = list(range(6141, 6274))
# Attacker fetching login page
login = [6275]

def mark_anomalies(row):
    if row['LineId'] in recon:
        return 'recon'
    if row['LineId'] in fuzzing:
        return 'fuzzing'
    if row['LineId'] in login:
        return 'login'
    return 'normal'

df['Classification'] = df.apply(mark_anomalies, axis=1)
df['Anomaly'] = df.apply(lambda row: -1 if row['Classification'] != 'normal' else 1, axis=1)
df

# Isolation forest, unsupervised

In [None]:
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.metrics import confusion_matrix, classification_report, f1_score, precision_score, recall_score
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

ldf = df[['EncodedUseragent', 'EncodedEventId', 'EncodedClientip', 'Statuscode', 'Size']]
features = df[['EncodedUseragent', 'EncodedEventId', 'EncodedClientip', 'Statuscode', 'Size']]

# Train Isolation Forest model
model = IsolationForest(contamination=0.01)
ldf['PredictedAnomaly'] = model.fit_predict(features)

# Generate the confusion matrix
cm = confusion_matrix(df['Anomaly'], ldf['PredictedAnomaly'], labels=[1, -1])

# Calculate performance metrics
report = classification_report(df['Anomaly'], ldf['PredictedAnomaly'], labels=[1, -1], target_names=['Normal', 'Anomaly'])
precision = precision_score(df['Anomaly'], ldf['PredictedAnomaly'], pos_label=-1)
recall = recall_score(df['Anomaly'], ldf['PredictedAnomaly'], pos_label=-1)
f1 = f1_score(df['Anomaly'], ldf['PredictedAnomaly'], pos_label=-1)

print("Classification Report:")
print(report)

# Normalize the confusion matrix by row (i.e by the number of samples in each actual class)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

# Plot the confusion matrix as a heatmap with counts and percentages
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Predicted Normal', 'Predicted Anomaly'], yticklabels=['Actual Normal', 'Actual Anomaly'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix with Counts')
plt.show()

plt.figure(figsize=(10, 7))
sns.heatmap(cm_normalized, annot=True, fmt='.2%', cmap='Blues', xticklabels=['Predicted Normal', 'Predicted Anomaly'], yticklabels=['Actual Normal', 'Actual Anomaly'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix with Percentages')
plt.show()



# Random forest, supervised

Next, we make use of the labels and try out random forest.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt


X = df[['EncodedUseragent', 'EncodedEventId', 'EncodedClientip', 'Statuscode', 'Size']]

y = df['Anomaly']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)

# Generate the confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=[1, -1])

# Print the classification report
report = classification_report(y_test, y_pred, target_names=['Normal', 'Anomaly'])
print("Classification Report:")
print(report)

# Plot the confusion matrix as a heatmap with counts
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Predicted Normal', 'Predicted Anomaly'], yticklabels=['Actual Normal', 'Actual Anomaly'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix with Counts')
plt.show()

# Normalize the confusion matrix by row (i.e., by the number of samples in each actual class)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

# Plot the confusion matrix as a heatmap with percentages
plt.figure(figsize=(10, 7))
sns.heatmap(cm_normalized, annot=True, fmt='.2%', cmap='Blues', xticklabels=['Predicted Normal', 'Predicted Anomaly'], yticklabels=['Actual Normal', 'Actual Anomaly'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix with Percentages')
plt.show()


# Explanation

The supervised approach is again successful right out of the box. The reason for this is (again, and likely) that there is very little (actually no) overlap between the anomalous and non-anomalous log entries (see below.) 

In [None]:
anomalies = list(set(df[df['Anomaly'] == -1]['EventId']))
normies = list(set(df[df['Anomaly'] == 1]['EventId']))

mixed = [ x for x in normies if x in anomalies]
print(len(mixed))