## hunt_password_spray_with_fourier
This notebook will accept a table/csv containing timestamped authentication failures (against any authentication service), 
and will attempt to find evidence of password spraying against any of the accounts in the data.

To do this, we apply a fast Fourier transform to the data, and try to find failures against any particular account that
occur on a near-metronomic cadence.  This could be consistent with a password spraying script being used against one 
or more accounts.  It will then be left to the investigator to determine what (if any) response actions are required.

Script parameters may be modified if more results are desired, at a cost of potentially more false positives (or vice versa). 

#### Prerequisite: 
Generate a .csv file of authentication failures, containing a column named "timestamp", and another named "username".
Save it to the same path as this notebook.

In [None]:
# Import stuff

import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

import numpy as np
from datetime import datetime, timedelta
import math
import statistics
from tqdm.notebook import tqdm

In [None]:
# Script variables.  Update these depending on your search conditions.

# Display graphs?
interactive = True

# Field to group the data by, for FFT analysis.  Change this based on your source data.
groupbyfield = "username"

# Number of stdevs above the mean signal will be considered a strong signal 
# (set higher for fewer results, lower for fewer false negatives)
multiplier = 4

# Show full non-truncated dataframes
pd.options.display.max_colwidth = None

# Define some sampling periods, at which the data will be downsampled.  To find a signal with a periodicity of X,
# you will need to use a sampling period of at most X/2.
sampleperiods = [
        "1s",
        "3s",
        "5s",
]

In [None]:
# Ingest a CSV file of timestamped authentication failures.  
# Include a header row. Change the filename as appropriate.
try:
    df = pd.read_csv("authfailures.csv", encoding='ISO-8859-1', header=0)
except Exception as error:
    print(error)

In [None]:
# Select just the columns you'd like to work with.

df = df[["timestamp", "username"]]
df.columns = ["timestamp", "username"]
df["timestamp"] = pd.to_datetime(df["timestamp"])

# Set the timestamp as the index, so we can use resample()
df.index = df["timestamp"]

print("Ingested " + str(len(df)) + " lines of data")

In [None]:
# Begin by visualizing in a timeseries scatterplot.  This will give at least an overview of the data.  
# A horizontal bar for any given username could be a good initial indicator. 

fig = px.scatter(
    df, x="timestamp", y="username", color="username", render_mode="auto"
)
fig.show()

In [None]:
# This cell defines the analysis function.  It will iterate through each user or sourceIP, resample the data
# on our chosen sampling periods, and apply a Fourier transform to try to find any strong periodic signals.
# We'll run it in the next cell.

def get_fft_by_user(x):
    user = x

    strongestsignal = 0  # for now

    for sampleperiod in sampleperiods:
        # Group the original data by username, and then for each user, resample the data at each sampling period.
        # This will return an array of the count of events, per user, per sampling period.
        grouped = (
            df.groupby(groupbyfield)
            .resample(sampleperiod)
            .agg({"timestamp": "count"})
            .rename(columns={"timestamp": "counts"})
        )
        # Get only the current user
        grouped = grouped[grouped.index.get_level_values(0) == user]

        # Get the number of found events, per sample period
        countsperperiod = grouped["counts"]
        totaleventscount = grouped["counts"].sum()

        # Apply a Real-Input Fast Fourier Transform on the counts of items per timeslice.  This gets the
        # signal strength (amplitude) of each present frequency.  This will become the yaxis.
        fft = np.fft.rfft(countsperperiod)
        # Remove the DC component of the signal (first element)
        fft[0] = 0

        # Turn the current sampling period into an int, for use in the next step
        dvalue = int(sampleperiod.rstrip("s"))

        # Use the Real-Input Fast Fourier Transform Frequency rfftfreq() function to get the frequencies present in the signal.
        # This will become the xaxis.  Arguments to rfftfreq():
        # n = window length, defined as the number of items in the current sample spacing
        # d = sample spacing, which is defined above (also equals 1/samplerate)
        frequencies = np.fft.rfftfreq(len(countsperperiod), dvalue)

        # Set a threshold for a "strong signal", at X stdevs over the mean
        stdev = np.std(fft.real)
        mean = np.mean(fft.real)
        threshold = mean + multiplier * stdev  # Adjust that multiplier as needed
        # Find the fft value with the highest amplitude, which will be different per sampling period
        highestsignal = np.max(fft)
        if highestsignal > strongestsignal:
            strongestsignal = highestsignal

        # Render any strong signals in frequency-domain, and then use the Inverse FFT to flip them back to time-domain.
        # To do so: find any signal amplitudes in the "fft" array larger than "X" stdevs over the mean.  Keep the largest.
        for signal in fft:
            if signal > threshold:
                amplitude = signal.real
                frequency = frequencies[np.where(fft.real == signal.real)][0]
                period = 1 / frequency

                if signal == strongestsignal:
                    print(
                        "Found strong signal for user "
                        + user
                        + " with amplitude of "
                        + str(amplitude)
                        + ", frequency of "
                        + str(frequency)
                        + ",  sampling period: "
                        + sampleperiod
                    )

                    if interactive:
                        # Plot the sampling period with the strongest signal in frequency-domain: frequency X amplitude
                        fig = go.Figure()
                        fig.add_trace(
                            go.Scatter(
                                mode="lines",
                                x=frequencies.real,
                                # fft is a complex number; plotting its absolute value gives the amplitude
                                y=(abs(fft.real)),
                            )
                        )
                        fig.update_layout(
                            xaxis_title="Frequency (cycle/sec)",
                            yaxis_title="amplitude",
                            title="User "
                            + x
                            + " EID3771 events in Frequency Domain; Sampling Period: "
                            + sampleperiod,
                            showlegend=False,
                        )

                        # Use the Inverse Real-Mode Fast Fourier Transform to flip back to time-domain
                        inversefft = np.fft.irfft(fft, len(countsperperiod))

                        # Plot the time-domain data, which should show periodicity
                        fig2 = go.Figure()
                        fig2.add_trace(
                            go.Scatter(
                                mode="lines",
                                x=grouped.index.get_level_values(1),
                                y=inversefft,
                            )
                        )
                        fig2.update_layout(
                            xaxis_title="Timestamp",
                            yaxis_title="count",
                            title="User "
                            + user
                            + " Periodic Signal in Time Domain at frequency "
                            + str(frequency)
                            + "Hz (period "
                            + str(1 / frequency)
                            + " sec)",
                            showlegend=False,
                        )

                        # Render the graphs
                        fig.show()
                        fig2.show()

                    return pd.DataFrame(
                        {
                            "user": [user],
                            "frequency (Hz)": [frequency],
                            "period (sec)": [period],
                            "starttime": [grouped.index.get_level_values(1)[0]],
                            "endtime": [grouped.index.get_level_values(1)[-1]],
                            "totaleventscount": totaleventscount,
                        }
                    )
        # If we did not find a strong signal, return an empty df
        return pd.DataFrame()

In [None]:
# Start of analysis.  
# We will call the function from the above cell, to take our initial dataframe
# and run an FFT analysis on it, grouped by our chosen groupby field.

potentialperiodicsignals = pd.DataFrame()  # empty, for now
for user in tqdm(df[groupbyfield].unique()):
    try:
        potentialsignal = get_fft_by_user(user)
        potentialperiodicsignals = pd.concat([potentialperiodicsignals, potentialsignal])
    except Exception as error:
        print(error)

In [None]:
# Display any results.  This will show any users/sources that had identified periodic password
# failures, what the period was, when it started/ended, and how many events were identified.
potentialperiodicsignals