## Threat Hunting Report with Logarithms
### Logarithms
You will need to tune the threshold when using a logarithm, though the threshold of -3 is set, by default. The logarithm is used because it can compress large datasets and help rare events to stand out. The term "compress" in this context means to group common data points. A value of -3 provides a good balance of whether or not an anomaly will stand out.

Changing the values will help you identify anomalies.

### Heterogenous Dataset
If the dataset is heterogenous, then you'll likely need a higher threshold (thus less sensitivity) to test it and start at around -2 or -2.5. That is because there is more variablity in the data so less sensitivity will lead to anomalies standing out easier.

The context is if you are analzying processes, for example, across different operating system versions or roles of the OS are different (eg. IIS web servers versus Exchange server). Not recommended, however.

### Homogenous Dataset
If you have a similar dataset, then you'd want a lower threshold so the -3 is a good start.

It is able to find 2 anomalies. The first value is the number of hosts the file is on and then the logarithmic value. Since the dataset is homogenous, the lower threshold shows rare processes running.

### Dataset size
Larger datasets provide higher confidence in detecting anomalies so a lower threshold works well (-3, -3.5, -4)

Smaller datasets have less confidence so a higher threshold is necessary -2 or -1 (for small datasets).

### Sample Analysis

```
[snipped for brevity]
[200 -1.838629 C:\ProgramData\Ashampoo Winzip AshampooWinZip.exe]
[200 -1.838629 C:\Program Files\WebServer_11.24.0.0_x64__8wekyb3d8bbwe\WebServer.exe]
[200 -1.838629 C:\Program Files\SessionManager_11.24.0.0_x64__8wekyb3d8bbwe\SessionManager.exe]
[200 -1.838629 C:\ProgramData\Dell\Supportassist\DellSupportAssist.exe]
[200 -1.838629 C:\Program Files (x86)\Twitch Twitch.exe]
[200 -1.838629 C:\Program Files (x86)\Microsoft Edge\Application MicrosoftEdge.exe]
[200 -1.838629 C:\Windows\SystemHealth.exe]
[200 -1.838629 C:\Windows\system32\taskmgr.exe]
[199 -1.840806 C:\Program Files (x86)\iPod\ iTunes AppleiTunes.exe]
[193 -1.854101 C:\Windows\taskmgr.exe]
[2 -3.838629 C:\ProgramData\system32\csrss.exe]
[1 -4.139659 C:\Program Files (x86)\iPod\ iTunes AppleTunes.exe]
```

Note how there is a C:\Windows\taskmgr.exe on 193 hosts, but didn't show up as an anomaly with a threshold of -3. That is why you need to change the threshold so that you better spot anomalies that may exist outside of a given threshold. While larger datasets have higher confidence of anomalies with a lower threshold, some anomalies could still be missed. Accordingly, all of these factors are the reason you need to change the the threshold during your analysis. Also, the legit taskmgr.exe file is located in C:\Windows\system32\taskmgr.exe. If this was a real system, this would likely be a mass compromise UNLESS there is a custom program in that path or a third-party program had a similar name process. CONTEXT! CONTEXT! CONTEXT!

In the above output, you can also see the difference with the 'AppleiTunes.exe' path and the 'AppleTunes.exe' path.

GitHub repo contains a Go based version of the logarithmic functions below and another tool to create a baseline and then compare the remaining systems against the baseline. https://github.com/thedunston/goMeeb/

In [None]:
import os
import csv
import math
import threading
import queue
import pandas as pd
import jinja2
import sys
from pathlib import Path
from collections import defaultdict

# Retrieve list of CSV files from the specified directory.
def get_csv_files(directory):
    # Recursively gather all CSV files.
    files = [str(path) for path in Path(directory).rglob('*.csv')]
    if not files:

        raise FileNotFoundError(f"No CSV files found in directory {directory}")
    
    return files

# Determine the number of threads to use based on the number of files.
def determine_num_threads(files):
    
    # Basic method to determine number of threads.
    # This is helpful for dozens of CSV files..
    num_threads = len(files) // 2
    if num_threads > 10:
        num_threads = 10
    elif num_threads < 1:
        num_threads = 1
    return num_threads

# Process each CSV file and return the count of occurrences for the specified header.
def process_file(file, header, data_queue):
    data = defaultdict(int)
    total_entries = 0

    # Open the CSV file.
    with open(file, 'r') as f:
        reader = csv.DictReader(f)
        # Check if the header is present in the CSV file.
        if header not in reader.fieldnames:
            print(f"Header {header} not found in file {file}")
            return
        
        # Count occurrences of each value under the specified header.
        for row in reader:
            value = row[header]
            data[value] += 1
            total_entries += 1
    
    # Put the processed data into the queue.
    data_queue.put((data, total_entries))

# Aggregate data from all files for the specified header.
def aggregate_data(files, header, num_threads):
    data_queue = queue.Queue()
    threads = []

    # Create and start threads for concurrent processing.
    for i in range(num_threads):
        t = threading.Thread(target=process_files_thread, args=(files[i::num_threads], header, data_queue))
        threads.append(t)
        t.start()

    # Wait for all threads to finish.
    for t in threads:
        t.join()

    aggregated_data = defaultdict(int)
    total_entries = 0

    # Collect data from the queue and aggregate it.
    while not data_queue.empty():
        data, count = data_queue.get()
        for key, value in data.items():
            aggregated_data[key] += value
        total_entries += count

    return aggregated_data, total_entries

# Process files.
def process_files_thread(files, header, data_queue):
    for file in files:
        process_file(file, header, data_queue)

# Identify anomalies based on the specified threshold.
# ChatGPT help with this...math. :)
def identify_anomalies(data, total_entries, threshold, filter_terms=None):
    anomalies = []
    # Add this new filtering logic (keep all existing comments)
    if filter_terms:
        filter_terms = [term.strip().lower() for term in filter_terms if term.strip()]
    
    for value, count in data.items():
        # Add this new filter check (keep all other existing code)
        if filter_terms and any(term in value.lower() for term in filter_terms):
            continue
            
        proportion = float(count) / float(total_entries)
        log_proportion = math.log10(proportion)
        if log_proportion < threshold:
            anomalies.append([count, log_proportion, value])
    anomalies.sort(key=lambda x: x[1])
    return anomalies
    
def print_results_dataframe(results):
    import pandas as pd
    import sys
    
    # Create and clean the DataFrame
    df = pd.DataFrame(results, columns=['Count', 'Log Proportion', 'Value'])
    
    # 1. Remove all problematic whitespace
    df['Value'] = df['Value'].str.strip()
    df['Value'] = df['Value'].str.replace(r'\s+', ' ', regex=True)
    
    # 2. Format numbers consistently
    df['Log Proportion'] = df['Log Proportion'].apply(lambda x: f"{float(x):.5f}")
    df['Count'] = df['Count'].astype(str)
    
    # 3. Environment detection and printing
    if 'ipykernel' in sys.modules:  # Jupyter Notebook/Lab
        from IPython.display import HTML
        html = df.to_html(
            index=False,
            justify='left',
            border=0,
            classes=['left-aligned-table'],
            na_rep=''
        )
        # CSS injection for perfect left alignment
        display(HTML(f"""
        <style>
        .left-aligned-table td, .left-aligned-table th {{
            text-align: left !important;
            font-family: monospace;
            white-space: pre;
        }}
        </style>
        {html}
        """))
    else:  # Terminal or script
        from tabulate import tabulate
        print(tabulate(
            df,
            headers='keys',
            tablefmt='psql',
            showindex=False,
            stralign='left',
            numalign='left'
        ))

try:

    ################ BEGIN EDITING HERE #########################

    # Directory path where the CSV files are stored.
    # Download the threat hunting dataset: https://github.com/mosse-security/threat-hunting-samples and update the path.
    directory = "dataset-1/w32processes"
    
    # Header in the CSV files to filter on. Update this as needed based on the CSV file you use.
    header = ""
    
    # Comma-separated values to exclude from results
    filter_str = ""
    
    # Threshold for identifying anomalies. This value will need to be adjusted 
    # based on the size of the dataset and the variability in the data.
    # Careful here because rendering the results in the browser can
    # cause it to crash if there are a lot of results.
    threshold = -2.0

    ################ END EDITING HERE #########################
    files = get_csv_files(directory)
    num_threads = determine_num_threads(files)
    aggregated_data, total_entries = aggregate_data(files, header, num_threads)
    
    # Parse filter string and pass to identify_anomalies
    filter_terms = [term.strip() for term in filter_str.split(',')] if filter_str else None
    
    # Add filter_terms parameter to this call (keep all other params the same)
    results = identify_anomalies(aggregated_data, total_entries, threshold, filter_terms)
    print_results_dataframe(results)

except Exception as e:
    print(f"An error occurred: {e}")
