# Phase 3 - Feature Engineering 
In this part we aim to make the master timeline table much more feedable to the model. The merged master timeline could have millions of rows right now. With that, we aim to look further into the content of the cells within the columns in hopes to lessen the rows significantly. 

## 1. Dropping rows on UsnJrnl related rows 
Based on the cleaning and merging that we have done so far from the previous phases, it is apparent that a bulk of the master timeline dataset consists mostly of UsnJrnl related rows. With that said, we can start observing the cells on their columns and attempt to drop irrelevant or far related on timestomping 

### 1.1 Addressing eventinfo 

In [23]:
import pandas as pd
import os
from pathlib import Path
from IPython.display import display

# --- Configuration ---
# File to load (The output from Phase 2.1)
INPUT_FILEPATH = Path('data/processed/phase 2.1 - data merged (all sub-folders)/MASTER_TIMELINE_ALL_CASES.csv')
# Output file for the filtered data
OUTPUT_DIR = 'data/processed/phase 3 - feature engineered'
OUTPUT_FILENAME = 'MASTER_TIMELINE_FILTERED.csv'
OUTPUT_FILEPATH = Path(OUTPUT_DIR) / OUTPUT_FILENAME

print("--- UsnJrnl Event Analysis: Unique Eventinfo Instances ---")

try:
    # Load the comprehensive master timeline (using low_memory=False for large files)
    df_master = pd.read_csv(INPUT_FILEPATH, low_memory=False)
    print(f"Master Timeline loaded successfully. Total Rows: {len(df_master):,}")
    
except FileNotFoundError:
    print(f"FATAL ERROR: Master Timeline not found at {INPUT_FILEPATH}")
    df_master = pd.DataFrame() 

if not df_master.empty:
    
    # 1. Filter down to only UsnJrnl records
    df_usn = df_master[df_master['source'] == 'UsnJrnl'].copy()
    print(f"UsnJrnl Records Isolated: {len(df_usn):,} rows.")
    
    # 2. Extract unique event values from the 'eventinfo' column
    # We drop any potential NaN/NA values as they are not actual events
    unique_events = df_usn['eventinfo'].dropna().unique()
    
    # 3. Sort the list alphabetically for easy review
    unique_events_sorted = sorted(unique_events)
    
    # 4. Print the results
    print("\nTotal Unique UsnJrnl Events Found:")
    print("-" * 40)
    for event in unique_events_sorted:
        print(f"- {event}")
    print("-" * 40)
    print(f"Count of Unique Events: {len(unique_events_sorted)}")

else:
    print("Cannot perform analysis as the Master Timeline DataFrame is empty.")

--- UsnJrnl Event Analysis: Unique Eventinfo Instances ---
Master Timeline loaded successfully. Total Rows: 3,190,725
UsnJrnl Records Isolated: 3,128,446 rows.

Total Unique UsnJrnl Events Found:
----------------------------------------
- Access_Right_Changed
- Access_Right_Changed / File_Closed
- Access_Right_Changed / Transacted_Changed / File_Closed
- Basic_Info_Changed
- Basic_Info_Changed / Access_Right_Changed
- Basic_Info_Changed / Compression_Changed / Data_Overwritten / Content_Indexed_Attr_Changed
- Basic_Info_Changed / Compression_Changed / Data_Overwritten / Content_Indexed_Attr_Changed / File_Closed
- Basic_Info_Changed / Content_Indexed_Attr_Changed
- Basic_Info_Changed / Content_Indexed_Attr_Changed / File_Closed
- Basic_Info_Changed / Content_Indexed_Attr_Changed / File_Closed / File_Deleted
- Basic_Info_Changed / Content_Indexed_Attr_Changed / File_Renamed_New
- Basic_Info_Changed / Content_Indexed_Attr_Changed / File_Renamed_New / File_Closed / File_Deleted
- Basic_In

### 1.2 Dropping unrelated rows from eventinfo

The primary goal was to filter 3.19 million rows of data to isolate events directly relevant to timestomping detection (MAC time manipulation). We aggressively discarded high-volume, low-value operating system noise from the UsnJrnl.

The strategy was to KEEP all records from the LogFile and only keep UsnJrnl records that contained one of the following six High-Value Markers:

Basic_Info_Changed: Critical, directly affects the stored MAC timestamps.

File_Created / File_Renamed: Essential baseline and structural file changes.

Data_Overwritten / Data_Added / Data_Truncated: Content modification events that should trigger time updates.

Noise Exclusion

By focusing on these high-value markers, we automatically excluded massive amounts of noise, including:
- Permissions/Security: Events like Access_Right_Changed and SECURITY_CHANGE.
- Internal/System: Events related to TxF (Transacted_Changed), volume points (Reparse_Point_Changed), and non-contextual File_Closed events.


In [24]:
# --- Configuration ---
# File to load (The output from Phase 2.1, containing all 12 cases)
INPUT_FILEPATH = Path('data/processed/phase 2.1 - data merged (all sub-folders)/MASTER_TIMELINE_ALL_CASES.csv')
# Output file for the filtered data (Phase 3 Directory)
OUTPUT_DIR = 'data/processed/phase 3 - feature engineered'
OUTPUT_FILENAME = 'MASTER_TIMELINE_FILTERED.csv'
OUTPUT_FILEPATH = Path(OUTPUT_DIR) / OUTPUT_FILENAME

print("--- Phase 3.1: Noise Reduction (Filtering UsnJrnl Events) ---")

try:
    # Load the comprehensive master timeline (using low_memory=False for large files)
    df_master = pd.read_csv(INPUT_FILEPATH, low_memory=False)
    
    # Ensure datetime columns are correctly typed after CSV load
    TIME_COLS = ['timestamp_primary', 'creationtime', 'modifiedtime', 'mftmodifiedtime', 'accessedtime']
    for col in TIME_COLS:
        if col in df_master.columns:
            # Use errors='coerce' to turn unparseable dates into NaT
            df_master[col] = pd.to_datetime(df_master[col], errors='coerce', utc=True) 

except FileNotFoundError:
    print(f"FATAL ERROR: Master Timeline not found at {INPUT_FILEPATH}")
    # Initialize an empty DataFrame to prevent script crash
    df_master = pd.DataFrame() 

if not df_master.empty:
    initial_rows = len(df_master)
    print(f"Initial Master Timeline rows: {initial_rows:,}")

    # --- Define HIGH-VALUE USNJRNL EVENTS to KEEP (Forensically Relevant Activity) ---
    # We define a regex pattern that looks for ANY of these critical modification markers.
    HIGH_VALUE_MARKERS = [
        'Basic_Info_Changed', # Most relevant, affects MAC times directly
        'File_Created',       # File Creation event
        'Data_Overwritten',   # File Content changed
        'Data_Added',         # File Content size increased
        'Data_Truncated',     # File Content size decreased
        'File_Renamed'        # File Renaming event
    ]
    
    # Create a single regex pattern using the OR operator (|)
    KEEP_PATTERN = '|'.join(HIGH_VALUE_MARKERS)


    # --- Filtering Logic ---
    
    # 1. Select all LogFile records (we keep them all, as they are essential LSN-based events)
    df_log_records = df_master[df_master['source'] == 'LogFile'].copy()
    print(f"LogFile records kept: {len(df_log_records):,}")
    
    # 2. Select UsnJrnl records
    df_usnjrnl = df_master[df_master['source'] == 'UsnJrnl'].copy()

    # 3. Create a boolean mask: KEEP if the 'eventinfo' contains any of the HIGH_VALUE_MARKERS.
    # We use .fillna('') to safely perform string operations even if eventinfo has NaNs.
    is_high_value = df_usnjrnl['eventinfo'].fillna('').str.contains(KEEP_PATTERN, case=False, na=False, regex=True)

    # Apply the mask to get the filtered high-value UsnJrnl subset
    df_usnjrnl_filtered = df_usnjrnl[is_high_value].copy()
    print(f"UsnJrnl records kept after filtering: {len(df_usnjrnl_filtered):,}")

    # 4. Concatenate the high-value subsets back into the new filtered master
    df_master_filtered = pd.concat([df_log_records, df_usnjrnl_filtered], ignore_index=True)
    
    final_rows = len(df_master_filtered)
    rows_dropped = initial_rows - final_rows
    reduction_percent = (rows_dropped / initial_rows) * 100

    print(f"\n--- Reduction Statistics ---")
    print(f"Total Rows Dropped:   {rows_dropped:,}")
    print(f"Final Filtered Rows:  {final_rows:,}")
    print(f"Dataset Size Reduced by: {reduction_percent:.2f}%")
    
    # 5. --- Export the Filtered Master Timeline ---
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    df_master_filtered.to_csv(
        OUTPUT_FILEPATH, 
        index=False, 
        encoding='utf-8', 
        date_format='%Y-%m-%d %H:%M:%S'
    )
    print(f"\n✅ Filtered Master Timeline saved to: {OUTPUT_FILEPATH}")
    
    print("\nFirst 5 rows of the filtered dataset (confirming Case_ID is first):")
    display(df_master_filtered.head())
    
    # Make the filtered DataFrame available for the next step (Feature Engineering)
    df_master = df_master_filtered

else:
    print("Cannot proceed with filtering as the Master Timeline DataFrame is empty.")


--- Phase 3.1: Noise Reduction (Filtering UsnJrnl Events) ---
Initial Master Timeline rows: 3,190,725
LogFile records kept: 62,279
UsnJrnl records kept after filtering: 2,177,139

--- Reduction Statistics ---
Total Rows Dropped:   951,307
Final Filtered Rows:  2,239,418
Dataset Size Reduced by: 29.81%

✅ Filtered Master Timeline saved to: data/processed/phase 3 - feature engineered/MASTER_TIMELINE_FILTERED.csv

First 5 rows of the filtered dataset (confirming Case_ID is first):


Unnamed: 0,Case_ID,timestamp_primary,fullpath,filedirectoryname,creationtime,modifiedtime,mftmodifiedtime,accessedtime,lsn,event,...,targetvcn,clusterindex,missingfullpathflaglsn,source,usn,eventinfo,fileattribute,filereferencenumber,parentfilereferencenumber,missingfullpathflagusn
0,1,2019-12-07 17:03:44+00:00,\Windows\System32\CatRoot\{F750E6C3-38EE-11D1-...,{F750E6C3-38EE-11D1-85E5-00C04FC295EE},2019-12-07 17:03:44+00:00,2023-12-23 00:18:12+00:00,2023-12-23 00:18:12+00:00,2023-12-23 00:18:12+00:00,8727013000.0,Updating Modified Time,...,0x393,0.0,0.0,LogFile,,,,,,
1,1,2019-12-07 17:14:52+00:00,\ProgramData\Microsoft\Windows\ClipSVC,ClipSVC,2019-12-07 17:14:52+00:00,2023-12-23 00:12:24+00:00,2023-12-23 00:12:24+00:00,2023-12-23 00:12:24+00:00,8728740000.0,Updating Modified Time,...,0x172,6.0,0.0,LogFile,,,,,,
2,1,2019-12-07 17:14:52+00:00,\Windows\AppReadiness,AppReadiness,2019-12-07 17:14:52+00:00,2023-12-23 00:14:32+00:00,2023-12-23 00:14:32+00:00,2023-12-23 00:14:32+00:00,8728732000.0,Updating Modified Time,...,0x19B,4.0,0.0,LogFile,,,,,,
3,1,2019-12-07 22:55:42+00:00,\Windows\WinSxS\amd64_windows-shield-provider_...,SecurityHealthHost.exe,2019-12-07 22:55:42+00:00,2019-12-07 22:58:27+00:00,2022-12-16 16:11:29+00:00,2019-12-07 22:58:27+00:00,8726152000.0,Writing Content of Resident File,...,0x4ED2,2.0,0.0,LogFile,,,,,,
4,1,2019-12-07 22:59:41+00:00,\Program Files\WindowsApps\Microsoft.XboxIdent...,clrcompression.dll,2019-12-07 22:59:41+00:00,2019-12-07 22:59:41+00:00,2023-12-23 00:15:45+00:00,2023-12-23 00:15:45+00:00,8724616000.0,Updating MFTModified Time,...,0x1C72,2.0,0.0,LogFile,,,,,,


The filtering successfully created a streamlined, high-relevance dataset:

| Metric | Value |
| :--- | :--- |
| **Initial Master Timeline Rows** | 3,190,725 |
| **Final Filtered Rows** | 2,239,418 |
| **Total Rows Dropped** | 951,307 |
| **Overall Dataset Reduction** | **29.81%** |

# 2: Feature Engineering - Calculating Time Delta Features
This script calculates the time differences (in seconds) between the four core file timestamps:
Creation (C), Modified (M), MFT Modified (MFTM), and Accessed (A).
These deltas are the primary features used for timestomping anomaly detection.

In [25]:
# Phase 3.2: Feature Engineering - Calculating Time Delta Features
# This script calculates the time differences (in seconds) between the four core file timestamps:
# Creation (C), Modified (M), MFT Modified (MFTM), and Accessed (A).
# These deltas are the primary features used for timestomping anomaly detection.

import pandas as pd
from pathlib import Path

# --- Configuration ---
# Input is the filtered master timeline from the previous step (Phase 3.1)
INPUT_DIR = 'data/processed/phase 3 - feature engineered'
INPUT_FILENAME = 'MASTER_TIMELINE_FILTERED.csv'
INPUT_FILEPATH = Path(INPUT_DIR) / INPUT_FILENAME

# Output is the master timeline with all the new features added
OUTPUT_FILENAME = 'MASTER_TIMELINE_FEATURES.csv'
OUTPUT_FILEPATH = Path(INPUT_DIR) / OUTPUT_FILENAME

# Columns to parse as datetime objects
TIME_COLUMNS = ['creationtime', 'modifiedtime', 'mftmodifiedtime', 'accessedtime', 'timestamp_primary']

# --- Functions ---

def load_and_convert_data(filepath, time_cols):
    """Loads the CSV and converts specified columns to datetime objects."""
    print(f"Loading filtered data from: {filepath}")
    
    # Use low_memory=False to handle the large file and mixed types cleanly
    df = pd.read_csv(filepath, dtype={'Case_ID': str}, low_memory=False)

    # Convert all timestamp columns
    for col in time_cols:
        if col in df.columns:
            # Errors='coerce' will turn unparsable dates into NaT (Not a Time)
            df[col] = pd.to_datetime(df[col], errors='coerce')
    
    print(f"Data loaded successfully with {len(df):,} rows.")
    return df

def calculate_time_deltas(df):
    """
    Calculates the time differences (deltas) in seconds between key timestamps.
    The primary goal is to find inconsistencies that indicate timestomping.
    """
    print("Calculating time delta features...")

    # Delta 1: MFT Modified vs. Modified (Crucial for timestomping detection)
    df['Delta_MFTM_vs_M'] = (df['mftmodifiedtime'] - df['modifiedtime']).dt.total_seconds().fillna(0)

    # Delta 2: Modified vs. Creation (How long the file existed since its "creation")
    df['Delta_M_vs_C'] = (df['modifiedtime'] - df['creationtime']).dt.total_seconds().fillna(0)

    # Delta 3: Creation vs. Accessed (How long before the file was first accessed)
    df['Delta_C_vs_A'] = (df['creationtime'] - df['accessedtime']).dt.total_seconds().fillna(0)
    
    # Delta 4: Timestamp Primary vs. Modified (How much time passed between the event record and the file's Modified time)
    df['Delta_Event_vs_M'] = (df['timestamp_primary'] - df['modifiedtime']).dt.total_seconds().fillna(0)
    
    # Delta 5: Timestamp Primary vs. MFT Modified
    df['Delta_Event_vs_MFTM'] = (df['timestamp_primary'] - df['mftmodifiedtime']).dt.total_seconds().fillna(0)
    
    # Delta 6: Timestamp Primary vs. Creation
    df['Delta_Event_vs_C'] = (df['timestamp_primary'] - df['creationtime']).dt.total_seconds().fillna(0)


    print("Time delta features calculated successfully.")
    return df

# --- Main Execution ---

def run_feature_engineering():
    """Main function to execute the feature calculation process."""
    print("\n--- Phase 3.2: Calculating Time Delta Features ---")
    
    try:
        # 1. Load Data
        df = load_and_convert_data(INPUT_FILEPATH, TIME_COLUMNS)

        # 2. Calculate Features
        df_features = calculate_time_deltas(df.copy()) # Use a copy to avoid SettingWithCopyWarning
        
        # 3. Save Output
        Path(INPUT_DIR).mkdir(parents=True, exist_ok=True)
        df_features.to_csv(OUTPUT_FILEPATH, index=False)
        print(f"\n✅ Feature-rich Master Timeline saved to: {OUTPUT_FILEPATH}")
        print(f"Final columns (including new deltas): {list(df_features.columns)[-6:]}")

    except FileNotFoundError:
        print(f"\n❌ ERROR: Input file not found. Ensure '{INPUT_FILEPATH}' exists.")
    except Exception as e:
        print(f"\n❌ An unexpected error occurred: {e}")


if __name__ == '__main__':
    run_feature_engineering()



--- Phase 3.2: Calculating Time Delta Features ---
Loading filtered data from: data/processed/phase 3 - feature engineered/MASTER_TIMELINE_FILTERED.csv
Data loaded successfully with 2,239,418 rows.
Calculating time delta features...
Time delta features calculated successfully.

✅ Feature-rich Master Timeline saved to: data/processed/phase 3 - feature engineered/MASTER_TIMELINE_FEATURES.csv
Final columns (including new deltas): ['Delta_MFTM_vs_M', 'Delta_M_vs_C', 'Delta_C_vs_A', 'Delta_Event_vs_M', 'Delta_Event_vs_MFTM', 'Delta_Event_vs_C']
