# Phase 3 - Feature Engineering 
In this part we aim to make the master timeline table much more feedable to the model. The merged master timeline could have millions of rows right now. With that, we aim to look further into the content of the cells within the columns in hopes to lessen the rows significantly. 

## 1. Dropping rows on UsnJrnl related rows 
Based on the cleaning and merging that we have done so far from the previous phases, it is apparent that a bulk of the master timeline dataset consists mostly of UsnJrnl related rows. With that said, we can start observing the cells on their columns and attempt to drop irrelevant or far related on timestomping 

### 1.1 Addressing eventinfo 

In [None]:
import pandas as pd
import os
from pathlib import Path
from IPython.display import display

# --- Configuration ---
# File to load (The output from Phase 2.1)
INPUT_FILEPATH = Path('data/processed/phase 2.1 - data merged (all sub-folders)/MASTER_TIMELINE_ALL_CASES.csv')
# Output file for the filtered data
OUTPUT_DIR = 'data/processed/phase 3 - feature engineered'
OUTPUT_FILENAME = 'MASTER_TIMELINE_FILTERED.csv'
OUTPUT_FILEPATH = Path(OUTPUT_DIR) / OUTPUT_FILENAME

print("--- UsnJrnl Event Analysis: Unique Eventinfo Instances ---")

try:
    # Load the comprehensive master timeline (using low_memory=False for large files)
    df_master = pd.read_csv(INPUT_FILEPATH, low_memory=False)
    print(f"Master Timeline loaded successfully. Total Rows: {len(df_master):,}")
    
except FileNotFoundError:
    print(f"FATAL ERROR: Master Timeline not found at {INPUT_FILEPATH}")
    df_master = pd.DataFrame() 

if not df_master.empty:
    
    # 1. Filter down to only UsnJrnl records
    df_usn = df_master[df_master['source'] == 'UsnJrnl'].copy()
    print(f"UsnJrnl Records Isolated: {len(df_usn):,} rows.")
    
    # 2. Extract unique event values from the 'eventinfo' column
    # We drop any potential NaN/NA values as they are not actual events
    unique_events = df_usn['eventinfo'].dropna().unique()
    
    # 3. Sort the list alphabetically for easy review
    unique_events_sorted = sorted(unique_events)
    
    # 4. Print the results
    print("\nTotal Unique UsnJrnl Events Found:")
    print("-" * 40)
    for event in unique_events_sorted:
        print(f"- {event}")
    print("-" * 40)
    print(f"Count of Unique Events: {len(unique_events_sorted)}")

else:
    print("Cannot perform analysis as the Master Timeline DataFrame is empty.")

--- UsnJrnl Event Analysis: Unique Eventinfo Instances ---
Master Timeline loaded successfully. Total Rows: 3,190,725
UsnJrnl Records Isolated: 3,128,446 rows.

Total Unique UsnJrnl Events Found:
----------------------------------------
- Access_Right_Changed
- Access_Right_Changed / File_Closed
- Access_Right_Changed / Transacted_Changed / File_Closed
- Basic_Info_Changed
- Basic_Info_Changed / Access_Right_Changed
- Basic_Info_Changed / Compression_Changed / Data_Overwritten / Content_Indexed_Attr_Changed
- Basic_Info_Changed / Compression_Changed / Data_Overwritten / Content_Indexed_Attr_Changed / File_Closed
- Basic_Info_Changed / Content_Indexed_Attr_Changed
- Basic_Info_Changed / Content_Indexed_Attr_Changed / File_Closed
- Basic_Info_Changed / Content_Indexed_Attr_Changed / File_Closed / File_Deleted
- Basic_Info_Changed / Content_Indexed_Attr_Changed / File_Renamed_New
- Basic_Info_Changed / Content_Indexed_Attr_Changed / File_Renamed_New / File_Closed / File_Deleted
- Basic_In

In [5]:
print("--- Phase 3.1: Noise Reduction (Filtering UsnJrnl Events) ---")

try:
    # Load the comprehensive master timeline
    df_master = pd.read_csv(INPUT_FILEPATH, low_memory=False)
    
    # Ensure datetime columns are correctly typed after CSV load
    TIME_COLS = ['timestamp_primary', 'creationtime', 'modifiedtime', 'mftmodifiedtime', 'accessedtime']
    for col in TIME_COLS:
        if col in df_master.columns:
            df_master[col] = pd.to_datetime(df_master[col], errors='coerce', utc=True) # Added utc=True for standard time handling

except FileNotFoundError:
    print(f"FATAL ERROR: Master Timeline not found at {INPUT_FILEPATH}")
    # Initialize an empty DataFrame to prevent script crash
    df_master = pd.DataFrame() 

if not df_master.empty:
    initial_rows = len(df_master)
    print(f"Initial Master Timeline rows: {initial_rows:,}")

    # --- Define Low-Value UsnJrnl Events to EXCLUDE (Noise) ---
    # These events fire frequently but rarely indicate adversarial action or useful metadata changes.
    # We focus on excluding low-value events *and* any multi-event record that includes READ_DATA.
    
    # 1. Primary Low-Value events (as a single event)
    LOW_VALUE_SINGLE_EVENTS = [
        'FILE_CLOSED', 
        'SECURITY_CHANGE', 
        'ACCESS_RIGHT_CHANGED'
    ]
    
    # 2. Key noisy markers that make any combination low-value
    NOISE_MARKERS = ['READ_DATA', 'SECURITY_CHANGE']


    # --- Filtering Logic ---
    
    # 1. Select all LogFile records (we keep them all)
    df_log_records = df_master[df_master['source'] == 'LogFile'].copy()
    
    # 2. Select UsnJrnl records that are considered high-value or contextual
    # a. Filter for UsnJrnl records
    df_usnjrnl = df_master[df_master['source'] == 'UsnJrnl'].copy()

    # b. Define the conditions for KEEPING a UsnJrnl record (High Value)
    # We KEEP a record if:
    #   i. It is NOT a single low-value event (like just FILE_CLOSED).
    #  ii. It does NOT contain a high-volume noise marker (like READ_DATA).
    
    # Condition i: Exclude single low-value events
    is_single_low_value = df_usnjrnl['eventinfo'].isin(LOW_VALUE_SINGLE_EVENTS)
    
    # Condition ii: Exclude events that contain noise markers (like READ_DATA)
    # We iterate through the noise markers and use string matching to exclude combinations
    is_noisy_combination = False
    for marker in NOISE_MARKERS:
        # Use .fillna('') to safely check for strings even if some are NaN
        is_noisy_combination |= df_usnjrnl['eventinfo'].fillna('').str.contains(marker, case=False, na=False)

    # Combine conditions: Keep only records that are NOT (single low-value OR noisy combination)
    df_usnjrnl_filtered = df_usnjrnl[
        (~is_single_low_value) & 
        (~is_noisy_combination)
    ].copy()
    
    # 3. Concatenate the high-value subsets back into the new filtered master
    df_master_filtered = pd.concat([df_log_records, df_usnjrnl_filtered], ignore_index=True)
    
    final_rows = len(df_master_filtered)
    rows_dropped = initial_rows - final_rows
    reduction_percent = (rows_dropped / initial_rows) * 100

    print(f"\n--- Reduction Statistics ---")
    print(f"Total Rows Dropped:   {rows_dropped:,}")
    print(f"Final Filtered Rows:  {final_rows:,}")
    print(f"Dataset Size Reduced by: {reduction_percent:.2f}%")

    # --- Export the Filtered Master Timeline ---
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    df_master_filtered.to_csv(
        OUTPUT_FILEPATH, 
        index=False, 
        encoding='utf-8', 
        date_format='%Y-%m-%d %H:%M:%S'
    )
    print(f"\n✅ Filtered Master Timeline saved to: {OUTPUT_FILEPATH}")
    
    print("\nFirst 5 rows of the filtered dataset:")
    display(df_master_filtered.head())
    
    # Make the filtered DataFrame available for the next step (Feature Engineering)
    df_master = df_master_filtered

else:
    print("Cannot proceed with filtering as the Master Timeline DataFrame is empty.")


--- Phase 3.1: Noise Reduction (Filtering UsnJrnl Events) ---
Initial Master Timeline rows: 3,190,725

--- Reduction Statistics ---
Total Rows Dropped:   0
Final Filtered Rows:  3,190,725
Dataset Size Reduced by: 0.00%

✅ Filtered Master Timeline saved to: data/processed/phase 3 - feature engineered/MASTER_TIMELINE_FILTERED.csv

First 5 rows of the filtered dataset:


Unnamed: 0,Case_ID,timestamp_primary,fullpath,filedirectoryname,creationtime,modifiedtime,mftmodifiedtime,accessedtime,lsn,event,...,targetvcn,clusterindex,missingfullpathflaglsn,source,usn,eventinfo,fileattribute,filereferencenumber,parentfilereferencenumber,missingfullpathflagusn
0,1,2019-12-07 17:03:44+00:00,\Windows\System32\CatRoot\{F750E6C3-38EE-11D1-...,{F750E6C3-38EE-11D1-85E5-00C04FC295EE},2019-12-07 17:03:44+00:00,2023-12-23 00:18:12+00:00,2023-12-23 00:18:12+00:00,2023-12-23 00:18:12+00:00,8727013000.0,Updating Modified Time,...,0x393,0.0,0.0,LogFile,,,,,,
1,1,2019-12-07 17:14:52+00:00,\ProgramData\Microsoft\Windows\ClipSVC,ClipSVC,2019-12-07 17:14:52+00:00,2023-12-23 00:12:24+00:00,2023-12-23 00:12:24+00:00,2023-12-23 00:12:24+00:00,8728740000.0,Updating Modified Time,...,0x172,6.0,0.0,LogFile,,,,,,
2,1,2019-12-07 17:14:52+00:00,\Windows\AppReadiness,AppReadiness,2019-12-07 17:14:52+00:00,2023-12-23 00:14:32+00:00,2023-12-23 00:14:32+00:00,2023-12-23 00:14:32+00:00,8728732000.0,Updating Modified Time,...,0x19B,4.0,0.0,LogFile,,,,,,
3,1,2019-12-07 22:55:42+00:00,\Windows\WinSxS\amd64_windows-shield-provider_...,SecurityHealthHost.exe,2019-12-07 22:55:42+00:00,2019-12-07 22:58:27+00:00,2022-12-16 16:11:29+00:00,2019-12-07 22:58:27+00:00,8726152000.0,Writing Content of Resident File,...,0x4ED2,2.0,0.0,LogFile,,,,,,
4,1,2019-12-07 22:59:41+00:00,\Program Files\WindowsApps\Microsoft.XboxIdent...,clrcompression.dll,2019-12-07 22:59:41+00:00,2019-12-07 22:59:41+00:00,2023-12-23 00:15:45+00:00,2023-12-23 00:15:45+00:00,8724616000.0,Updating MFTModified Time,...,0x1C72,2.0,0.0,LogFile,,,,,,


^^ Wrong code. Need to be adjusted!! 