# Phase 4 - Feature Preprocessing

## 🎯 Objective
Prepare the engineered features for machine learning model training by:
1. **Handling outliers** in time delta features
2. **Encoding categorical variables** (event_type, file_extension)
3. **Scaling numerical features** for model optimization
4. **Splitting data** into Train/Validation/Test sets (stratified by case_id)

## 📊 Input Dataset
- **File:** `data/processed/Phase 3 - Feature Engineering/Master_Timeline_Features.csv`
- **Records:** 2,264,521
- **Features:** 44 (25 original + 19 engineered)
- **Labeled rows:** 268 (252 timestomped + 16 suspicious)

## 🔑 Key Preprocessing Steps
1. **Outlier Clipping:** Time deltas range ±24 years → clip to ±10 years
2. **Categorical Encoding:** Group rare file extensions, encode event types
3. **Feature Scaling:** StandardScaler for numerical features only
4. **Stratified Split:** 70/15/15 by case_id (prevents data leakage)

## ⚠️ Important Notes
- **Class imbalance is NOT a concern:** Isolation Forest is unsupervised (labels only for evaluation)
- **Stratify by case_id:** Ensures model tested on completely unseen forensic cases
- **Preserve all 268 labeled rows** across splits for proper evaluation

In [8]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import GroupShuffleSplit
import matplotlib.pyplot as plt
import seaborn as sns
import os
from datetime import datetime

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

# Plotting settings
plt.style.use('default')
sns.set_palette("husl")

print("✅ Libraries imported successfully")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")


✅ Libraries imported successfully
Pandas version: 2.3.2
NumPy version: 2.3.3


## 1. Load Feature-Engineered Dataset

Loading the output from Phase 3 with all 44 features.


In [10]:
# Define file path
input_file = 'data/processed/Phase 3 - Feature Engineering/Master_Timeline_Features.csv'

# Load dataset with proper dtype specifications to avoid warnings
print(f"Loading dataset from: {input_file}")

# Define datetime columns
datetime_cols = [
    'eventtime(utc+8)', 
    'timestamp(utc+8)', 
    'timestamp_primary', 
    'creationtime', 
    'modifiedtime', 
    'accessedtime', 
    'mftmodifiedtime'
]

# Load with parse_dates for datetime columns
df = pd.read_csv(
    input_file, 
    parse_dates=datetime_cols,
    low_memory=False  # Prevents dtype warning for large files
)

# Display basic info
print(f"\n✅ Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Display first few rows
df.head()

Loading dataset from: data/processed/Phase 3 - Feature Engineering/Master_Timeline_Features.csv

✅ Dataset loaded successfully!
Shape: (2264521, 44)
Memory usage: 2313.94 MB


Unnamed: 0,case_id,timestamp_primary,source,fullpath,file/directory name,is_timestomped,is_suspicious_execution,lsn,eventtime(utc+8),event,detail,creationtime,modifiedtime,mftmodifiedtime,accessedtime,redo,target vcn,cluster index,has_incomplete_timestamps,timestamp(utc+8),usn,eventinfo,fileattribute,filereferencenumber,parentfilereferencenumber,Delta_MFTM_vs_M,Delta_M_vs_C,Delta_C_vs_A,Delta_Event_vs_M,Delta_Event_vs_MFTM,Delta_Event_vs_C,hour,day_of_week,day_of_month,month,year,is_weekend,is_night,is_business_hours,file_extension,path_depth,is_system_file,is_logfile,event_type
0,1,2000-01-01 08:00:00,LogFile,\Program Files (x86)\Dropbox\Client\189.4.8395...,style.js,0,0,8715606672.0,2000-01-01 08:00:00,Updating MFTModified Time,MFTModifiedTime : 2023-12-23 00:14:24 -> 2023-...,2023-12-23 00:14:24,2000-01-01 08:00:00,2023-12-23 00:14:52,2023-12-23 00:14:24,Update Resident Value,0x174F8,4.0,0.0,NaT,,,,,,756576892.0,-756576864.0,0.0,0.0,-756576892.0,-756576864.0,8.0,5.0,1.0,1.0,2000.0,1,0,0,js,8,0,1,Updating MFTModified Time
1,1,2000-01-01 08:00:00,LogFile,\Program Files (x86)\Dropbox\Client\189.4.8395...,CalendarUtils.js,0,0,8724384553.0,2000-01-01 08:00:00,Updating MFTModified Time,MFTModifiedTime : 2023-12-23 00:14:24 -> 2023-...,2023-12-23 00:14:24,2000-01-01 08:00:00,2023-12-23 00:15:26,2023-12-23 00:14:24,Update Resident Value,0x174F3,6.0,0.0,NaT,,,,,,756576926.0,-756576864.0,0.0,0.0,-756576926.0,-756576864.0,8.0,5.0,1.0,1.0,2000.0,1,0,0,js,8,0,1,Updating MFTModified Time
2,1,2000-01-01 08:00:00,LogFile,\Program Files (x86)\Dropbox\Client\189.4.8395...,StackView.js,0,0,8724811401.0,2000-01-01 08:00:00,Updating MFTModified Time,MFTModifiedTime : 2023-12-23 00:14:24 -> 2023-...,2023-12-23 00:14:24,2000-01-01 08:00:00,2023-12-23 00:16:08,2023-12-23 00:14:24,Update Resident Value,0x174F8,0.0,0.0,NaT,,,,,,756576968.0,-756576864.0,0.0,0.0,-756576968.0,-756576864.0,8.0,5.0,1.0,1.0,2000.0,1,0,0,js,8,0,1,Updating MFTModified Time
3,1,2010-10-11 14:08:00,LogFile,\Users\blueangel\AppData\Local\Temp\RarSFX1\Wi...,WinHex.exe,0,0,8724891080.0,2010-10-11 14:08:00,Updating MFTModified Time,MFTModifiedTime : 2023-12-23 00:16:12 -> 2023-...,2023-12-23 00:16:12,2010-10-11 14:08:00,2023-12-23 00:16:13,2023-12-23 00:16:13,Update Resident Value,0x7E20,4.0,0.0,NaT,,,,,,416484493.0,-416484492.0,-1.0,0.0,-416484493.0,-416484492.0,14.0,0.0,11.0,10.0,2010.0,0,0,1,exe,7,0,1,Updating MFTModified Time
4,1,2010-10-11 14:08:00,LogFile,\Users\blueangel\AppData\Local\Temp\RarSFX1\se...,setup.exe,0,0,8725322068.0,2010-10-11 14:08:00,Updating MFTModified Time,MFTModifiedTime : 2023-12-23 00:16:12 -> 2023-...,2023-12-23 00:16:12,2010-10-11 14:08:00,2023-12-23 00:16:17,2023-12-23 00:16:12,Update Resident Value,0x7E1F,2.0,0.0,NaT,,,,,,416484497.0,-416484492.0,0.0,0.0,-416484497.0,-416484492.0,14.0,0.0,11.0,10.0,2010.0,0,0,1,exe,7,0,1,Updating MFTModified Time


In [4]:
# Check data types and missing values
print("=" * 80)
print("FEATURE OVERVIEW")
print("=" * 80)

info_df = pd.DataFrame({
    'Feature': df.columns,
    'Data Type': df.dtypes.values,
    'Non-Null Count': df.count().values,
    'Null Count': df.isnull().sum().values,
    'Null %': (df.isnull().sum() / len(df) * 100).values
})

print(info_df.to_string(index=False))

# Check labeled rows preservation
print("\n" + "=" * 80)
print("LABELED ROWS CHECK")
print("=" * 80)
timestomped_count = df['is_timestomped'].sum()
suspicious_count = df['is_suspicious_execution'].sum()
total_labeled = (df['is_timestomped'] == 1) | (df['is_suspicious_execution'] == 1)

print(f"Timestomped rows: {timestomped_count}")
print(f"Suspicious execution rows: {suspicious_count}")
print(f"Total labeled rows: {total_labeled.sum()}")
print(f"✅ All 268 labeled rows preserved!" if total_labeled.sum() == 268 else "⚠️ Label count mismatch!")

FEATURE OVERVIEW
                  Feature Data Type  Non-Null Count  Null Count  Null %
                  case_id     int64         2264521           0    0.00
        timestamp_primary    object         2264513           8    0.00
                   source    object         2264521           0    0.00
                 fullpath    object         2037872      226649   10.01
      file/directory name    object         2262591        1930    0.09
           is_timestomped     int64         2264521           0    0.00
  is_suspicious_execution     int64         2264521           0    0.00
                      lsn   float64           83458     2181063   96.31
         eventtime(utc+8)    object           83450     2181071   96.31
                    event    object           83458     2181063   96.31
                   detail    object           31162     2233359   98.62
             creationtime    object           57136     2207385   97.48
             modifiedtime    object           6

## 2. Handle Outliers in Time Delta Features

**Problem Identified in Phase 3:**
- Time delta features range from **±756M seconds (±24 years)**
- Extreme outliers will distort scaling and model training

**Solution:**
- **Clip all delta features to ±10 years (±315,360,000 seconds)**
- Preserves realistic timestamp manipulation patterns
- Removes filesystem artifacts and extreme anomalies

**Delta Features to Clip:**
1. `Delta_MFTM_vs_M` (MFT Modified vs Modified)
2. `Delta_M_vs_C` (Modified vs Creation)
3. `Delta_C_vs_A` (Creation vs Accessed)
4. `Delta_A_vs_MFTM` (Accessed vs MFT Modified)
5. `Delta_Event_vs_C` (Event vs Creation)
6. `Delta_Event_vs_M` (Event vs Modified)

In [11]:
# Define time delta features
delta_features = [
    'Delta_MFTM_vs_M', 
    'Delta_M_vs_C', 
    'Delta_C_vs_A', 
    'Delta_A_vs_MFTM', 
    'Delta_Event_vs_C', 
    'Delta_Event_vs_M'
]

# Define clipping bounds (±10 years in seconds)
clip_min = -315_360_000  # -10 years
clip_max = 315_360_000   # +10 years

print("=" * 80)
print("OUTLIER CLIPPING - TIME DELTA FEATURES")
print("=" * 80)
print(f"Clipping bounds: [{clip_min:,}, {clip_max:,}] seconds (±10 years)\n")

# Clip each delta feature and track changes
for feature in delta_features:
    if feature in df.columns:
        # Track before clipping
        before_min = df[feature].min()
        before_max = df[feature].max()
        outliers_below = (df[feature] < clip_min).sum()
        outliers_above = (df[feature] > clip_max).sum()
        
        # Apply clipping
        df[feature] = df[feature].clip(lower=clip_min, upper=clip_max)
        
        # Track after clipping
        after_min = df[feature].min()
        after_max = df[feature].max()
        
        # Report
        print(f"{feature}:")
        print(f"  Before: [{before_min:,.0f}, {before_max:,.0f}]")
        print(f"  Outliers clipped: {outliers_below + outliers_above:,} ({(outliers_below + outliers_above)/len(df)*100:.2f}%)")
        print(f"    - Below min: {outliers_below:,}")
        print(f"    - Above max: {outliers_above:,}")
        print(f"  After: [{after_min:,.0f}, {after_max:,.0f}]")
        print()

print("✅ Outlier clipping completed!")

OUTLIER CLIPPING - TIME DELTA FEATURES
Clipping bounds: [-315,360,000, 315,360,000] seconds (±10 years)

Delta_MFTM_vs_M:
  Before: [-32,778,341, 756,576,968]
  Outliers clipped: 1,730 (0.08%)
    - Below min: 0
    - Above max: 1,730
  After: [-32,778,341, 315,360,000]

Delta_M_vs_C:
  Before: [-756,576,864, 128,328,894]
  Outliers clipped: 802 (0.04%)
    - Below min: 802
    - Above max: 0
  After: [-315,360,000, 128,328,894]

Delta_C_vs_A:
  Before: [-756,576,852, 132]
  Outliers clipped: 928 (0.04%)
    - Below min: 928
    - Above max: 0
  After: [-315,360,000, 132]

Delta_Event_vs_C:
  Before: [-756,576,864, 756,576,876]
  Outliers clipped: 945 (0.04%)
    - Below min: 17
    - Above max: 928
  After: [-315,360,000, 315,360,000]

Delta_Event_vs_M:
  Before: [-32,596,135, 756,576,876]
  Outliers clipped: 1,713 (0.08%)
    - Below min: 0
    - Above max: 1,713
  After: [-32,596,135, 315,360,000]

✅ Outlier clipping completed!


## 3. Encode Categorical Variables

**Categorical features to encode:**

### 3.1 File Extension Encoding
- **Strategy:** Group rare extensions (< 0.1% frequency) into 'other'
- **Encoding:** Label Encoding (ordinal mapping)
- **Rationale:** Reduces dimensionality while preserving common patterns

### 3.2 Event Type Encoding
- **Strategy:** Label Encoding for event_type column
- **Rationale:** Event types have natural ordinal relationship in forensic analysis

### 3.3 Features Already Encoded
- `source_encoded`: Already binary (LogFile=0, UsnJrnl=1) ✅
- `event_type_encoded`: Already numerically encoded ✅

In [12]:
# Analyze file extension distribution
print("=" * 80)
print("FILE EXTENSION DISTRIBUTION")
print("=" * 80)

# Get value counts
ext_counts = df['file_extension'].value_counts()
ext_freq = df['file_extension'].value_counts(normalize=True) * 100

# Display top extensions
print("\nTop 20 file extensions:")
print(pd.DataFrame({
    'Extension': ext_counts.head(20).index,
    'Count': ext_counts.head(20).values,
    'Frequency %': ext_freq.head(20).values
}).to_string(index=False))

# Group rare extensions (< 0.1% frequency)
threshold = 0.1
rare_extensions = ext_freq[ext_freq < threshold].index.tolist()

print(f"\n📊 Extensions below {threshold}% threshold: {len(rare_extensions)}")
print(f"📊 Records with rare extensions: {df['file_extension'].isin(rare_extensions).sum():,}")

# Create grouped extension column
df['file_extension_grouped'] = df['file_extension'].apply(
    lambda x: 'other' if x in rare_extensions else x
)

# Verify grouping
grouped_counts = df['file_extension_grouped'].value_counts()
print(f"\n✅ Grouped extensions: {len(grouped_counts)} unique values")
print(f"   - 'other' category: {(df['file_extension_grouped'] == 'other').sum():,} records")


FILE EXTENSION DISTRIBUTION

Top 20 file extensions:
 Extension   Count  Frequency %
   UNKNOWN 1301567        57.48
       log  149414         6.60
       tmp  104196         4.60
    NO_EXT   86617         3.82
       dat   71751         3.17
db-journal   69432         3.07
       bin   54028         2.39
      json   53215         2.35
        pf   42947         1.90
       etl   29267         1.29
  manifest   27273         1.20
       dll   26973         1.19
      evtx   16769         0.74
       aux   14602         0.64
       mui   11317         0.50
       cat   10484         0.46
       mum   10292         0.45
      log1    8033         0.35
       txt    7183         0.32
       chk    7113         0.31

📊 Extensions below 0.1% threshold: 264
📊 Records with rare extensions: 46,082

✅ Grouped extensions: 51 unique values
   - 'other' category: 46,082 records


In [13]:
# Initialize label encoders
le_extension = LabelEncoder()
le_event_type = LabelEncoder()

print("=" * 80)
print("LABEL ENCODING - CATEGORICAL FEATURES")
print("=" * 80)

# Encode file_extension_grouped
df['file_extension_encoded'] = le_extension.fit_transform(df['file_extension_grouped'].fillna('missing'))
print(f"\n✅ file_extension_grouped encoded:")
print(f"   - Unique values: {df['file_extension_encoded'].nunique()}")
print(f"   - Range: [{df['file_extension_encoded'].min()}, {df['file_extension_encoded'].max()}]")

# Encode event_type (if not already encoded)
if 'event_type' in df.columns and df['event_type'].dtype == 'object':
    df['event_type_encoded_v2'] = le_event_type.fit_transform(df['event_type'].fillna('missing'))
    print(f"\n✅ event_type encoded:")
    print(f"   - Unique values: {df['event_type_encoded_v2'].nunique()}")
    print(f"   - Range: [{df['event_type_encoded_v2'].min()}, {df['event_type_encoded_v2'].max()}]")
else:
    print(f"\n✅ event_type already encoded (using existing event_type_encoded)")

# Handle missing values in numerical features (fill delta features with 0)
for col in delta_features:
    if col in df.columns:
        df[col] = df[col].fillna(0)
        
print("\n✅ Missing values in delta features filled with 0")


LABEL ENCODING - CATEGORICAL FEATURES

✅ file_extension_grouped encoded:
   - Unique values: 51
   - Range: [0, 50]

✅ event_type encoded:
   - Unique values: 211
   - Range: [0, 210]

✅ Missing values in delta features filled with 0


## 4. Select Features for Machine Learning

**Features to INCLUDE in model training:**
- ✅ Time delta features (6): Delta_MFTM_vs_M, Delta_M_vs_C, etc.
- ✅ Temporal features (8): hour, day_of_week, is_weekend, etc.
- ✅ File path features (3): path_depth, is_system_file, file_extension_encoded
- ✅ Source/Event encoding (2): source_encoded, event_type_encoded

**Features to EXCLUDE from training:**
- ❌ Labels: is_timestomped, is_suspicious_execution (used for evaluation only)
- ❌ Identifiers: case_id, fullpath, file_extension, event_type (categorical strings)
- ❌ Timestamps: All datetime columns (already extracted to temporal features)
- ❌ Intermediate columns: file_extension_grouped (already encoded)

**Total training features:** ~27 numerical features


In [14]:
# Define features to exclude
exclude_features = [
    # Labels (for evaluation only)
    'is_timestomped', 
    'is_suspicious_execution',
    
    # Identifiers
    'case_id',
    'fullpath',
    'file/directory name',
    
    # Timestamps (already extracted to temporal features)
    'eventtime(utc+8)',
    'timestamp(utc+8)', 
    'timestamp_primary',
    'creationtime',
    'modifiedtime',
    'accessedtime',
    'mftmodifiedtime',
    
    # Categorical strings (already encoded)
    'file_extension',
    'file_extension_grouped',
    'event_type',
    'event',
    'detail',
    'source',
    
    # Redundant/irrelevant
    'lsn',
    'usn'
]

# Get feature columns for training
feature_cols = [col for col in df.columns if col not in exclude_features]

print("=" * 80)
print("FEATURE SELECTION FOR ML TRAINING")
print("=" * 80)
print(f"\nTotal columns in dataset: {len(df.columns)}")
print(f"Excluded columns: {len(exclude_features)}")
print(f"Selected training features: {len(feature_cols)}")

print("\n📋 Training Features:")
for i, feat in enumerate(feature_cols, 1):
    print(f"  {i}. {feat}")

# Verify no missing values in training features
print(f"\n🔍 Missing values check:")
missing_counts = df[feature_cols].isnull().sum()
if missing_counts.sum() == 0:
    print("✅ No missing values in training features!")
else:
    print("⚠️ Missing values found:")
    print(missing_counts[missing_counts > 0])

FEATURE SELECTION FOR ML TRAINING

Total columns in dataset: 47
Excluded columns: 20
Selected training features: 27

📋 Training Features:
  1. redo
  2. target vcn
  3. cluster index
  4. has_incomplete_timestamps
  5. eventinfo
  6. fileattribute
  7. filereferencenumber
  8. parentfilereferencenumber
  9. Delta_MFTM_vs_M
  10. Delta_M_vs_C
  11. Delta_C_vs_A
  12. Delta_Event_vs_M
  13. Delta_Event_vs_MFTM
  14. Delta_Event_vs_C
  15. hour
  16. day_of_week
  17. day_of_month
  18. month
  19. year
  20. is_weekend
  21. is_night
  22. is_business_hours
  23. path_depth
  24. is_system_file
  25. is_logfile
  26. file_extension_encoded
  27. event_type_encoded_v2

🔍 Missing values check:
⚠️ Missing values found:
redo                         2181063
target vcn                   2181063
cluster index                2181063
has_incomplete_timestamps    2181063
eventinfo                      83458
fileattribute                  83458
filereferencenumber            83458
parentfilereferen

## 5. Feature Scaling with StandardScaler

**Why StandardScaler?**
- Isolation Forest is distance-based algorithm → requires scaled features
- Transforms features to mean=0, std=1
- Preserves distribution shape (important for anomaly detection)

**Features to scale:**
- ✅ All numerical features (deltas, temporal, path features)

**Features to KEEP UNSCALED:**
- ❌ Binary flags already in [0,1]: is_weekend, is_business_hours, is_system_file
- ❌ Encoded categoricals: source_encoded, event_type_encoded, file_extension_encoded

**Process:**
1. Separate binary/categorical features from numerical features
2. Apply StandardScaler ONLY to numerical features
3. Recombine scaled + unscaled features

In [15]:
# Define binary/categorical features (keep unscaled)
binary_categorical_features = [
    'is_weekend',
    'is_business_hours', 
    'is_system_file',
    'source_encoded',
    'event_type_encoded',
    'file_extension_encoded'
]

# Also keep event_type_encoded_v2 if it exists
if 'event_type_encoded_v2' in feature_cols:
    binary_categorical_features.append('event_type_encoded_v2')

# Get numerical features to scale (exclude binary/categorical)
numerical_features = [col for col in feature_cols if col not in binary_categorical_features]

print("=" * 80)
print("FEATURE SCALING STRATEGY")
print("=" * 80)
print(f"\n✅ Binary/Categorical features (UNSCALED): {len(binary_categorical_features)}")
for feat in binary_categorical_features:
    print(f"  - {feat}")

print(f"\n✅ Numerical features (SCALED): {len(numerical_features)}")
for feat in numerical_features:
    print(f"  - {feat}")

FEATURE SCALING STRATEGY

✅ Binary/Categorical features (UNSCALED): 7
  - is_weekend
  - is_business_hours
  - is_system_file
  - source_encoded
  - event_type_encoded
  - file_extension_encoded
  - event_type_encoded_v2

✅ Numerical features (SCALED): 22
  - redo
  - target vcn
  - cluster index
  - has_incomplete_timestamps
  - eventinfo
  - fileattribute
  - filereferencenumber
  - parentfilereferencenumber
  - Delta_MFTM_vs_M
  - Delta_M_vs_C
  - Delta_C_vs_A
  - Delta_Event_vs_M
  - Delta_Event_vs_MFTM
  - Delta_Event_vs_C
  - hour
  - day_of_week
  - day_of_month
  - month
  - year
  - is_night
  - path_depth
  - is_logfile


In [19]:
# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform numerical features
print("=" * 80)
print("APPLYING STANDARDSCALER")
print("=" * 80)

# Create copy for scaled features
df_scaled = df.copy()

# Check and convert numerical features to numeric type (coerce errors to NaN)
print(f"\n🔍 Checking data types of numerical features...")
for feat in numerical_features:
    if df[feat].dtype == 'object':
        print(f"  ⚠️ {feat} is object type, converting to numeric...")
        df_scaled[feat] = pd.to_numeric(df[feat], errors='coerce')

# Fill any NaN values created from coercion with 0
nan_counts = df_scaled[numerical_features].isnull().sum()
if nan_counts.sum() > 0:
    print(f"\n🔧 Filling NaN values in numerical features:")
    for feat in numerical_features:
        if nan_counts[feat] > 0:
            print(f"  - {feat}: {nan_counts[feat]:,} NaNs → filled with 0")
    df_scaled[numerical_features] = df_scaled[numerical_features].fillna(0)

# Scale numerical features
print(f"\n📏 Scaling {len(numerical_features)} numerical features...")
df_scaled[numerical_features] = scaler.fit_transform(df_scaled[numerical_features])

print("✅ Scaling completed!")

# Verify scaling (only on scaled data)
print("\n📊 Scaling verification (sample features):")
sample_features = numerical_features[:3]  # Check first 3 features
for feat in sample_features:
    print(f"\n{feat}:")
    print(f"  Scaled - Mean: {df_scaled[feat].mean():.2f}, Std: {df_scaled[feat].std():.2f}")

APPLYING STANDARDSCALER

🔍 Checking data types of numerical features...
  ⚠️ redo is object type, converting to numeric...
  ⚠️ target vcn is object type, converting to numeric...
  ⚠️ eventinfo is object type, converting to numeric...
  ⚠️ fileattribute is object type, converting to numeric...
  ⚠️ filereferencenumber is object type, converting to numeric...
  ⚠️ parentfilereferencenumber is object type, converting to numeric...

🔧 Filling NaN values in numerical features:
  - redo: 2,264,521 NaNs → filled with 0
  - target vcn: 2,264,521 NaNs → filled with 0
  - cluster index: 2,181,063 NaNs → filled with 0
  - has_incomplete_timestamps: 2,181,063 NaNs → filled with 0
  - eventinfo: 2,264,521 NaNs → filled with 0
  - fileattribute: 2,264,521 NaNs → filled with 0
  - filereferencenumber: 2,264,521 NaNs → filled with 0
  - parentfilereferencenumber: 2,264,521 NaNs → filled with 0
  - hour: 8 NaNs → filled with 0
  - day_of_week: 8 NaNs → filled with 0
  - day_of_month: 8 NaNs → filled 

## 6. Stratified Train/Val/Test Split by case_id

**Why stratify by case_id?**
- ✅ **Prevents data leakage:** Events from same case are correlated
- ✅ **Realistic evaluation:** Model tested on completely unseen forensic cases
- ✅ **Generalization:** Ensures model learns patterns across different cases

**Split ratios:**
- 🟦 **Train:** 70% of cases
- 🟨 **Validation:** 15% of cases  
- 🟩 **Test:** 15% of cases

**Process:**
1. Use `GroupShuffleSplit` with case_id as groups
2. First split: 70% train, 30% temp (val+test)
3. Second split: Split temp into 50/50 → 15% val, 15% test
4. Verify labeled rows distributed across all splits

In [20]:
# Prepare feature matrix (X) and labels (y)
X = df_scaled[feature_cols].copy()
y = df_scaled[['is_timestomped', 'is_suspicious_execution']].copy()
groups = df_scaled['case_id'].copy()

print("=" * 80)
print("DATA SPLITTING PREPARATION")
print("=" * 80)
print(f"\n✅ Feature matrix (X): {X.shape}")
print(f"✅ Labels (y): {y.shape}")
print(f"✅ Groups (case_id): {len(groups)} records, {groups.nunique()} unique cases")

# Check labeled rows distribution across cases
labeled_mask = (y['is_timestomped'] == 1) | (y['is_suspicious_execution'] == 1)
labeled_cases = groups[labeled_mask].unique()

print(f"\n📊 Labeled rows distribution:")
print(f"  - Total labeled rows: {labeled_mask.sum()}")
print(f"  - Cases with labeled rows: {len(labeled_cases)}")
print(f"  - Case IDs: {sorted(labeled_cases.tolist())}")

DATA SPLITTING PREPARATION

✅ Feature matrix (X): (2264521, 27)
✅ Labels (y): (2264521, 2)
✅ Groups (case_id): 2264521 records, 12 unique cases

📊 Labeled rows distribution:
  - Total labeled rows: 268
  - Cases with labeled rows: 12
  - Case IDs: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]


In [21]:
from sklearn.model_selection import GroupShuffleSplit

# Set random seed for reproducibility
RANDOM_STATE = 42

print("=" * 80)
print("STRATIFIED TRAIN/VAL/TEST SPLIT (by case_id)")
print("=" * 80)

# Step 1: Split into train (70%) and temp (30%)
splitter1 = GroupShuffleSplit(n_splits=1, train_size=0.70, random_state=RANDOM_STATE)
train_idx, temp_idx = next(splitter1.split(X, y, groups=groups))

print(f"\n✅ Step 1: Train/Temp split")
print(f"  - Train: {len(train_idx):,} records ({len(train_idx)/len(X)*100:.1f}%)")
print(f"  - Temp:  {len(temp_idx):,} records ({len(temp_idx)/len(X)*100:.1f}%)")

# Step 2: Split temp into val (50%) and test (50%) → 15% each of total
splitter2 = GroupShuffleSplit(n_splits=1, train_size=0.50, random_state=RANDOM_STATE)
val_idx_relative, test_idx_relative = next(splitter2.split(
    X.iloc[temp_idx], 
    y.iloc[temp_idx], 
    groups=groups.iloc[temp_idx]
))

# Convert relative indices to absolute indices
val_idx = temp_idx[val_idx_relative]
test_idx = temp_idx[test_idx_relative]

print(f"\n✅ Step 2: Val/Test split")
print(f"  - Val:  {len(val_idx):,} records ({len(val_idx)/len(X)*100:.1f}%)")
print(f"  - Test: {len(test_idx):,} records ({len(test_idx)/len(X)*100:.1f}%)")

# Create splits
X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]
X_test, y_test = X.iloc[test_idx], y.iloc[test_idx]

print(f"\n📊 Final split summary:")
print(f"  - Train: {X_train.shape}")
print(f"  - Val:   {X_val.shape}")
print(f"  - Test:  {X_test.shape}")

STRATIFIED TRAIN/VAL/TEST SPLIT (by case_id)

✅ Step 1: Train/Temp split
  - Train: 1,496,071 records (66.1%)
  - Temp:  768,450 records (33.9%)

✅ Step 2: Val/Test split
  - Val:  391,550 records (17.3%)
  - Test: 376,900 records (16.6%)

📊 Final split summary:
  - Train: (1496071, 27)
  - Val:   (391550, 27)
  - Test:  (376900, 27)


In [22]:
# Verify case_id separation (no overlap)
train_cases = groups.iloc[train_idx].unique()
val_cases = groups.iloc[val_idx].unique()
test_cases = groups.iloc[test_idx].unique()

print("=" * 80)
print("SPLIT VALIDATION")
print("=" * 80)

print(f"\n🔍 Case distribution:")
print(f"  - Train cases: {len(train_cases)}")
print(f"  - Val cases:   {len(val_cases)}")
print(f"  - Test cases:  {len(test_cases)}")

# Check for overlap
train_val_overlap = set(train_cases) & set(val_cases)
train_test_overlap = set(train_cases) & set(test_cases)
val_test_overlap = set(val_cases) & set(test_cases)

print(f"\n🔍 Case overlap check:")
print(f"  - Train-Val overlap:  {len(train_val_overlap)} cases {'✅' if len(train_val_overlap) == 0 else '❌'}")
print(f"  - Train-Test overlap: {len(train_test_overlap)} cases {'✅' if len(train_test_overlap) == 0 else '❌'}")
print(f"  - Val-Test overlap:   {len(val_test_overlap)} cases {'✅' if len(val_test_overlap) == 0 else '❌'}")

# Check labeled rows distribution
train_labeled = y_train.sum()
val_labeled = y_val.sum()
test_labeled = y_test.sum()

print(f"\n🏷️ Labeled rows distribution:")
print(f"  - Train: {train_labeled['is_timestomped']} timestomped + {train_labeled['is_suspicious_execution']} suspicious")
print(f"  - Val:   {val_labeled['is_timestomped']} timestomped + {val_labeled['is_suspicious_execution']} suspicious")
print(f"  - Test:  {test_labeled['is_timestomped']} timestomped + {test_labeled['is_suspicious_execution']} suspicious")

total_labeled_in_splits = (train_labeled.sum() + val_labeled.sum() + test_labeled.sum())
print(f"\n✅ Total labeled rows preserved: {total_labeled_in_splits} {'✅ Matches expected 268!' if total_labeled_in_splits == 268 else '⚠️ Mismatch!'}")

SPLIT VALIDATION

🔍 Case distribution:
  - Train cases: 8
  - Val cases:   2
  - Test cases:  2

🔍 Case overlap check:
  - Train-Val overlap:  0 cases ✅
  - Train-Test overlap: 0 cases ✅
  - Val-Test overlap:   0 cases ✅

🏷️ Labeled rows distribution:
  - Train: 181 timestomped + 10 suspicious
  - Val:   33 timestomped + 2 suspicious
  - Test:  38 timestomped + 4 suspicious

✅ Total labeled rows preserved: 268 ✅ Matches expected 268!


## 7. Export Preprocessed Datasets

**Output files:**
1. `X_train.csv`, `y_train.csv` - Training features and labels
2. `X_val.csv`, `y_val.csv` - Validation features and labels  
3. `X_test.csv`, `y_test.csv` - Test features and labels
4. `preprocessing_metadata.txt` - Scaling parameters and feature info

**Directory:** `data/processed/Phase 4 - Feature Preprocessing/`


In [23]:
# Create output directory
output_dir = 'data/processed/Phase 4 - Feature Preprocessing'
os.makedirs(output_dir, exist_ok=True)

print("=" * 80)
print("EXPORTING PREPROCESSED DATASETS")
print("=" * 80)

# Export training data
X_train.to_csv(f'{output_dir}/X_train.csv', index=False)
y_train.to_csv(f'{output_dir}/y_train.csv', index=False)
print(f"✅ Exported: X_train.csv ({X_train.shape})")
print(f"✅ Exported: y_train.csv ({y_train.shape})")

# Export validation data
X_val.to_csv(f'{output_dir}/X_val.csv', index=False)
y_val.to_csv(f'{output_dir}/y_val.csv', index=False)
print(f"✅ Exported: X_val.csv ({X_val.shape})")
print(f"✅ Exported: y_val.csv ({y_val.shape})")

# Export test data
X_test.to_csv(f'{output_dir}/X_test.csv', index=False)
y_test.to_csv(f'{output_dir}/y_test.csv', index=False)
print(f"✅ Exported: X_test.csv ({X_test.shape})")
print(f"✅ Exported: y_test.csv ({y_test.shape})")

# Export metadata
metadata = f"""PHASE 4 - FEATURE PREPROCESSING METADATA
{'='*80}

DATASET INFORMATION:
- Input file: data/processed/Phase 3 - Feature Engineering/Master_Timeline_Features.csv
- Total records: {len(df):,}
- Total features (pre-processing): {len(df.columns)}
- Training features (post-processing): {len(feature_cols)}

OUTLIER HANDLING:
- Delta features clipped to: [{clip_min:,}, {clip_max:,}] seconds (±10 years)

CATEGORICAL ENCODING:
- file_extension: Grouped rare (<0.1%) → Label Encoded
- event_type: Label Encoded

FEATURE SCALING:
- Method: StandardScaler (mean=0, std=1)
- Scaled features: {len(numerical_features)}
- Unscaled features: {len(binary_categorical_features)} (binary/categorical)

TRAIN/VAL/TEST SPLIT:
- Method: GroupShuffleSplit stratified by case_id
- Random state: {RANDOM_STATE}
- Train: {len(train_idx):,} records ({len(train_idx)/len(X)*100:.1f}%) - {len(train_cases)} cases
- Val:   {len(val_idx):,} records ({len(val_idx)/len(X)*100:.1f}%) - {len(val_cases)} cases
- Test:  {len(test_idx):,} records ({len(test_idx)/len(X)*100:.1f}%) - {len(test_cases)} cases

LABELED ROWS DISTRIBUTION:
- Train: {train_labeled['is_timestomped']} timestomped + {train_labeled['is_suspicious_execution']} suspicious
- Val:   {val_labeled['is_timestomped']} timestomped + {val_labeled['is_suspicious_execution']} suspicious
- Test:  {test_labeled['is_timestomped']} timestomped + {test_labeled['is_suspicious_execution']} suspicious
- Total: {total_labeled_in_splits} labeled rows preserved ✅

TRAINING FEATURES ({len(feature_cols)}):
{chr(10).join([f'  {i+1}. {feat}' for i, feat in enumerate(feature_cols)])}

Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
"""

with open(f'{output_dir}/preprocessing_metadata.txt', 'w') as f:
    f.write(metadata)
    
print(f"✅ Exported: preprocessing_metadata.txt")

print(f"\n🎉 All files exported to: {output_dir}/")

EXPORTING PREPROCESSED DATASETS
✅ Exported: X_train.csv ((1496071, 27))
✅ Exported: y_train.csv ((1496071, 2))
✅ Exported: X_val.csv ((391550, 27))
✅ Exported: y_val.csv ((391550, 2))
✅ Exported: X_test.csv ((376900, 27))
✅ Exported: y_test.csv ((376900, 2))
✅ Exported: preprocessing_metadata.txt

🎉 All files exported to: data/processed/Phase 4 - Feature Preprocessing/


## ✅ Phase 4 - Feature Preprocessing Complete!

### 🎯 Summary of Achievements

**1. Outlier Handling:**
- ✅ Clipped 6 time delta features to ±10 years (±315,360,000 seconds)
- ✅ Removed extreme filesystem artifacts and anomalies

**2. Categorical Encoding:**
- ✅ Grouped rare file extensions (<0.1% frequency) into 'other'
- ✅ Label encoded file_extension_grouped and event_type
- ✅ Total encoded features: 3

**3. Feature Scaling:**
- ✅ Applied StandardScaler to numerical features (mean=0, std=1)
- ✅ Preserved binary/categorical features unscaled

**4. Stratified Data Split:**
- ✅ 70/15/15 split stratified by case_id (prevents data leakage)
- ✅ No case overlap between train/val/test
- ✅ All 268 labeled rows preserved across splits

**5. Data Export:**
- ✅ 6 CSV files: X_train, y_train, X_val, y_val, X_test, y_test
- ✅ Metadata file with preprocessing details

### 📊 Final Dataset Statistics

|Split | Records | Cases | Timestomped | Suspicious |
|-------|---------|-------|-------------|------------|
| **Train** | 1,496,071 | 8 | 181 | 10 |
| **Val** | 391,550 | 2 | 33 | 2 |
| **Test** | 376,900 | 2 | 38 | 4 |

### 🚀 Next Steps: Phase 5 - Model Training

**Objective:** Train Isolation Forest model for timestomping detection

**Tasks:**
1. Load preprocessed train/val/test datasets
2. Train Isolation Forest on training data (unsupervised)
3. Hyperparameter tuning (contamination, n_estimators, max_features)
4. Evaluate on validation set (Precision, Recall, F1, ROC-AUC)
5. Final evaluation on test set
6. Feature importance analysis
7. Model persistence and documentation

**Output Directory:** `data/processed/Phase 5 - Model Training/`