# Phase 2 - Data Cleaning

**Objective:** Clean, standardize, and prepare the labeled LogFile datasets for merging and analysis.

**Process:**
1. Import necessary libraries
2. Load 12 labeled LogFile datasets from Phase 1
3. Clean and standardize data
4. Drop irrelevant columns and rows
5. Impute missing values
6. Merge all LogFiles into Master dataset
7. Export cleaned Master_LogFile_Cleaned.csv

## 1. Import Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import os
from pathlib import Path
import warnings

# Visualization libraries (optional, for data exploration)
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Matplotlib is building the font cache; this may take a moment.


Libraries imported successfully!
Pandas version: 2.3.2
NumPy version: 2.3.3


## 2. Load the 12 Labeled LogFile Datasets

Load all labeled LogFile datasets from Phase 1 output directory.

In [2]:
# --- Configuration ---
INPUT_DIR = Path('data/processed/Phase 1 - Data Labeling')
OUTPUT_DIR = Path('data/processed/Phase 2 - Data Cleaning')

# Ensure output directory exists
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Define the 12 case IDs
CASE_IDS = [f'{i:02d}' for i in range(1, 13)]  # ['01', '02', ..., '12']

print(f"Input Directory: {INPUT_DIR}")
print(f"Output Directory: {OUTPUT_DIR}")
print(f"Case IDs to process: {CASE_IDS}")


Input Directory: data/processed/Phase 1 - Data Labeling
Output Directory: data/processed/Phase 2 - Data Cleaning
Case IDs to process: ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']


In [3]:
# --- Load all 12 LogFile datasets ---
logfile_dataframes = {}

print("Loading labeled LogFile datasets...")
print("-" * 60)

for case_id in CASE_IDS:
    filename = f'{case_id}-PE-LogFile_labeled.csv'
    filepath = INPUT_DIR / filename
    
    try:
        # Load with low_memory=False to handle mixed types
        df = pd.read_csv(filepath, low_memory=False)
        
        # Add Case_ID column for tracking
        df['Case_ID'] = int(case_id)
        
        # Store in dictionary
        logfile_dataframes[case_id] = df
        
        print(f"✓ Case {case_id}: Loaded {len(df):,} records | Columns: {len(df.columns)}")
        
    except FileNotFoundError:
        print(f"✗ Case {case_id}: File not found at {filepath}")
    except Exception as e:
        print(f"✗ Case {case_id}: Error loading file - {e}")

print("-" * 60)
print(f"Total datasets loaded: {len(logfile_dataframes)}/12")

# Calculate total records
total_records = sum(len(df) for df in logfile_dataframes.values())
print(f"Total records across all cases: {total_records:,}")


Loading labeled LogFile datasets...
------------------------------------------------------------
✓ Case 01: Loaded 39,077 records | Columns: 16
✓ Case 02: Loaded 14,783 records | Columns: 16
✓ Case 03: Loaded 24,063 records | Columns: 16
✓ Case 04: Loaded 12,731 records | Columns: 16
✓ Case 05: Loaded 14,242 records | Columns: 16
✓ Case 06: Loaded 14,030 records | Columns: 16
✓ Case 07: Loaded 23,737 records | Columns: 16
✓ Case 08: Loaded 23,379 records | Columns: 16
✓ Case 09: Loaded 25,688 records | Columns: 16
✓ Case 10: Loaded 23,932 records | Columns: 16
✓ Case 11: Loaded 14,083 records | Columns: 16
✓ Case 12: Loaded 14,139 records | Columns: 16
------------------------------------------------------------
Total datasets loaded: 12/12
Total records across all cases: 243,884


## 3. Initial Data Exploration

Examine the structure and content of the loaded datasets.

In [4]:
# --- Inspect first dataset as a sample ---
sample_case = '01'
df_sample = logfile_dataframes[sample_case]

print(f"Sample Dataset: Case {sample_case}")
print("=" * 60)
print(f"\nShape: {df_sample.shape}")
print(f"\nColumn Names:")
print(df_sample.columns.tolist())
print(f"\nData Types:")
print(df_sample.dtypes)

Sample Dataset: Case 01

Shape: (39077, 16)

Column Names:
['lsn', 'eventtime(utc+8)', 'event', 'detail', 'file/directory name', 'full path', 'creationtime', 'modifiedtime', 'mftmodifiedtime', 'accessedtime', 'redo', 'target vcn', 'cluster index', 'is_timestomped', 'is_suspicious_execution', 'Case_ID']

Data Types:
lsn                         int64
eventtime(utc+8)           object
event                      object
detail                     object
file/directory name        object
full path                  object
creationtime               object
modifiedtime               object
mftmodifiedtime            object
accessedtime               object
redo                       object
target vcn                 object
cluster index               int64
is_timestomped              int64
is_suspicious_execution     int64
Case_ID                     int64
dtype: object


# --- Preview first few rows ---
print(f"\nFirst 5 rows of Case {sample_case}:")
df_sample.head()

In [5]:
# --- Check for missing values ---
print("Missing Values Summary:")
print("=" * 60)
missing_summary = df_sample.isnull().sum()
missing_pct = (missing_summary / len(df_sample)) * 100

missing_df = pd.DataFrame({
    'Missing_Count': missing_summary,
    'Missing_Percentage': missing_pct
}).sort_values('Missing_Count', ascending=False)

print(missing_df[missing_df['Missing_Count'] > 0])

Missing Values Summary:
                     Missing_Count  Missing_Percentage
detail                       26678           68.270338
eventtime(utc+8)             22815           58.384728
event                        19577           50.098523
creationtime                 10897           27.885969
modifiedtime                 10897           27.885969
mftmodifiedtime              10897           27.885969
accessedtime                 10897           27.885969
full path                     4315           11.042301
file/directory name            447            1.143895


In [6]:
# --- Check label distribution ---
print("\nLabel Distribution (Case 01):")
print("=" * 60)
print(f"is_timestomped:")
print(df_sample['is_timestomped'].value_counts())
print(f"\nis_suspicious_execution:")
print(df_sample['is_suspicious_execution'].value_counts())


Label Distribution (Case 01):
is_timestomped:
is_timestomped
0    39076
1        1
Name: count, dtype: int64

is_suspicious_execution:
is_suspicious_execution
0    39076
1        1
Name: count, dtype: int64


## 4. Next Steps

**To Do:**
- [ ] Identify and drop irrelevant columns
- [ ] Standardize timestamp columns to datetime format
- [ ] Handle missing values (drop or impute)
- [ ] Remove duplicate records
- [ ] Merge all 12 LogFile datasets
- [ ] Export Master_LogFile_Cleaned.csv


## 5. Data Cleaning - Drop Rows with Null Event and Event Detail

Remove rows where both `event` and `eventdetail` are null, as these records lack essential information for forensic analysis.

In [9]:
# --- Drop rows where BOTH event AND detail are null ---
print("Cleaning: Dropping rows with null event AND detail...")
print("=" * 60)

# Store original counts
original_counts = {}
cleaned_dataframes = {}

for case_id, df in logfile_dataframes.items():
    original_count = len(df)
    original_counts[case_id] = original_count
    
    # Drop rows where BOTH event and detail are null
    df_cleaned = df[~(df['event'].isnull() & df['detail'].isnull())].copy()
    
    dropped_count = original_count - len(df_cleaned)
    dropped_pct = (dropped_count / original_count) * 100 if original_count > 0 else 0
    
    cleaned_dataframes[case_id] = df_cleaned
    
    print(f"Case {case_id}: {original_count:,} → {len(df_cleaned):,} records "
          f"(Dropped: {dropped_count:,} | {dropped_pct:.2f}%)")

print("-" * 60)

# Calculate totals
total_original = sum(original_counts.values())
total_cleaned = sum(len(df) for df in cleaned_dataframes.values())
total_dropped = total_original - total_cleaned
total_dropped_pct = (total_dropped / total_original) * 100

print(f"Total: {total_original:,} → {total_cleaned:,} records")
print(f"Total Dropped: {total_dropped:,} ({total_dropped_pct:.2f}%)")

# Update the main dictionary
logfile_dataframes = cleaned_dataframes

# --- Verify the cleaning ---
print("\nVerification: Check if any rows still have null event AND detail")
print("=" * 60)

for case_id, df in logfile_dataframes.items():
    null_both = df[df['event'].isnull() & df['detail'].isnull()]
    print(f"Case {case_id}: {len(null_both)} rows with both null")

print("\n✓ Cleaning verified!")

Cleaning: Dropping rows with null event AND detail...
Case 01: 39,077 → 19,500 records (Dropped: 19,577 | 50.10%)
Case 02: 14,783 → 9,694 records (Dropped: 5,089 | 34.42%)
Case 03: 24,063 → 12,575 records (Dropped: 11,488 | 47.74%)
Case 04: 12,731 → 7,713 records (Dropped: 5,018 | 39.42%)
Case 05: 14,242 → 7,530 records (Dropped: 6,712 | 47.13%)
Case 06: 14,030 → 7,426 records (Dropped: 6,604 | 47.07%)
Case 07: 23,737 → 12,824 records (Dropped: 10,913 | 45.97%)
Case 08: 23,379 → 12,397 records (Dropped: 10,982 | 46.97%)
Case 09: 25,688 → 13,414 records (Dropped: 12,274 | 47.78%)
Case 10: 23,932 → 12,929 records (Dropped: 11,003 | 45.98%)
Case 11: 14,083 → 7,459 records (Dropped: 6,624 | 47.04%)
Case 12: 14,139 → 7,770 records (Dropped: 6,369 | 45.05%)
------------------------------------------------------------
Total: 243,884 → 131,231 records
Total Dropped: 112,653 (46.19%)

Verification: Check if any rows still have null event AND detail
Case 01: 0 rows with both null
Case 02: 0 rows

## 6. Drop Irrelevant Columns

Remove columns that are not relevant for timestomping detection:
- `targetvcn`: Low-level NTFS virtual cluster number (physical storage location)
- `clusterindex`: Cluster index information (disk fragmentation data)

These columns are useful for data recovery and disk forensics, but do not contribute to timestamp manipulation analysis.


In [10]:
# --- Drop irrelevant columns for timestomping detection ---
print("Dropping irrelevant columns: targetvcn, clusterindex")
print("=" * 60)

columns_to_drop = ['targetvcn', 'clusterindex']

for case_id, df in logfile_dataframes.items():
    # Check which columns exist before dropping
    existing_cols = [col for col in columns_to_drop if col in df.columns]
    
    if existing_cols:
        df.drop(columns=existing_cols, inplace=True)
        print(f"Case {case_id}: Dropped {len(existing_cols)} columns - {existing_cols}")
    else:
        print(f"Case {case_id}: No columns to drop (already removed)")

print("-" * 60)
print("✓ Irrelevant columns dropped successfully!")

# --- Verify column removal ---
print("\nVerification: Check remaining columns")
print("=" * 60)

sample_case = '01'
df_sample = logfile_dataframes[sample_case]

print(f"Remaining columns in Case {sample_case}: {len(df_sample.columns)}")
print(f"\nColumn names:")
print(df_sample.columns.tolist())

# Verify dropped columns are gone
dropped_still_present = [col for col in columns_to_drop if col in df_sample.columns]
if dropped_still_present:
    print(f"\n⚠️ Warning: These columns still exist: {dropped_still_present}")
else:
    print(f"\n✓ Confirmed: targetvcn and clusterindex have been removed")

Dropping irrelevant columns: targetvcn, clusterindex
Case 01: No columns to drop (already removed)
Case 02: No columns to drop (already removed)
Case 03: No columns to drop (already removed)
Case 04: No columns to drop (already removed)
Case 05: No columns to drop (already removed)
Case 06: No columns to drop (already removed)
Case 07: No columns to drop (already removed)
Case 08: No columns to drop (already removed)
Case 09: No columns to drop (already removed)
Case 10: No columns to drop (already removed)
Case 11: No columns to drop (already removed)
Case 12: No columns to drop (already removed)
------------------------------------------------------------
✓ Irrelevant columns dropped successfully!

Verification: Check remaining columns
Remaining columns in Case 01: 16

Column names:
['lsn', 'eventtime(utc+8)', 'event', 'detail', 'file/directory name', 'full path', 'creationtime', 'modifiedtime', 'mftmodifiedtime', 'accessedtime', 'redo', 'target vcn', 'cluster index', 'is_timestomped