# Phase 2.1 - Data Merging (LogFile + UsnJrnl)

**Objective:** Merge the cleaned LogFile and UsnJrnl datasets into a unified temporal timeline for comprehensive forensic analysis.

**Input Files:**
- `data/processed/Phase 2 - Data Cleaning/Master_LogFile_Cleaned.csv` (83,458 records)
- `data/processed/Phase 2 - Data Cleaning/Master_UsnJrnl_Cleaned.csv` (2,181,063 records)

**Output File:**
- `data/processed/Phase 2.1 - Data Merging/Master_Timeline.csv`

**Process:**
1. Load both cleaned master datasets
2. Standardize column names and structure
3. Merge LogFile and UsnJrnl into unified timeline
4. Sort by timestamp
5. Validate merged data
6. Export Master_Timeline.csv

## 1. Import Libraries and Setup

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import warnings

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Libraries imported successfully!
Pandas version: 2.3.2
NumPy version: 2.3.3


## 2. Configuration and Load Cleaned Datasets

In [2]:
# --- Configuration ---
INPUT_DIR = Path('data/processed/Phase 2 - Data Cleaning')
OUTPUT_DIR = Path('data/processed/Phase 2.1 - Data Merging')

# Ensure output directory exists
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# File paths
LOGFILE_PATH = INPUT_DIR / 'Master_LogFile_Cleaned.csv'
USNJRNL_PATH = INPUT_DIR / 'Master_UsnJrnl_Cleaned.csv'
OUTPUT_PATH = OUTPUT_DIR / 'Master_Timeline.csv'

print(f"Input Directory: {INPUT_DIR}")
print(f"Output Directory: {OUTPUT_DIR}")
print(f"\nInput Files:")
print(f"  - LogFile: {LOGFILE_PATH}")
print(f"  - UsnJrnl: {USNJRNL_PATH}")
print(f"\nOutput File:")
print(f"  - Master Timeline: {OUTPUT_PATH}")

Input Directory: data/processed/Phase 2 - Data Cleaning
Output Directory: data/processed/Phase 2.1 - Data Merging

Input Files:
  - LogFile: data/processed/Phase 2 - Data Cleaning/Master_LogFile_Cleaned.csv
  - UsnJrnl: data/processed/Phase 2 - Data Cleaning/Master_UsnJrnl_Cleaned.csv

Output File:
  - Master Timeline: data/processed/Phase 2.1 - Data Merging/Master_Timeline.csv


In [5]:
# --- Load Master LogFile ---
print("\nLoading Master LogFile...")
print("=" * 60)

try:
    df_logfile = pd.read_csv(LOGFILE_PATH, low_memory=False)
    
    # Convert timestamp columns to datetime
    timestamp_cols_logfile = ['eventtime(utc+8)', 'creationtime', 'modifiedtime', 'mftmodifiedtime', 'accessedtime']
    for col in timestamp_cols_logfile:
        if col in df_logfile.columns:
            df_logfile[col] = pd.to_datetime(df_logfile[col], errors='coerce', utc=True)
    
    print(f"✓ Master LogFile loaded successfully")
    print(f"  Records: {len(df_logfile):,}")
    print(f"  Columns: {len(df_logfile.columns)}")
    print(f"  Timestomped: {df_logfile['is_timestomped'].sum()}")
    print(f"  Suspicious: {df_logfile['is_suspicious_execution'].sum()}")
    
except FileNotFoundError:
    print(f"✗ ERROR: Master LogFile not found at {LOGFILE_PATH}")
    df_logfile = pd.DataFrame()
except Exception as e:
    print(f"✗ ERROR loading LogFile: {e}")
    df_logfile = pd.DataFrame()

# --- Load Master UsnJrnl ---
print("\nLoading Master UsnJrnl...")
print("=" * 60)

try:
    df_usnjrnl = pd.read_csv(USNJRNL_PATH, low_memory=False)
    
    # Convert timestamp column to datetime
    if 'timestamp(utc+8)' in df_usnjrnl.columns:
        df_usnjrnl['timestamp(utc+8)'] = pd.to_datetime(df_usnjrnl['timestamp(utc+8)'], errors='coerce', utc=True)
    
    print(f"✓ Master UsnJrnl loaded successfully")
    print(f"  Records: {len(df_usnjrnl):,}")
    print(f"  Columns: {len(df_usnjrnl.columns)}")
    print(f"  Timestomped: {df_usnjrnl['is_timestomped'].sum()}")
    print(f"  Suspicious: {df_usnjrnl['is_suspicious_execution'].sum()}")
    
except FileNotFoundError:
    print(f"✗ ERROR: Master UsnJrnl not found at {USNJRNL_PATH}")
    df_usnjrnl = pd.DataFrame()
except Exception as e:
    print(f"✗ ERROR loading UsnJrnl: {e}")
    df_usnjrnl = pd.DataFrame()


Loading Master LogFile...
✓ Master LogFile loaded successfully
  Records: 83,458
  Columns: 17
  Timestomped: 14
  Suspicious: 8

Loading Master UsnJrnl...
✓ Master UsnJrnl loaded successfully
  Records: 2,181,063
  Columns: 11
  Timestomped: 238
  Suspicious: 8


## 3. Examine Column Structures

Before merging, let's examine the column structures of both datasets to understand how to standardize them.

In [6]:
# --- Compare column structures ---
print("\nColumn Structure Comparison:")
print("=" * 60)

print(f"\nLogFile Columns ({len(df_logfile.columns)}):")
print(df_logfile.columns.tolist())

print(f"\nUsnJrnl Columns ({len(df_usnjrnl.columns)}):")
print(df_usnjrnl.columns.tolist())

# Find common columns
common_cols = set(df_logfile.columns).intersection(set(df_usnjrnl.columns))
print(f"\nCommon Columns ({len(common_cols)}):")
print(sorted(list(common_cols)))

# Find unique columns
logfile_only = set(df_logfile.columns) - set(df_usnjrnl.columns)
usnjrnl_only = set(df_usnjrnl.columns) - set(df_logfile.columns)

print(f"\nLogFile-Only Columns ({len(logfile_only)}):")
print(sorted(list(logfile_only)))

print(f"\nUsnJrnl-Only Columns ({len(usnjrnl_only)}):")
print(sorted(list(usnjrnl_only)))


Column Structure Comparison:

LogFile Columns (17):
['case_id', 'lsn', 'eventtime(utc+8)', 'event', 'detail', 'file/directory name', 'full path', 'creationtime', 'modifiedtime', 'mftmodifiedtime', 'accessedtime', 'redo', 'target vcn', 'cluster index', 'is_timestomped', 'is_suspicious_execution', 'has_incomplete_timestamps']

UsnJrnl Columns (11):
['case_id', 'timestamp(utc+8)', 'usn', 'file/directory name', 'fullpath', 'eventinfo', 'fileattribute', 'filereferencenumber', 'parentfilereferencenumber', 'is_timestomped', 'is_suspicious_execution']

Common Columns (4):
['case_id', 'file/directory name', 'is_suspicious_execution', 'is_timestomped']

LogFile-Only Columns (13):
['accessedtime', 'cluster index', 'creationtime', 'detail', 'event', 'eventtime(utc+8)', 'full path', 'has_incomplete_timestamps', 'lsn', 'mftmodifiedtime', 'modifiedtime', 'redo', 'target vcn']

UsnJrnl-Only Columns (7):
['eventinfo', 'fileattribute', 'filereferencenumber', 'fullpath', 'parentfilereferencenumber', 'ti

## 4. Standardize Column Names and Structure

**Strategy for Merging:**

1. **Add source identifier** to distinguish LogFile vs UsnJrnl records
2. **Standardize timestamp column** - use unified primary timestamp
3. **Handle dataset-specific columns:**
   - LogFile-specific: `lsn`, `event`, `detail`, `redo`, MAC timestamps
   - UsnJrnl-specific: `usn`, `eventinfo`, `fileattribute`, `filereferencenumber`, `parentfilereferencenumber`
4. **Standardize path columns:**
   - LogFile: `full path`
   - UsnJrnl: `fullpath`
   - Merge into single `fullpath` column

**Approach:** Keep all columns, fill missing values with NaN where dataset-specific columns don't exist


In [9]:
# --- Add source identifier to both datasets ---
print("Adding source identifier columns...")
print("=" * 60)

df_logfile['source'] = 'LogFile'
df_usnjrnl['source'] = 'UsnJrnl'

print(f"✓ Added 'source' column to LogFile: {len(df_logfile):,} records")
print(f"✓ Added 'source' column to UsnJrnl: {len(df_usnjrnl):,} records")

# --- Standardize column names ---
print("\nStandardizing column names...")
print("=" * 60)

# Rename 'full path' to 'fullpath' in LogFile to match UsnJrnl
if 'full path' in df_logfile.columns:
    df_logfile.rename(columns={'full path': 'fullpath'}, inplace=True)
    print("✓ Renamed 'full path' → 'fullpath' in LogFile")

# Create unified primary timestamp column
# LogFile uses 'eventtime(utc+8)', UsnJrnl uses 'timestamp(utc+8)'
# We'll create 'timestamp_primary' for both

if 'eventtime(utc+8)' in df_logfile.columns:
    df_logfile['timestamp_primary'] = df_logfile['eventtime(utc+8)']
    print("✓ Created 'timestamp_primary' from 'eventtime(utc+8)' in LogFile")

if 'timestamp(utc+8)' in df_usnjrnl.columns:
    df_usnjrnl['timestamp_primary'] = df_usnjrnl['timestamp(utc+8)']
    print("✓ Created 'timestamp_primary' from 'timestamp(utc+8)' in UsnJrnl")

print("\n✓ Column standardization completed!")



Adding source identifier columns...
✓ Added 'source' column to LogFile: 83,458 records
✓ Added 'source' column to UsnJrnl: 2,181,063 records

Standardizing column names...
✓ Renamed 'full path' → 'fullpath' in LogFile
✓ Created 'timestamp_primary' from 'eventtime(utc+8)' in LogFile
✓ Created 'timestamp_primary' from 'timestamp(utc+8)' in UsnJrnl

✓ Column standardization completed!


## 5. Merge Datasets into Unified Timeline

Concatenate LogFile and UsnJrnl vertically, preserving all columns from both datasets.

In [10]:
# --- Merge LogFile and UsnJrnl ---
print("\nMerging LogFile and UsnJrnl into unified timeline...")
print("=" * 60)

# Concatenate vertically (rows)
master_timeline = pd.concat([df_logfile, df_usnjrnl], ignore_index=True, sort=False)

print(f"✓ Datasets merged successfully!")
print(f"\nMerge Statistics:")
print(f"  LogFile records:     {len(df_logfile):>12,}")
print(f"  UsnJrnl records:     {len(df_usnjrnl):>12,}")
print(f"  ----------------------------------------")
print(f"  Master Timeline:     {len(master_timeline):>12,}")
print(f"\n  Total columns:       {len(master_timeline.columns)}")


Merging LogFile and UsnJrnl into unified timeline...
✓ Datasets merged successfully!

Merge Statistics:
  LogFile records:           83,458
  UsnJrnl records:        2,181,063
  ----------------------------------------
  Master Timeline:        2,264,521

  Total columns:       25


In [11]:
# --- Verify merge integrity ---
print("\nVerifying merge integrity...")
print("=" * 60)

# Check source distribution
source_counts = master_timeline['source'].value_counts()
print(f"\nSource Distribution:")
for source, count in source_counts.items():
    pct = (count / len(master_timeline)) * 100
    print(f"  {source:<10} {count:>12,} ({pct:>6.2f}%)")

# Check labeled rows preservation
total_timestomped = master_timeline['is_timestomped'].sum()
total_suspicious = master_timeline['is_suspicious_execution'].sum()

print(f"\nLabeled Rows Preservation:")
print(f"  Timestomped:         {total_timestomped:>12}")
print(f"  Suspicious:          {total_suspicious:>12}")
print(f"  Total Labeled:       {total_timestomped + total_suspicious:>12}")

# Expected: 14 + 238 = 252 timestomped, 8 + 8 = 16 suspicious
expected_timestomped = df_logfile['is_timestomped'].sum() + df_usnjrnl['is_timestomped'].sum()
expected_suspicious = df_logfile['is_suspicious_execution'].sum() + df_usnjrnl['is_suspicious_execution'].sum()

if total_timestomped == expected_timestomped and total_suspicious == expected_suspicious:
    print("\n✓ All labeled rows preserved correctly!")
else:
    print(f"\n⚠️ WARNING: Expected {expected_timestomped} timestomped and {expected_suspicious} suspicious")


Verifying merge integrity...

Source Distribution:
  UsnJrnl       2,181,063 ( 96.31%)
  LogFile          83,458 (  3.69%)

Labeled Rows Preservation:
  Timestomped:                  252
  Suspicious:                    16
  Total Labeled:                268

✓ All labeled rows preserved correctly!


## 6. Sort by Timestamp and Case ID

Sort the unified timeline chronologically by primary timestamp and case_id for logical ordering.

In [12]:
# --- Sort master timeline ---
print("\nSorting Master Timeline...")
print("=" * 60)

# Sort by case_id and timestamp_primary
master_timeline = master_timeline.sort_values(
    ['case_id', 'timestamp_primary'], 
    na_position='last'
).reset_index(drop=True)

print("✓ Master Timeline sorted by case_id and timestamp_primary")

# Display temporal coverage
print(f"\nTemporal Coverage:")
earliest = master_timeline['timestamp_primary'].min()
latest = master_timeline['timestamp_primary'].max()

print(f"  Earliest event: {earliest}")
print(f"  Latest event:   {latest}")

if pd.notna(earliest) and pd.notna(latest):
    duration = latest - earliest
    print(f"  Time span:      {duration.days} days (~{duration.days/365:.1f} years)")


Sorting Master Timeline...
✓ Master Timeline sorted by case_id and timestamp_primary

Temporal Coverage:
  Earliest event: 2000-01-01 08:00:00+00:00
  Latest event:   2024-01-01 00:08:36+00:00
  Time span:      8765 days (~24.0 years)


## 7. Reorder Columns for Better Readability

Place the most important columns first for easier analysis.

In [13]:
# --- Reorder columns logically ---
print("\nReordering columns...")
print("=" * 60)

# Define preferred column order
priority_cols = [
    'case_id',
    'timestamp_primary',
    'source',
    'fullpath',
    'file/directory name',
    'is_timestomped',
    'is_suspicious_execution'
]

# Get remaining columns
remaining_cols = [col for col in master_timeline.columns if col not in priority_cols]

# Combine: priority columns first, then remaining
final_col_order = priority_cols + remaining_cols

# Reorder (only include columns that exist)
final_col_order = [col for col in final_col_order if col in master_timeline.columns]

master_timeline = master_timeline[final_col_order]

print("✓ Columns reordered")
print(f"\nFirst 10 columns:")
print(master_timeline.columns.tolist()[:10])



Reordering columns...
✓ Columns reordered

First 10 columns:
['case_id', 'timestamp_primary', 'source', 'fullpath', 'file/directory name', 'is_timestomped', 'is_suspicious_execution', 'lsn', 'eventtime(utc+8)', 'event']


## 8. Final Validation and Statistics

In [14]:
# --- Final validation ---
print("\nFinal Master Timeline Validation")
print("=" * 60)

print(f"Total records: {len(master_timeline):,}")
print(f"Total columns: {len(master_timeline.columns)}")

# Per-case distribution
print(f"\nRecords per Case:")
case_counts = master_timeline['case_id'].value_counts().sort_index()
for case_id, count in case_counts.items():
    print(f"  Case {case_id:02d}: {count:>10,}")

# Source breakdown per case
print(f"\nSource Breakdown per Case:")
print(f"{'Case':<8} {'LogFile':<12} {'UsnJrnl':<12} {'Total':<12}")
print("-" * 50)

for case_id in sorted(master_timeline['case_id'].unique()):
    case_data = master_timeline[master_timeline['case_id'] == case_id]
    logfile_count = len(case_data[case_data['source'] == 'LogFile'])
    usnjrnl_count = len(case_data[case_data['source'] == 'UsnJrnl'])
    total_count = len(case_data)
    
    print(f"Case {case_id:02d}  {logfile_count:<12,} {usnjrnl_count:<12,} {total_count:<12,}")

# Data completeness check
print(f"\nData Completeness:")
print(f"  timestamp_primary: {master_timeline['timestamp_primary'].notna().sum():,} / {len(master_timeline):,} ({master_timeline['timestamp_primary'].notna().sum()/len(master_timeline)*100:.2f}%)")
print(f"  fullpath: {master_timeline['fullpath'].notna().sum():,} / {len(master_timeline):,} ({master_timeline['fullpath'].notna().sum()/len(master_timeline)*100:.2f}%)")

print("\n" + "=" * 60)
print("✓ Validation completed successfully!")


Final Master Timeline Validation
Total records: 2,264,521
Total columns: 25

Records per Case:
  Case 01:    242,400
  Case 02:    145,435
  Case 03:    144,938
  Case 04:    226,352
  Case 05:    228,078
  Case 06:    227,209
  Case 07:    147,593
  Case 08:    148,019
  Case 09:    149,569
  Case 10:    149,150
  Case 11:    227,331
  Case 12:    228,447

Source Breakdown per Case:
Case     LogFile      UsnJrnl      Total       
--------------------------------------------------
Case 01  13,763       228,637      242,400     
Case 02  6,049        139,386      145,435     
Case 03  7,262        137,676      144,938     
Case 04  4,859        221,493      226,352     
Case 05  4,909        223,169      228,078     
Case 06  4,703        222,506      227,209     
Case 07  7,823        139,770      147,593     
Case 08  7,666        140,353      148,019     
Case 09  8,453        141,116      149,569     
Case 10  8,073        141,077      149,150     
Case 11  4,898        222,433    

## 9. Export Master Timeline to CSV

In [15]:
# --- Export Master Timeline ---
print("\nExporting Master Timeline to CSV...")
print("=" * 60)

# Export with proper datetime formatting
master_timeline.to_csv(
    OUTPUT_PATH,
    index=False,
    date_format='%Y-%m-%d %H:%M:%S'
)

# Get file size
import os
file_size_bytes = os.path.getsize(OUTPUT_PATH)
file_size_mb = file_size_bytes / (1024 * 1024)

print(f"✓ Master Timeline exported successfully!")
print(f"\nFile Details:")
print(f"  Location: {OUTPUT_PATH}")
print(f"  Size: {file_size_mb:.2f} MB")
print(f"  Records: {len(master_timeline):,}")
print(f"  Columns: {len(master_timeline.columns)}")

# Verify file exists
if OUTPUT_PATH.exists():
    print(f"\n✓ File verified at: {OUTPUT_PATH}")
else:
    print(f"\n✗ Error: File not found at {OUTPUT_PATH}")


Exporting Master Timeline to CSV...
✓ Master Timeline exported successfully!

File Details:
  Location: data/processed/Phase 2.1 - Data Merging/Master_Timeline.csv
  Size: 643.65 MB
  Records: 2,264,521
  Columns: 25

✓ File verified at: data/processed/Phase 2.1 - Data Merging/Master_Timeline.csv


In [16]:
# --- Display sample of merged timeline ---
print("\nSample of Master_Timeline.csv (First 10 rows):")
print("=" * 60)
master_timeline.head(10)


Sample of Master_Timeline.csv (First 10 rows):


Unnamed: 0,case_id,timestamp_primary,source,fullpath,file/directory name,is_timestomped,is_suspicious_execution,lsn,eventtime(utc+8),event,detail,creationtime,modifiedtime,mftmodifiedtime,accessedtime,redo,target vcn,cluster index,has_incomplete_timestamps,timestamp(utc+8),usn,eventinfo,fileattribute,filereferencenumber,parentfilereferencenumber
0,1,2000-01-01 08:00:00+00:00,LogFile,\Program Files (x86)\Dropbox\Client\189.4.8395...,style.js,0,0,8715607000.0,2000-01-01 08:00:00+00:00,Updating MFTModified Time,MFTModifiedTime : 2023-12-23 00:14:24 -> 2023-...,2023-12-23 00:14:24+00:00,2000-01-01 08:00:00+00:00,2023-12-23 00:14:52+00:00,2023-12-23 00:14:24+00:00,Update Resident Value,0x174F8,4.0,0.0,NaT,,,,,
1,1,2000-01-01 08:00:00+00:00,LogFile,\Program Files (x86)\Dropbox\Client\189.4.8395...,CalendarUtils.js,0,0,8724385000.0,2000-01-01 08:00:00+00:00,Updating MFTModified Time,MFTModifiedTime : 2023-12-23 00:14:24 -> 2023-...,2023-12-23 00:14:24+00:00,2000-01-01 08:00:00+00:00,2023-12-23 00:15:26+00:00,2023-12-23 00:14:24+00:00,Update Resident Value,0x174F3,6.0,0.0,NaT,,,,,
2,1,2000-01-01 08:00:00+00:00,LogFile,\Program Files (x86)\Dropbox\Client\189.4.8395...,StackView.js,0,0,8724811000.0,2000-01-01 08:00:00+00:00,Updating MFTModified Time,MFTModifiedTime : 2023-12-23 00:14:24 -> 2023-...,2023-12-23 00:14:24+00:00,2000-01-01 08:00:00+00:00,2023-12-23 00:16:08+00:00,2023-12-23 00:14:24+00:00,Update Resident Value,0x174F8,0.0,0.0,NaT,,,,,
3,1,2010-10-11 14:08:00+00:00,LogFile,\Users\blueangel\AppData\Local\Temp\RarSFX1\Wi...,WinHex.exe,0,0,8724891000.0,2010-10-11 14:08:00+00:00,Updating MFTModified Time,MFTModifiedTime : 2023-12-23 00:16:12 -> 2023-...,2023-12-23 00:16:12+00:00,2010-10-11 14:08:00+00:00,2023-12-23 00:16:13+00:00,2023-12-23 00:16:13+00:00,Update Resident Value,0x7E20,4.0,0.0,NaT,,,,,
4,1,2010-10-11 14:08:00+00:00,LogFile,\Users\blueangel\AppData\Local\Temp\RarSFX1\se...,setup.exe,0,0,8725322000.0,2010-10-11 14:08:00+00:00,Updating MFTModified Time,MFTModifiedTime : 2023-12-23 00:16:12 -> 2023-...,2023-12-23 00:16:12+00:00,2010-10-11 14:08:00+00:00,2023-12-23 00:16:17+00:00,2023-12-23 00:16:12+00:00,Update Resident Value,0x7E1F,2.0,0.0,NaT,,,,,
5,1,2019-12-07 22:58:27+00:00,LogFile,\Windows\WinSxS\amd64_windows-shield-provider_...,SecurityHealthHost.exe,0,0,8726152000.0,2019-12-07 22:58:27+00:00,Writing Content of Resident File,Writing Size : 340,2019-12-07 22:55:42+00:00,2019-12-07 22:58:27+00:00,2022-12-16 16:11:29+00:00,2019-12-07 22:58:27+00:00,Update Resident Value,0x4ED2,2.0,0.0,NaT,,,,,
6,1,2019-12-07 22:59:41+00:00,LogFile,\Program Files\WindowsApps\Microsoft.XboxIdent...,clrcompression.dll,0,0,8724616000.0,2019-12-07 22:59:41+00:00,Updating MFTModified Time,MFTModifiedTime : 2022-12-21 01:16:54 -> 2023-...,2019-12-07 22:59:41+00:00,2019-12-07 22:59:41+00:00,2023-12-23 00:15:45+00:00,2023-12-23 00:15:45+00:00,Update Resident Value,0x1C72,2.0,0.0,NaT,,,,,
7,1,2022-09-08 11:18:48+00:00,LogFile,\Windows\WinSxS\amd64_windows-shield-provider_...,SecurityHealthAgent.dll,0,0,8726153000.0,2022-09-08 11:18:48+00:00,Writing Content of Resident File,Writing Size : 264,2022-09-08 11:13:08+00:00,2022-09-08 11:18:48+00:00,2022-12-16 16:11:27+00:00,2022-09-08 11:18:48+00:00,Update Resident Value,0x4C2B,2.0,0.0,NaT,,,,,
8,1,2022-10-27 19:52:30+00:00,LogFile,\Program Files\Common Files\microsoft shared\C...,AppVClient.man,0,0,8726886000.0,2022-10-27 19:52:30+00:00,Updating MFTModified Time,MFTModifiedTime : 2023-04-19 10:15:17 -> 2023-...,2022-12-21 21:44:51+00:00,2022-10-27 19:52:30+00:00,2023-12-23 00:18:03+00:00,2023-12-23 00:17:48+00:00,Update Resident Value,0x147BA,0.0,0.0,NaT,,,,,
9,1,2022-10-27 19:52:30+00:00,LogFile,\Program Files\Common Files\microsoft shared\C...,AppVClientIsv.man,0,0,8726886000.0,2022-10-27 19:52:30+00:00,Updating MFTModified Time,MFTModifiedTime : 2023-04-19 10:15:17 -> 2023-...,2022-12-21 21:44:51+00:00,2022-10-27 19:52:30+00:00,2023-12-23 00:18:03+00:00,2023-12-23 00:17:48+00:00,Update Resident Value,0x147BA,2.0,0.0,NaT,,,,,


## Phase 2.1 - Data Merging Summary

### ✅ Completed Tasks:

1. **Loaded cleaned datasets:**
   - Master_LogFile_Cleaned.csv: 83,458 records
   - Master_UsnJrnl_Cleaned.csv: 2,181,063 records

2. **Standardized structure:**
   - Added `source` column (LogFile/UsnJrnl identifier)
   - Created unified `timestamp_primary` column
   - Standardized `fullpath` column name

3. **Merged datasets:**
   - Combined LogFile + UsnJrnl vertically
   - Preserved all columns from both sources
   - Total: 2,264,521 records

4. **Sorted timeline:**
   - Ordered by case_id and timestamp_primary
   - Chronological event ordering

5. **Validated integrity:**
   - All labeled rows preserved
   - Source distribution verified
   - Data completeness checked

### 📊 Master Timeline Metrics:

**Total Records:** 2,264,521
- LogFile: 83,458 (3.7%)
- UsnJrnl: 2,181,063 (96.3%)

**Labeled Rows:** 
- Timestomped: 252
- Suspicious: 16
- Total: 268

**Temporal Coverage:** ~24 years of forensic timeline data

### 🎯 Ready for Phase 3:

**Output File:** `data/processed/Phase 2.1 - Data Merging/Master_Timeline.csv`

**Next Phase:** Feature Engineering
- Calculate time delta features
- Extract temporal patterns
- Prepare features for ML model training