# 02 - Data Preprocessing & Feature Engineering

**Objective:** Minimal preprocessing while keeping maximum useful features

**Strategy:**
1. Filter to Single Family Residential only
2. Remove ONLY true leakage (ListPrice, dates, agent/office names, IDs)
3. Keep geographic features (City, County, MLSArea, School districts, etc.)
4. Handle missing values with higher threshold (50%)
5. One-hot encode ALL categories without dropping first (drop_first=False)
6. Light feature engineering
7. Target outlier removal

**Input:** `data/train_raw.csv`, `data/test_raw.csv`  
**Output:** `data/X_train.csv`, `data/X_test.csv`, `data/y_train.csv`, `data/y_test.csv`

In [14]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
from pathlib import Path
from datetime import datetime
import json

print(f"Notebook run: {datetime.now().isoformat()}")

Notebook run: 2025-10-24T09:47:05.242082


In [None]:
# Paths (works on Windows and Linux/Amarel)
from pathlib import Path

# Use current working directory as ROOT
ROOT = Path.cwd()
RAW_DATA_DIR = ROOT / 'filled_data'  # Load from filled_data to get lat/lon columns
DATA_DIR = ROOT / 'data'  # For saving processed data
MODELS_DIR = ROOT / 'models'

train_path = RAW_DATA_DIR / 'train_raw.csv'  # Load from filled_data/
test_path = RAW_DATA_DIR / 'test_raw.csv'

print(f"Loading data from:")
print(f"  Train: {train_path}")
print(f"  Test: {test_path}")

Loading data from:
  Train: c:\Users\lpnhu\Downloads\home-price-prediction\filled_data\train_raw.csv
  Test: c:\Users\lpnhu\Downloads\home-price-prediction\filled_data\test_raw.csv


In [16]:
# Load data (already split by month in notebook 01)
df_train = pd.read_csv(train_path, low_memory=False)
df_test = pd.read_csv(test_path, low_memory=False)

print(f"Training data: {df_train.shape}")
print(f"Test data: {df_test.shape}")
print(f"\nNote: This is the original raw data with {df_train.shape[1]} columns")
print(f"We will maximize feature count through lenient preprocessing")

Training data: (151830, 81)
Test data: (22972, 81)

Note: This is the original raw data with 81 columns
We will maximize feature count through lenient preprocessing


## Step 1: Filter to Single Family Residential

Focus on homogeneous property type for better predictions.

In [17]:
# TESTING: Skip SFR filter to see if all property types perform better
print(f"Skipping SFR filter - using all property types")
print(f"Train: {len(df_train):,}, Test: {len(df_test):,}")


Skipping SFR filter - using all property types
Train: 151,830, Test: 22,972


## Step 2: Define Target and Remove Leakage Features

In [18]:
# Define ONLY true leakage features - be conservative but thorough!
LEAKAGE_FEATURES = [
    # Price-related (direct leakage)
    'ListPrice', 'OriginalListPrice',
    
    # ALL Date/time features (anything with 'Date' in name)
    'CloseDate', 'DaysOnMarket', 'DOM', 'CDOM',
    'ModificationTimestamp', 'StatusChangeTimestamp', 'OnMarketTimestamp',
    'ContractDate', 'StatusChangeDate', 'PurchaseContractDate',
    'ListingContractDate', 'ContractStatusChangeDate',
    
    # Agent/Office names (high cardinality, not useful)
    'ListAgentEmail', 'ListAgentFirstName', 'ListAgentLastName',
    'BuyerAgentEmail', 'BuyerAgentFirstName', 'BuyerAgentLastName',
    'CoListAgentFirstName', 'CoListAgentLastName',
    'ListOfficeName', 'BuyerOfficeName',
    
    # Unique IDs
    'ListingId', 'ListingKey', 'MLSNumber',
    'Matrix_Unique_ID', 'UniversalPropertyId',
    
    # Address (too unique)
    'UnparsedAddress', 'StreetAddress', 'StreetName', 'StreetNumber',
    
    # Text remarks
    'PublicRemarks', 'PrivateRemarks', 'Directions',
    
    # Source marker
    '_source_file'
]

TARGET = 'ClosePrice'

print(f"Defined {len(LEAKAGE_FEATURES)} leakage features to remove")
print("Keeping: City, County, PostalCode, PropertyType, YN features, AgentAOR, Schools, Flooring")

Defined 37 leakage features to remove
Keeping: City, County, PostalCode, PropertyType, YN features, AgentAOR, Schools, Flooring


In [19]:
def remove_leakage(df, target_col=TARGET):
    """Remove true leakage features"""
    drop_cols = []
    
    for col in df.columns:
        if col == target_col:
            continue
        # Check if column matches leakage pattern (case-insensitive)
        col_lower = col.lower()
        if any(leak.lower() in col_lower for leak in LEAKAGE_FEATURES):
            drop_cols.append(col)
    
    print(f"Dropping {len(drop_cols)} leakage columns:")
    for col in sorted(drop_cols)[:15]:
        print(f"  - {col}")
    if len(drop_cols) > 15:
        print(f"  ... and {len(drop_cols)-15} more")
    
    return df.drop(columns=drop_cols, errors='ignore')

# Separate target and features
y_train = pd.to_numeric(df_train[TARGET], errors='coerce')
y_test = pd.to_numeric(df_test[TARGET], errors='coerce')

X_train = remove_leakage(df_train.drop(columns=[TARGET], errors='ignore'))
X_test = remove_leakage(df_test.drop(columns=[TARGET], errors='ignore'))

print(f"\nFeature matrix shapes after leakage removal:")
print(f"  X_train: {X_train.shape}")
print(f"  X_test: {X_test.shape}")

Dropping 24 leakage columns:
  - BuyerAgentFirstName
  - BuyerAgentLastName
  - BuyerOfficeName
  - CloseDate
  - CoBuyerAgentFirstName
  - CoListAgentFirstName
  - CoListAgentLastName
  - CoListOfficeName
  - ContractStatusChangeDate
  - DaysOnMarket
  - ListAgentEmail
  - ListAgentFirstName
  - ListAgentLastName
  - ListOfficeName
  - ListPrice
  ... and 9 more
Dropping 24 leakage columns:
  - BuyerAgentFirstName
  - BuyerAgentLastName
  - BuyerOfficeName
  - CloseDate
  - CoBuyerAgentFirstName
  - CoListAgentFirstName
  - CoListAgentLastName
  - CoListOfficeName
  - ContractStatusChangeDate
  - DaysOnMarket
  - ListAgentEmail
  - ListAgentFirstName
  - ListAgentLastName
  - ListOfficeName
  - ListPrice
  ... and 9 more

Feature matrix shapes after leakage removal:
  X_train: (151830, 56)
  X_test: (22972, 56)
Dropping 24 leakage columns:
  - BuyerAgentFirstName
  - BuyerAgentLastName
  - BuyerOfficeName
  - CloseDate
  - CoBuyerAgentFirstName
  - CoListAgentFirstName
  - CoListAgent

## Step 3: Feature Engineering (Light)

In [20]:
# Create a few useful engineered features
current_year = datetime.now().year

# BuildingAge
if 'YearBuilt' in X_train.columns:
    X_train['BuildingAge'] = current_year - pd.to_numeric(X_train['YearBuilt'], errors='coerce')
    X_test['BuildingAge'] = current_year - pd.to_numeric(X_test['YearBuilt'], errors='coerce')
    X_train['BuildingAge'] = X_train['BuildingAge'].clip(lower=0)
    X_test['BuildingAge'] = X_test['BuildingAge'].clip(lower=0)
    print("Created BuildingAge")

# TotalRooms
if 'BedroomsTotal' in X_train.columns and 'BathroomsTotalInteger' in X_train.columns:
    X_train['TotalRooms'] = (pd.to_numeric(X_train['BedroomsTotal'], errors='coerce').fillna(0) +
                              pd.to_numeric(X_train['BathroomsTotalInteger'], errors='coerce').fillna(0))
    X_test['TotalRooms'] = (pd.to_numeric(X_test['BedroomsTotal'], errors='coerce').fillna(0) +
                             pd.to_numeric(X_test['BathroomsTotalInteger'], errors='coerce').fillna(0))
    print("Created TotalRooms")

# HasGarage
if 'GarageSpaces' in X_train.columns:
    X_train['HasGarage'] = (pd.to_numeric(X_train['GarageSpaces'], errors='coerce').fillna(0) > 0).astype(int)
    X_test['HasGarage'] = (pd.to_numeric(X_test['GarageSpaces'], errors='coerce').fillna(0) > 0).astype(int)
    print("Created HasGarage")

print(f"\nShape after feature engineering: {X_train.shape}")

Created BuildingAge
Created TotalRooms
Created HasGarage

Shape after feature engineering: (151830, 59)


## Step 4: Handle Missing Values (Lenient Threshold)

In [None]:
# Drop columns with >60% missing (lenient to keep more features)
MISSING_THRESHOLD = 0.60

missing_pct_train = X_train.isnull().mean()
high_missing_cols = missing_pct_train[missing_pct_train > MISSING_THRESHOLD].index.tolist()

print(f"Dropping {len(high_missing_cols)} columns with >{MISSING_THRESHOLD*100}% missing:")
for col in high_missing_cols[:10]:
    print(f"  - {col}: {missing_pct_train[col]*100:.1f}% missing")
if len(high_missing_cols) > 10:
    print(f"  ... and {len(high_missing_cols)-10} more")

X_train = X_train.drop(columns=high_missing_cols)
X_test = X_test.drop(columns=high_missing_cols, errors='ignore')

print(f"\nShape after dropping high-missing columns: {X_train.shape}")

Dropping 19 columns with >60.0% missing:
  - WaterfrontYN: 99.9% missing
  - BasementYN: 98.3% missing
  - FireplacesTotal: 100.0% missing
  - AssociationFeeFrequency: 70.8% missing
  - AboveGradeFinishedArea: 100.0% missing
  - TaxAnnualAmount: 99.6% missing
  - ElementarySchool: 89.1% missing
  - BuilderName: 96.2% missing
  - SubdivisionName: 64.5% missing
  - TaxYear: 99.9% missing
  ... and 9 more

Shape after dropping high-missing columns: (151830, 40)


## Step 5: Encode Categorical Variables (Keep ALL Categories)

In [22]:
# Debug: Check shapes before outlier removal
print(f"\n=== DEBUG: Before outlier removal ===")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train type: {type(y_train)}, shape: {y_train.shape if hasattr(y_train, 'shape') else len(y_train)}")
print(f"y_test type: {type(y_test)}, shape: {y_test.shape if hasattr(y_test, 'shape') else len(y_test)}")
print(f"y_train valid count: {y_train.notna().sum() if hasattr(y_train, 'notna') else 'N/A'}")
print(f"y_test valid count: {y_test.notna().sum() if hasattr(y_test, 'notna') else 'N/A'}")


=== DEBUG: Before outlier removal ===
X_train shape: (151830, 40)
X_test shape: (22972, 40)
y_train type: <class 'pandas.core.series.Series'>, shape: (151830,)
y_test type: <class 'pandas.core.series.Series'>, shape: (22972,)
y_train valid count: 151828
y_test valid count: 22972


In [23]:
# Identify categorical columns
categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()
print(f"Categorical columns: {len(categorical_cols)}")

# Use target encoding for high cardinality (>50 unique values)
# This matches Steph's approach and prevents too many sparse one-hot features
HIGH_CARD_THRESHOLD = 600
target_encode_cols = []
onehot_cols = []

for col in categorical_cols:
    n_unique = X_train[col].nunique()
    if n_unique > HIGH_CARD_THRESHOLD:
        target_encode_cols.append(col)
        print(f"  Target encoding: {col} ({n_unique} unique)")
    else:
        onehot_cols.append(col)
        print(f"  One-hot encoding: {col} ({n_unique} unique)")

print(f"\nTarget encoding: {len(target_encode_cols)} columns")
print(f"One-hot encoding: {len(onehot_cols)} columns")

# Target encoding for very high cardinality
global_mean = y_train.mean()
alpha = 10  # smoothing

for col in target_encode_cols:
    # Create mapping from training data
    stats = X_train[[col]].assign(target=y_train.values).groupby(col).agg(
        count=('target', 'size'),
        mean=('target', 'mean')
    )
    stats['smoothed'] = (stats['count'] * stats['mean'] + alpha * global_mean) / (stats['count'] + alpha)
    
    X_train[f'{col}_target'] = X_train[col].map(stats['smoothed']).fillna(global_mean)
    X_test[f'{col}_target'] = X_test[col].map(stats['smoothed']).fillna(global_mean)
    
    # Drop original categorical column
    X_train = X_train.drop(columns=[col])
    X_test = X_test.drop(columns=[col])

# One-hot encode remaining categoricals - KEEP ALL CATEGORIES (drop_first=False)
if onehot_cols:
    print(f"\nOne-hot encoding {len(onehot_cols)} columns...")
    X_train = pd.get_dummies(X_train, columns=onehot_cols, drop_first=False, dummy_na=False)
    X_test = pd.get_dummies(X_test, columns=onehot_cols, drop_first=False, dummy_na=False)
    
    # Align columns
    train_cols = set(X_train.columns)
    test_cols = set(X_test.columns)
    
    for col in train_cols - test_cols:
        X_test[col] = 0
    for col in test_cols - train_cols:
        X_train[col] = 0
    
    X_test = X_test[X_train.columns]
    
    print(f"After encoding: {X_train.shape[1]} features!")

print(f"\nFinal feature count: {X_train.shape[1]}")

Categorical columns: 21
  One-hot encoding: BuyerAgentAOR (53 unique)
  One-hot encoding: ListAgentAOR (53 unique)
  One-hot encoding: Flooring (258 unique)
  One-hot encoding: ViewYN (2 unique)
  One-hot encoding: PoolPrivateYN (2 unique)
  One-hot encoding: PropertyType (8 unique)
  Target encoding: ListAgentFullName (49994 unique)
  Target encoding: BuyerAgentMlsId (65098 unique)
  Target encoding: MLSAreaMajor (1046 unique)
  One-hot encoding: CountyOrParish (59 unique)
  One-hot encoding: MlsStatus (1 unique)
  One-hot encoding: AttachedGarageYN (2 unique)
  One-hot encoding: PropertySubType (34 unique)
  One-hot encoding: BuyerOfficeAOR (60 unique)
  Target encoding: City (1009 unique)
  One-hot encoding: StateOrProvince (13 unique)
  One-hot encoding: FireplaceYN (2 unique)
  One-hot encoding: Levels (17 unique)
  One-hot encoding: NewConstructionYN (2 unique)
  One-hot encoding: HighSchoolDistrict (419 unique)
  Target encoding: PostalCode (2127 unique)

Target encoding: 5 colu

## Step 6: Remove Target Outliers

In [24]:
import numpy as np

# Reset indices to ensure alignment
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

print(f"Before outlier removal: X_train={X_train.shape}, y_train={len(y_train)}")

# Remove extreme outliers from training set
y_valid_train = y_train.dropna()
p_low = np.percentile(y_valid_train, 0.5)
p_high = np.percentile(y_valid_train, 99.5)
keep_mask = (y_train >= p_low) & (y_train <= p_high) & y_train.notna()

print(f"Outlier bounds: ${p_low:,.0f} to ${p_high:,.0f}")
print(f"Keeping {keep_mask.sum():,} of {len(y_train):,} training samples")

X_train = X_train[keep_mask].reset_index(drop=True)
y_train = y_train[keep_mask].reset_index(drop=True)

# Remove outliers from test set
y_valid_test = y_test.dropna()
p_low_test = np.percentile(y_valid_test, 0.5)
p_high_test = np.percentile(y_valid_test, 99.5)
keep_mask_test = (y_test >= p_low_test) & (y_test <= p_high_test) & y_test.notna()

print(f"\nTest outlier bounds: ${p_low_test:,.0f} to ${p_high_test:,.0f}")
print(f"Keeping {keep_mask_test.sum():,} of {len(y_test):,} test samples")

X_test = X_test[keep_mask_test].reset_index(drop=True)
y_test = y_test[keep_mask_test].reset_index(drop=True)

print(f"\nAfter outlier removal:")
print(f"  X_train: {X_train.shape}")
print(f"  X_test: {X_test.shape}")

Before outlier removal: X_train=(151830, 1022), y_train=151830
Outlier bounds: $1,500 to $6,807,595
Keeping 150,311 of 151,830 training samples

Test outlier bounds: $1,500 to $6,750,000
Keeping 22,759 of 22,972 test samples

After outlier removal:
  X_train: (150311, 1022)
  X_test: (22759, 1022)

Test outlier bounds: $1,500 to $6,750,000
Keeping 22,759 of 22,972 test samples

After outlier removal:
  X_train: (150311, 1022)
  X_test: (22759, 1022)


## Step 7: Final Imputation

Fill any remaining missing values with median (numeric columns).

In [25]:
# Impute remaining missing values with median for numeric columns
numeric_cols = X_train.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
    median_val = X_train[col].median()
    X_train[col].fillna(median_val, inplace=True)
    X_test[col].fillna(median_val, inplace=True)

print(f"Missing values after imputation:")
print(f"  X_train: {X_train.isnull().sum().sum()}")
print(f"  X_test: {X_test.isnull().sum().sum()}")

Missing values after imputation:
  X_train: 0
  X_test: 0
  X_train: 0
  X_test: 0


## Step 8: Save Preprocessed Data

In [27]:
import json

# Save processed data
X_train.to_csv(ROOT / 'filled_data' / 'X_train.csv', index=False)
X_test.to_csv(ROOT / 'filled_data' / 'X_test.csv', index=False)
y_train.to_csv(ROOT / 'filled_data' / 'y_train.csv', index=False)
y_test.to_csv(ROOT / 'filled_data' / 'y_test.csv', index=False)

# Also save to models folder for pipeline compatibility
X_train.to_csv(ROOT / 'models' / 'X_train.csv', index=False)
X_test.to_csv(ROOT / 'models' / 'X_test.csv', index=False)
y_train.to_csv(ROOT / 'models' / 'y_train.csv', index=False)
y_test.to_csv(ROOT / 'models' / 'y_test.csv', index=False)

# Save feature schema
feature_schema = {
    'feature_columns': list(X_train.columns),
    'n_features': len(X_train.columns),
    'n_train_samples': len(X_train),
    'n_test_samples': len(X_test)
}
with open(ROOT / 'models' / 'feature_schema.json', 'w') as f:
    json.dump(feature_schema, f, indent=2)

print(f"\nPreprocessed data saved!")
print(f"Final feature count: {len(X_train.columns)}")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")


Preprocessed data saved!
Final feature count: 1022
Training samples: 150311
Test samples: 22759
