# Feature Engineering on Unified Profiles

**Purpose**:
  1. Load the unified taxpayer profile data created in Notebook 02.
  2. Select relevant base features and engineer new features, focusing on those
     that capture cross-source interactions potentially indicative of fraud
     (e.g., income vs. property value ratios).
  3. Handle missing values that remain after joining and aggregation (e.g.,
     imputing based on column type or meaning).
  4. Encode categorical features (e.g., Sector) into a numerical format.
  5. Scale numerical features to ensure they are suitable for distance-based
     similarity calculations or embedding model inputs.
  6. Produce a final DataFrame containing only the engineered, numerical features
     ready for embedding in the next notebook.

**Prerequisites**:
  - Successful completion of Notebook 02.
  - Existence of the unified profile file: 'unified_taxpayer_profiles.csv'.

**Outputs**:
  - A Pandas DataFrame containing the final, processed features ready for embedding.
  - This DataFrame saved to a CSV file (e.g., 'engineered_features.csv').
  - The corresponding Taxpayer IDs saved separately (e.g., 'taxpayer_ids.csv').

**Next Step**:
  Notebook 04 will use the engineered features to generate vector embeddings.

## Imports and Configuration

In [1]:
import pandas as pd
import numpy as np
import os
from datetime import datetime
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from IPython.display import display

# --- Configuration ---
PROCESSED_DATA_DIR = './data/processed' # Directory containing N02 output
OUTPUT_DIR = './data/processed' # Directory to save engineered features

UNIFIED_PROFILE_FILE = os.path.join(PROCESSED_DATA_DIR, 'unified_taxpayer_profiles.csv')
FEATURES_OUTPUT_FILE = os.path.join(OUTPUT_DIR, 'engineered_features.csv')
IDS_OUTPUT_FILE = os.path.join(OUTPUT_DIR, 'taxpayer_ids.csv')

# Create output directory if it doesn't exist (should exist from N02, but check)
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Get current date for calculating durations - use fixed date for reproducibility if needed
# CURRENT_DATE = pd.to_datetime('2025-04-22') # Example fixed date
CURRENT_DATE = pd.to_datetime(datetime.now())


print("Notebook 03: Feature Engineering on Unified Profiles")
print("-" * 50)
print(f"Loading unified profiles from: {UNIFIED_PROFILE_FILE}")
print(f"Saving engineered features to: {FEATURES_OUTPUT_FILE}")
print(f"Saving taxpayer IDs to: {IDS_OUTPUT_FILE}")
print("-" * 50)

Notebook 03: Feature Engineering on Unified Profiles
--------------------------------------------------
Loading unified profiles from: ./data/processed/unified_taxpayer_profiles.csv
Saving engineered features to: ./data/processed/engineered_features.csv
Saving taxpayer IDs to: ./data/processed/taxpayer_ids.csv
--------------------------------------------------


## Load Unified Profile Data

In [2]:
try:
    unified_df = pd.read_csv(UNIFIED_PROFILE_FILE)
    # Infer datetime columns if they were saved as strings
    date_cols = ['prop_ownership_earliest', 'prop_ownership_latest']
    for col in date_cols:
        if col in unified_df.columns:
            unified_df[col] = pd.to_datetime(unified_df[col], errors='coerce')
    print(f"Successfully loaded unified profile data: {unified_df.shape}")
except FileNotFoundError:
    print(f"ERROR: Unified profile file not found at {UNIFIED_PROFILE_FILE}.")
    print("Please ensure Notebook 02 was run successfully and saved the file.")
    raise

# Keep Taxpayer ID separate early on
if 'Taxpayer ID' not in unified_df.columns:
     print("ERROR: 'Taxpayer ID' column not found in the unified profile data.")
     raise KeyError("'Taxpayer ID' column missing")

taxpayer_ids = unified_df['Taxpayer ID'].copy()
features_df = unified_df.drop(columns=['Taxpayer ID']).copy()
print("Separated Taxpayer IDs.")

Successfully loaded unified profile data: (4906, 14)
Separated Taxpayer IDs.


## Feature Creation

In [3]:
print("Engineering new features, especially cross-source interactions.")

# --- Define columns for easier handling ---
# Base numerical columns (potentially needing imputation before use in ratios)
base_numeric_cols = [
    'Declared Income', 'Deductions', 'prop_count', 'prop_value_total',
    'prop_value_avg', 'prop_value_max', 'prop_value_min', 'prop_loc_distinct_count',
    'directorship_count', 'comp_distinct_count'
]
# Ensure only existing columns are included (directorship might be optional)
base_numeric_cols = [col for col in base_numeric_cols if col in features_df.columns]

# Base categorical
base_categorical_cols = ['Sector'] # Add more if exist
base_categorical_cols = [col for col in base_categorical_cols if col in features_df.columns]

# Base date cols
base_date_cols = ['prop_ownership_earliest', 'prop_ownership_latest']
base_date_cols = [col for col in base_date_cols if col in features_df.columns]

Engineering new features, especially cross-source interactions.


### Imputing Base Missing Values

In [4]:
# --- Handle Missing Values in Base Columns FIRST ---
# This is important before creating ratios/derived features

# Impute numerical columns with median
num_imputer = SimpleImputer(strategy='median')
features_df[base_numeric_cols] = num_imputer.fit_transform(features_df[base_numeric_cols])
print(f"Imputed missing values in base numerical columns using median: {base_numeric_cols}")

# Impute categorical columns with 'Unknown' (or most frequent)
# Using fillna for simplicity here, SimpleImputer(strategy='most_frequent' or 'constant') is also good
for col in base_categorical_cols:
    fill_val = 'Unknown'
    features_df[col] = features_df[col].fillna(fill_val)
    print(f"Imputed missing values in base categorical column '{col}' with '{fill_val}'.")

# Impute date columns (e.g., with a placeholder or median date if appropriate)
# NaT dates were likely filled with a placeholder in N01/N02, let's check.
# If we filled NaT with 1900-01-01, they are not technically NaN anymore.
# If NaTs still exist, we need to handle them before calculating durations.
for col in base_date_cols:
    if features_df[col].isnull().any():
        # Example: fill with the median date (requires calculating median first)
        # median_date = features_df[col].median()
        # features_df[col] = features_df[col].fillna(median_date)
        # Or fill with a placeholder
        placeholder_date = pd.to_datetime('1900-01-01')
        features_df[col] = features_df[col].fillna(placeholder_date)
        print(f"Imputed missing (NaT) dates in '{col}' with {placeholder_date.date()}.")


Imputed missing values in base numerical columns using median: ['Declared Income', 'Deductions', 'prop_count', 'prop_value_total', 'prop_value_avg', 'prop_value_max', 'prop_value_min', 'prop_loc_distinct_count', 'directorship_count', 'comp_distinct_count']
Imputed missing values in base categorical column 'Sector' with 'Unknown'.
Imputed missing (NaT) dates in 'prop_ownership_earliest' with 1900-01-01.
Imputed missing (NaT) dates in 'prop_ownership_latest' with 1900-01-01.


### Engineering New Features

In [5]:
# Ratios (handle potential division by zero)
epsilon = 1e-6 # Small number to avoid division by zero

# Income vs Property
features_df['income_per_prop_value_total'] = features_df['Declared Income'] / (features_df['prop_value_total'] + epsilon)
features_df['prop_value_total_per_income'] = features_df['prop_value_total'] / (features_df['Declared Income'] + epsilon)

# Deductions vs Income
features_df['deduction_ratio'] = features_df['Deductions'] / (features_df['Declared Income'] + epsilon)

# Property characteristics
features_df['prop_value_avg_per_prop'] = features_df['prop_value_total'] / (features_df['prop_count'] + epsilon)
# Clamp ratios if needed (e.g., cap deduction_ratio at 1 or 2)
features_df['deduction_ratio'] = features_df['deduction_ratio'].clip(lower=0, upper=2) # Example clamp

# Directorships vs Income (if directorship data exists)
if 'directorship_count' in features_df.columns:
    features_df['directorships_per_income'] = features_df['directorship_count'] / (features_df['Declared Income'] + epsilon)
    features_df['income_per_directorship'] = features_df['Declared Income'] / (features_df['directorship_count'] + epsilon)

print("Created ratio features.")

# Date-based features (handle potential NaT dates if not imputed above)
if 'prop_ownership_earliest' in features_df.columns:
    features_df['prop_ownership_span_days'] = (features_df['prop_ownership_latest'] - features_df['prop_ownership_earliest']).dt.days
    # Impute negative spans if latest < earliest (due to imputation/placeholders)
    features_df.loc[features_df['prop_ownership_span_days'] < 0, 'prop_ownership_span_days'] = 0
    # Calculate time since latest purchase
    features_df['prop_days_since_latest'] = (CURRENT_DATE - features_df['prop_ownership_latest']).dt.days
    features_df.loc[features_df['prop_days_since_latest'] < 0, 'prop_days_since_latest'] = 0 # Handle future dates if any
    print("Created date-based features (ownership span, days since latest).")


Created ratio features.
Created date-based features (ownership span, days since latest).


### Imputing Derived Feature Missing Values

In [6]:
# For ratios, 0 might be a reasonable imputation if components were 0 or NaN
# For durations, 0 or median might be appropriate
new_cols = features_df.columns.difference(base_numeric_cols + base_categorical_cols + base_date_cols)
new_numeric_cols = features_df[new_cols].select_dtypes(include=np.number).columns

# Replace inf values that might result from division by epsilon
features_df.replace([np.inf, -np.inf], np.nan, inplace=True)

if not features_df[new_numeric_cols].isnull().values.any():
    print("No NaNs found in newly created numerical features.")
else:
    print(f"NaNs found in new numerical features: {features_df[new_numeric_cols].isnull().sum().sum()}. Imputing with median...")
    derived_imputer = SimpleImputer(strategy='median')
    features_df[new_numeric_cols] = derived_imputer.fit_transform(features_df[new_numeric_cols])
    print("Imputed missing values in derived numerical features using median.")

No NaNs found in newly created numerical features.


## Encode Categorical Features

In [7]:
if base_categorical_cols:
    print(f"One-Hot Encoding categorical columns: {base_categorical_cols}")
    encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False) # Important for dense output

    # Fit and transform
    encoded_data = encoder.fit_transform(features_df[base_categorical_cols])

    # Create a new DataFrame with encoded columns
    encoded_df = pd.DataFrame(encoded_data, index=features_df.index, columns=encoder.get_feature_names_out(base_categorical_cols))

    # Drop original categorical columns and concatenate encoded ones
    features_df = pd.concat([features_df.drop(columns=base_categorical_cols), encoded_df], axis=1)
    print(f"Added {len(encoded_df.columns)} one-hot encoded columns. Dropped original categorical columns.")
    print(f"Shape after encoding: {features_df.shape}")
else:
    print("No categorical columns found to encode.")

One-Hot Encoding categorical columns: ['Sector']
Added 10 one-hot encoded columns. Dropped original categorical columns.
Shape after encoding: (4906, 30)


## Scale Numerical Features

In [8]:
# Identify all numerical columns (including newly encoded ones)
# Drop any remaining non-numeric columns (like original date columns if kept)
numerical_features = features_df.select_dtypes(include=np.number).columns.tolist()
print(f"Scaling {len(numerical_features)} numerical features using StandardScaler.")

# Remove date columns if they weren't dropped earlier and aren't needed as features
cols_to_drop = [col for col in base_date_cols if col in features_df.columns]
if cols_to_drop:
     features_df = features_df.drop(columns=cols_to_drop)
     print(f"Dropped original date columns: {cols_to_drop}")
     numerical_features = [col for col in numerical_features if col not in cols_to_drop]


# Initialize and apply scaler
scaler = StandardScaler()
features_df[numerical_features] = scaler.fit_transform(features_df[numerical_features])

print("Numerical features scaled to have zero mean and unit variance.")
print("Sample scaled data description:")
display(features_df[numerical_features].describe().round(2))

Scaling 28 numerical features using StandardScaler.
Dropped original date columns: ['prop_ownership_earliest', 'prop_ownership_latest']
Numerical features scaled to have zero mean and unit variance.
Sample scaled data description:


Unnamed: 0,Declared Income,Deductions,prop_count,prop_value_total,prop_value_avg,prop_value_max,prop_value_min,prop_loc_distinct_count,directorship_count,comp_distinct_count,...,Sector_Construction,Sector_Education,Sector_Finance,Sector_Healthcare,Sector_Manufacturing,Sector_Other,Sector_Retail,Sector_Services,Sector_Technology,Sector_Unknown
count,4906.0,4906.0,4906.0,4906.0,4906.0,4906.0,4906.0,4906.0,4906.0,4906.0,...,4906.0,4906.0,4906.0,4906.0,4906.0,4906.0,4906.0,4906.0,4906.0,4906.0
mean,-0.0,-0.0,0.0,0.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,...,0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,-0.0
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-1.16,-1.11,-0.83,-0.58,-1.09,-1.11,-0.78,-0.87,-0.29,-0.29,...,-0.32,-0.27,-0.33,-0.35,-0.28,-0.22,-0.41,-0.45,-0.41,-0.25
25%,-0.71,-0.67,-0.83,-0.58,-0.27,-0.3,-0.27,-0.87,-0.29,-0.29,...,-0.32,-0.27,-0.33,-0.35,-0.28,-0.22,-0.41,-0.45,-0.41,-0.25
50%,-0.31,-0.32,0.06,-0.46,-0.22,-0.25,-0.22,0.16,-0.29,-0.29,...,-0.32,-0.27,-0.33,-0.35,-0.28,-0.22,-0.41,-0.45,-0.41,-0.25
75%,0.38,0.31,0.94,0.19,-0.16,-0.18,-0.17,0.16,-0.29,-0.29,...,-0.32,-0.27,-0.33,-0.35,-0.28,-0.22,-0.41,-0.45,-0.41,-0.25
max,2.99,6.04,5.38,10.87,7.39,5.46,8.2,4.29,8.39,8.41,...,3.11,3.71,3.04,2.86,3.64,4.56,2.45,2.21,2.46,3.93


## Final Feature Selection & Inspection

In [9]:
# At this point, features_df should contain only the numerical features ready for embedding
# Verify no non-numeric columns remain (except maybe index)
non_numeric_cols = features_df.select_dtypes(exclude=np.number).columns
if len(non_numeric_cols) > 0:
    print(f"Warning: Non-numeric columns still present: {list(non_numeric_cols)}. Dropping them.")
    features_df = features_df.drop(columns=non_numeric_cols)

# Final check for NaNs
if features_df.isnull().values.any():
    print("ERROR: NaNs still present after processing! Check imputation steps.")
    print(features_df.isnull().sum()[features_df.isnull().sum() > 0])
    # Simple final fallback: fill remaining with 0
    # features_df.fillna(0, inplace=True)
    # print("Filled remaining NaNs with 0 as a fallback.")
    raise ValueError("NaNs found in final feature set before saving.")
else:
    print("Final check: No missing values found in the feature set.")

print("\nFinal Engineered Features DataFrame Info:\n")
print(features_df.info())

print("\nFirst 5 rows of final features:")
display(features_df.head())

print(f"\nFinal feature set shape: {features_df.shape} (Rows, Features)")


Final check: No missing values found in the feature set.

Final Engineered Features DataFrame Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4906 entries, 0 to 4905
Data columns (total 28 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Declared Income              4906 non-null   float64
 1   Deductions                   4906 non-null   float64
 2   prop_count                   4906 non-null   float64
 3   prop_value_total             4906 non-null   float64
 4   prop_value_avg               4906 non-null   float64
 5   prop_value_max               4906 non-null   float64
 6   prop_value_min               4906 non-null   float64
 7   prop_loc_distinct_count      4906 non-null   float64
 8   directorship_count           4906 non-null   float64
 9   comp_distinct_count          4906 non-null   float64
 10  income_per_prop_value_total  4906 non-null   float64
 11  prop_value_total_per_income  4906

Unnamed: 0,Declared Income,Deductions,prop_count,prop_value_total,prop_value_avg,prop_value_max,prop_value_min,prop_loc_distinct_count,directorship_count,comp_distinct_count,...,Sector_Construction,Sector_Education,Sector_Finance,Sector_Healthcare,Sector_Manufacturing,Sector_Other,Sector_Retail,Sector_Services,Sector_Technology,Sector_Unknown
0,2.990028,3.752137,1.828895,0.71609,-0.246165,-0.100092,-0.646172,2.224952,-0.292075,-0.29232,...,-0.322044,-0.269414,-0.328557,-0.349487,-0.27503,-0.219241,-0.408637,-0.452236,2.457384,-0.254757
1,1.242153,1.241963,-0.831218,-0.584093,-0.221618,-0.245723,-0.219293,-0.865831,-0.292075,-0.29232,...,-0.322044,-0.269414,-0.328557,-0.349487,3.635972,-0.219241,-0.408637,-0.452236,-0.406937,-0.254757
2,-0.413438,-0.196272,0.055487,0.166885,0.50372,0.125289,0.904591,0.16443,6.655414,6.669447,...,-0.322044,-0.269414,-0.328557,2.86134,-0.27503,-0.219241,-0.408637,-0.452236,-0.406937,-0.254757
3,-0.562726,-0.156393,-0.831218,-0.584093,-0.221618,-0.245723,-0.219293,-0.865831,-0.292075,-0.29232,...,-0.322044,-0.269414,-0.328557,-0.349487,-0.27503,-0.219241,2.447162,-0.452236,-0.406937,-0.254757
4,-0.661776,-0.885219,0.055487,-0.218229,-0.405619,-0.577894,-0.057969,0.16443,-0.292075,-0.29232,...,-0.322044,-0.269414,-0.328557,-0.349487,-0.27503,-0.219241,-0.408637,-0.452236,2.457384,-0.254757



Final feature set shape: (4906, 28) (Rows, Features)


## Save Processed Features and IDs

In [10]:
try:
    # Save the features (without index)
    features_df.to_csv(FEATURES_OUTPUT_FILE, index=False)
    print(f"Successfully saved engineered features to: {FEATURES_OUTPUT_FILE}")

    # Save the corresponding Taxpayer IDs
    pd.DataFrame({'Taxpayer ID': taxpayer_ids}).to_csv(IDS_OUTPUT_FILE, index=False)
    print(f"Successfully saved Taxpayer IDs to: {IDS_OUTPUT_FILE}")

except Exception as e:
    print(f"ERROR saving processed data files: {e}")

print("-" * 50)

Successfully saved engineered features to: ./data/processed/engineered_features.csv
Successfully saved Taxpayer IDs to: ./data/processed/taxpayer_ids.csv
--------------------------------------------------


## Conclusion

In [11]:
print("Notebook 03 finished.")
print("Successfully performed feature engineering on the unified profiles:")
print("  - Loaded unified data.")
print("  - Created new features (ratios, date-based interactions).")
print("  - Handled missing values through imputation.")
print("  - Encoded categorical features using One-Hot Encoding.")
print("  - Scaled all numerical features using StandardScaler.")
print(f"\nFinal feature matrix shape: {features_df.shape}")
print("The processed feature set and corresponding IDs are saved.")
print("\nReady to proceed to Notebook 04: Generating Unified Profile Vector Embeddings.")

Notebook 03 finished.
Successfully performed feature engineering on the unified profiles:
  - Loaded unified data.
  - Created new features (ratios, date-based interactions).
  - Handled missing values through imputation.
  - Encoded categorical features using One-Hot Encoding.
  - Scaled all numerical features using StandardScaler.

Final feature matrix shape: (4906, 28)
The processed feature set and corresponding IDs are saved.

Ready to proceed to Notebook 04: Generating Unified Profile Vector Embeddings.
