# Phase 2: Variable Engineering (Corrected)

**Goal**: Construction of the final dataset for predictive modeling.
**Target Variable**: Behavioral Cluster (from Phase 1)
**Explanatory Variables**: 20 variables across Spatial, Internal, Network, and Transit dimensions.


In [1]:
import pandas as pd
import numpy as np

# Load Full Dataset
df = pd.read_csv('../interim/final_dataset_FULL.csv')
print(f"Loaded FULL dataset with {df.shape[1]} columns.")

Loaded FULL dataset with 111 columns.


## Final Feature Selection (Target 20)
Mapping confirmed via `src/full_columns_list.txt`.

In [2]:
selected_cols = ['park_name', 'cluster']

# Verified Variable Mapping
final_vars = [
    # 1. Spatial Characteristics
    'area_official_m2',
    'visitor_density',
    'management_intensity',
    'infrastructure_index',
    'recreation_index',

    # 2. Internal Park Design
    'restrooms_count',
    'playgrounds_count',
    'exercise_equipment_count',
    'facility_density',
    'workers_count',

    # 3. Network and Topology
    'topo_node_degree',
    'topo_local_efficiency',
    'topo_clustering_coefficient',
    'topo_eccentricity',
    'centrality_score',

    # 4. Transit Accessibility
    'subway_station_count',
    'bus_station_count',
    'total_transit_ridership',
    'transit_accessibility_index',
    'distance_to_center'
]

print("Selecting Variables...")
missing_vars = []
for col in final_vars:
    if col not in df.columns:
        missing_vars.append(col)
        print(f"  [ ] MISSING: {col}")
    else:
        print(f"  [x] Found: {col}")

if missing_vars:
    print(f"\nERROR: {len(missing_vars)} variables are missing from the dataset!")
    # Stop execution or fail gracefully
else:
    print(f"\nAll {len(final_vars)} variables found.")
    
    # Save Reduced Dataset
    df_final = df[selected_cols + final_vars].copy()
    df_final = df_final.fillna(0)

    df_final.to_csv('../interim/final_dataset_18parks.csv', index=False)
    print("Saved final dataset to 'interim/final_dataset_18parks.csv'.")

Selecting Variables...
  [x] Found: area_official_m2
  [x] Found: visitor_density
  [x] Found: management_intensity
  [x] Found: infrastructure_index
  [x] Found: recreation_index
  [x] Found: restrooms_count
  [x] Found: playgrounds_count
  [x] Found: exercise_equipment_count
  [x] Found: facility_density
  [x] Found: workers_count
  [x] Found: topo_node_degree
  [x] Found: topo_local_efficiency
  [x] Found: topo_clustering_coefficient
  [x] Found: topo_eccentricity
  [x] Found: centrality_score
  [x] Found: subway_station_count
  [x] Found: bus_station_count
  [x] Found: total_transit_ridership
  [x] Found: transit_accessibility_index
  [x] Found: distance_to_center

All 20 variables found.
Saved final dataset to 'interim/final_dataset_18parks.csv'.
