# Feature Engineering

**Objective:**  
Create derived features from the cleaned EV population dataset to support
comparative analysis and downstream modeling. Feature selection is
guided by insights identified during exploratory data analysis.

**Data Source:**  
Processed dataset generated in `02_cleaning.ipynb`.


In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/cleaned/ev_population_clean.csv")
df.head()

Unnamed: 0,vin_1_10,county,city,state,postal_code,model_year,make,model,electric_vehicle_type,clean_alternative_fuel_vehicle_cafv_eligibility,electric_range,legislative_district,dol_vehicle_id,vehicle_location,electric_utility,2020_census_tract
0,5YJYGDEE8L,Thurston,Tumwater,WA,98501.0,2020,TESLA,MODEL Y,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,291.0,35.0,124633715,POINT (-122.89165 47.03954),PUGET SOUND ENERGY INC,53067010000.0
1,5YJXCAE2XJ,Snohomish,Bothell,WA,98021.0,2018,TESLA,MODEL X,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,238.0,1.0,474826075,POINT (-122.18384 47.8031),PUGET SOUND ENERGY INC,53061050000.0
2,5YJ3E1EBXK,King,Kent,WA,98031.0,2019,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,220.0,47.0,280307233,POINT (-122.17743 47.41185),PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),53033030000.0
3,7SAYGDEE4T,King,Issaquah,WA,98027.0,2026,TESLA,MODEL Y,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0.0,41.0,280786565,POINT (-122.03439 47.5301),PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),53033020000.0
4,WAUUPBFF9G,King,Seattle,WA,98103.0,2016,AUDI,A3,Plug-in Hybrid Electric Vehicle (PHEV),Not eligible due to low battery range,16.0,43.0,198988891,POINT (-122.35436 47.67596),CITY OF SEATTLE - (WA)|CITY OF TACOMA - (WA),53033000000.0


In [3]:
df["is_bev"] = df["electric_vehicle_type"].str.contains("BEV", na=False)
df["is_bev"].value_counts(normalize=True)

is_bev
True     0.798703
False    0.201297
Name: proportion, dtype: float64

Battery Electric Vehicles constitute approximately 80% of the EV
population, indicating that fully electric vehicles dominate the market
relative to Plug-in Hybrid Electric Vehicles.

In [4]:
df["model_year_bin"] = pd.cut(
    df["model_year"],
    bins=[1990,2009,2014,2019,2026],
    labels=["pre_2010", "2010_2014", "2015_2019", "2020_plus"]
)

df["model_year_bin"].value_counts().sort_index()

model_year_bin
pre_2010         32
2010_2014      9240
2015_2019     42846
2020_plus    218144
Name: count, dtype: int64

EV registrations are heavily concentrated in recent model years, with
over 80% of vehicles falling into the post-2020 period. Earlier periods
account for a comparatively small share of the dataset, highlighting
the rapid acceleration of EV adoption in the last decade.

In [5]:
top_counties = df["county"].value_counts().head(10).index
df["is_top_county"] = df["county"].isin(top_counties)

df["is_top_county"].value_counts(normalize=True)

is_top_county
True     0.911264
False    0.088736
Name: proportion, dtype: float64

EV adoption is highly concentrated geographically, with more than 91% of
registered vehicles located in the top ten counties, indicating that EV
uptake is driven by a small number of high-adoption regions.

In [7]:
df["has_electric_range"] = df["electric_range"].notna()

range_reporting = (
    df["has_electric_range"]
    .value_counts(normalize=True)
    .rename("proportion")
)

range_reporting


has_electric_range
True     0.999981
False    0.000019
Name: proportion, dtype: float64

In [8]:
df[[
    "is_bev",
    "model_year_bin",
    "is_top_county",
    "has_electric_range",
]].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 270262 entries, 0 to 270261
Data columns (total 4 columns):
 #   Column              Non-Null Count   Dtype   
---  ------              --------------   -----   
 0   is_bev              270262 non-null  bool    
 1   model_year_bin      270262 non-null  category
 2   is_top_county       270262 non-null  bool    
 3   has_electric_range  270262 non-null  bool    
dtypes: bool(3), category(1)
memory usage: 1.0 MB


In [9]:
df.to_csv("../data/cleaned/ev_population_features.csv", index=False)

## Feature Engineering Summary

Feature engineering translated key exploratory findings into structured
variables that explicitly distinguish Battery Electric Vehicles from
Plug-in Hybrid Electric Vehicles, stabilize temporal comparisons through
model year grouping, and capture the extreme geographic concentration of
EV adoption. An additional reporting indicator separates true zero
electric range from missing or unreported values. Together, these
features reduce noise, prevent misleading aggregation, and provide a
robust foundation for downstream analysis and interpretation.
