<h1>Research Question</h1>
Can we reliably predict a county's electric vehicle (EV) registrations based of of vehicles' model year, manufacturer brand, electric range, and other factors in Washington State?

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import duckdb

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split, \
    KFold, cross_val_score
from sklearn.metrics import root_mean_squared_error, mean_absolute_error, \
    mean_absolute_percentage_error, accuracy_score, precision_score, \
        recall_score, f1_score, precision_recall_curve
from sklearn import preprocessing

## Data Description
We are using a dataset that shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL). It provides key information about vehicle registrations, tax exemptions, and eligibility criteria for clean alternative fuel vehicles, and is updated regularly, with the monthly vehicle count subject to changes due to county assignment processes during registration. A Battery Electric Vehicle (BEV) is an all-electric vehicle using one or more batteries to store the electrical energy that powers the motor and is charged by plugging the vehicle in to an electric power source. A Plug-in Hybrid Electric Vehicle (PHEV) is a vehicle that uses one or more batteries to power an electric motor; uses another fuel, such as gasoline or diesel, to power an internal combustion engine or other propulsion source; and is charged by plugging the vehicle in to an electric power source.

The dataset consists of several columns representing various attributes related to each electric vehicle, including: VIN: Vehicle Identification Number; County: The county where the vehicle is registered; City: The city where the vehicle is registered; State: The state (WA for Washington); Postal Code: The postal code of the registration; Model Year: The year the vehicle model was manufactured; Make: The manufacturer brand of the vehicle; Model: The model of the vehicle. Each row in the dataset describes a specific electric vehicle that is registered in Washington State.

## Data Cleaning

In [2]:
df = pd.read_csv('Combined_Data.csv', encoding='ISO-8859-1')
print(df.shape)
df.head()

(77114, 17)


Unnamed: 0,VIN (1-10),County,City,State,Postal Code,Model Year,Make,Model,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Base MSRP,Legislative District,DOL Vehicle ID,Electric Utility,income_2023,population_2023
0,5YJYGDEE1L,King,Seattle,WA,98122.0,2020,TESLA,MODEL Y,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,291,0,37.0,125701579,CITY OF SEATTLE - (WA)|CITY OF TACOMA - (WA),119926.0,6794340.0
1,5YJSA1E4XK,King,Seattle,WA,98109.0,2019,TESLA,MODEL S,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,270,0,36.0,156773144,CITY OF SEATTLE - (WA)|CITY OF TACOMA - (WA),119926.0,6794340.0
2,5YJSA1E27G,King,Issaquah,WA,98027.0,2016,TESLA,MODEL S,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,210,0,5.0,165103011,PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),119926.0,6794340.0
3,3FA6P0SU8H,Thurston,Yelm,WA,98597.0,2017,FORD,FUSION,Plug-in Hybrid Electric Vehicle (PHEV),Not eligible due to low battery range,21,0,2.0,122057736,PUGET SOUND ENERGY INC,91522.0,766220.0
4,1N4AZ0CP2D,Yakima,Yakima,WA,98903.0,2013,NISSAN,LEAF,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,75,0,14.0,150126840,PACIFICORP,65167.0,694445.0


In [3]:
aggregated_df = duckdb.sql("""
    SELECT County, COUNT(*) As "EV Count",
    AVG("Electric Range") AS "Average Electric Range",
    AVG("Model Year") AS "Average Model Year",
    MODE("Make") AS "Popular Brand",
    MODE("Model") AS "Popular Model",
    MODE("Electric Vehicle Type") AS "Popular EV Type",
    AVG("income_2023") AS "Average Income",
    AVG("population_2023") AS "Population",
    FROM df
    GROUP BY County
    ORDER BY County ASC
""").df()
aggregated_df.head()

Unnamed: 0,County,EV Count,Average Electric Range,Average Model Year,Popular Brand,Popular Model,Popular EV Type,Average Income,Population
0,Adams,19,131.842105,2018.105263,TESLA,MODEL 3,Battery Electric Vehicle (BEV),64498.0,54015.0
1,Alameda,2,131.0,2020.0,TESLA,MODEL 3,Battery Electric Vehicle (BEV),,
2,Albemarle,2,211.5,2016.0,TESLA,MODEL 3,Battery Electric Vehicle (BEV),,
3,Alexandria,2,326.0,2020.0,TESLA,MODEL S,Battery Electric Vehicle (BEV),,
4,Allen,2,121.5,2017.5,CHRYSLER,PACIFICA,Plug-in Hybrid Electric Vehicle (PHEV),,


In [4]:
filtered_df = duckdb.sql("""
    SELECT * 
    FROM aggregated_df
    WHERE "Average Income" IS NOT NULL AND "Population" IS NOT NULL
""").df()
filtered_df.head()

Unnamed: 0,County,EV Count,Average Electric Range,Average Model Year,Popular Brand,Popular Model,Popular EV Type,Average Income,Population
0,Adams,19,131.842105,2018.105263,TESLA,MODEL 3,Battery Electric Vehicle (BEV),64498.0,54015.0
1,Asotin,41,89.414634,2018.97561,TOYOTA,WRANGLER,Plug-in Hybrid Electric Vehicle (PHEV),67820.0,53745.0
2,Benton,1120,125.328571,2018.716964,TESLA,MODEL 3,Battery Electric Vehicle (BEV),87992.0,608885.0
3,Chelan,510,142.231373,2018.533333,TESLA,LEAF,Battery Electric Vehicle (BEV),84430.0,210625.0
4,Clallam,573,110.722513,2018.186736,CHEVROLET,LEAF,Battery Electric Vehicle (BEV),68924.0,188135.0


In [5]:
filtered_df

Unnamed: 0,County,EV Count,Average Electric Range,Average Model Year,Popular Brand,Popular Model,Popular EV Type,Average Income,Population
0,Adams,19,131.842105,2018.105263,TESLA,MODEL 3,Battery Electric Vehicle (BEV),64498.0,54015.0
1,Asotin,41,89.414634,2018.97561,TOYOTA,WRANGLER,Plug-in Hybrid Electric Vehicle (PHEV),67820.0,53745.0
2,Benton,1120,125.328571,2018.716964,TESLA,MODEL 3,Battery Electric Vehicle (BEV),87992.0,608885.0
3,Chelan,510,142.231373,2018.533333,TESLA,LEAF,Battery Electric Vehicle (BEV),84430.0,210625.0
4,Clallam,573,110.722513,2018.186736,CHEVROLET,LEAF,Battery Electric Vehicle (BEV),68924.0,188135.0
5,Clark,4906,120.988382,2018.681411,TESLA,MODEL 3,Battery Electric Vehicle (BEV),94198.0,1342045.0
6,Columbia,5,234.2,2016.4,CHEVROLET,BOLT EV,Battery Electric Vehicle (BEV),65040.0,10465.0
7,Cowlitz,435,122.981609,2018.790805,TESLA,MODEL 3,Plug-in Hybrid Electric Vehicle (PHEV),74250.0,288895.0
8,Douglas,170,132.117647,2018.417647,TESLA,LEAF,Battery Electric Vehicle (BEV),86676.0,108610.0
9,Ferry,13,183.846154,2018.692308,TESLA,MODEL 3,Battery Electric Vehicle (BEV),58973.0,15600.0


## Pre Registration Statements
### Statement 1
**Hypothesis:** Counties with a higher density of BEVs (Battery Electric Vehicles) relative to other types of EVs have higher total EV registrations.

**Analysis:** We will calculate the proportion of BEVs among all EVs for each county. Then, we will perform a linear regression with the BEV proportion (for each county) as the input variable and the total EV registrations (for each county) as the output. The goal is to test whether the coefficient for BEV proportion indicates a meaningful link between BEV density and overall EV registrations within each county. We chose linear regression because it can detect direct correlations between BEV popularity and overall adoption trends. We will test whether BEV > 0. If it is significantly positive, then it indicates that the counties with higher BEV proportion tend to have more total EV registrations. Additionally, if a relationship exists, it could inform further investigation into socioeconomic or geographic factors that may affect EV adoption rates.

### Statement 2
**Hypothesis:** Counties with higher median incomes are correlated with a greater number of registered electric vehicles.

**Analysis:** We will combine EV registration by county and the median income by county from the median income dataset. Where each row represents a different county, we run a linear regression for median income in 2023 as input and the number of registered electric vehicles as output. Because the coefficient can contain signs to indicate a positive or negative relationship between the two variables, we will test whether income  0. If it is significantly positive, then it indicates that high-income countries tend to have more EVs. We will also evaluate the R-squared value to understand how much of the variation in EV registrations is explained by median income and other factors included in the model.