In [1]:
# Import packages
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.mixture import GaussianMixture

In [8]:
# Import and clean up dataset
df = pd.read_csv('data/electric_vehicles_spec_2025.csv')
df = df.dropna()
df.head()

Unnamed: 0,brand,model,top_speed_kmh,battery_capacity_kWh,battery_type,number_of_cells,torque_nm,efficiency_wh_per_km,range_km,acceleration_0_100_s,...,towing_capacity_kg,cargo_volume_l,seats,drivetrain,segment,length_mm,width_mm,height_mm,car_body_type,source_url
0,Abarth,500e Convertible,155,37.8,Lithium-ion,192.0,235.0,156,225,7.0,...,0.0,185,4,FWD,B - Compact,3673,1683,1518,Hatchback,https://ev-database.org/car/1904/Abarth-500e-C...
1,Abarth,500e Hatchback,155,37.8,Lithium-ion,192.0,235.0,149,225,7.0,...,0.0,185,4,FWD,B - Compact,3673,1683,1518,Hatchback,https://ev-database.org/car/1903/Abarth-500e-H...
2,Abarth,600e Scorpionissima,200,50.8,Lithium-ion,102.0,345.0,158,280,5.9,...,0.0,360,5,FWD,JB - Compact,4187,1779,1557,SUV,https://ev-database.org/car/3057/Abarth-600e-S...
3,Abarth,600e Turismo,200,50.8,Lithium-ion,102.0,345.0,158,280,6.2,...,0.0,360,5,FWD,JB - Compact,4187,1779,1557,SUV,https://ev-database.org/car/3056/Abarth-600e-T...
6,Alfa,Romeo Junior Elettrica 54 kWh,150,50.8,Lithium-ion,102.0,260.0,128,320,9.0,...,0.0,400,5,FWD,JB - Compact,4173,1781,1532,SUV,https://ev-database.org/car/2184/Alfa-Romeo-Ju...


In [7]:
# Check for missing values
df.isnull().sum()

brand                        0
model                        0
top_speed_kmh                0
battery_capacity_kWh         0
battery_type                 0
number_of_cells              0
torque_nm                    0
efficiency_wh_per_km         0
range_km                     0
acceleration_0_100_s         0
fast_charging_power_kw_dc    0
fast_charge_port             0
towing_capacity_kg           0
cargo_volume_l               0
seats                        0
drivetrain                   0
segment                      0
length_mm                    0
width_mm                     0
height_mm                    0
car_body_type                0
source_url                   0
dtype: int64

Using the dataset from Kaggle about electric vehicles - https://www.kaggle.com/datasets/urvishahir/electric-vehicle-specifications-dataset-2025

Description :

This dataset provides a comprehensive collection of specifications and performance metrics for modern electric vehicles (EVs). It is designed to support researchers, analysts, students, and developers working on data science, machine learning, automotive market research, sustainability studies, or electric vehicle adoption analysis.

Each row in the dataset represents a specific electric vehicle model with a rich set of attributes covering:


        Brand and Model: Manufacturer and specific nameplate of the EV.
        Car Body Type: Classification such as hatchback, SUV, sedan, etc.
        Segment: Vehicle segment (e.g., compact, midsize, executive).

        Battery Capacity (kWh): The gross energy capacity of the battery.
        Number of Cells and Battery Type: Technical battery information, where available.
        Efficiency (Wh/km): Power consumption rate of the vehicle.
        Range (km): Estimated driving range on a full charge.

        Fast Charging Power (kW): Maximum supported DC fast-charging power.
        Fast Charge Port Type: Connector standard (e.g., CCS, CHAdeMO).

        Top Speed (km/h): Maximum speed of the vehicle.
        0–100 km/h Acceleration (s): Time to reach 100 km/h from a standstill.
        Torque (Nm): Maximum torque output, where available.

        Towing Capacity (kg): Ability to tow loads, provided where applicable.
        Cargo Volume (L): Luggage space, sometimes approximate or expressed in alternative units.
        Seats: Total seating capacity.
        
        Length, Width, Height (mm): Physical footprint of the vehicle.
        Drivetrain: Powertrain configuration (e.g., AWD, RWD, FWD).
        Source URL: Reference link for each car (used in scraping).

Question 1: Is there a relationship between the technical qualities of the batteries? 
- Unsupervised
- No latent variable
- No clear truth variable
- Perform clustering
- Perform other unsupervised modeling

In [9]:
# Could cluster car brands together based on technical specs: battery size, top speeds, number of cells, etc...
scaler = StandardScaler()
technical_features = df[['top_speed_kmh', 'battery_capacity_kWh', 'number_of_cells', 
                      'torque_nm', 'efficiency_wh_per_km', 'range_km', 'acceleration_0_100_s']].values


technical_features_scaled = scaler.fit_transform(technical_features)

manu = df[['brand']].values # Grab brands to group by

In [27]:
# Look for best cluster parameters based on AIC and/or BIC
def best_cluster(X):
    # Initialize parameters
    cluster_range = range(2, 21)
    aic_history = []
    bic_history = []
    models = []
    
    for num_clusters in cluster_range:
        gmm = GaussianMixture(n_components = num_clusters, n_init = 10)
        gmm.fit(X)
    
        aic = gmm.aic(X)
        bic = gmm.bic(X)
    
        aic_history.append(aic) # Append AIC scores
        bic_history.append(bic) # Append BIC scores
        models.append(gmm) # Append models used
    
        print(f"Number of clusters = {num_clusters} with AIC = {aic:.4f}, BIC = {bic:.4f}")
    
    # Find best number of clusters
    aic_min_index = np.argmin(aic_history)
    bic_min_index = np.argmin(bic_history)
    best_aic_cluster = cluster_range[aic_min_index]
    best_bic_cluster = cluster_range[bic_min_index]

    # Depending on the run, sometimes AIC and BIC don't agree
    best_aic_model = models[best_aic_cluster - 2] # minus 2 because we started with range of 2
    best_bic_model = models[best_bic_cluster - 2]
    
    print(f"\nWith a minimum value of {min(aic_history):.4f} AIC, the optimal number of clusters based on AIC is {best_aic_cluster}")
    print(f"With a minimum value of {min(bic_history):.4f} BIC, the optimal number of clusters based on BIC is {best_bic_cluster}")
    return best_aic_model, best_bic_model, best_aic_cluster, best_bic_cluster

# Function to print out clusters and their respective countries through specified evaluator
def brand_clusters(cluster, best_cluster_value, df, evaluator):
    print(f'\nGiven a total of {best_cluster_value} clusters through {evaluator}')
    for cluster_number in range(best_cluster_value): # Iterate through each cluster
        brand_in_clusters = df[df[cluster] == cluster_number]['brand'].unique()
        print(f"\nCluster {cluster_number + 1} ({len(brand_in_clusters)} brands):")
        print(', '.join(brand_in_clusters)) # https://stackoverflow.com/questions/22399014/print-elements-in-an-array-with-a-delimiter

In [28]:
# Find best cluster, then group and print out the brands
q1_copy = df.copy() # Create a copy of dataframe for cluster labels
best_aic_model, best_bic_model, best_aic_cluster, best_bic_cluster = best_cluster(technical_features_scaled)

# Get cluster labels
aic_cluster_labels = best_aic_model.predict(technical_features_scaled)
bic_cluster_labels = best_bic_model.predict(technical_features_scaled)
q1_copy['AIC Cluster'] = aic_cluster_labels
q1_copy['BIC Cluster'] = bic_cluster_labels

# Print out brands clustered together, based on AIC or BIC
brand_clusters('AIC Cluster', best_aic_cluster, q1_copy, 'AIC')
brand_clusters('BIC Cluster', best_bic_cluster, q1_copy, 'BIC')

Number of clusters = 2 with AIC = 2425.6881, BIC = 2679.5804
Number of clusters = 3 with AIC = 1842.5340, BIC = 2225.1606
Number of clusters = 4 with AIC = 1459.8186, BIC = 1971.1793
Number of clusters = 5 with AIC = 1318.9359, BIC = 1959.0308
Number of clusters = 6 with AIC = 1008.3411, BIC = 1777.1702
Number of clusters = 7 with AIC = 949.0690, BIC = 1846.6322
Number of clusters = 8 with AIC = 583.0173, BIC = 1609.3147
Number of clusters = 9 with AIC = 460.3223, BIC = 1615.3539
Number of clusters = 10 with AIC = -117.4369, BIC = 1166.3289
Number of clusters = 11 with AIC = -99.6355, BIC = 1312.8644
Number of clusters = 12 with AIC = -160.8924, BIC = 1380.3416
Number of clusters = 13 with AIC = -401.4978, BIC = 1268.4704
Number of clusters = 14 with AIC = -519.5843, BIC = 1279.1181
Number of clusters = 15 with AIC = -239.3303, BIC = 1688.1062
Number of clusters = 16 with AIC = -816.2383, BIC = 1239.9324
Number of clusters = 17 with AIC = -535.2231, BIC = 1649.6818
Number of clusters =

### Temporary Analysis
BIC tends to penalizes model complexity more than AIC. It tends to select simpler models with fewer clusters; therefore BIC is generally preferred to prevent overfitting, and to compress down datasets to have less complexity.

AIC tends to penalizes complexity less, so it will often select models with more clusters. AIC is better when the goal is to predict or when we want a model that captures more nuances in the data.

With our current EM algorithm implementation, we can see this reflected in the number of clusters chosen either via AIC or BIC. AIC consistently chooses more clusters than BIC through multiple runs. Due to brands having different models of cars, we will see brands repeated throughout the clusters. We can still extrapolate information from the groupings (are perceived high-end brands grouped together? etc...)

It's also interesting to see how the clusters are formed with closely related car brands. For example, Polestar is the sister company or EV subsidiary of Volvo. However, they are frequently grouped separately. Is this because Volvo is considered the lower end EV models?

Question 2: 
- What other factors are closely related to efficiency? What makes an electric car the most efficient?

Question 3:

Predictive Modeling

Build a predictive model: Given variables like battery_capacity_kWh, body type, segment, drivetrain, brand, dimensions — can you predict range_km?

Or: Predict fast_charging_power_kW given other specs (battery size, brand, segment) — which vehicles support high charging?

Cluster analysis: Are there natural clusters of EVs (e.g., economy commuter, high-performance, luxury long-range) based on specs?

Feature engineering: Create derived metrics like “range per kWh”, “Wh per km per kg volume”, etc.— which features are most predictive of being in “premium” vs “economy” class?