# Car Classification Pipeline



This notebook performs two main stages:
1. **Data Cleaning** – Removes duplicates, irrelevant vehicle types, and text inconsistencies.  
2. **Car Classification** – Categorizes cars (SUV, MPV, Sports, Compact, Medium, Large) based on vehicle characteristics such as body type, mass, and length.


In [1]:
from google.colab import files
import numpy as np
import pandas as pd

In [3]:
rdw_cars = pd.read_csv('brand_model_peryear_cleaned_final.csv')
rdw_cars.head()

Unnamed: 0,brand,model,fuel_types_primary,economy_rate,resold_flag,inrichting_std,seats_median,mass_empty_median,length_median,width_median,...,count_2024,count_2025,avg_2023,avg_2024,avg_2025,flag_2023,flag_2024,flag_2025,new_car,image_url
0,FORD,KUGA,"Alcohol, Benzine, Diesel, Elektriciteit",A,1,MPV,5.0,1744.0,463.0,188.0,...,5187,4568,45809.63,47773.96,47943.79,sold,sold,sold,not new,https://www.ford.it/content/dam/guxeu/rhd/cent...
1,FORD,PUMA,"Alcohol, Benzine, Diesel, Elektriciteit",B,1,MPV,5.0,1225.0,421.0,181.0,...,3246,2801,34940.27,36422.75,37647.14,sold,sold,sold,not new,https://www.gpas-cache.ford.com/guid/5420b8eb-...
2,FORD,TRANSIT CUSTOM,"Benzine, Diesel, Elektriciteit",A,1,gesloten opbouw,3.0,1984.0,534.0,203.0,...,10344,2798,57319.72,61603.97,60622.39,sold,sold,sold,not new,"data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQA..."
3,FORD,FOCUS,"Alcohol, Benzine, Diesel, Elektriciteit",A,1,stationwagen,5.0,1340.0,469.0,184.0,...,4950,2322,35604.24,35591.46,36124.42,sold,sold,sold,not new,"data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQA..."
4,FORD,EXPLORER,"Benzine, Elektriciteit",A,1,MPV,5.0,1990.0,447.0,187.0,...,1459,1451,83982.49,51094.9,48713.71,sold,sold,sold,not new,https://api.openverse.org/v1/images/8c5dc7c2-8...


In [4]:
rdw_cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3984 entries, 0 to 3983
Data columns (total 24 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   brand                        3984 non-null   object 
 1   model                        3984 non-null   object 
 2   fuel_types_primary           3984 non-null   object 
 3   economy_rate                 580 non-null    object 
 4   resold_flag                  3984 non-null   int64  
 5   inrichting_std               3984 non-null   object 
 6   seats_median                 3984 non-null   float64
 7   mass_empty_median            3984 non-null   float64
 8   length_median                3179 non-null   float64
 9   width_median                 3350 non-null   float64
 10  wheelbase_median             3984 non-null   float64
 11  pw_ratio_median              3984 non-null   float64
 12  datum_eerste_toelating_year  3984 non-null   int64  
 13  count_2023        

## Final cleaning
Before Doing the classification it was noticed that some sjudtments were needed. Because for now we already have the dataset downloaded with the link we will compute the changes here, for the final this will be cleaned

Steps performed:

1. Remove rows where `brand == model`.  
2. Drop duplicate `(brand, model)` combinations.  
3. Remove placeholder or mistyped entries.  
4. Exclude non-passenger vehicle types (e.g., campers, trucks).  

### dropping the rows where brand == model

In [5]:
_brand = rdw_cars["brand"].astype(str).str.strip().str.casefold()
_model = rdw_cars["model"].astype(str).str.strip().str.casefold()

# Mask rows where brand == model (ignoring NaNs safely)
mask_same = _brand.notna() & _model.notna() & (_brand == _model)

# Report and drop in place
n_before = len(rdw_cars)
rdw_cars = rdw_cars.loc[~mask_same].copy()
n_after = len(rdw_cars)

print(f"Dropped {n_before - n_after} rows where brand == model (case-insensitive).")

Dropped 3 rows where brand == model (case-insensitive).


### dropping final duplicates

In [6]:
print(f"Rows before cleaning: {rdw_cars.shape[0]}")
rdw_unique = rdw_cars.drop_duplicates(subset=["brand", "model"], keep="first").copy()
print(f"Rows after cleaning: {rdw_unique.shape[0]}")

Rows before cleaning: 3981
Rows after cleaning: 3854


### cleaning final mistypos

In [7]:
print(f"before: {rdw_unique.shape[0]}")
rdw_unique = rdw_unique[rdw_unique["model"].astype(str).str.strip() != "-"].copy()
print(f"after: {rdw_unique.shape[0]}")

before: 3854
after: 3852


### Deleting non-pannensger vehicle types

Eliminsting rows where inrichting_std equals “gesloten opbouw” (closed body/box) or “kampeerwagen” (campervan) and exclude all non-passenger body types so they don’t skew passenger-car indicators, clustering, and price/forecast analyses.

In [8]:
rdw_unique["inrichting_std"] = rdw_unique["inrichting_std"].astype(str).str.strip().str.casefold() #consistency

to_exclude = {
    "gesloten opbouw", "kampeerwagen", #afterwards removal
    # broader non-passenger body types
    "open laadvloer", "speciale groep", "neerklapbare zijschotten", "veewagen",
    "voor vervoer voertuigen", "lijkwagen", "huifopbouw", "opleggertrekker",
    "detailhandel/expositiedoel.", "kipper", "gecond. met temperatuurreg.",
    "kraanwagen", "hoogwerker"
}

mask_excl = rdw_unique["inrichting_std"].isin(to_exclude)

n_before = len(rdw_unique)
n_matches = int(mask_excl.sum())
print(f"[BEFORE] Total rows: {n_before}")
print(f"[EXCLUDE] Matching rows (any excluded type): {n_matches}")

if n_matches > 0:
    print("\n[DETAIL] Value counts among rows to be dropped:")
    print(rdw_unique.loc[mask_excl, "inrichting_std"].value_counts())

rdw_unique = rdw_unique.loc[~mask_excl].copy()
n_after = len(rdw_unique)
print(f"\n[AFTER] Total rows: {n_after}")
print(f"Dropped {n_before - n_after} rows.")
print("Remaining unique types (sample):", sorted(rdw_unique["inrichting_std"].dropna().unique())[:12])


[BEFORE] Total rows: 3852
[EXCLUDE] Matching rows (any excluded type): 1051

[DETAIL] Value counts among rows to be dropped:
inrichting_std
kampeerwagen                   753
gesloten opbouw                207
speciale groep                  33
neerklapbare zijschotten        17
open laadvloer                  13
opleggertrekker                  6
voor vervoer voertuigen          5
kipper                           5
gecond. met temperatuurreg.      3
lijkwagen                        3
veewagen                         2
detailhandel/expositiedoel.      1
huifopbouw                       1
kraanwagen                       1
hoogwerker                       1
Name: count, dtype: int64

[AFTER] Total rows: 2801
Dropped 1051 rows.
Remaining unique types (sample): ['cabriolet', 'coupe', 'hatchback', 'mpv', 'niet geregistreerd', 'pick-up truck', 'sedan', 'stationwagen', 'voor rolstoelen toegankelijk voertuig']


5. One hot encoding fuel types primary column

- 1: the model is produced with that type of fuel,

- 0: otherwise

In [9]:
rdw_unique["fuel_types_primary"].unique()

array(['Alcohol, Benzine, Diesel, Elektriciteit',
       'Benzine, Elektriciteit', 'Elektriciteit', 'Benzine, Diesel',
       'Benzine, Diesel, Elektriciteit', 'Benzine', 'Benzine, LPG',
       'Alcohol, Benzine, Diesel', 'Alcohol, Benzine, LPG',
       'Alcohol, Benzine', 'Benzine, Elektriciteit, LPG',
       'Benzine, Diesel, LPG', 'Benzine, CNG', 'Diesel',
       'Diesel, Elektriciteit', 'CNG', 'LPG', 'Alcohol',
       'Alcohol, Benzine, Diesel, LPG',
       'Alcohol, Benzine, Diesel, Elektriciteit, LPG',
       'Benzine, CNG, Elektriciteit, LPG', 'Benzine, CNG, Diesel, LPG',
       'Benzine, Diesel, Elektriciteit, LPG', 'Elektriciteit, Waterstof',
       'Benzine, CNG, Diesel', 'Benzine, CNG, Diesel, Elektriciteit, LPG',
       'Benzine, CNG, Elektriciteit',
       'Benzine, CNG, Diesel, Elektriciteit',
       'Alcohol, Benzine, CNG, Diesel, Elektriciteit, LPG',
       'Alcohol, Benzine, CNG, Diesel, Elektriciteit'], dtype=object)

## Car Classification


### Logic
- **SUV:** Mass ≥ 1800 kg or length ≥ 4500 mm.  
- **MPV:** Passenger vans or ≥ 6 seats.  
- **Sports:** Sports-oriented bodies or power-to-weight ratio ≥ 0.12.  
- **Other cars:** Classified as Compact, Medium, or Large by weight and size percentiles.

Results are saved in `rdw_cars_classified.csv`.

In [10]:
# Keyword list for the classification
suv_keywords = ['pick-up truck']
mpv_keywords = ['mpv', 'personenbus']
sports_keywords = ['cabriolet', 'coupe', 'sportwagen']

# groups based on mass
p33_mass = rdw_cars['mass_empty_median'].quantile(0.33)
p66_mass = rdw_cars['mass_empty_median'].quantile(0.66)

print(f"Global 33rd percentile mass: {p33_mass:.1f}")
print(f"Global 66th percentile mass: {p66_mass:.1f}")

Global 33rd percentile mass: 1468.4
Global 66th percentile mass: 2175.0


In [11]:
def classify_car(row):
    inrichting = str(row['inrichting_std']).lower()
    mass = row['mass_empty_median']
    length = row['length_median']

    # SUV
    if inrichting in suv_keywords or (mass >= 1800 and length >= 4500):
        return 'SUV'

    # MPV
    elif inrichting in mpv_keywords or row.get('seats_median', 0) >= 6:
        return 'MPV'

    # Sports
    elif inrichting in sports_keywords or row.get('pw_ratio_median', 0) >= 0.12:
        return 'Sports'

    # Size-based (Compact / Medium / Large)
    if mass < p33_mass:
        if length >= 4500:
            return 'Medium'
        else:
            return 'Compact'
    elif mass < p66_mass:
        return 'Medium'
    else:
        if length < 4400:
            return 'Medium'
        else:
            return 'Large'

In [12]:
rdw_cars['body_class'] = rdw_cars.apply(classify_car, axis=1)

print(rdw_cars['body_class'].value_counts())
print(rdw_cars.head(10))

body_class
Medium     1761
Compact    1125
Sports      389
MPV         369
Large       181
SUV         156
Name: count, dtype: int64
  brand           model                       fuel_types_primary economy_rate  \
0  FORD            KUGA  Alcohol, Benzine, Diesel, Elektriciteit            A   
1  FORD            PUMA  Alcohol, Benzine, Diesel, Elektriciteit            B   
2  FORD  TRANSIT CUSTOM           Benzine, Diesel, Elektriciteit            A   
3  FORD           FOCUS  Alcohol, Benzine, Diesel, Elektriciteit            A   
4  FORD        EXPLORER                   Benzine, Elektriciteit            A   
5  AUDI    Q4 45 e-tron                            Elektriciteit          NaN   
6  AUDI              Q2                          Benzine, Diesel            C   
7  AUDI    A3 Sportback           Benzine, Diesel, Elektriciteit            B   
8  AUDI    A1 Sportback                          Benzine, Diesel            C   
9  FORD          FIESTA  Alcohol, Benzine, Diesel, Elektr

In [13]:
class_counts = rdw_cars['body_class'].value_counts()
print(class_counts)

body_class
Medium     1761
Compact    1125
Sports      389
MPV         369
Large       181
SUV         156
Name: count, dtype: int64


In [14]:
rdw_cars.to_csv('rdw_cars_classified.csv', index=False)

'''files.download('rdw_cars_classified.csv')'''

"files.download('rdw_cars_classified.csv')"