<a href="https://colab.research.google.com/github/sof1a03/DSS_groupproject/blob/main/RDW/RDW_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Car classification



In [1]:
from google.colab import files
import pandas as pd

In [2]:
rdw_cars = pd.read_csv('df_cleaned_FINAL.csv')
rdw_cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3843 entries, 0 to 3842
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   brand                        3843 non-null   object 
 1   model                        3843 non-null   object 
 2   fuel_types_primary           3843 non-null   object 
 3   resold_flag                  3843 non-null   int64  
 4   inrichting_std               3843 non-null   object 
 5   seats_median                 3843 non-null   float64
 6   mass_empty_median            3843 non-null   float64
 7   length_median                3113 non-null   float64
 8   width_median                 3233 non-null   float64
 9   wheelbase_median             3843 non-null   float64
 10  pw_ratio_median              3843 non-null   float64
 11  datum_eerste_toelating_year  3843 non-null   int64  
 12  count_2023                   3843 non-null   float64
 13  count_2024        

In [3]:
rdw_cars.head()

Unnamed: 0,brand,model,fuel_types_primary,resold_flag,inrichting_std,seats_median,mass_empty_median,length_median,width_median,wheelbase_median,pw_ratio_median,datum_eerste_toelating_year,count_2023,count_2024,count_2025,avg_2023,avg_2024,avg_2025,image_url_2,image_url_3
0,AUDI,Q4 45 e-tron,Elektriciteit,1,stationwagen,5.0,2135.0,459.0,187.0,277.0,1.3e-05,2024,105.0,2863.0,1243.0,67735.88,58893.6,57493.41,https://uploads.audi-mediacenter.com/system/pr...,https://uploads.audi-mediacenter.com/system/pr...
1,AUDI,Q2,"Benzine, Diesel",1,stationwagen,5.0,1280.0,421.0,180.0,259.0,5.8e-05,2023,1891.0,1848.0,940.0,45577.94,46148.48,48594.64,https://uploads.audi-mediacenter.com/system/pr...,https://uploads.audi-mediacenter.com/system/pr...
2,AUDI,A3 Sportback,"Benzine, Diesel, Elektriciteit",1,hatchback,5.0,1255.0,435.0,182.0,262.0,4.4e-05,2023,3012.0,1837.0,931.0,41083.76,42957.07,43886.43,https://uploads.audi-mediacenter.com/system/pr...,https://daisypstrg.blob.core.windows.net/vehic...
3,AUDI,A1 Sportback,"Benzine, Diesel",1,hatchback,5.0,1080.0,404.0,174.0,255.0,5.3e-05,2023,2128.0,1626.0,920.0,33815.98,33853.25,36263.14,https://uploads.audi-mediacenter.com/system/pr...,https://daisypstrg.blob.core.windows.net/vehic...
4,AUDI,Q6 Suv e-tron Performance,Elektriciteit,1,stationwagen,5.0,2175.0,477.0,195.0,290.0,2.2e-05,2025,0.0,329.0,520.0,,82466.87,82587.72,https://media.audi.com/is/image/audi/nemo/mode...,https://uploads.audi-mediacenter.com/system/pr...


All vehicle configuration names are converted to lowercase to ensure consistent comparisons.

We remove vehicles that are not relevant for passenger classification (e.g., cargo trucks, hearses, cranes) based on specific keywords.
This ensures that the classification focuses solely on passenger vehicles.

In [4]:
# column to lowercase
rdw_cars['inrichting_std'] = rdw_cars['inrichting_std'].str.lower()

# exclude non-passenger cars
exclude_keywords = [
    'open laadvloer', 'speciale groep', 'neerklapbare zijschotten', 'veewagen',
    'voor vervoer voertuigen', 'lijkwagen', 'huifopbouw', 'opleggertrekker',
    'detailhandel/expositiedoel.', 'kipper', 'gecond. met temperatuurreg.',
    'kraanwagen', 'hoogwerker'
]
rdw_cars = rdw_cars[~rdw_cars['inrichting_std'].isin(exclude_keywords)]

print(f"Remaining models after exclusion: {len(rdw_cars)}")
print("Remaining unique types:", rdw_cars['inrichting_std'].unique())

Remaining models after exclusion: 3793
Remaining unique types: ['stationwagen' 'hatchback' 'sedan' 'cabriolet' 'coupe' 'mpv'
 'niet geregistreerd' 'kampeerwagen' 'gesloten opbouw' 'pick-up truck'
 'voor rolstoelen toegankelijk voertuig']


We define keyword lists for specific car categories (e.g., campers, SUVs, MPVs, sports cars).
These keywords allow us to quickly identify vehicles belonging to these categories based on the inrichting_std field.

We calculate the 33rd and 66th percentiles of vehicle mass. These thresholds divide the dataset into three approximate groups — light, medium, and heavy vehicles — which will later help us classify cars into Compact, Medium, or Large categories.

In [5]:
# Keyword list for the classification
camper_keywords = ['kampeerwagen']
suv_keywords = ['pick-up truck']
mpv_keywords = ['mpv', 'personenbus']
sports_keywords = ['cabriolet', 'coupe', 'sportwagen']

# Computing the groups based on mass
p33_mass = rdw_cars['mass_empty_median'].quantile(0.33)
p66_mass = rdw_cars['mass_empty_median'].quantile(0.66)

print(f"Global 33rd percentile mass: {p33_mass:.1f}")
print(f"Global 66th percentile mass: {p66_mass:.1f}")

Global 33rd percentile mass: 1425.0
Global 66th percentile mass: 2109.3


This function implements the classification rules. Each vehicle is categorized step-by-step:

Keyword-based rules: Directly assign a class if the description matches a camper, SUV, MPV, or sports car.

Heuristic rules: Use thresholds on mass, length, seat count, or power-to-weight ratio for classification.

Size-based classification: If none of the above applies, vehicles are classified into Compact, Medium, or Large categories based on mass and length.

In [6]:
def classify_car(row):
    inrichting = str(row['inrichting_std']).lower()
    mass = row['mass_empty_median']
    length = row['length_median']

    # --- Camper Identification ---
    if inrichting in camper_keywords:
        return 'Camper'

    # --- SUV Identification ---
    elif inrichting in suv_keywords or (mass >= 1800 and length >= 4500):
        return 'SUV'

    # --- MPV Identification ---
    elif inrichting in mpv_keywords or row.get('seats_median', 0) >= 6:
        return 'MPV'

    # --- Sports Identification ---
    elif inrichting in sports_keywords or row.get('pw_ratio_median', 0) >= 0.12:
        return 'Sports'

    # --- Size-based Classification (Compact / Medium / Large) ---
    if mass < p33_mass:
        if length >= 4500:
            return 'Medium'
        else:
            return 'Compact'
    elif mass < p66_mass:
        return 'Medium'
    else:
        if length < 4400:
            return 'Medium'
        else:
            return 'Large'

We apply the classification function to each row of the DataFrame and create a new column body_class.
The distribution of vehicle categories is then displayed, followed by a preview of the classified dataset.

In [7]:
rdw_cars['body_class'] = rdw_cars.apply(classify_car, axis=1)

print(rdw_cars['body_class'].value_counts())
print(rdw_cars.head(10))

body_class
Medium     1112
Compact    1083
Camper      753
Sports      394
MPV         307
SUV         127
Large        17
Name: count, dtype: int64
  brand                      model              fuel_types_primary  \
0  AUDI               Q4 45 e-tron                   Elektriciteit   
1  AUDI                         Q2                 Benzine, Diesel   
2  AUDI               A3 Sportback  Benzine, Diesel, Elektriciteit   
3  AUDI               A1 Sportback                 Benzine, Diesel   
4  AUDI  Q6 Suv e-tron Performance                   Elektriciteit   
5  AUDI              Q6 Suv e-tron                   Elektriciteit   
6  AUDI        Q3 Sportback e-tron          Benzine, Elektriciteit   
7  AUDI                         Q3  Benzine, Diesel, Elektriciteit   
8  AUDI      A3 Sportback Phev 150                   Elektriciteit   
9  AUDI               Q5 50 TFSI E          Benzine, Elektriciteit   

   resold_flag inrichting_std  seats_median  mass_empty_median  length_median  \

In [8]:
# Count cars per body class
class_counts = rdw_cars['body_class'].value_counts()
print(class_counts)

body_class
Medium     1112
Compact    1083
Camper      753
Sports      394
MPV         307
SUV         127
Large        17
Name: count, dtype: int64


In [9]:
# Save the updated DataFrame to a CSV file
rdw_cars.to_csv('rdw_cars_classified.csv', index=False)

files.download('rdw_cars_classified.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>