# Composite MatchScore: Car → PC4 Purchase-Likelihood Proxy

We estimate how *compatible* a given **car** is with a **PC4 region** by combining three distances:

1. **Demographics (numeric)** — affordability & lifestyle  
   - Car: `mass_empty_median`, `pw_ratio_median`, `seats_median`, `datum_eerste_toelating_year`  
   - Region: `avg_yearly_income_k`, `avg_household_size`, `urbanization`, `avg_house_value_woz`  
   - Normalize with *one shared scaler* (z-scores from regions) for comparability  
   - Distance: **Manhattan (L1)** by default; **Mahalanobis** optional for decorrelation

2. **Body mix (composition)** — fleet composition fit  
   - Car: one-hot of `body_class` across `[compact, medium, large, suv, mpv, sports]`  
   - Region: `[p_compact, p_medium, p_large, p_suv, p_mpv, p_sports]`  
   - Distance: **Manhattan (L1)** between car one-hot and region proportions

3. **Fuel mix (composition)** — energy preference fit  
   - Car: one-hot of `fuel_types_primary` across `[gasoline, diesel, electric, hybrid]`  
   - Region: `[p_gasoline, p_diesel, p_electric, p_hybrid]`  
   - Distance: **Manhattan (L1)** between car one-hot and region proportions

**Final combination** (weights you can tune):

$
d_{\text{total}} = w_{\text{demo}}\, d_{\text{demo}} + w_{\text{body}}\, d_{\text{body}} + w_{\text{fuel}}\, d_{\text{fuel}},\qquad
\text{Score} = \frac{1}{1 + d_{\text{total}}}.
$

### Why Manhattan (L1)?

$
d_{\text{L1}}(a,b)=\sum_i w_i\,|a_i-b_i|
$

L1 is robust, interpretable, and works well across heterogeneous features (z-scored numerics, one-hots, and probabilities).

### When Mahalanobis?
On **numeric, correlated subsets** (e.g., demographics) you may use:


$
d_{\text{Maha}}(a,b)=\sqrt{(a-b)^\top\Sigma^{-1}(a-b)}
$


to reduce double counting among correlated features. We keep **L1** for body/fuel because those are compositions/one-hots.


In [58]:
import numpy as np
import pandas as pd
from dataclasses import dataclass
from typing import Dict, List, Tuple, Optional

In [59]:
USE_MAHALANOBIS_DEMO = False

#  weights (sum to 1)
W_DEMO = 0.50
W_BODY = 0.30
W_FUEL = 0.20

In [60]:
# Load Data -- > fix path if needed
REGIONS_PATH = "REGIONAL.csv"
CARS_PATH    = "rdw_cars_classified.csv"

regions = pd.read_csv(REGIONS_PATH)
cars    = pd.read_csv(CARS_PATH)

print("Regions rows:", len(regions), "| Cars rows:", len(cars))


Regions rows: 5701 | Cars rows: 2801


### Feature Lists & One-Hot Helpers

In [61]:
# Demographic (numeric) features: explicit PAIRS (car_feature -> region_feature)
DEMO_MAP = [
    ("mass_empty_median",          "avg_yearly_income_k"),
    ("pw_ratio_median",            "urbanization"),
    ("seats_median",               "avg_household_size"),
    ("datum_eerste_toelating_year","avg_house_value_woz"),
]

# Derived lists (region-side schema for the demo block)
REG_DEMO = [r for (_, r) in DEMO_MAP]
CAR_DEMO = [c for (c, _) in DEMO_MAP]

# Body mix composition (regional)
BODY_CLASSES = ["compact", "medium", "large", "suv", "mpv", "sports"]
REG_BODY = [f"p_{c}" for c in BODY_CLASSES]

# Fuel mix composition (regional)
FUEL_CLASSES = ["gasoline", "diesel", "electric", "hybrid"]
REG_FUEL = [f"p_{f}" for f in FUEL_CLASSES]

In [62]:
def map_fuel_column(fuel_str: str) -> str:
    f = str(fuel_str).strip().lower()
    if ("benzine" in f) or ("gas" in f) or ("petrol" in f):
        return "p_gasoline"
    if "diesel" in f:
        return "p_diesel"
    if ("electric" in f) or ("ev" in f):
        return "p_electric"
    if "hybrid" in f:
        return "p_hybrid"
    # Fallback
    return "p_gasoline"

random Car & Show It

_for now: we need to add this to the front end with the actual car_

In [63]:
# random car --> reproducibility: set random_state=42
car = cars.sample(1, random_state=np.random.randint(0, 10_000)).reset_index(drop=True)
display(car[["brand","model","body_class","fuel_types_primary",
             "mass_empty_median","pw_ratio_median","seats_median","datum_eerste_toelating_year"]])


Unnamed: 0,brand,model,body_class,fuel_types_primary,mass_empty_median,pw_ratio_median,seats_median,datum_eerste_toelating_year
0,BMW,M 135 I xDrive,Medium,Benzine,1500.0,8.8e-05,5.0,2020


Clean Regions & Build Car Vectors

In [64]:
# Keep only regions with complete info we need
need_cols = REG_DEMO + REG_BODY + REG_FUEL + ["pc4"]
R = regions.dropna(subset=need_cols).copy()
for c in need_cols:
    R[c] = pd.to_numeric(R[c], errors="coerce")
R = R.dropna(subset=need_cols).reset_index(drop=True)

# Car numeric "demo-analog" vector (as row DataFrame)
car_demo = pd.DataFrame({
    "mass_empty_median": [pd.to_numeric(car["mass_empty_median"].iloc[0], errors="coerce")],
    "pw_ratio_median":   [pd.to_numeric(car["pw_ratio_median"].iloc[0], errors="coerce")],
    "seats_median":      [pd.to_numeric(car["seats_median"].iloc[0], errors="coerce")],
    "datum_eerste_toelating_year": [pd.to_numeric(car["datum_eerste_toelating_year"].iloc[0], errors="coerce")],
})

Shared Z-Scaling for Demographics

In [65]:
demo_mean = R[REG_DEMO].mean()
demo_std  = R[REG_DEMO].std().replace(0.0, np.nan)

# Regions Z (n,4)
Z_regions_demo = (R[REG_DEMO] - demo_mean) / demo_std

In [66]:
def build_Z_car_demo(car_df: pd.DataFrame) -> pd.DataFrame:
    vals = {}
    for c_col, r_col in DEMO_MAP:
        c_val = pd.to_numeric(car_df[c_col].iloc[0], errors="coerce")
        mu, sd = demo_mean[r_col], demo_std[r_col]
        z = (c_val - mu) / sd
        vals[r_col] = [z]
    return pd.DataFrame(vals)[REG_DEMO]  # enforce column order

Z_car_demo = build_Z_car_demo(car)


In [68]:
# Optional: Mahalanobis on the demo block (both already standardized)
def mahalanobis_distance_row(a_row: np.ndarray, B: np.ndarray, lam: float = 1e-6):
    Sigma = np.cov(B.T, bias=False)
    d = Sigma.shape[0]
    VI = np.linalg.pinv(Sigma + lam * np.eye(d))
    X = B - a_row[None, :]
    quad = np.sum((X @ VI) * X, axis=1)
    return np.sqrt(np.maximum(quad, 0.0))

### Distance Functions (L1 + Composition)

In [67]:
# Distances
def weighted_L1_row_vs_matrix(a_row_df: pd.DataFrame, B_df: pd.DataFrame, weights: np.ndarray):
    a = a_row_df.values.reshape(1, -1)
    B = B_df.values
    delta = np.abs(B - a)
    contrib = delta * weights.reshape(1, -1)
    dist = contrib.sum(axis=1)
    return dist, contrib

Compute Partial Distances & Combine

## Demographic mapping (car → region) and rationale

We map each **car numeric attribute** to a **region demographic analogue**, then standardize the car value using the region feature’s mean/std, so both sides share the **same 4 columns**.

**Mapping used:**
- `mass_empty_median` → `avg_yearly_income_k` (affordability/vehicle price class proxy)
- `pw_ratio_median` → `urbanization` (performance trims more common in dense/central areas; tune if needed)
- `seats_median` → `avg_household_size` (family size vs seats)
- `datum_eerste_toelating_year` → `avg_house_value_woz` (newer models skew to wealthier areas)


In [69]:
# DEMO distance (L1 or Mahalanobis)
if USE_MAHALANOBIS_DEMO:
    d_demo = mahalanobis_distance_row(
        a_row=Z_car_demo.values[0],
        B=Z_regions_demo.values,
        lam=1e-6
    )
else:
    demo_w = np.full(len(REG_DEMO), 1.0 / len(REG_DEMO))
    d_demo, _ = weighted_L1_row_vs_matrix(Z_car_demo, Z_regions_demo, demo_w)

# Convert distance to a [0,1] "fit" via exponential (median -> 0.5)
median_d = float(np.median(d_demo))
k = (np.log(2.0) / median_d) if median_d > 0 else 1.0
demo_fit = np.exp(-k * d_demo)  # higher = better

In [71]:
# BODY probability (as-is)
body_class = str(car["body_class"].iloc[0]).strip().lower()
p_col_body = f"p_{body_class}"
if p_col_body not in R.columns:
    # fallback if body is outside expected set
    p_col_body = "p_compact"
p_body = R[p_col_body].values  # higher = better

# FUEL probability (as-is)
p_col_fuel = map_fuel_column(str(car["fuel_types_primary"].iloc[0]))
p_fuel = R[p_col_fuel].values  # higher = better

# Final score (higher = better)
score = (W_DEMO * demo_fit) + (W_BODY * p_body) + (W_FUEL * p_fuel)

ranked = pd.DataFrame({
    "pc4": R["pc4"].astype(int),
    "demo_fit": demo_fit,
    "p_body": p_body,
    "p_fuel": p_fuel,
    "score": score
}).sort_values("score", ascending=False).reset_index(drop=True)

Weighted mean per inhabitants

In [75]:
ranked_pc4 = (
    ranked.merge(regions[["pc4", "inhabitants_total"]], on="pc4", how="left")
          .groupby("pc4", as_index=False)
          .apply(lambda g: pd.Series({
              "demo_fit": np.average(g["demo_fit"], weights=g["inhabitants_total"]),
              "p_body": np.average(g["p_body"], weights=g["inhabitants_total"]),
              "p_fuel": np.average(g["p_fuel"], weights=g["inhabitants_total"]),
              "score":  np.average(g["score"],   weights=g["inhabitants_total"])
          }))
          .sort_values("score", ascending=False)
          .reset_index(drop=True)
)

  .apply(lambda g: pd.Series({


In [76]:
print(f"\nWeighted aggregate likelihoods by PC4 for: "
      f"{car.brand.iloc[0]} {car.model.iloc[0]} "
      f"({car.body_class.iloc[0]}, {car.fuel_types_primary.iloc[0]})\n")

for i, row in ranked_pc4.head(10).iterrows():
    print(f"{i+1:02d}. PC4 {int(row['pc4'])} | "
          f"Score={row['score']:.4f}  "
          f"(demo_fit={row['demo_fit']:.3f}, p_body={row['p_body']:.3f}, p_fuel={row['p_fuel']:.3f})")


Weighted aggregate likelihoods by PC4 for: BMW M 135 I xDrive (Medium, Benzine)

01. PC4 1077 | Score=0.6821  (demo_fit=0.518, p_body=0.884, p_fuel=0.788)
02. PC4 3585 | Score=0.6708  (demo_fit=0.503, p_body=0.857, p_fuel=0.811)
03. PC4 1071 | Score=0.6639  (demo_fit=0.514, p_body=0.821, p_fuel=0.804)
04. PC4 1014 | Score=0.6551  (demo_fit=0.506, p_body=0.829, p_fuel=0.767)
05. PC4 2596 | Score=0.6548  (demo_fit=0.508, p_body=0.786, p_fuel=0.825)
06. PC4 1017 | Score=0.6496  (demo_fit=0.507, p_body=0.792, p_fuel=0.793)
07. PC4 3735 | Score=0.6452  (demo_fit=0.522, p_body=0.756, p_fuel=0.788)
08. PC4 7313 | Score=0.6443  (demo_fit=0.506, p_body=0.787, p_fuel=0.775)
09. PC4 2116 | Score=0.6431  (demo_fit=0.509, p_body=0.748, p_fuel=0.820)
10. PC4 9163 | Score=0.6367  (demo_fit=0.496, p_body=0.701, p_fuel=0.892)
