<a href="https://colab.research.google.com/github/sof1a03/DSS_groupproject/blob/main/Data_science_Distance_measures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Cloning the Github repo

In [None]:
!git clone https://github.com/sof1a03/DSS_groupproject.git

fatal: destination path 'DSS_groupproject' already exists and is not an empty directory.


# Composite Match Score (2-Stage): Car → PC4 Regional Compatibility

This notebook estimates the *regional compatibility* of a given car model with  PC4 code areas.  
The goal is to find which regions that show the highest likelihood of interest and affordability for the selected car.

The approach combines two stages:

1. **Stage 1 – Interest (Body + Fuel mix):**  
   Measures how well the car's body type and primary fuel type align with regional fleet compositions.

2. **Stage 2 – Affordability Fit:**  
   Measures how well the car’s price and seating capacity align with the region’s income and household structure.

These two components are then combined into a **final composite score**, giving a ranked list of the best-fitting PC4 regions for the selected car.

Using the **Manhattan (L1) distance** throughout because it’s robust, interpretable, and suitable for mixed feature types (probabilities, z-scores, one-hots).


## 1. Setup

We start by importing core libraries and defining constants and weights used in the composite score.

Weights can be tuned later to emphasize certain components — for example, `W_INTEREST_BODY` vs. `W_INTEREST_FUEL` can shift attention between styling preferences and fuel adoption trends.


In [None]:
import numpy as np
import pandas as pd
from scipy.stats import norm

# **TO DO HERE: WHY WE HAVE CHOSEN THESE VALUES**
(We still need to evaluate if this approach works well and if we should fine-tune the parameters)

In [None]:
USE_MAHALANOBIS_DEMO = False

# Weights
W_INTEREST_BODY  = 0.6
W_INTEREST_FUEL  = 0.4
W_FINAL_INTEREST = 0.7
W_FINAL_AFFORD   = 0.3
TOP_QUANTILE     = 0.8  # keep top 20% most interested PC4s

## 2. Feature Definitions and Helper Functions

Each region (PC4) contains aggregated proportions of **body types** and **fuel types**.  
We define helper lists and a simple mapping function to match each car’s attributes to the correct regional columns.


In [None]:
# Body mix composition (regional)
BODY_CLASSES = ["compact", "medium", "large", "suv", "mpv", "sports"]
REG_BODY = [f"p_{c}" for c in BODY_CLASSES]

# Fuel mix composition (regional)
FUEL_CLASSES = ["gasoline", "diesel", "electric", "hybrid"]
REG_FUEL = [f"p_{f}" for f in FUEL_CLASSES]

In [None]:
def map_fuel_column(fuel_str: str) -> str:
    f = str(fuel_str).strip().lower()
    if ("benzine" in f) or ("gas" in f) or ("petrol" in f):
        return "p_gasoline"
    if "diesel" in f:
        return "p_diesel"
    if ("electric" in f) or ("ev" in f):
        return "p_electric"
    if "hybrid" in f:
        return "p_hybrid"
    return "p_gasoline"

## 3. Load Data

We load the cleaned car dataset (`RDW.csv`) and the regional dataset (`REGIONAL.csv`),
then check basic info to verify data availability and consistency.

Each dataset:
- **Cars:** one row per car model with attributes like body type, price, mass, and power ratio.  
- **Regions:** one row per PC4 area with demographics, income, and fleet composition.


In [None]:
# Load Data -- > fix path if needed
REGIONS_PATH = "/content/DSS_groupproject/Data/Final/REGIONAL.csv"
CARS_PATH    = "/content/DSS_groupproject/Data/Final/RDW.csv"

regions = pd.read_csv(REGIONS_PATH)
cars    = pd.read_csv(CARS_PATH)

print("Regions rows:", len(regions), "| Cars rows:", len(cars))

Regions rows: 5701 | Cars rows: 2767


Getting a random car example to demonstrate.

For the dashboard, replace sampling with *user selection*

In [None]:
# Random car example
car = cars[cars["model"]=="Rs 6 Avant Performance"]
#car = cars.sample(1, random_state=np.random.randint(0, 10_000)).reset_index(drop=True)
display(car[["brand","model","body_class","fuel_types_primary","price_z_score"]])

Unnamed: 0,brand,model,body_class,fuel_types_primary,price_z_score
152,AUDI,Rs 6 Avant Performance,Medium,Benzine,1.185878


## 4. Stage 1: Interest (Body + Fuel Composition)

We estimate regional **interest** as a weighted combination of how common the car’s
body and fuel types are in each PC4 area.

$$
\text{Interest Score} = 0.6 \times p_{\text{body}} + 0.4 \times p_{\text{fuel}}
$$

**Note to do:**
**- Function to return data suitable for the spider-graph (matching score for multiple categories of statistics)**
**- Function to return additional (unstandardized) data for result in table**



In [None]:
body_col = f"p_{car['body_class'].iloc[0].strip().lower()}"
fuel_col = map_fuel_column(car["fuel_types_primary"].iloc[0])

R = regions.dropna(subset=[body_col, fuel_col, "avg_yearly_income_k", "inhabitants_total"]).copy()
R["interest_score"] = (
    (W_INTEREST_BODY * R[body_col]) +
    (W_INTEREST_FUEL * R[fuel_col])
)


**Note for the MatchScore visualization**
If we want to visualize all regions as a heatmap - use R_top = R.copy() if we want to only go for the top 20% percent change it in the following code

In [None]:
# Filter: If we only want to keep top 20% of interest
#q = R["interest_score"].quantile(TOP_QUANTILE)
#R_top = R[R["interest_score"] >= q].copy().reset_index(drop=True)

# Otherwise, to calculate for all PC4 regions
R_top = R.copy()

## 5. Stage 2: Affordability Fit

In this stage, we assess how economically compatible a region is with the selected car model.

We compare:
- Car price (z-score) ↔ Regional average income (z-score)

We use **Manhattan (L1) distance** between this pair and then apply an exponential transformation
so that smaller distances (better matches) produce higher affordability scores.

$$
\text{Affordability Fit} = e^{-k \cdot d}
$$


In [None]:
# Prepare numeric fields
price_z = pd.to_numeric(car["price_z_score"].iloc[0], errors="coerce")

income_z = pd.to_numeric(R_top["std_avg_yearly_income_k"], errors="coerce")

# Manhattan (L1) distance on 2D feature space
d = np.abs(price_z - income_z)

# Convert to [0,1] fit (median distance → 0.5)
median_d = float(np.nanmedian(d))
k = (np.log(2.0) / median_d) if median_d > 0 else 1.0
R_top["affordability_fit"] = np.exp(-k * d)


## 6. Final Composite Score

We combine the two stages into a final compatibility score per region.

$$
\text{Final Score} = 0.7 \times \text{Interest} + 0.3 \times \text{Affordability Fit}
$$

We then aggregate by PC4 and weight by population size to reflect the importance of larger regions.

**Note: for visualizing the heatmap**
If we want to change the number of groups/colors we want to use - refer to the following code

In [None]:
R_top["final_score"] = (
    (W_FINAL_INTEREST * R_top["interest_score"]) +
    (W_FINAL_AFFORD   * R_top["affordability_fit"])
)

# Weighted aggregate by inhabitants
ranked_pc4 = (
    R_top.groupby("pc4", as_index=False)
          .apply(lambda g: pd.Series({
              "interest_score": np.average(g["interest_score"], weights=g["inhabitants_total"]),
              "affordability_fit": np.average(g["affordability_fit"], weights=g["inhabitants_total"]),
              "final_score": np.average(g["final_score"], weights=g["inhabitants_total"])
          }))
          .sort_values("final_score", ascending=False)
          .reset_index(drop=True)
)

print(ranked_pc4[ranked_pc4["pc4"]==2023])
R_top[R_top["pc4"]==2023]


# Creating a visualization for 5 groups for heat-map
N_BINS = 5
group_labels = ["Very Low", "Low", "Medium", "High", "Very High"]
ranked_pc4["score_group_label"] = pd.qcut(
    ranked_pc4["final_score"],
    q=N_BINS,
    labels=group_labels
)
# Summarize ranges for sanity check
group_summary = (
    ranked_pc4.groupby("score_group_label")["final_score"]
    .agg(["count", "min", "max"])
    .reset_index()
)
print("\nScore group summary:")
display(group_summary)

R_top.columns

    pc4  interest_score  affordability_fit  final_score
0  2023        0.660939           0.991797     0.760197

Score group summary:


  .apply(lambda g: pd.Series({
  ranked_pc4.groupby("score_group_label")["final_score"]


Unnamed: 0,score_group_label,count,min,max
0,Very Low,306,0.334209,0.438077
1,Low,306,0.438212,0.472076
2,Medium,305,0.472271,0.505419
3,High,306,0.505491,0.560253
4,Very High,306,0.560547,0.760197


Index(['nbh_code', 'avg_household_size', 'avg_house_value_woz', 'urbanization',
       'std_avg_household_size', 'std_avg_house_value_woz', 'std_urbanization',
       'pc4', 'p_gasoline', 'p_diesel', 'p_electric', 'p_hybrid',
       'avg_yearly_income_k', 'p_car_weight_0_to_850',
       'p_car_weight_851_to_1150', 'p_car_weight_1151_to_1500',
       'p_car_weight_1501_more', 'body_hatchback', 'body_station', 'body_mpv',
       'std_p_gasoline', 'std_p_diesel', 'std_p_electric', 'std_p_hybrid',
       'std_avg_yearly_income_k', 'std_p_car_weight_0_to_850',
       'std_p_car_weight_851_to_1150', 'std_p_car_weight_1151_to_1500',
       'std_p_car_weight_1501_more', 'std_body_hatchback', 'std_body_station',
       'std_body_mpv', 'inhabitants_total', 'p_inhb_15_to_25_year',
       'p_inhb_25_to_45_year', 'p_inhb_45_to_65_year', 'p_inhb_65_year_older',
       'std_p_inhb_15_to_25_year', 'std_p_inhb_25_to_45_year',
       'std_p_inhb_45_to_65_year', 'std_p_inhb_65_year_older', 'p_compact',
 

## 7. Results: Regional Compatibility Ranking

Below we list the top 10 PC4 areas with the highest compatibility for the selected car model.

Each score combines interest (composition fit) and affordability.

In [None]:
print(f"\nRegional compatibility ranking for: "
      f"{car.brand.iloc[0]} {car.model.iloc[0]} "
      f"({car.body_class.iloc[0]}, {car.fuel_types_primary.iloc[0]})\n")

for i, row in ranked_pc4.head(10).iterrows():
    print(f"{i+1:02d}. PC4 {int(row['pc4'])} | "
          f"Final={row['final_score']:.4f}  "
          f"(Interest={row['interest_score']:.3f}, "
          f"Afford={row['affordability_fit']:.3f}, "
          f"Group={row['score_group_label']})")


Regional compatibility ranking for: AUDI Rs 6 Avant Performance (Medium, Benzine)

01. PC4 2023 | Final=0.7602  (Interest=0.661, Afford=0.992, Group=Very High)
02. PC4 7626 | Final=0.7472  (Interest=0.691, Afford=0.878, Group=Very High)
03. PC4 5333 | Final=0.7440  (Interest=0.718, Afford=0.805, Group=Very High)
04. PC4 3585 | Final=0.7437  (Interest=0.839, Afford=0.522, Group=Very High)
05. PC4 1132 | Final=0.7378  (Interest=0.640, Afford=0.965, Group=Very High)
06. PC4 7573 | Final=0.7376  (Interest=0.647, Afford=0.950, Group=Very High)
07. PC4 3583 | Final=0.7341  (Interest=0.676, Afford=0.871, Group=Very High)
08. PC4 8894 | Final=0.7295  (Interest=0.663, Afford=0.885, Group=Very High)
09. PC4 1019 | Final=0.7205  (Interest=0.601, Afford=1.000, Group=Very High)
10. PC4 6704 | Final=0.7173  (Interest=0.628, Afford=0.925, Group=Very High)


# KPIs

To complement the regional compatibility model, we introduce two additional indicators that measure how a selected car model
compares to others in the current market.

These indicators help assess whether a car fits existing consumer preferences or occupies a unique niche in the market.
They build directly on the *Look-Alike recommendation* logic, which ranks new areas or models based on feature similarity.

---

## 1. Popularity score — Relative Fit vs. Market Leader

The **Popularity score** measures how similar the selected car is to the *most popular model* in its class (based on most amount of sales in 2024 from the RDW database).

$$
\frac{sales_{selected}}{sales_{leader}} * 100
$$


- A score near **100%** means the car closely resembles the market leader.   
- This helps quantify how “mainstream” or “distinct” a model is within its segment.

---

## 2. Niche Score — Market Concentration Indicator
The Niche Score quantifies how concentrated or broad a car model’s market position is.
It serves as a proxy for market specialization — indicating whether a vehicle appeals to a wide audience (mass-market) or to a narrow customer group (niche product).

**Rationale**

A car’s niche positioning typically reflects two key aspects:

**Price level:** higher-priced models tend to target smaller, more affluent buyer segments.

**Sales volume:** low sales counts suggest limited reach and more specialized appeal.

To capture this relationship, we combine price (a proxy for exclusivity) and sales volume (a proxy for reach) into a single interpretable index.
This creates a continuous measure in the range 0–100%, where:

**Low scores (0–30%)** indicate mass-market models — affordable cars with wide appeal and high sales.

**Mid-range scores (30–60%)** represent moderately specialized models.

**High scores (60–100%)** identify niche or premium vehicles with limited market reach.

---

## 3. Profile Match — Strength of Best Region-Car Match & Data for Radar Graph
The Profile Match quantifies the overall suitability of the selected car for the target region, providing a single metric from $\text{0%}$ to $\text{100%}$.It's calculated using Cosine Similarity, which measures the angular distance between the car's characteristics ($\mathbf{C_{vec}}$) and the region's preference profile ($\mathbf{R_{vec}}$):$$\frac{\mathbf{R_{vec}} \cdot \mathbf{C_{vec}}}{\Vert \mathbf{R_{vec}} \Vert \cdot \Vert \mathbf{C_{vec}} \Vert} \times 100$$A score near $100\%$ indicates a strong fit, meaning the car's segment, fuel, weight, price, and seat capacity highly align with the region's demographics and existing market preferences. These five factors are the same ones used to plot the radar graph.

The associated variables are:

**Segment**
- Car segment (binary)
- Region’s estimated preference (0 to 1)

**Propulsion/Fuel**
- Propulsion type (binary)
- Cars in region with same propulsion (0 to 1)

**Weight**
- Car weight (0 to 1)
- Cars in region of same weight group (0 to 1)

**Car & Household Size**
- Number of seats
- Avg. household size (both discretesized by quintile)

**Price & Income**
- Average price (0 to 1)
- Average yearly income (0 to 1)

---

### Implementation

We identify the best-selling car within each class (using sales data from `count_2024`) and compute
the Manhattan distance between the selected car and this reference model on standardized feature sets.
This ensures interpretability and consistency with other metrics in the dashboard.


### KPI - Popularity Score Calculation

In [None]:
# Random car example
selected_car = cars.sample(1, random_state=45).reset_index(drop=True)

# Get selected car's details efficiently
car_class = selected_car["body_class"].iloc[0]
selected_car_sales = selected_car["count_2024"].astype(float).iloc[0] # <-- FIX 1
selected_car_model = selected_car["model"].iloc[0]
selected_car_brand = selected_car["brand"].iloc[0]

print(f"Selected car: {selected_car_brand} {selected_car_model} ({car_class})")
print(f"Selected car's sales: {selected_car_sales:,.0f}")

# Filter cars to the same class
same_class_cars = cars[cars["body_class"] == car_class]

# Find the *sales number* of the best-selling car in that class
best_selling_sales = same_class_cars["count_2024"].astype(float).max() # <-- FIX 2 (Method 1)

print(f"Best-selling sales in class: {best_selling_sales:,.0f}")

# --- Calculating the Popularity score ---

# Handle division by zero, just in case
if best_selling_sales > 0:
    popularity_score =(selected_car_sales / best_selling_sales) * 100 # <-- FIX 3 (Calculation)
else:
    popularity_score = 0.0

print(f"\nPopularity score is: {popularity_score:.2f}%")

### KPI - Niche Score Calculation


In [None]:
# Niche Score (price + inverse sales proxy)

#Computing the percentile ranks for price and 2024 sales for all cars
cars["price_pct"] = cars["price_z_score"].rank(pct=True)
cars["sales_pct"] = cars["count_2024"].rank(pct=True)

# Extract the selected car’s percentiles from the 'cars' DataFrame
car_price_pct = float(cars[cars["model"] == selected_car["model"].iloc[0]]["price_pct"].iloc[0])
car_sales_pct = float(cars[cars["model"] == selected_car["model"].iloc[0]]["sales_pct"].iloc[0])

# Combining into a single Niche Score (%)
# Higher price + lower sales  → higher niche
W_PRICE = 0.6
W_SALES = 0.4
niche_score = 100 * (W_PRICE * car_price_pct + W_SALES * (1 - car_sales_pct))

# Printing the results
print(f"Niche Score (price + inverse sales): {niche_score:.2f}%")
print(f"Price percentile: {car_price_pct:.3f}")
print(f"Sales percentile: {car_sales_pct:.3f}")

### KPI - Profile Match Calculation

In [None]:
# -----------------------------
# CREATE DICTIONARY FOR CAR VARIABLES
# -----------------------------
match_cols_dict = {
    "Segment": {
        "Compact": "p_compact",
        "Medium": "p_medium",
        "Large": "p_large",
        "SUV": "p_suv",
        "MPV": "p_mpv",
        "Sports": "p_sports"
    },
    "Fuel": {
        "Benzine": "p_gasoline",
        "Diesel": "p_diesel",
        "Elektrisch": "p_electric",
        "Hybride": "p_hybrid"
    },
    "Weight": {
        "0,850": "p_car_weight_0_to_850",
        "851,1150": "p_car_weight_0_to_850",
        "1151,1500": "p_car_weight_0_to_850",
        "1501,10000": "p_car_weight_0_to_850"
    }
}

# -----------------------------
# POSTAL CODE (REGION)
# -----------------------------
R_highest_score_stats = regions.loc[[ranked_pc4['final_score'].idxmax()]]
R_highest_score_pc4 = R_highest_score_stats["pc4"].iloc[0]
R_highest_score_pc4 = regions[regions["pc4"] == R_highest_score_pc4].mean(numeric_only=True)

# -----------------------------
# CAR STATISTICS
# -----------------------------
C_selected_body = car.body_class.iloc[0]
C_selected_fuel = car.fuel_types_primary.iloc[0]
C_selected_weight = car.mass_empty_median.iloc[0]
C_selected_seats = car.seats_median.iloc[0]
C_selected_price = car.price_z_score.iloc[0]

# Map weight to column
weight_keys = list(match_cols_dict["Weight"].keys())
weight_ranges = [list(map(int, k.split(","))) for k in weight_keys]

for i, (w_min, w_max) in enumerate(weight_ranges):
    if w_min <= C_selected_weight <= w_max:
        R_col_weight = match_cols_dict["Weight"][weight_keys[i]]
        break

# Map body and fuel to columns
R_col_segment = match_cols_dict["Segment"][C_selected_body]
R_col_fuel = match_cols_dict["Fuel"][C_selected_fuel]

# Create final car stats dictionary
C_cor_stats = {
    "segment": 1,
    "fuel": 1,
    "weight": 1,
    "seats": C_selected_seats,
    "price": C_selected_price
}


# -----------------------------
# CREATE FINAL REGIONAL DATASET
# -----------------------------
R_cor_stats = {
    "segment": R_highest_score_pc4[R_col_segment],
    "fuel": R_highest_score_pc4[R_col_fuel],
    "weight": R_highest_score_pc4[R_col_weight],
    "seats": R_highest_score_pc4["avg_household_size"],
    "price": R_highest_score_pc4["avg_yearly_income_k"]
}

# -----------------------------
# BIN SEATS AND HOUSEHOLD
# -----------------------------
# Define the labels
labels = [0.2, 0.4, 0.6, 0.8, 1.0]

# Bin edges from your previous calculation
house_bins = [1.0, 2.2, 3.4, 4.6, 5.8, 7.0]  # for avg_household_size
seat_bins = [1.0, 2.6, 4.2, 5.8, 7.4, 9.0]   # for seats_median

# Replace numeric values with categories
# For regional stats (household size)
R_cor_stats["seats"] = pd.cut(
    [R_cor_stats["seats"]],  # wrap in list so pd.cut works on single value
    bins=house_bins,
    labels=labels,
    include_lowest=True
)[0]  # extract single value from resulting Series

# For car stats (number of seats)
C_cor_stats["seats"] = pd.cut(
    [C_cor_stats["seats"]],
    bins=seat_bins,
    labels=labels,
    include_lowest=True
)[0]

# Example: print results
print("Regional categories:", R_cor_stats)
print("Car categories:", C_cor_stats)


# -----------------------------
# NORMALIZE NUMERIC VALUES 0-1
# -----------------------------
# --- Car ---
from scipy.stats import norm
C_cor_stats["price"] = norm.cdf(C_cor_stats["price"])


# --- Region ---
R_min = regions["avg_yearly_income_k"].min()
R_max = regions["avg_yearly_income_k"].max()
R_cor_stats["price"] = (R_cor_stats["price"] - R_min) / (R_max - R_min)



# -----------------------------
# Pretty print results
# -----------------------------
def print_stats(title, stats_dict):
    print(f"--- {title} ---")
    for k, v in stats_dict.items():
        # If it's a Series, convert to scalar
        if isinstance(v, pd.Series):
            v = v.iloc[0]
        print(f"{k:8}: {v:.4f}" if isinstance(v, (int, float, np.float64)) else f"{k:8}: {v}")
    print()

# Print nicely
print_stats("Regional Stats", R_cor_stats)
print_stats("Car Stats", C_cor_stats)


# -----------------------------
# PROFILE FIT - COSINE SIMILARITY
# -----------------------------
features = ["segment", "fuel", "weight", "seats", "price"]
R_vec = np.array([R_cor_stats[f] for f in features])
C_vec = np.array([C_cor_stats[f] for f in features])

# --- Cosine similarity ---
profile_match = np.dot(R_vec, C_vec) / (np.linalg.norm(R_vec) * np.linalg.norm(C_vec)) * 100
print("Profile Match (cosine similarity):", profile_match)


Regional categories: {'segment': np.float64(0.5823789162550271), 'fuel': np.float64(0.80485), 'weight': np.float64(0.101841), 'seats': np.float64(0.2), 'price': np.float64(35.2)}
Car categories: {'segment': 1, 'fuel': 1, 'weight': 1, 'seats': np.float64(0.6), 'price': np.float64(1.185878064775364)}
--- Regional Stats ---
segment : 0.5824
fuel    : 0.8048
weight  : 0.1018
seats   : 0.2000
price   : 0.2246

--- Car Stats ---
segment : 1.0000
fuel    : 1.0000
weight  : 1.0000
seats   : 0.6000
price   : 0.8822

Profile fit (cosine similarity): 85.17922271465615


## Archived/Unused codes


In [None]:
'''
# Defining the features for comparison
FEATURES = ["price_z_score", "seats_median", "pw_ratio_median"]
#FEATURE_WEIGHTS = np.ones(len(FEATURES))  # change to weight features, e.g. [0.7,0.2,0.1]
FEATURE_WEIGHTS = [0.7,0.2,0.1]

# --- Prepare numeric arrays ---
def get_feat_vector(df, features):
    return df[features].apply(pd.to_numeric, errors="coerce").iloc[0].values

vec_selected = get_feat_vector(selected_car, FEATURES)
vec_best = get_feat_vector(best_selling_car, FEATURES)
X_all = same_class_cars[FEATURES].apply(pd.to_numeric, errors="coerce").fillna(same_class_cars[FEATURES].median())

# --- Manhattan distance helper ---
def L1(a, b):
    return np.sum(np.abs(a - b))

# --- Match distance (to best-selling car) ---
d_selected_vs_best = L1(vec_selected, vec_best)

# --- Normalize and compute Match Score ---
distances_to_best = np.array([L1(get_feat_vector(cars.iloc[[i]], FEATURES).astype(float), vec_best)
                              for i in same_class_cars.index])
max_d = np.nanmax(distances_to_best)
match_score = (1 - d_selected_vs_best / max_d) * 100

# --- Niche score (normalized 0–100) ---
dist_selected_to_all = np.array([L1(vec_selected, X_all.iloc[i].values) for i in range(len(X_all))])
all_pairwise = np.array([L1(X_all.iloc[j].values, X_all.iloc[k].values)
                         for j in range(len(X_all)) for k in range(j+1, len(X_all))])

mean_d = np.nanmean(dist_selected_to_all)
min_d, max_d = np.nanmin(all_pairwise), np.nanmax(all_pairwise)
niche_score = ((mean_d - min_d) / (max_d - min_d)) * 100

print(f"\n--- {selected_car.brand.iloc[0]} {selected_car.model.iloc[0]} ---")
print(f"Match Score (vs. {best_selling_car.model.iloc[0]}): {match_score:.2f}%")
print(f"Niche Score (market uniqueness): {niche_score:.2f}%")
'''


In [None]:
# -----------------------------
# KPIs: ProfileMatch, Popularity & Niche Score (robust)
# -----------------------------
import numpy as np

# FEATURES: use z-scored price and raw seats and other standardized numeric features
FEATURES = ["price_z_score", "seats_median", "pw_ratio_median"]  # adjust to your available standardized features
FEATURE_WEIGHTS = np.ones(len(FEATURES))  # change to weight features, e.g. [0.7,0.2,0.1]

# Selected car (already defined earlier), and filter to same class
selected_car = selected_car.reset_index(drop=True)
car_class = selected_car["body_class"].iloc[0]
same_class_cars = cars[cars["body_class"] == car_class].copy().reset_index(drop=True)

# Best-selling car in class (count_2024)
best_idx = same_class_cars["count_2024"].astype(float).idxmax()
best_selling_car = same_class_cars.loc[[best_idx]].reset_index(drop=True)

# Prepare numeric feature matrix for the class
X = same_class_cars[FEATURES].apply(pd.to_numeric, errors="coerce")
# simple imputation: fill NaNs with column median (document this choice)
X = X.fillna(X.median())

# Convert to numpy matrix
X_mat = X.values.astype(float)                 # shape (n_class, n_features)
weights = np.asarray(FEATURE_WEIGHTS, dtype=float).reshape(1, -1)  # shape (1, n_features)

# helper: vectorized weighted L1 distances from a single vector to rows of X_mat
def l1_distances_vec(a_vec, B_mat, weights=None):
    if weights is None:
        return np.sum(np.abs(B_mat - a_vec), axis=1)
    return np.sum(np.abs(B_mat - a_vec) * weights.reshape(1, -1), axis=1)

# selected and best vectors (same feature ordering)
sel_vec = selected_car[FEATURES].iloc[0].astype(float).values.reshape(1, -1)[0]
best_vec = best_selling_car[FEATURES].iloc[0].astype(float).values.reshape(1, -1)[0]

# 1) Distances from selected & best to all cars in same class
d_sel_to_all  = l1_distances_vec(sel_vec,  X_mat, weights=weights)
d_best_to_all = l1_distances_vec(best_vec, X_mat, weights=weights)

mean_d_sel  = float(np.mean(d_sel_to_all))
mean_d_best = float(np.mean(d_best_to_all))



# Profile Match: Q-Correlation Coefficient based on six linked variables

Q_dictionary = {

}


# Popularity Score: normalized similarity to class leader
# We convert distance to similarity in 0..100 scale using max observed distance to the best
max_d_to_best = np.max(d_best_to_all) if d_best_to_all.size > 0 else 0.0
if max_d_to_best == 0.0:
    match_score = 100.0  # degenerate: all identical
else:
    # smaller d_sel -> higher match; we cap in [0,100]
    match_score = max(0.0, min(100.0, (1 - (np.sum(np.abs(sel_vec - best_vec) * FEATURE_WEIGHTS) / max_d_to_best)) * 100.0))



# Niche Score: compare mean distance of selected car vs distribution of pairwise distances
n = X_mat.shape[0]
PAIRWISE_SAMPLE_LIMIT = 2000  # if n large sample to avoid O(n^2)
if n <= 800:                  # small n -> compute exact pairwise
    # full upper triangle pairwise distances
    diffs = np.abs(X_mat[:, None, :] - X_mat[None, :, :])  # shape n x n x f
    pairwise = diffs.sum(axis=2)
    iu = np.triu_indices(n, k=1)
    pairwise_vals = pairwise[iu]
else:
    # sample subset of rows for pairwise estimate
    sample_n = min(PAIRWISE_SAMPLE_LIMIT, n)
    idx = np.random.RandomState(42).choice(n, size=sample_n, replace=False)
    S = X_mat[idx]
    diffs = np.abs(S[:, None, :] - S[None, :, :])
    pairwise = diffs.sum(axis=2)
    iu = np.triu_indices(pairwise.shape[0], k=1)
    pairwise_vals = pairwise[iu]

# guard against empty pairwise (n small)
if pairwise_vals.size == 0:
    pair_min = 0.0
    pair_max = 1.0
    pair_mean = mean_d_sel
else:
    pair_min = float(np.min(pairwise_vals))
    pair_max = float(np.max(pairwise_vals))
    pair_mean = float(np.mean(pairwise_vals))

# scale mean_d_sel into 0–100 using observed pairwise range
if pair_max - pair_min == 0:
    niche_score = 0.0
else:
    niche_score = ((mean_d_sel - pair_min) / (pair_max - pair_min)) * 100.0
    niche_score = max(0.0, min(100.0, niche_score))

# Print results
print(f"Selected car: {selected_car.brand.iloc[0]} {selected_car.model.iloc[0]} ({car_class})")
print(f"Best-selling in class: {best_selling_car.brand.iloc[0]} {best_selling_car.model.iloc[0]}")
print(f"Match Score: {match_score:.2f}%")
print(f"Niche Score: {niche_score:.2f}%")
print(f"(mean_d_sel={mean_d_sel:.4f}, mean_d_best={mean_d_best:.4f}, pair_mean={pair_mean:.4f})")
