## Install all required library

In [1]:
# %%capture
!pip install -r requirements.txt



## Import all required library

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
import warnings
import json
import os
from joblib import dump, load
from category_encoders import TargetEncoder

from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.feature_extraction import FeatureHasher

from xgboost import XGBRegressor
from catboost import CatBoostRegressor

## Import all settings

In [3]:
RPT_PATH = 'data/DataSample.rpt'

os.makedirs('data', exist_ok=True)
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', None)

## Basic Data Analysis

1. Initial thought. Generally, to perform valuation on a car, we needs
    * Car Brand
    * Car Initial Price
    * Car Mileage (Odometer reading)
    * Age of Car - (Current year - year of manufacture)
    * Car Exterior & Interior Condition
    * Trim Level (BadgeDescription)
    * Accident History (Can get from VIN)
    * Market Demand

2. Basic Exploratory Data Analysis
    * There is 62192 data points, with 130 features (64 float64 dtype, 60 object dtype, 6 int64 dtype)
    * 130 features can be splited in 10 categories
    * There are 44 data points where NewPrice is less than Sold_Amount. This is reasonable, as some items were sold after only 1-2 years, while others are antiques.

3. Identify features to remove:
    * that are useless/redundant to predict the car price
        * MakeCode - redundant with **Make**
        * FamilyCode - redundant with **Model**
        * SeriesModelYear - redundant with **YearGroup** (with minor discrepancies)
        * Columns that are identifier and are useless for car price prediction - 'SequenceNum', 'DriveCode', 'VIN', 'ModelCode' & 'EngineNum' 
        * Columns that has exact 1 unique value - 'ImportFlag', 'NormalChargeMins', 'NormalChargeVoltage' & 'TopSpeedElectricEng' 
    * Result degenerate feedback loop - ('AvgWholesale', 'AvgRetail', 'GoodWholesale', 'GoodRetail', 'TradeMin', 'TradeMax', 'PrivateMax') - (just for analysis purpose, not for vechicle price prediction)
    * Result in data leakage in predicting sales price

4. Find out the features that 
    * Possible to add in to improve the prediction 
        * Car History like accident history, which can be obtained from VIN - but needs to pay
        * Appearance of the car (by image)
    * Require further preprocessing
        * Convert date information to year, month, day, day of week

5. Try ways to embed categorical columns into vectors: 
    * One-hot Encoder (Baseline)
    * Target Encoder
    * Features Hashing
    * Embedding using neural network? (not using it because it is too expensive to train neural network for each category)

6. ML Model Building and Training (No dealing with null and normalization needed for the following model)
    * XGBRegressor
    * CatBoostRegressor
    * HistGradientBoostingRegressor

7. Evalaute through slice-based evaluation on 'Make'
8. Perform analysis and Add Explanability on the feautures
9. Retrain anything using 10 n_iter (needs 1 hour)
10. Tidy the code and write proper documentation

### Brief features categorization
#### (1) Manufacturer & Model Information - Features related to brand, model, and series identification.

1. Make: Manufacturer (e.g., Holden, Toyota).
2. Model: Car model name (e.g., Commodore, RAV4).
3. MakeCode: Manufacturer code (e.g., HOLD for Holden) - **REDUNDANT (Make)**.
4. FamilyCode: Internal code for the vehicle family (e.g., COMMODO) **REDUNDANT (Model)**.
5. Series: Generation or variant (e.g., VE, VR, ACA33R).
6. SeriesModelYear: Model year of the series (e.g., MY12, MY88, Series IV, ...).
7. BadgeDescription: Trim level (e.g., Omega, Executive).
8. BadgeSecondaryDescription: Secondary trim details (often empty).
9. OptionCategory: Vehicle type (e.g., PASS for passenger, SUV, VAN, BUS).
10. VFactsClass: Market classification (e.g., Passenger, SUV).
11. VFactsSegment: Size category (e.g., Large, Medium).
12. VFactsPrice: Price range (e.g., <$70K).

#### (2) Vehicle Identification & Classification - Unique identifiers and general classification.
13. YearGroup: Model year (e.g., 2008).
14. MonthGroup: Month of production (e.g., 0 = unknown).
15. SequenceNum: Unique identifier for the car - **USELESS**.
16. Description: Detailed description of a car (e.g., "VE Omega Sedan...").
17. CurrentRelease: Is the model current? (F = False, T=True).
18. ImportFlag: Import status (L = locally made) - **USELSES (exactly 1 unique value)**.
19. LimitedEdition: Is it a limited edition? (F = False, T=True).
20. BodyStyleDescription: Body type (e.g., Sedan, Wagon).
21. BodyConfigDescription: Body configuration (empty) - (0.7-850).
22. WheelBaseConfig: Wheelbase type.
23. Roofline: Roof design.
24. ExtraIdentification: Additional identifiers (empty).
25. DriveDescription: Drivetrain (e.g., Rear Wheel Drive).
26. DriveCode: Drivetrain code (e.g., RWD) - **USELESS**.
27. ModelCode: Internal model code (e.g., ACA33R-ANMXKQ) - **USELESS**.
28. BuildCountryOriginDescription: Manufacturing country (e.g., Australia, Japan, Thailand, ...).
29. VIN: Vehicle Identification Number - **USELESS**.

#### (3) Technical Specifications - Engine, transmission, and mechanical details.
30. GearTypeDescription: Transmission type (e.g., Automatic, Manual, Sports Automatic, ...).
31. GearLocationDescription: Gear lever position (e.g., Floor, Dash,Column, ...).
32. GearNum: Number of gears (1-9).
33. DoorNum: Number of doors (2-5).
34. EngineSize: Engine displacement (cc) - (659-7300).
35. EngineDescription: Engine name (e.g., 3.6i, 13B, 800, ...).
36. Cylinders: Number of cylinders (2-12).
37. FuelTypeDescription: Fuel type (e.g., Petrol, Diesel, LPG only, ...).
38. InductionDescription: Aspiration type (e.g., Aspirated, Supercharged, ...).
39. CamDescription: Valve mechanism (e.g., DOHC with VVT, Pushrod, OHC with VVT, ...).
40. EngineTypeDescription: Engine type (e.g., Piston, Piston - Electric OR Rotary).
41. FuelDeliveryDescription: Fuel injection type (e.g., Multi-Point).
42. MethodOfDeliveryDescription: Fuel delivery method (e.g., Electronic, Electronic Sequantial, ...).
43. ValvesCylinder: Valves per cylinder (2-5).
44. EngineCycleDescription: Engine cycle (e.g., 4 Stroke).
45. EngineConfigurationDescription: Engine layout (e.g., V6).
46. EngineLocation: Engine placement (e.g., Front).
47. EngineNum: Engine serial number - **USELESS**
48. FrontTyreSize: Front tire dimensions (e.g., 225/60 R16).
49. RearTyreSize: Rear tire dimensions.
50. FrontRimDesc: Front rim size (e.g., 16x7.0).
51. RearRimDesc: Rear rim size.

#### (4) Dimensions & Weight - Physical measurements and weights.
52. WheelBase: Distance between axles (mm) - (2-4332).
53. Height: Vehicle height (mm).
54. Length: Vehicle length (mm).
55. Width: Vehicle width (mm).
56. KerbWeight: Weight with fluids (kg).
57. TareMass: Empty weight (kg).
58. PayLoad: Maximum load capacity (kg) - (260-2701).
59. SeatCapacity: Number of seats (2-15).
60. FuelCapacity: Fuel tank size (liters) - (32-180).

#### (5) Performance Metrics - Power, torque, and towing.
61. Power: Engine power (kW).
62. PowerRPMFrom: RPM range start for peak power.
63. PowerRPMTo: RPM range end for peak power.
64. Torque: Engine torque (Nm).
65. TorqueRPMFrom: RPM range start for peak torque.
66. TorqueRPMTo: RPM range end for peak torque.
67. Acceleration: 0-100 km/h time (empty here).
68. TowingBrakes: Towing capacity with brakes (kg).
69. TowingNoBrakes: Towing capacity without brakes (kg).
70. TopSpeedElectricEng: Top speed on electric power - **USELESS - exactly 1 unique value**.

#### (6) Fuel & Emissions - Efficiency and environmental impact.
71. RonRating: Fuel octane rating.
72. FuelUrban: Urban fuel consumption (L/100km).
73. FuelExtraurban: Highway fuel consumption (L/100km).
74. FuelCombined: Combined fuel consumption (L/100km).
75. CO2Combined: Combined CO2 emissions (g/km).
76. CO2Urban: Urban CO2 emissions.
77. CO2ExtraUrban: Highway CO2 emissions.
78. EmissionStandard: Compliance standard (empty here).
79. MaxEthanolBlend: Ethanol compatibility (empty here).
80. GreenhouseRating: Environmental rating (1–10).
81. AirpollutionRating: Air pollution score (1–10).
82. OverallGreenStarRating: Overall eco-rating (1–5).

#### (7) Safety & Compliance - Safety ratings and regulatory data.
83. AncapRating: Safety rating (e.g., 4/5).
84. GrossCombinationMass: Total allowable weight (kg) - (1450-9071).
85. GrossVehicleMass: Vehicle max weight (kg) - (970-5670).
86. IsPPlateApproved: Approved for probationary drivers (T/F).

#### (8) Sales & Pricing - Pricing, sales history, and market data.
87. AverageKM: Average odometer reading.
88. GoodKM: Low odometer threshold.
89. AvgWholesale: Average wholesale price **Cant use it, make result degenerate feedback loop**. 
90. AvgRetail: Average retail price **Cant use it, make result degenerate feedback loop**.
91. GoodWholesale: Wholesale price for low-KM cars **Cant use it, make result degenerate feedback loop**.
92. GoodRetail: Retail price for low-KM cars **Cant use it, make result degenerate feedback loop**.
93. TradeMin: Minimum trade-in value **Cant use it, make result degenerate feedback loop**.
94. TradeMax: Maximum trade-in value **Cant use it, make result degenerate feedback loop**.
95. PrivateMax: Maximum private sale price **Cant use it, make result degenerate feedback loop**.
96. NewPrice: Original new price **This features is very important!, cannot simply impute the nan**.
97. Colour: Vehicle color.
98. Branch: Sale location (e.g., Perth).
99. SaleCategory: Sale type (e.g., Auction).
100. Sold_Date: Date sold **Also very important features, convert it into year,month,day**.
101. Compliance_Date: Compliance plate date.
102. Age_Comp_Months: Age in months at compliance.
103. KM: Odometer reading at sale.
104. **Sold_Amount: Sale price.**

#### (9) Maintenance & Warranty - Service and warranty terms.
105. WarrantyCustAssist: Roadside assistance duration.
106. FreeScheduledService: Free services (empty here).
107. WarrantyYears: Warranty duration (years).
108. WarrantyKM: Warranty kilometers.
109. FirstServiceKM: First service odometer.
110. FirstServiceMonths: First service time.
111. RegServiceMonths: Regular service interval.

#### (10) Miscellaneous Attributes - Electric/hybrid specs and charging.
112. AltEngEngineType: Alternate engine type (e.g., electric).
113. AltEngBatteryType: Battery type.
114. AltEngCurrentType: Current type (AC/DC).
115. AltEngAmpHours: Battery capacity (Ah).
116. AltEngVolts: Battery voltage.
117. AltEngChargingMethod: Charging method.
118. AltEngPower: Electric motor power (kW).
119. AltEngPowerFrom: Power RPM range start.
120. AltEngPowerTo: Power RPM range end.
121. AltEngTorque: Electric motor torque (Nm).
122. AltEngTorqueFrom: Torque RPM range start.
123. AltEngTorqueTo: Torque RPM range end.
124. AltEngDrive: Electric drivetrain type.
125. NormalChargeMins: Standard charging time - **USELSES (exactly 1 unique value)**.
126. QuickChargeMins: Fast charging time.
127. NormalChargeVoltage: Standard charging voltage - **USELSES (exactly 1 unique value)**.
128. QuickChargeVoltage: Fast charging voltage.
129. KMRangeElectricEng: Electric range (km).
130. ElectricEngineLocation: Electric motor placement.

## Visualize some datapoints

In [4]:
# Read the file with tab delimiter
df = pd.read_csv(RPT_PATH, delimiter='\	', engine='python')

print("DataFrame shape (rows, columns):", df.shape)
df.head(3).T

DataFrame shape (rows, columns): (62192, 130)


Unnamed: 0,0,1,2
Make,Holden,Holden,Toyota
Model,Commodore,Commodore,RAV4
MakeCode,HOLD,HOLD,TOYO
FamilyCode,COMMODO,COMMODO,RAV4
YearGroup,2008,1993,2012
MonthGroup,0,7,0
SequenceNum,0,41,6
Description,VE Omega Sedan 4dr. Auto 4sp 3.6i,VR Executive Wagon 5dr. Auto 4sp 3.8i,ACA33R MY12 CV Wagon 5dr Man 5sp 4x4 2.4i
CurrentRelease,F,F,F
ImportFlag,L,L,L


## Identify features to remove

In [5]:
# Identify features with many missing values based on threshold (retain first as null value may be meaningful)
for threshold in range(50, 101, 1):
    threshold = threshold / 100
    high_missing_cols = [col for col in df.columns if df[col].isna().mean() > threshold]
    print(f"Columns with >{threshold*100:.1f}% missing values ({len(high_missing_cols)}):", high_missing_cols)

Columns with >50.0% missing values (36): ['SeriesModelYear', 'BadgeSecondaryDescription', 'BodyConfigDescription', 'WheelBaseConfig', 'Roofline', 'ExtraIdentification', 'GrossCombinationMAss', 'PowerRPMFrom', 'TorqueRPMFrom', 'Acceleration', 'WarrantyCustAssist', 'FreeScheduledService', 'AltEngEngineType', 'AltEngBatteryType', 'AltEngCurrentType', 'AltEngAmpHours', 'AltEngVolts', 'AltEngChargingMethod', 'AltEngPower', 'AltEngPowerFrom', 'AltEngPowerTo', 'AltEngTorque', 'AltEngTorqueFrom', 'AltEngTorqueTo', 'AltEngDrive', 'NormalChargeMins', 'QuickChargeMins', 'NormalChargeVoltage', 'QuickChargeVoltage', 'KMRangeElectricEng', 'ElectricEngineLocation', 'TopSpeedElectricEng', 'CO2Urban', 'CO2ExtraUrban', 'EmissionStandard', 'MaxEthanolBlend']
Columns with >51.0% missing values (36): ['SeriesModelYear', 'BadgeSecondaryDescription', 'BodyConfigDescription', 'WheelBaseConfig', 'Roofline', 'ExtraIdentification', 'GrossCombinationMAss', 'PowerRPMFrom', 'TorqueRPMFrom', 'Acceleration', 'Warrant

In [6]:
# Get columns with exactly 1 or full unique value
single_unique_cols = [col for col in df.columns if df[col].nunique() == 1]
fully_unique_cols = [col for col in df.columns if df[col].nunique() == len(df)]

print("Columns with exactly 1 unique value:", single_unique_cols)
print("Columns with full unique value:", fully_unique_cols)

Columns with exactly 1 unique value: ['ImportFlag', 'NormalChargeMins', 'NormalChargeVoltage', 'TopSpeedElectricEng']
Columns with full unique value: []


In [7]:
USELESS_COLS = ['MakeCode', 'FamilyCode', 'DriveCode', 
                'ModelCode', 'SequenceNum', 'VIN', 
                'EngineNum', 'SeriesModelYear']
IRRELEVANT_COLS = ['AvgWholesale', 'AvgRetail', 'GoodWholesale', 
                   'GoodRetail', 'TradeMin', 'TradeMax', 'PrivateMax']

COL_TO_REMOVE = USELESS_COLS + IRRELEVANT_COLS + single_unique_cols + fully_unique_cols

df_dropped = df.drop(columns=COL_TO_REMOVE)
print("There is total of", len(COL_TO_REMOVE), "columns to remove")
print("Now, the shape of the dataframe is", df_dropped.shape) 

There is total of 19 columns to remove
Now, the shape of the dataframe is (62192, 111)


## Exploratory Data Analysis
* To find out some possible useful insights from the features   
* left it to be future works due to time constraints

## Data Preprocessing

### Drop some rows and process datetime data

In [8]:
# Drop rows with missing values in 'Sold_Amount'
print("Originally, the shape of the dataframe is", df_dropped.shape)
df_cleaned = df_dropped.dropna(subset=['Sold_Amount'])
print("After removing rows with missing values in 'Sold_Amount', the shape of the dataframe is", df_cleaned.shape)

Originally, the shape of the dataframe is (62192, 111)
After removing rows with missing values in 'Sold_Amount', the shape of the dataframe is (62188, 111)


In [9]:
# Process data information into numerical value
if 'Sold_Date' in df_cleaned.columns:
    df_cleaned['Sold_Date'] = pd.to_datetime(df_cleaned['Sold_Date'], errors='coerce').copy()
    # Extract datetime features
    df_cleaned['Sold_Year'] = df_cleaned['Sold_Date'].dt.year
    df_cleaned['Sold_Month'] = df_cleaned['Sold_Date'].dt.month
    df_cleaned['Sold_Day'] = df_cleaned['Sold_Date'].dt.day
    df_cleaned['Sold_DayOfWeek'] = df_cleaned['Sold_Date'].dt.dayofweek
    # Drop original Sold_Date
    df_cleaned.drop('Sold_Date', axis=1, inplace=True)
print("After processing date information, the shape of dataframe is", df_cleaned.shape)

After processing date information, the shape of dataframe is (62188, 114)


In [10]:
# Identify categorical (object or category) and numerical columns
cat_cols = df_cleaned.select_dtypes(include=['object', 'category']).columns.tolist()
num_cols = df_cleaned.select_dtypes(include=['number']).columns.tolist()
num_cols.remove('Sold_Amount')

print(f"Categorical columns ({len(cat_cols)}):", cat_cols)
print(f"Numerical columns ({len(num_cols)}):", num_cols)


Categorical columns (50): ['Make', 'Model', 'Description', 'CurrentRelease', 'LimitedEdition', 'Series', 'BadgeDescription', 'BadgeSecondaryDescription', 'BodyStyleDescription', 'BodyConfigDescription', 'WheelBaseConfig', 'Roofline', 'ExtraIdentification', 'DriveDescription', 'GearTypeDescription', 'GearLocationDescription', 'EngineDescription', 'FuelTypeDescription', 'InductionDescription', 'OptionCategory', 'CamDescription', 'EngineTypeDescription', 'FuelDeliveryDescription', 'MethodOfDeliveryDescription', 'BuildCountryOriginDescription', 'EngineCycleDescription', 'EngineConfigurationDescription', 'EngineLocation', 'FrontTyreSize', 'RearTyreSize', 'FrontRimDesc', 'RearRimDesc', 'WarrantyCustAssist', 'FreeScheduledService', 'AltEngEngineType', 'AltEngBatteryType', 'AltEngCurrentType', 'AltEngChargingMethod', 'AltEngDrive', 'ElectricEngineLocation', 'EmissionStandard', 'MaxEthanolBlend', 'VFactsClass', 'VFactsSegment', 'VFactsPrice', 'IsPPlateApproved', 'Colour', 'Branch', 'SaleCategor

In [11]:
# Split the data into training (80%) and testing sets (20%)
X_train, X_test, y_train, y_test = train_test_split(
    df_cleaned.drop('Sold_Amount', axis=1),
    df_cleaned['Sold_Amount'],
    test_size=0.2,
    random_state=0,
)

### One-hot encoding method

In [12]:
### max_categories need to be set properly, too low will results loss of information, too high will results in too many columns
### If set to be 5, Total number of columns with unique values exceeding the maximum allowed: 31/50
### If set to be 10, Total number of columns with unique values exceeding the maximum allowed: 21/50 (Seem to be a good balance points)
### If set to be 15, Total number of columns with unique values exceeding the maximum allowed: 19/50
### If set to be 20, Total number of columns with unique values exceeding the maximum allowed: 17/50
for max_categories in [5,10,15,20]:
    count         = 0
    for col in cat_cols:
        unique_values = X_train[col].nunique()
        if unique_values > max_categories:
            count+=1
    print(f"Total number of columns with unique values exceeding the maximum allowed: {count}")

Total number of columns with unique values exceeding the maximum allowed: 31
Total number of columns with unique values exceeding the maximum allowed: 21
Total number of columns with unique values exceeding the maximum allowed: 19
Total number of columns with unique values exceeding the maximum allowed: 17


In [13]:
# Function to determine and map top categories in training data
## Get top categories in training data
def get_top_categories(df, cat_cols, max_categories=10, file_path='data/top_categories.json'):
    ### Save top_categories_dict to JSON file
    def save_top_categories(top_categories_dict, file_path='data/top_categories.json'):
        with open(file_path, 'w') as f:
            json.dump(top_categories_dict, f)

    ### Load top_categories_dict from JSON file
    def load_top_categories(file_path='data/top_categories.json'):
        with open(file_path, 'r') as f:
            return json.load(f)
        
    if os.path.exists(file_path):
        top_categories_dict = load_top_categories(file_path)
    else:
        top_categories_dict = {}
        for col in cat_cols:
            value_counts = df[col].value_counts()
            num_unique_values = len(value_counts)

            num_category_retain = max_categories - 1 if num_unique_values > max_categories else num_unique_values - 1
            top_categories = value_counts.head(num_category_retain).index.tolist()

            top_categories_dict[col] = top_categories
        save_top_categories(top_categories_dict, file_path) # Save the dictionary to JSON file
    return top_categories_dict

## Function to map categories using the provided top categories
def apply_top_categories(df, top_categories_dict):
    for col, top_categories in top_categories_dict.items():
        df[col] = df[col].fillna('Missing')  # Fix NaN
        df[col] = df[col].apply(lambda x: x if x in top_categories else 'Other')

    # Verify the changes
    for col in top_categories_dict.keys():
        if df[col].nunique() > len(top_categories_dict[col]) + 1:
            raise Exception(f"Column '{col}' has too many unique values after applying top categories.")
    return df

# Step 4: Fit OneHotEncoder on training data and transform both datasets
## Function to get OneHotEncoder ()
def get_one_hot_encoder(X_train=None, cat_cols=None):
    if os.path.exists('data/onehot_encoder.joblib'):
        encoder = load('data/onehot_encoder.joblib')
    elif X_train is None or cat_cols is None:
        raise ValueError("X_train and cat_cols must be provided if data/onehot_encoder.joblib does not exist")
    else:
        encoder = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')
        encoder.fit(X_train[cat_cols])
        dump(encoder, 'data/onehot_encoder.joblib')  # Save to file

    return encoder
    
def transform_one_hot_encoder(X_data, cat_cols):
    encoder = get_one_hot_encoder()
    X_data_encoded = pd.DataFrame(
        encoder.transform(X_data[cat_cols]),
        columns=encoder.get_feature_names_out(cat_cols),
        index=X_data.index
    )
    return X_data_encoded


# Step 5: Apply consistent handling 
max_categories = 10
top_categories_dict = get_top_categories(X_train, cat_cols, max_categories=max_categories)

X_train = apply_top_categories(X_train, top_categories_dict)
X_test = apply_top_categories(X_test, top_categories_dict)

# Step 6: One-hot encoding with consistent columns
# Transform train and test sets
encoder = get_one_hot_encoder(X_train, cat_cols)
X_train_OH_encoded = transform_one_hot_encoder(X_train, cat_cols)
X_test_OH_encoded = transform_one_hot_encoder(X_test, cat_cols)

# Combine with numerical columns
X_train_OH_encoded = pd.concat([X_train[num_cols], X_train_OH_encoded], axis=1)
X_test_OH_encoded  = pd.concat([X_test[num_cols], X_test_OH_encoded], axis=1)

# Preprocessing for XGBoost
X_train_OH_encoded.columns = [str(col).replace('[', '_').replace(']', '_').replace('<', '_') for col in X_train_OH_encoded.columns]
X_test_OH_encoded.columns = [str(col).replace('[', '_').replace(']', '_').replace('<', '_') for col in X_test_OH_encoded.columns]

# Now X_train_OH_encoded and X_test_OH_encoded have identical columns
print("Total number of columns in X_train_OH_encoded:", len(X_train_OH_encoded.columns))
print("Total number of columns in X_test_OH_encoded:", len(X_test_OH_encoded.columns))
print("Total number of X_test_OH_encoded.columns == X_train_OH_encoded.columns:", (X_test_OH_encoded.columns == X_train_OH_encoded.columns).sum())

Total number of columns in X_train_OH_encoded: 346
Total number of columns in X_test_OH_encoded: 346
Total number of X_test_OH_encoded.columns == X_train_OH_encoded.columns: 346


### Target encoding method

In [14]:
# Initialize the encoder
target_encoder = TargetEncoder(cols=cat_cols, 
                               handle_missing='return_nan', 
                               handle_unknown='value')

# Fit on training data
target_encoder.fit(X_train[cat_cols], y_train)

# Transform both train and test
X_train_cat_encoded = target_encoder.transform(X_train[cat_cols])
X_test_cat_encoded = target_encoder.transform(X_test[cat_cols])

# Concatenate with numerical features
X_train_T_encoded = pd.concat([X_train_cat_encoded, X_train[num_cols]], axis=1)
X_test_T_encoded = pd.concat([X_test_cat_encoded, X_test[num_cols]], axis=1)

# Now X_train_OH_encoded and X_test_OH_encoded have identical columns
print("Shape in X_train_T_encoded:", X_train_T_encoded.shape)
print("Shape in X_test_OH_encoded:", X_test_T_encoded.shape)
print("Total number of X_test_T_encoded.columns == X_train_T_encoded.columns:", (X_test_T_encoded.columns == X_train_T_encoded.columns).sum())


Shape in X_train_T_encoded: (49750, 113)
Shape in X_test_OH_encoded: (12438, 113)
Total number of X_test_T_encoded.columns == X_train_T_encoded.columns: 113


### FeatureHasher

In [15]:
import pandas as pd

# Initialize the feature hasher
n_features = 100
hasher     = FeatureHasher(n_features=n_features,input_type='string')

# Apply the hasher to the categorical columns
X_train_cat_hashed = hasher.transform(X_train[cat_cols].astype(str).values)
X_test_cat_hashed = hasher.transform(X_test[cat_cols].astype(str).values)

# Convert the sparse matrix to a dense format (if needed) and create DataFrames
X_train_cat_hashed_df = pd.DataFrame(X_train_cat_hashed.toarray(), columns=[f'hashed_{i}' for i in range(n_features)])
X_test_cat_hashed_df = pd.DataFrame(X_test_cat_hashed.toarray(), columns=[f'hashed_{i}' for i in range(n_features)])

# Concatenate with numerical features
X_train_FH_encoded = pd.concat([X_train_cat_hashed_df, X_train[num_cols].reset_index(drop=True)], axis=1)
X_test_FH_encoded = pd.concat([X_test_cat_hashed_df, X_test[num_cols].reset_index(drop=True)], axis=1)

# Now X_train_OH_encoded and X_test_OH_encoded have identical columns
print("Shape in X_train_T_encoded:", X_train_FH_encoded.shape)
print("Shape in X_test_FH_encoded:", X_test_FH_encoded.shape)
print("Total number of X_test_FH_encoded.columns == X_train_FH_encoded.columns:", (X_test_FH_encoded.columns == X_train_FH_encoded.columns).sum())


Shape in X_train_T_encoded: (49750, 1063)
Shape in X_test_FH_encoded: (12438, 1063)
Total number of X_test_FH_encoded.columns == X_train_FH_encoded.columns: 1063


## Regression Model building and training

In [18]:
# Hyperparameter tuning 
n_iter = 1
param_distributions = {
    'HistGradientBoosting': {
        'model': HistGradientBoostingRegressor(random_state=0),
        'params': {
            'learning_rate': [0.1, 0.05, 0.01],
            'max_iter': [100, 300, 1000],
            'max_depth': [None, 10, 20],
            'l2_regularization': [0.0, 0.1, 1.0]
        }
    },
    'XGBoost': {
        'model': XGBRegressor(objective='reg:squarederror', random_state=0, n_jobs=-1),
        'params': {
            'max_depth': [3, 6, 10],
            'learning_rate': [0.1, 0.05, 0.01],
            'n_estimators': [100, 500, 1000],
            'subsample': [0.8, 0.9, 1.0]
        }
    },
    'CatBoost': {
        'model': CatBoostRegressor(random_state=0, silent=True),
        'params': {
            'depth': [6, 8, 10],
            'learning_rate': [0.1, 0.05, 0.01],
            'iterations': [100, 500, 1000]
        }
    }
}

# Function to tune and save the model
def tune_and_save_model(name, model, params, X_train, y_train, X_test, save_dir="models"):
    print(f"Tuning {name}...")
    search = RandomizedSearchCV(estimator=model,
                                param_distributions=params,
                                n_iter=n_iter,
                                cv=3,
                                scoring='neg_mean_squared_error',
                                n_jobs=-1,
                                random_state=0)
    search.fit(X_train, y_train)
    
    best_model = search.best_estimator_

    # Save model
    os.makedirs(save_dir, exist_ok=True)
    dump(best_model, os.path.join(save_dir, f"{name}_best_model.pkl"))

    # Predict on test
    y_pred = best_model.predict(X_test)
    return best_model, y_pred


# Compute the evaluation metrics 
def evaluate_model(y_test, y_pred, X_test):
    category_metrics = {}  # Initialize a dictionary to store metrics by category
    
    # Loop through each unique 'make_category' in X_test
    for make_category in X_test['Make'].unique().tolist() + ["Overall"]:
        # Filter the data based on 'Make' category
        if make_category == "Overall":
            y_test_slice = y_test
            y_pred_slice = y_pred
        else:
            masked_idx   = X_test['Make'] == make_category
            y_test_slice = y_test[masked_idx]
            y_pred_slice = y_pred[masked_idx]
        
        # Calculate the evaluation metrics
        r2 = r2_score(y_test_slice, y_pred_slice)
        rmse = np.sqrt(mean_squared_error(y_test_slice, y_pred_slice))
        mae = mean_absolute_error(y_test_slice, y_pred_slice)
        
        # Store metrics for each category (Make)
        category_metrics[make_category] = {
            'R2': r2,
            'RMSE': rmse,
            'MAE': mae
        }
    
    return category_metrics  # Return the metrics for each category


In [19]:
# Initialize a dictionary to store metrics for each encoding strategy
final_results = {}

# Loop over encoding strategies
for save_dir, x_train, x_test in [('models/One_hot_encoding', X_train_OH_encoded, X_test_OH_encoded), 
                                  ('models/Target_encoding', X_train_T_encoded, X_test_T_encoded), 
                                  ('models/Feature_hashing', X_train_FH_encoded, X_test_FH_encoded)]:
    print("Running for", save_dir.split('/')[-1].replace('_', ' '))
    encoding_metrics = {}
    
    # Loop over each model configuration
    for name, config in param_distributions.items():
        # Train and get predictions
        best_model, y_pred = tune_and_save_model(
            name, config['model'], config['params'],
            x_train, y_train, x_test,
            save_dir=save_dir
        )
        
        # Get metrics for the current model
        category_metrics = evaluate_model(y_test, y_pred, X_test)
        encoding_metrics[name] = category_metrics

    # Convert to structured DataFrame
    results_list = []
    for model_name, makes in encoding_metrics.items():
        for make, metrics in makes.items():
            results_list.append({
                'Model': model_name,
                'Make': make,
                'R²': metrics['R2'],
                'RMSE': metrics['RMSE'],
                'MAE': metrics['MAE']
            })
    
    results_df = pd.DataFrame(results_list)
    
    # Formatting improvements
    results_df = results_df.round(3)
    results_df = results_df.set_index(['Model', 'Make'])
    
    # Store and display
    final_results[save_dir] = results_df

Tuning HistGradientBoosting...
Tuning XGBoost...
Tuning CatBoost...
Tuning HistGradientBoosting...
Tuning XGBoost...
Tuning CatBoost...
Tuning HistGradientBoosting...
Tuning XGBoost...
Tuning CatBoost...


In [35]:
final_results = pd.concat(final_results.values(), 
                       axis=1)
multi_index = [
    ('One-hot encoding', 'R²'),
    ('One-hot encoding', 'RMSE'),
    ('One-hot encoding', 'MAE'),
    ('Target encoding', 'R²'),
    ('Target encoding', 'RMSE'),
    ('Target encoding', 'MAE'),
    ('Feature hashing', 'R²'),
    ('Feature hashing', 'RMSE'),
    ('Feature hashing', 'MAE')
]
final_results.columns = pd.MultiIndex.from_tuples(multi_index)
styled_df = final_results.style.set_table_styles([
        {'selector': 'th', 'props': [('background-color', '#404040'), 
                                     ('color', 'white'),
                                     ('font-weight', 'bold'), 
                                     ('border', '1px solid white')]}
    ])\
    .format({'R²': '{:.2f}', 'RMSE': '{:.2f}', 'MAE': '{:.2f}'})\
    .background_gradient(subset=multi_index, cmap='BuGn')
display(styled_df)

Unnamed: 0_level_0,Unnamed: 1_level_0,One-hot encoding,One-hot encoding,One-hot encoding,Target encoding,Target encoding,Target encoding,Feature hashing,Feature hashing,Feature hashing
Unnamed: 0_level_1,Unnamed: 1_level_1,R²,RMSE,MAE,R²,RMSE,MAE,R²,RMSE,MAE
Model,Make,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
CatBoost,Ford,0.551,4775.175,3721.04,0.554,4760.924,3706.198,0.556,4750.39,3709.412
CatBoost,Holden,0.496,5603.638,3759.015,0.495,5611.079,3753.311,0.496,5603.31,3746.898
CatBoost,Hyundai,0.544,4742.079,3939.713,0.537,4779.089,3975.859,0.535,4789.042,3978.022
CatBoost,Mazda,0.584,5181.942,4024.575,0.59,5144.594,4000.696,0.584,5185.45,4022.795
CatBoost,Mitsubishi,0.624,4845.788,3816.027,0.629,4816.713,3783.796,0.628,4820.998,3793.88
CatBoost,Nissan,0.595,5389.994,3964.009,0.594,5397.39,3964.317,0.599,5364.627,3932.909
CatBoost,Other,0.599,14686.853,8074.391,0.596,14739.52,8058.454,0.598,14705.296,8044.373
CatBoost,Overall,0.619,6713.292,4299.301,0.62,6707.36,4279.05,0.62,6705.535,4281.604
CatBoost,Subaru,0.46,5679.555,4430.89,0.488,5532.577,4261.854,0.467,5644.902,4391.096
CatBoost,Toyota,0.633,6566.659,4421.82,0.636,6539.461,4393.309,0.635,6550.17,4397.913


## Inference pipeline that show how it handles new categorical data

In [None]:
def load_model_and_predict(model_name, X_input, model_dir="models"):
    model_path = f"{model_dir}/{model_name}_best_model.pkl"
    model = load(model_path)
    return model.predict(X_input)

y_pred_hist = load_model_and_predict("HistGradientBoosting", X_new_encoded)
y_pred_cat = load_model_and_predict("CatBoost", X_new_encoded)
y_pred_xgb = load_model_and_predict("XGBoost", X_new_encoded)

NameError: name 'X_new_encoded' is not defined