# Decision Trees

Use a Decision Tree model and the _AutoTrader_ May 2023 dataset to predict how long it will take a newly listed car to sell.
This is intended to be an experiment to set a baseline for a Random Forest approach.

## Workflow

1. Simple data cleaning / preprocessing - drop NaNs for this example
2. Build and Train Decision Tree model
3. Evaluate the Model
4. **Feature Importance** - identify which features are the most influential for the informing splits in the decision tree 

In [178]:
import pandas as pd

df = pd.read_parquet("./data/rate_of_sale_may_2023.snappy.parquet")
TARGET_FEATURE = "days_to_sell"
df.head()

Unnamed: 0,stock_item_id,last_date_seen,first_date_seen,days_to_sell,first_retailer_asking_price,last_retailer_asking_price,can_home_deliver,reviews_per_100_advertised_stock_last_12_months,segment,seats,...,odometer_reading_miles,first_registration_date,attention_grabber,manufacturer_approved,price_indicator_rating,adjusted_retail_amount_gbp,predicted_mileage,number_of_images,advert_quality,postcode_area
0,52ae009b671ab58b3d4ff109a9fbdcf8d847de0fa190e1...,2023-05-05,2021-03-25,771,6995,6495.0,False,3.9,Independent,5.0,...,65000,2004-05-07,*IMMACULATE**FULL HISTORY*,False,NOANALYSIS,,,50,,AL
1,32b1bac6934b1f64ff43cffa9df5aa296ead8143c36f9f...,2023-05-09,2021-05-25,714,13725,14995.0,False,,Franchise,5.0,...,16018,2019-11-30,Sports Styling | Great Economy,True,GOOD,14848.0,26078.0,15,57.0,HP
2,21703d22d87eaa95c4dc81a60ba2c8cbe3b90ab659292c...,2023-05-12,2021-11-26,532,15499,13999.0,False,0.2,Independent,5.0,...,31093,2018-03-08,"Sat Nav,Leather,Auto,Euro 6",False,GREAT,14571.0,34732.0,22,61.0,SR
3,661acafc271373946cea7d30ac7f34257404ab89a1ad33...,2023-05-16,2022-02-17,453,10995,9995.0,False,7.9,Franchise,5.0,...,79000,2015-07-02,Viewing by APPOINTMENT ONLY,False,FAIR,9349.0,65684.0,30,61.0,FY
4,638216dc92410d965b416fea5b3cec9ca903368795fdde...,2023-05-04,2022-03-21,409,46000,37500.0,False,6.8,Franchise,5.0,...,10214,2022-03-03,Reserve Online,True,GOOD,37055.0,11765.0,22,48.0,LE


## Preprocessing
### Handle missing values
* Drop columns which aren't interesting/useful
    * `generation` - remove (for now) as can get a good enough estimate from plate without needing processing
    * `derivative` - same rationale as above
    * `first_registration_date` - use plate instead — this is too precise and requires extra processing (for now)
    * `postcode_area` - requires extra processing - add in later after inital tests
* Drop null values except
    * `reviews_per_100_advertised_stock_last_12_months`: $33,266$ missing => replace with $0.0$
    * `zero_to_sixty_mph_seconds`: $143,152$ missing => replace with $-1$. Probably not missing at random - slower 0-60s less likely to be advertised
    * `advert_quality`: $43932$ missing => replace with $0.0$
    * `colour`: $2072$ missing: replace with `"black"`
* Remove EVs from model (incl. hybrids)

### Remove outliers
Numerical outliers are defined as being outside the range $[\hat\mu - 2\cdot\sigma, \hat\mu + 2\cdot\sigma]$

In [179]:
columns_to_drop = [
    "stock_item_id",
    "last_date_seen",
    "last_retailer_asking_price",
    "generation",
    "derivative",
    "derivative_id",
    "first_registration_date",
    "attention_grabber",
    "postcode_area",
    "first_date_seen"
]

# drop excluded features
df = df.drop(columns_to_drop, axis=1)

# Replace missing values for specified columns with custom values
df['reviews_per_100_advertised_stock_last_12_months'] = df['reviews_per_100_advertised_stock_last_12_months'].fillna(0.0)
df['zero_to_sixty_mph_seconds'] = df['zero_to_sixty_mph_seconds'].fillna(-1)
df['advert_quality'] = df['advert_quality'].fillna(0.0)
df['colour'] = df['colour'].fillna("Black")

# drop EVs
excluded_fuel_types = ['Electric', 'Petrol Plug-in Hybrid', 'Diesel Plug-in Hybrid']
df = df[~df['fuel_type'].isin(excluded_fuel_types)]
df = df.drop(["battery_range_miles", "battery_usable_capacity_kwh"], axis=1)

In [180]:
# drop numerical outliers
numerical_features = [col for col in df.columns if df[col].dtype in ['int64', 'float64'] and col != 'days_to_sell']

means = df[numerical_features].mean()
stds = df[numerical_features].std()

# Define the upper and lower bounds for each feature
lower_bounds = means - 2 * stds
upper_bounds = means + 2 * stds

# Filter rows: keep only those within ±2 standard deviations from the mean for each numerical feature
filtered_indices = (df[numerical_features] >= lower_bounds) & (df[numerical_features] <= upper_bounds)
df = df[filtered_indices.all(axis=1)]

# drop other NaNs
df = df.dropna()

### Feature Encoding
* Binary values left as-is
* Categories all one-hot encoded
* Numerical features scaled

In [181]:
from sklearn.preprocessing import StandardScaler

binary_features = [
    "can_home_deliver",
    "manufacturer_approved"
]


categorical_features = [
    "segment",
    "make",
    "model",
    "body_type",
    "fuel_type",
    "transmission_type",
    "drivetrain",
    "colour",
    "price_indicator_rating",
]

df_encoded = pd.get_dummies(df, columns=categorical_features, drop_first=True)
    
numerical_features = [col for col in df_encoded.columns if col not in ('days_to_sell',) + tuple(categorical_features) + tuple(binary_features)]

scaler = StandardScaler()
df_encoded[numerical_features] = scaler.fit_transform(df_encoded[numerical_features])

### Test-train split

In the interest of trying both regression and classification, two different splits will be performed. 
Splitting ratio 80/20 train/test

Since the vast majority of values are below 100 days, remove all days outside 100 from the dataset

In [182]:
from sklearn.model_selection import train_test_split

X = df_encoded.drop('days_to_sell', axis=1)

# regression split
y_regression = df_encoded['days_to_sell']
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X, y_regression, test_size=0.2, random_state=42)

# Combine X_train and y_train back to filter efficiently
train_data = X_train_reg.copy()
train_data['days_to_sell'] = y_train_reg

# Filter
train_data_filtered = train_data[train_data['days_to_sell'] <= 100]

# Split the filtered training data back into X_train and y_train
X_train_reg = train_data_filtered.drop('days_to_sell', axis=1)
y_train_reg = train_data_filtered['days_to_sell']

# classification split
bins = [-1, 7, 14, 21, 28, 60, 90, float('inf')]
labels = ['Within a week', "1-2 weeks", "2-3 weeks", "3-4 weeks", "Within two months", "Within three months", "3+ months"]
df_encoded['days_to_sell_category'] = pd.cut(df_encoded['days_to_sell'], bins=bins, labels=labels)

y_classification = df_encoded['days_to_sell_category']

# Splitting dataset for classification
X_train_class, X_test_class, y_train_class, y_test_class = train_test_split(X, y_classification, test_size=0.2, random_state=42)

  df_encoded['days_to_sell_category'] = pd.cut(df_encoded['days_to_sell'], bins=bins, labels=labels)


## Training a Decision Tree Regressor

Training is done and evaluated on the training set using k-fold cross validation

In [188]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
import numpy as np

k = 5
model = DecisionTreeRegressor(random_state=42)

# Perform k-fold CV and calculate MSE scores
cv_mse_scores = cross_val_score(model, pd.concat([X_train_reg, X_test_reg]), pd.concat([y_train_reg, y_test_reg]), cv=k, scoring='neg_mean_squared_error')

# Convert scores to positive MSE scores
cv_mse_scores_positive = -cv_mse_scores

# Calculate the average MSE and RMSE across all folds
average_cv_mse = np.mean(cv_mse_scores_positive)
average_cv_rmse = np.sqrt(average_cv_mse)

print(f"Average MSE across {k} folds: {average_cv_mse}")
print(f"Average RMSE across {k} folds: {average_cv_rmse}")

Average MSE across 5 folds: 941.4103542996718
Average RMSE across 5 folds: 30.682411155247753


In [184]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Fit the model to the entire training dataset
model.fit(X_train_reg, y_train_reg)

# Predict on the test set
y_pred_reg = model.predict(X_test_reg)

# Calculate performance metrics on the test set
test_rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))
test_r2 = r2_score(y_test_reg, y_pred_reg)
test_mae = mean_absolute_error(y_test_reg, y_pred_reg)

print(f"Test RMSE: {test_rmse}")
print(f"Test MAE:  {test_mae}")
print(f"Test R^2 score: {test_r2}")


Test RMSE: 52.77026090824235
Test MAE:  30.901622579765476
Test R^2 score: -0.11437926511723018


## Training a Decision Tree Classifier

In [185]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# train
classifier = DecisionTreeClassifier(random_state=42)
classifier.fit(X_train_class, y_train_class)

# test
y_pred_class = classifier.predict(X_test_class)

# Evaluation
accuracy = accuracy_score(y_test_class, y_pred_class)
conf_matrix = confusion_matrix(y_test_class, y_pred_class)
class_report = classification_report(y_test_class, y_pred_class)

print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

Accuracy: 0.216764385055904
Confusion Matrix:
[[ 748  576  241  438  837  362 1123]
 [ 573  445  274  361  597  388 1015]
 [ 204  204  562  187  217  472  693]
 [ 409  352  178  277  389  285  769]
 [ 850  566  215  437 1137  299  926]
 [ 349  329  438  306  313  582 1089]
 [1090 1024  715  847  992 1048 2608]]
Classification Report:
                     precision    recall  f1-score   support

          1-2 weeks       0.18      0.17      0.18      4325
          2-3 weeks       0.13      0.12      0.12      3653
          3+ months       0.21      0.22      0.22      2539
          3-4 weeks       0.10      0.10      0.10      2659
      Within a week       0.25      0.26      0.26      4430
Within three months       0.17      0.17      0.17      3406
  Within two months       0.32      0.31      0.32      8324

           accuracy                           0.22     29336
          macro avg       0.19      0.19      0.19     29336
       weighted avg       0.22      0.22      0.22  

## Guesswork

Guess the distribution of days
Randomly sample it for each car

In [186]:
from scipy import stats

# Assuming 'days_to_sell' is your target column in the dataframe 'df'
data = df['days_to_sell'].dropna()  # Ensure no NaN values

# Fit a distribution to the data
distribution = stats.norm # Assuming a normal distribution for simplicity; adjust based on your data analysis
params = distribution.fit(data)

def sample_from_distribution(num_samples=1):
    """Sample from the estimated distribution."""
    return distribution.rvs(*params, size=num_samples)

# Example usage: Sample 10 values from the estimated distribution
sampled_values = sample_from_distribution(10)
print(sampled_values)


[-81.40068603 108.7977558   -3.17206275  76.7386306  146.37549392
  47.62947024 113.78350592  66.94258705  50.79368907 -11.44768527]


In [187]:
from sklearn.metrics import root_mean_squared_error, r2_score, mean_absolute_error

# Sample from the distribution for each instance in the test set
num_samples = len(X_test_reg)  # X_test_reg is your regression test set's features
random_predictions = sample_from_distribution(num_samples)

# Assuming y_test_reg is your actual target values for the regression test set
rmse = root_mean_squared_error(y_test_reg, random_predictions)
r2 = r2_score(y_test_reg, random_predictions)
mae= mean_absolute_error(y_test_reg, random_predictions)
print(f"RMSE of random predictions: {rmse}")
print(f"MAE of random predictions:  {mae}")
print(f"$r^2$ score: {r2}")

RMSE of random predictions: 72.3796043997248
MAE of random predictions:  52.7614716873917
$r^2$ score: -1.0964617834217378
