# Car Price Prediction Model Analysis

Prepared by *Lau Wen Jun*

## Table of Contents
1. [Background](##1-background)
2. [Methodology](##2-methodology)
3. [Variables](##3-variables)
4. [Analysis](##4-analysis)
    - 4.1. [Import Libraries](###41-import-libraries)
    - 4.2. [Data Preprocessing](###42-data-preprocessing)
    - 4.3. [Feature Engineering](###43-feature-engineering)
    - 4.4. [Feature Importance Analysis](###44-feature-importance-analysis)
    - 4.5. [Model Development & Performance Validation](###45-model-development--performance-validation)
    - 4.6. [Result & Export](###46-result--export) 
5. [Discussion and Conclusion](##5-discussion-and-conclusion)
    - 5.1. [Discussion](##51-discussion)
    - 5.2. [Conclusion](##52-conclusion)
6. [Appendix: Function](##6-appendix-function)

## 1. Background

This analysis focuses on developing a machine learning model to predict second-hand car prices using a dataset containing various car attributes and historical sales information. The dataset includes hashed identifiers for car brands and models, along with specific vehicle characteristics, condition indicators, and temporal data. The goal is to create an accurate price prediction model that can assist in valuing used cars based on their features.

## 2. Methodology

The analysis methodology comprises three main phases: data preprocessing, feature engineering, and model development. The preprocessing phase focuses on data quality and standardization, including handling missing values, converting date fields to appropriate formats, and encoding categorical variables using label encoding to make them suitable for machine learning algorithms.

Feature engineering enhances the dataset's predictive power by creating meaningful new features. This includes deriving temporal features from dates (such as days_until_sale, sales_month, and car_age), developing interaction features between related variables (age_milage combining car age and mileage), and constructing aggregate indicators like damage_score to capture the combined effect of accident and flood damage.

The model development phase begins with feature selection based on importance thresholds to identify the most influential predictors. Three different machine learning models are implemented and compared: Random Forest, which creates multiple decision trees and combines their predictions through voting, making it robust against outliers and good at capturing complex patterns; XGBoost, which builds trees sequentially where each new tree focuses on correcting the mistakes of previous trees, making it particularly effective for structured data like our car dataset; and Gradient Boosting, which works similarly to XGBoost but with a different optimization approach for building trees. These models are evaluated using complementary metrics: Root Mean Square Error (RMSE) to penalize large prediction errors, Mean Absolute Error (MAE) for interpretable error measurements, and R-squared (R²) to assess the model's overall explanatory power.

## 3. Variables

The "Car Sales Data" contains hashed data regarding second hand car prices.

In [10]:
import pandas as pd

# Create variables description table
variables_df = pd.DataFrame({
    'Variable': [
        'car_brand',
        'car_model',
        'car_variant',
        'car_year',
        'car_engine',
        'car_transmission',
        'mileage',
        'accident',
        'flood',
        'color',
        'purchase_date',
        'sales_date',
        'price'
    ],
    'Description': [
        'Hashed value for a car brand (Honda, BMW, Proton, etc.)',
        'Hashed value for a car model (Myvi, Saga, i3, etc.)',
        'Hashed value for a car variant. Car variants can be repeated with different car brands and models. For example, Perodua Myvi & Honda City can both have variant G',
        'Car production year as 4 digits integer',
        'Car engine capacity',
        'Car transmission type (auto/manual)',
        'Car mileage when it was sold',
        'Boolean flag to indicate if a car has been through an accident (0 no accident, 1 accident)',
        'Boolean flag to indicate if a car has been through a flood (0 no flood damage, 1 flood damage)',
        'Color of the car being sold',
        'Date when the car was purchased',
        'Date when the car was sold',
        'Price at which the car was sold'
    ]
})

# Display the table with styled header
display(variables_df.style
       .hide(axis='index')
       .set_properties(**{
           'text-align': 'left',
           'white-space': 'pre-wrap'
       })
       .set_table_styles([
           {'selector': 'th',
            'props': [('text-align', 'center'),
                     ('background-color', '#f2f2f2'),
                     ('font-weight', 'bold')]
           }
       ]))

Variable,Description
car_brand,"Hashed value for a car brand (Honda, BMW, Proton, etc.)"
car_model,"Hashed value for a car model (Myvi, Saga, i3, etc.)"
car_variant,"Hashed value for a car variant. Car variants can be repeated with different car brands and models. For example, Perodua Myvi & Honda City can both have variant G"
car_year,Car production year as 4 digits integer
car_engine,Car engine capacity
car_transmission,Car transmission type (auto/manual)
mileage,Car mileage when it was sold
accident,"Boolean flag to indicate if a car has been through an accident (0 no accident, 1 accident)"
flood,"Boolean flag to indicate if a car has been through a flood (0 no flood damage, 1 flood damage)"
color,Color of the car being sold


## 4. Analysis

### 4.1. Import Libraries

The analysis utilizes essential data manipulation libraries (pandas, numpy) and machine learning libraries from scikit-learn for model development and evaluation. Additional libraries including XGBoost for gradient boosting implementation and visualization libraries (matplotlib, seaborn) are incorporated to enhance model performance and result presentation. The datetime library is included for handling temporal data features in the car sales dataset.

In [14]:
# Import libraries
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import xgboost as xgb
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

### 4.2. Data Preprocessing

The car price prediction project began with extensive data preprocessing of a dataset containing 95,428 records across 13 columns. Initial data quality assessment revealed minimal missing values (4 rows), which were removed to ensure data integrity, resulting in 95,424 clean records. The preprocessing phase included encoding categorical variables (car brand, model, variant, transmission, and color) using LabelEncoder and transforming date fields into numerical features for model 

In [17]:
# Load data and initial processing
df = pd.read_csv('DS Case Study_ Car Sales Data.csv')
print("\nInitial data shape:", df.shape)
print("\nMissing values in initial data:")
print(df.isnull().sum())

# Remove rows with NaN values
df = df.dropna()
print("\nShape after removing NaN values:", df.shape)


Initial data shape: (95428, 13)

Missing values in initial data:
car_brand           2
car_model           4
car_variant         2
car_year            2
car_engine          2
car_transmission    2
milage              2
accident            2
flood               2
color               2
purchase_date       2
sales_date          2
price               2
dtype: int64

Shape after removing NaN values: (95424, 13)


### 4.3. Feature Engineering

Feature engineering played a crucial role in enhancing the model's predictive power. Several types of features were created: temporal features (days_until_sale, sales_month, sales_year, car_age) to capture time-based patterns; interaction features (age_milage) to represent combined effects; technical features (engine_transmission); and risk features (damage_score combining accident and flood indicators). The original datetime columns (purchase_date and sales_date) were dropped after extracting useful numerical features because machine learning models cannot directly process datetime objects, and all relevant temporal information had already been captured in the engineered features. This comprehensive feature engineering approach aimed to capture complex relationships within the data while ensuring all features were in a format suitable for model training.

In [20]:
# Convert dates
df['purchase_date'] = pd.to_datetime(df['purchase_date'], format='%d/%m/%y')
df['sales_date'] = pd.to_datetime(df['sales_date'], format='%d/%m/%y')

# Extract temporal features
df['days_until_sale'] = (df['sales_date'] - df['purchase_date']).dt.days
df['sales_month'] = df['sales_date'].dt.month
df['sales_year'] = df['sales_date'].dt.year
df['car_age'] = df['sales_year'] - df['car_year']

# Encode categorical variables
categorical_cols = ['car_brand', 'car_model', 'car_variant', 'car_transmission', 'color']
label_encoders = {}

for col in categorical_cols:
    label_encoders[col] = LabelEncoder()
    df[col] = label_encoders[col].fit_transform(df[col])

# Drop datetime columns
df = df.drop(['purchase_date', 'sales_date'], axis=1)
print("\nAvailable features:", df.columns.tolist())

# Create interaction features
df['age_milage'] = df['car_age'] * df['milage']
df['engine_transmission'] = df['car_engine'] * df['car_transmission']
df['damage_score'] = df['accident'] + df['flood']

print("\nFinal features:", df.columns.tolist())


Available features: ['car_brand', 'car_model', 'car_variant', 'car_year', 'car_engine', 'car_transmission', 'milage', 'accident', 'flood', 'color', 'price', 'days_until_sale', 'sales_month', 'sales_year', 'car_age']

Final features: ['car_brand', 'car_model', 'car_variant', 'car_year', 'car_engine', 'car_transmission', 'milage', 'accident', 'flood', 'color', 'price', 'days_until_sale', 'sales_month', 'sales_year', 'car_age', 'age_milage', 'engine_transmission', 'damage_score']


### 4.4. Feature Importance Analysis

The feature importance analysis, conducted using Random Forest with a 1% importance threshold, revealed seven key predictors. The most influential features were car_year (30.97%) and car_engine (30.68%), which together account for over 60% of the predictive power. Brand-related features showed significant importance, with car_brand (19.35%) and car_model (7.40%) collectively contributing nearly 27%. The engineered feature age_milage (4.57%) proved valuable, demonstrating the effectiveness of our feature engineering approach. Car_age (2.05%) and car_variant (1.97%) were also identified as meaningful predictors. Based on the 1% threshold criterion, the model retained these seven features while excluding others with minimal impact, such as temporal features (sales_month, days_until_sale), condition indicators (accident, flood), and auxiliary characteristics (color, transmission). This streamlined feature set maintains model efficiency while preserving the most predictive variables.

In [23]:
# Now proceed with feature importance analysis
features = [col for col in df.columns if col != 'price']
X = df[features]
y = df['price']

# Initial train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initial feature importance analysis
initial_model = RandomForestRegressor(n_estimators=100, random_state=42)
initial_model.fit(X_train_scaled, y_train)

# Analyze and display feature importance
importances = initial_model.feature_importances_
feature_imp = pd.DataFrame({
    'feature': features,
    'importance': importances
})
feature_imp = feature_imp.sort_values('importance', ascending=False)
print("\nFeature Importance:")
print(feature_imp)

# Feature selection based on importance threshold
importance_threshold = 0.01  # 1% importance threshold
selected_features = feature_imp[feature_imp['importance'] > importance_threshold]['feature'].tolist()
print("\nSelected Features (importance > 1%):")
print(selected_features)

# Create new feature matrices with selected features only
X_selected = X[selected_features]
X_train_selected, X_test_selected = train_test_split(X_selected, test_size=0.2, random_state=42)

# Scale selected features
scaler_selected = StandardScaler()
X_train_scaled = scaler_selected.fit_transform(X_train_selected)
X_test_scaled = scaler_selected.transform(X_test_selected)


Feature Importance:
                feature  importance
3              car_year    0.309676
4            car_engine    0.306830
0             car_brand    0.193475
1             car_model    0.073979
14           age_milage    0.045717
13              car_age    0.020505
2           car_variant    0.019672
11          sales_month    0.008741
6                milage    0.008316
10      days_until_sale    0.003847
9                 color    0.003136
12           sales_year    0.001824
15  engine_transmission    0.001697
5      car_transmission    0.001371
7              accident    0.000641
16         damage_score    0.000570
8                 flood    0.000004

Selected Features (importance > 1%):
['car_year', 'car_engine', 'car_brand', 'car_model', 'age_milage', 'car_age', 'car_variant']


### 4.5. Model Development & Performance Validation

Three different regression models were implemented and compared using the selected features from our feature importance analysis. After comparing the models' performance on the test set (unseen data), the Random Forest model emerged as the best performer with RMSE of 4358.27, MAE of 2426.90, and R-squared value of 0.98. XGBoost showed comparable performance with slightly higher error metrics (RMSE: 4513.91, MAE: 2721.57) and the same R² of 0.98, while Gradient Boosting demonstrated notably lower performance (RMSE: 9728.04, MAE: 5728.41, R²: 0.89).

The selection of these evaluation metrics was deliberate: RMSE to penalize large prediction errors more heavily (making it particularly relevant for price predictions), MAE to provide easily interpretable error measurements in actual price units, and R-squared to indicate the model's overall explanatory power. The strong performance of both Random Forest and XGBoost, particularly their high R² values, suggests that either could be suitable for practical application, though Random Forest's lower error metrics make it the preferred choice for this specific car price prediction task.

In [26]:
# Train models with selected features
models = {
    'RandomForest': RandomForestRegressor(n_estimators=100, random_state=42),
    'XGBoost': xgb.XGBRegressor(random_state=42),
    'GradientBoosting': GradientBoostingRegressor(random_state=42)
}
results = {}
best_model = None
best_score = float('inf')
best_model_name = None

# Train and evaluate each model with selected features
for name, model in models.items():
    # Train model
    model.fit(X_train_scaled, y_train)
    
    # Predictions on test set
    y_pred = model.predict(X_test_scaled)
    
    # Calculate metrics
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    # Store results
    results[name] = {
        'RMSE': rmse,
        'MAE': mae,
        'R2': r2
    }
    
    # Update best model if current is better
    if rmse < best_score:
        best_score = rmse
        best_model = model
        best_model_name = name

# Compare results before and after feature selection
print("\nModel Results with Selected Features:")
for model_name, metrics in results.items():
    print(f"\n{model_name}:")
    for metric_name, value in metrics.items():
        print(f"{metric_name}: {value:.2f}")

# Print best model information
print(f"\nBest Performing Model: {best_model_name}")

# Generate predictions using best model
X_all_selected = X[selected_features]
X_all_scaled = scaler_selected.transform(X_all_selected)  # Use scaler_selected instead of scaler
predictions = best_model.predict(X_all_scaled)

# Create output DataFrame with predictions
output_df = df.copy()
output_df['predicted_price'] = predictions
output_df['price_difference'] = output_df['price'] - output_df['predicted_price']


Model Results with Selected Features:

RandomForest:
RMSE: 4358.27
MAE: 2426.90
R2: 0.98

XGBoost:
RMSE: 4513.91
MAE: 2721.57
R2: 0.98

GradientBoosting:
RMSE: 9728.04
MAE: 5728.41
R2: 0.89

Best Performing Model: RandomForest


### 4.6. Result & Export

The model's predictions were saved in 'car_price_predictions.csv', which includes the original features along with three additional columns:

- Price: The actual selling price of the car (e.g., 30100.0, 32100.0)

- Predicted Price: The model's price prediction (e.g., 32282.0, 32187.0)

- Price Difference: The difference between actual and predicted prices (Actual - Predicted), helping identify prediction accuracy. A negative value indicates over-prediction (e.g., -2182.0), while a positive value indicates under-prediction (e.g., 425.0).

Sample predictions from the dataset show that the model's predictions were generally close to actual prices, with differences ranging from relatively small (-87.0) to more substantial (-4369.0). Most predictions stayed within a reasonable range, demonstrating the model's ability to capture market values effectively.

In [29]:
# Save and display results
output_df.to_csv('car_price_predictions2.csv', index=False, encoding='utf-8-sig')
print("\nPredictions saved to 'car_price_predictions.csv'")

# Display sample of predictions
print("\nSample of Predictions:")
sample_cols = ['car_brand', 'car_model', 'car_year', 'milage', 'price', 'predicted_price', 'price_difference']
print(output_df[sample_cols].head(10))


Predictions saved to 'car_price_predictions.csv'

Sample of Predictions:
   car_brand  car_model  car_year    milage    price  predicted_price  \
0         22         52    2010.0  282508.0  30100.0          32282.0   
1         22         52    2010.0  169475.0  32100.0          32187.0   
2         22         52    2010.0  105276.0  34100.0          33675.0   
3         22         52    2010.0   81123.0  40100.0          40501.0   
4         22         52    2013.0  157239.0  42000.0          46309.0   
5         22         52    2013.0  170215.0  50500.0          50975.0   
6         22         52    2013.0  192647.0  55100.0          54908.0   
7         22         52    2013.0   81127.0  57400.0          58306.0   
8         22         52    2013.0  171512.0  58800.0          55269.0   
9         22         52    2013.0   98656.0  60300.0          60644.0   

   price_difference  
0           -2182.0  
1             -87.0  
2             425.0  
3            -401.0  
4           

## 5. Discussion and Conclusion

### 5.1. Discussion

The car price prediction model reveals important patterns in the relationship between actual and predicted prices, providing valuable insights for various stakeholders in the used car market. When the model overpredicts (negative price difference), such as predicting 32282.0 for a car actually priced at 30100.0 (difference of -2182.0), it suggests the vehicle might be selling below its expected market value - potentially indicating good buying opportunities or cases requiring investigation into why the price is lower than market expectations. 

Conversely, underprediction (positive price difference), as seen in the case where a car priced at 58800.0 was predicted at 55269.0 (difference of 3531.0), indicates vehicles selling above their expected market value, which could signal premium features not captured by our model or potential overpricing situations. 

These prediction patterns, combined with our feature importance findings where basic car specifications (year, engine, brand) heavily outweigh condition factors (accident, flood damage), suggest a market where fundamental vehicle characteristics primarily drive pricing decisions. This insight makes our model particularly valuable for dealers in pricing strategy optimization, buyers in deal identification, sellers in competitive price setting, and market analysts in understanding price anomalies.

### 5.2. Conclusion

The study developed a robust car price prediction model achieving high accuracy (R² = 0.98) on the test set, with Random Forest emerging as the best performer (RMSE: 4358.27, MAE: 2426.90). Feature engineering and selection identified seven key features that effectively capture price determinants, with car specifications (year, engine, brand) being most influential. The model's ability to identify over and under-priced vehicles makes it a valuable tool for market participants in making informed pricing decisions. Future work could explore incorporating additional features like market trends or regional factors to further enhance prediction accuracy.

## 6. Appendix: Function

For replication and future reference, the complete code implementation has been consolidated into a single main function in the appendix section. The function encompasses all steps from data preprocessing through model evaluation and prediction generation, making it easier to reproduce the analysis pipeline with different datasets or parameters.

In [37]:
# Load and preprocess the data
def preprocess_data(df):
    print("Available columns:", df.columns.tolist())
    
    # Convert dates to datetime with correct format 'DD/MM/YY'
    df['purchase_date'] = pd.to_datetime(df['purchase_date'], format='%d/%m/%y')
    df['sales_date'] = pd.to_datetime(df['sales_date'], format='%d/%m/%y')
    
    # Extract numeric features from dates
    df['days_until_sale'] = (df['sales_date'] - df['purchase_date']).dt.days
    df['sales_month'] = df['sales_date'].dt.month
    df['sales_year'] = df['sales_date'].dt.year
    
    # Calculate car age at time of sale
    df['car_age'] = df['sales_year'] - df['car_year']
    
    # Encode categorical variables
    categorical_cols = ['car_brand', 'car_model', 'car_variant', 'car_transmission', 'color']
    label_encoders = {}
    
    for col in categorical_cols:
        label_encoders[col] = LabelEncoder()
        df[col] = label_encoders[col].fit_transform(df[col])
    
    # Drop the original datetime columns as they can't be used in the model
    df = df.drop(['purchase_date', 'sales_date'], axis=1)
    
    return df, label_encoders

# Feature engineering
def create_features(df):
    # Create interaction features
    df['age_milage'] = df['car_age'] * df['milage']
    df['engine_transmission'] = df['car_engine'] * df['car_transmission']
    
    # Create damage score
    df['damage_score'] = df['accident'] + df['flood']
    
    return df


# Feature importance and selection analysis
def analyze_and_select_features(X_train_scaled, y_train, features, threshold=0.01):
    # Initial model for feature importance
    initial_model = RandomForestRegressor(n_estimators=100, random_state=42)
    initial_model.fit(X_train_scaled, y_train)
    
    # Get feature importance
    importances = initial_model.feature_importances_
    feature_imp = pd.DataFrame({
        'feature': features,
        'importance': importances
    })
    feature_imp = feature_imp.sort_values('importance', ascending=False)
    
    # Select features above threshold
    selected_features = feature_imp[feature_imp['importance'] > threshold]['feature'].tolist()
    
    return feature_imp, selected_features

# Model training and evaluation
def train_evaluate_model(X_train, X_test, y_train, y_test):
    models = {
        'RandomForest': RandomForestRegressor(n_estimators=100, random_state=42),
        'XGBoost': xgb.XGBRegressor(random_state=42),
        'GradientBoosting': GradientBoostingRegressor(random_state=42)
    }
    
    results = {}
    best_model = None
    best_score = float('inf')
    best_model_name = None  # Add variable for storing best model name
    
    for name, model in models.items():
        # Train model
        model.fit(X_train, y_train)
        
        # Make predictions
        y_pred = model.predict(X_test)
        
        # Calculate metrics
        mse = mean_squared_error(y_test, y_pred)
        rmse = np.sqrt(mse)
        mae = mean_absolute_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        
        results[name] = {
            'RMSE': rmse,
            'MAE': mae,
            'R2': r2
        }
        
        # Update best model if current is better
        if rmse < best_score:
            best_score = rmse
            best_model = model
            best_model_name = name
    
    return results, best_model

# Main execution
def main():
    # Load data
    df = pd.read_csv('DS Case Study_ Car Sales Data.csv')
    print("\nInitial data shape:", df.shape)
    print("\nMissing values in initial data:")
    print(df.isnull().sum())
    
    # Remove rows with NaN values
    df = df.dropna()
    print("\nShape after removing NaN values:", df.shape)
    
    # Store original data before preprocessing
    original_data = df.copy()
    
    # Preprocess data
    df_processed, encoders = preprocess_data(df)
    df_processed = create_features(df_processed)
    
    # Print feature names
    print("\nFinal features:", df_processed.columns.tolist())
    
    # Prepare features and target
    features = [col for col in df_processed.columns if col != 'price']
    X = df_processed[features]
    y = df_processed['price']
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Perform feature selection
    print("\nPerforming feature selection...")
    feature_imp, selected_features = analyze_and_select_features(X_train_scaled, y_train, features)
    print("\nFeature Importance Before Selection:")
    print(feature_imp)
    print(f"\nSelected Features (importance > 1%):")
    print(selected_features)
    
    # Create new feature matrices with selected features
    X_train_selected = X_train_scaled[:, [features.index(feat) for feat in selected_features]]
    X_test_selected = X_test_scaled[:, [features.index(feat) for feat in selected_features]]
    
    # Train and evaluate models with selected features
    results, best_model = train_evaluate_model(X_train_selected, X_test_selected, y_train, y_test)
    
    # Print results
    print("\nModel Results (with selected features):")
    for model_name, metrics in results.items():
        print(f"\n{model_name}:")
        for metric_name, value in metrics.items():
            print(f"{metric_name}: {value:.2f}")
   
    # Print best model information
    print(f"\nBest Performing Model: {best_model_name}")

    # Generate predictions for all data using selected features
    X_all_scaled = scaler.transform(X)
    X_all_selected = X_all_scaled[:, [features.index(feat) for feat in selected_features]]
    predictions = best_model.predict(X_all_selected)
    
    # Create output DataFrame
    output_df = original_data.copy()
    output_df['predicted_price'] = predictions
    output_df['price_difference'] = output_df['price'] - output_df['predicted_price']
    
    # Save predictions to CSV
    output_df.to_csv('car_price_predictions.csv', index=False, encoding='utf-8-sig')
    print("\nPredictions saved to 'car_price_predictions.csv'")
    
    # Display sample of predictions
    print("\nSample of Predictions:")
    sample_cols = ['car_brand', 'car_model', 'car_year', 'milage', 'price', 'predicted_price', 'price_difference']
    print(output_df[sample_cols].head(10))
    
    return results, best_model, feature_imp, output_df

# Run analysis
if __name__ == "__main__":
    results, best_model, feature_importance, output_df = main()


Initial data shape: (95428, 13)

Missing values in initial data:
car_brand           2
car_model           4
car_variant         2
car_year            2
car_engine          2
car_transmission    2
milage              2
accident            2
flood               2
color               2
purchase_date       2
sales_date          2
price               2
dtype: int64

Shape after removing NaN values: (95424, 13)
Available columns: ['car_brand', 'car_model', 'car_variant', 'car_year', 'car_engine', 'car_transmission', 'milage', 'accident', 'flood', 'color', 'purchase_date', 'sales_date', 'price']

Final features: ['car_brand', 'car_model', 'car_variant', 'car_year', 'car_engine', 'car_transmission', 'milage', 'accident', 'flood', 'color', 'price', 'days_until_sale', 'sales_month', 'sales_year', 'car_age', 'age_milage', 'engine_transmission', 'damage_score']

Performing feature selection...

Feature Importance Before Selection:
                feature  importance
3              car_year    0.