## Modeling & Interpretations

To predict car prices, I decided to use multiple different regression models and see which one performs the best in predicting these prices and accounting for the variation in my data and the fluctuations in price. For each of these models, I decided to utilize an 80-20 train-test split, training my model on 80% of the data and then testing it on the remaining 20%.

In [43]:
# Import statements

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import plot_tree
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.inspection import permutation_importance
from sklearn.model_selection import GridSearchCV

In [2]:
# Load dataset
from google.colab import files
uploaded = files.upload()

cars_url = 'car_price_prediction_.csv'
cars = pd.read_csv(cars_url)

Saving car_price_prediction_.csv to car_price_prediction_.csv


#### Baseline Model

I evaluated the success of each of my models by comparing its performance metrics, such as the model's mean squared error, against this baseline's mean squared error. To get my baseline value, I simply took the mean car price of my dataset.

In [97]:
#set up baseline model using mean price score, calculate baseline mse
y = cars['Price']
mean = y.sum()/2234
baseline_preds = np.ones(len(y))*mean
baseline_mse = mean_squared_error(y, baseline_preds)
print(baseline_mse)
print(np.sqrt(baseline_mse))


784046722.9322176
28000.83432564497


The mean squared error of the baseline model is around 784046723. The root mean squared error (RMSE) helps put things in perspective as it shows that on average, the model's predictions are off by about $28,001.

#### Multiple Regression Model

I chose to build a multiple regression model because I wanted to use independent variables to predict the dependent variable, as I believed these predictors may have collectively influenced the car price. Multiple linear regression allowed me to model the relationships between the price and each of these predictors while also considering their combined effect.

In [136]:
#create X & y, split into training and testing data
X = cars[['Fuel Type', 'Brand', 'Model']]
y = cars['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 20)

In [137]:
#encode categorical column
cat_cols = ['Fuel Type', 'Brand', 'Model']
transformer = make_column_transformer((OneHotEncoder(drop = 'first', sparse_output = False), cat_cols), remainder = 'passthrough')

In [138]:
#create pipeline for multiple regression model
pipe = Pipeline([('encode', transformer), ('model', LinearRegression())])

In [139]:
#fit pipeline
pipe.fit(X_train, y_train)

In [140]:
#find coefficients
lr = pipe.named_steps['model']
coefficients = lr.coef_
names = transformer.get_feature_names_out()
pd.DataFrame(coefficients, names)

Unnamed: 0,0
onehotencoder__Fuel Type_Electric,-3633.814
onehotencoder__Fuel Type_Hybrid,-1880.224
onehotencoder__Fuel Type_Petrol,-2530.592
onehotencoder__Brand_BMW,-1.083115e+18
onehotencoder__Brand_Ford,2.707862e+17
onehotencoder__Brand_Honda,3.94552e+17
onehotencoder__Brand_Mercedes,-1.028911e+18
onehotencoder__Brand_Tesla,-1.618029e+18
onehotencoder__Brand_Toyota,2.806695e+17
onehotencoder__Model_5 Series,-1473.65


In [141]:
#find y-int
lr.intercept_

1.0831145637429147e+18

In [142]:
#calculate mse for training data
y_train_preds = pipe.predict(X_train)
mean_squared_error(y_train, y_train_preds)

730377930.9138377

In [143]:
#calculate mse for testing data
y_test_preds = pipe.predict(X_test)
test_mse = mean_squared_error(y_test, y_test_preds)
print("Testing MSE:", test_mse)
print("Testing RMSE:", np.sqrt(test_mse))


Testing MSE: 784095347.8221062
Testing RMSE: 28001.702587916083


In [144]:
#determine feature importance
r = permutation_importance(pipe, X_test, y_test, n_repeats = 10)
pd.DataFrame(r['importances_mean'], index = X_train.columns.tolist())

Unnamed: 0,0
Fuel Type,0.005263629
Brand,1.485454e+27
Model,1.485454e+27


<strong>Overall, my multiple regression model performed better than my baseline in the training data, but not the testing data.</strong> Only the training data outperformed the baseline, with the testing data performing very similar to the baseline. I think this may be due similarities between the brand and model feature.

<strong>The inputs that were most important in this scenario was the Brand and Model.</strong> Fuel Type was the least important for predicting the car prices in this model.

#### K-Nearest Neighbors Regression Model

I chose to try the k-nearest neighbors regression evaluation metric next because KNN makes predictions based on the similarity of instances in the feature space. If car prices were influenced by local patterns or clusters of similar cars with comparable features, KNN would be effective in capturing these localized relationships.

In [145]:
#create X & y, split into training and testing data
X = cars[['Fuel Type', 'Brand', 'Model']]
y = cars['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 20)

In [146]:
#encode categorical column & scale data
cat_col = ['Fuel Type', 'Brand', 'Model']
transformer = make_column_transformer((OneHotEncoder(drop = 'first', sparse_output = False), cat_col), remainder = StandardScaler())

In [147]:
#create pipeline for knn regression model
pipe = Pipeline([('encode', transformer), ('model', KNeighborsRegressor())])

In [148]:
#define grid of hyperparameters for number of neighbors
param_grid = {'model__n_neighbors': [5, 10, 15, 20, 25, 30, 50]}

In [149]:
#perform grid-search w/ cross validation
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

In [150]:
#determine best parameter
grid_search.best_params_

{'model__n_neighbors': 50}

In [151]:
#use 50 neighbors in model
knn = grid_search.best_estimator_

In [152]:
#calculate mse for training data
y_train_preds = knn.predict(X_train)
mean_squared_error(y_train, y_train_preds)

725471107.1230369

In [153]:
#calculate mse for testing data
y_test_preds = knn.predict(X_test)
test_mse = mean_squared_error(y_test, y_test_preds)
print("Testing MSE:", test_mse)
print("Testing RMSE:", np.sqrt(test_mse))


Testing MSE: 776192791.93465
Testing RMSE: 27860.236753025805


In [133]:
#determine feature importance
r = permutation_importance(knn, X_test, y_test, n_repeats = 10)
pd.DataFrame(r['importances_mean'], index = X_train.columns.tolist())

Unnamed: 0,0
Fuel Type,0.012499
Brand,0.009993
Model,0.005271


<strong>My KNN model performed better than both my baseline and my multiple regression model.</strong> While my training data performed much better than my testing data, my testing data still performed very well in comparison to my previous models. I think this was due to the fact that KNN models can capture non-linear patterns in the data, which might be present within the car prices data. Additionally, using grid search for hyperparameter tuning within this model may have led to its better overall performance. By selecting the optimal number of neighbors to use, I was able to fine-tune the model for better results.

<strong>This time, the most important feature in this scenario was the Fuel Type,</strong> followed by Brand. The Model feature was the least important feature.

#### XGBoost Regression Model

For my last model, I also chose to build a XGBoost model which builds an ensemble of decision trees, where each tree is a weak learner trained to correct the errors of the previous ones. Trees are added sequentially, and each tree focuses on improving the predictions made by the ensemble so far.I chose this model because like k-nearest neighbors, decision trees can capture non-linear relationships within the car prices data. XGBoost also excels at handling complex interactions, which is important when car prices depend on intricate features like Fuel Type, Brand, and Model.

In [23]:
!pip install xgboost



In [24]:
from xgboost import XGBRegressor


In [164]:
# Feature and target selection
X = cars[['Fuel Type', 'Brand', 'Model']]
y = cars['Price']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [165]:
# Define categorical columns
cat_col = ['Fuel Type', 'Brand', 'Model']

# Create a ColumnTransformer for preprocessing
transformer = make_column_transformer(
    (OneHotEncoder(drop='first', sparse_output=False), cat_col),  # One-hot encode categorical features
    remainder=StandardScaler()  # Standardize numerical features (if any)
)

In [174]:
# Define the XGBRegressor model
model = XGBRegressor(
    objective='reg:squarederror',  # Objective function for regression
    n_estimators=100,             # Number of boosting rounds
    learning_rate=0.1,            # Learning rate
    max_depth=1,                  # Maximum depth of each tree
    random_state=42
)

# Create a pipeline that includes preprocessing and the model
pipeline = Pipeline(steps=[
    ('preprocessor', transformer),  # Preprocessing step
    ('regressor', model)            # Regression model
])

# Train the pipeline
pipeline.fit(X_train, y_train)

In [175]:

# Predict on training and testing data
y_train_pred = pipeline.predict(X_train)
y_test_pred = pipeline.predict(X_test)

# Calculate MSE for training and testing data
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)

# Print MSE values
print("Training MSE:", train_mse)
print("Testing MSE:", test_mse)
print("Testing RMSE:", np.sqrt(test_mse))



Training MSE: 732104073.8661993
Testing MSE: 768095557.9847302
Testing RMSE: 27714.536943357547


In [182]:
# Perform permutation importance
r = permutation_importance(pipeline, X_test, y_test, n_repeats=10, random_state=42)

# Use the original feature names since permutation importance works on the unprocessed features
raw_feature_names = X_test.columns.tolist()

# Check if the lengths match
if len(raw_feature_names) != len(r['importances_mean']):
    raise ValueError(f"Mismatch between raw feature names ({len(raw_feature_names)}) and importances ({len(r['importances_mean'])})")

# Create a DataFrame for permutation importances
importance_df = pd.DataFrame({
    'Feature': raw_feature_names,
    'Importance Mean': r['importances_mean'],
}).sort_values(by='Importance Mean', ascending=False)

# Display the result
print("Permutation Importance:")
print(importance_df)

Permutation Importance:
     Feature  Importance Mean
0  Fuel Type         0.001755
1      Brand        -0.000454
2      Model        -0.009864


<strong>Overall, my XGBoost model performed the best in comparison to all of the other models I developed.</strong> This model also had a smaller disparity between my mean squared errors for my testing data compared to my training data, and the testing data's mean squared error was the lowest out of all the models, which means it was the most successful in predicting car prices.

<strong>Fuel Type was the most important feature,</strong> with Brand and Model not contributing as much towards this model.

## Next Steps & Discussion

#### Summary of Findings

In my analysis of car prices, all the models I constructed demonstrated improved performance over the baseline predictor, except for the multiple regression model, which had a similar success rate. This signified their utility and significance. The models ranked in terms of performance are as follows: XGBoost Regression, K-Nearest Neighbors Regression, Multiple Linear Regression.

Key Findings:

1) Success of XGBoost Regression:
The XGBoost regression model emerged as the most effective, showcasing the best predictive capabilities out of all the models. Its robust performance suggests its suitability for capturing complex relationships within the car priceing data.

2) Feature Impact:
Feature importance in all of these models were changing, with the Brand and Model features have similar impact in each model, most likely due to similarity between the features and price pattern.

3) Variable Influence:
In the XGBoost model, the Brand and Model impact was negative, showing that while these insights were helpful in the other two models, they were not contributing postively in this predictive model.

In conclusion, the ensemble nature of the XGBoost model, incorporating diverse decision trees, proved advantageous in capturing intricate patterns within the data. The changing emphasis on different features shows how different models treat features differently. The findings provide a nuanced understanding of feature importance and model performance, offering valuable insights for future analyses and predictive modeling in the realm of car prices.

#### Next Steps/Improvements

To enhance the predictive capabilities of the models and gain deeper insights into car prices, I would want to incorporate these additional features into my models:

- Economic Conditions Data:
    - I think it would be really interesting to explore the impact of broader economic conditions, such as interest rates, GDP growth, or regional economic performance, on car prices. Integrating data about the economic environment could help analyze how affordability and market trends influence buyer behavior and, subsequently, vehicle valuations.

- Location Data:
    - I would like to incorporate location-specific data, such as urban vs. rural settings or regional preferences, to better understand how geographical factors affect car prices. For example, certain car models may hold higher value in specific regions due to climate, terrain, or local preferences.

- Luxury Features:
    - It would be valuable to include data on luxury features, such as advanced safety systems, premium interiors, or high-end technology, to assess how these enhancements contribute to a car's price. Analyzing the role of these features might offer insights into the demand for premium offerings.

- Private Seller vs. Dealership:
    - I think it would be insightful to differentiate between cars sold by private sellers and those sold by dealerships. Incorporating this distinction could reveal pricing trends based on seller type and help identify any premium buyers are willing to pay for dealership-sourced vehicles.

By integrating these additional factors into the analysis, I would be able to refine the model further and obtain a more nuanced understanding of the factors that influence car prices. This approach could lead to more accurate predictions and actionable insights for buyers, sellers, and industry stakeholders.