# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

In [1]:
#The goal is to use vehicle features to predict the price of a vehicle. 

In [2]:
#We will plan on building a regression model that predicts the price of a vehicle based on its features. 

In [3]:
#In order to build a performant prediction model, we may need to prepare the data by removing outliers, scaling the data, and handling missing values. 
#We will also need to explore the data to understand the relationship between the features and the price of a vehicle. 
#And decide which variables to include in the final model.  

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [4]:
#To better understand the data, we want to understand the shape of the data, the types of data, and the distribution of the data. 
#We will also want to understand the relationship between the features and the price of a vehicle.  We also want to understand the missing values and outliers in the data.

In [5]:
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns


data_path = 'data/vehicles.csv'
cars = pd.read_csv(data_path)

cars.info()
cars.head()
cars.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

Unnamed: 0,id,price,year,odometer
count,426880.0,426880.0,425675.0,422480.0
mean,7311487000.0,75199.03,2011.235191,98043.33
std,4473170.0,12182280.0,9.45212,213881.5
min,7207408000.0,0.0,1900.0,0.0
25%,7308143000.0,5900.0,2008.0,37704.0
50%,7312621000.0,13950.0,2013.0,85548.0
75%,7315254000.0,26485.75,2017.0,133542.5
max,7317101000.0,3736929000.0,2022.0,10000000.0


In [6]:
#cars.info()
missing = (cars.isna().mean()*100).sort_values(ascending=False)
print(missing)


size            71.767476
cylinders       41.622470
condition       40.785232
VIN             37.725356
drive           30.586347
paint_color     30.501078
type            21.752717
manufacturer     4.133714
title_status     1.930753
model            1.236179
odometer         1.030735
fuel             0.705819
transmission     0.598763
year             0.282281
id               0.000000
region           0.000000
price            0.000000
state            0.000000
dtype: float64


In [7]:
numeric_df = cars.select_dtypes(include=[np.number])
corr = numeric_df.corr()

fig = px.imshow(
    corr,
    text_auto=True,
    title="Correlation of Numerical Features"
)
fig.show()

In [8]:
sample = cars.sample(min(10000, len(cars)), random_state=42)

fig = px.scatter(
    sample, 
    x="odometer", 
    y="price",
    opacity=0.4,
    title="Price vs Mileage - sample",
    labels={"odometer": "Odometer", "price": "Price"}
)
fig.show()

In [9]:
fig = px.histogram(
    cars, 
    x="price", 
    nbins=50, 
    title="Histogram of Car Prices",
    labels={"price": "Price"}
)
fig.update_layout(bargap=0.1)
fig.show()

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [10]:
# First we will reomove columns missing too much information
cars2 = cars.drop(columns=['size', 'cylinders','condition','VIN','drive','paint_color','type'])
#cars2.isna().sum()

# Next we will remove rows with missing data in key fields
cars3 = cars2.dropna(subset=['price', 'manufacturer', 'model', 'fuel', 'title_status', 'transmission','state','year','odometer'])
cars3.isna().sum()

# Next we will remove outliers
cars4 = cars3.copy()
cars4 = cars4[cars4["price"] >= 1000]
cars4 = cars4[cars4["price"] <= 100000]

cars4 = cars4[cars4["odometer"] > 0]
cars4 = cars4[cars4["odometer"] <= 300000]

cars4 = cars4[cars4["year"] >= 1980]  
cars4 = cars4[cars4["year"] <= 2025] 

#Last we will check that we still have enough records and there is no missing values
cars3.shape, cars4.shape
cars4.isna().sum()


id              0
region          0
price           0
year            0
manufacturer    0
model           0
fuel            0
odometer        0
title_status    0
transmission    0
state           0
dtype: int64

In [11]:
fig = px.histogram(
    cars4, 
    x="price", 
    nbins=10, 
    title="Histogram of Car Prices",
    labels={"price": "Price"}
)
fig.update_layout(bargap=0.1)
fig.show()

In [12]:
sample = cars4.sample(min(10000, len(cars4)), random_state=42)

fig = px.scatter(
    sample, 
    x="year", 
    y="price",
    opacity=0.4,
    title="Price vs year - sample",
    labels={"year": "year", "price": "Price"}
)
fig.show()

In [13]:
sample = cars4.sample(min(10000, len(cars4)), random_state=42)

fig = px.scatter(
    sample, 
    x="manufacturer", 
    y="price",
    opacity=0.4,
    title="Price vs mfg - sample",
    labels={"manufacturer": "mfg", "price": "Price"}
)
fig.show()

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [14]:
# Create some variables better suited for modeling
cars4["age"] = 2025 - cars4["year"]
cars4["odometer_log"] = np.log(cars4["odometer"])
cars4["price_log"] = np.log(cars4["price"])


In [15]:
num_features = ["age", "odometer_log"]
cat_features = [
    "manufacturer",
    "model",
    "fuel",
    "title_status",
    "transmission",
    "state"
]
target = "price_log"

cars5 = cars4[num_features + cat_features + [target]].copy()

cars5.head()

Unnamed: 0,age,odometer_log,manufacturer,model,fuel,title_status,transmission,state,price_log
27,11.0,10.96687,gmc,sierra 1500 crew cab slt,gas,clean,other,al,10.421984
28,15.0,11.173655,chevrolet,silverado 1500,gas,clean,other,al,10.025263
29,5.0,9.86058,chevrolet,silverado 1500 crew,gas,clean,other,al,10.586332
30,8.0,10.624347,toyota,tundra double cab sr,gas,clean,other,al,10.34142
31,12.0,11.759786,ford,f-150 xlt,gas,clean,automatic,al,9.615805


In [20]:
#Load modeling packages, and split data into training and test sets
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV, KFold


X = cars5[num_features + cat_features]
y = cars5[target]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

X_train.shape, X_test.shape

((273820, 8), (68455, 8))

In [17]:
#Prepare numeric and category data for modeling

numeric_transformer = Pipeline(steps=[
    ("scaler", StandardScaler())
])

categorical_transformer = OneHotEncoder(
    handle_unknown="ignore", sparse_output=True
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_features),
        ("cat", categorical_transformer, cat_features)
    ])

In [18]:
def eval_model(y_true, y_pred, label=""):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    print(f"{label}  RMSE: {rmse:.4f},  MAE: {mae:.4f}")
    return rmse, mae

linreg = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", LinearRegression())
])

linreg.fit(X_train, y_train)

y_pred_train = linreg.predict(X_train)
y_pred_test = linreg.predict(X_test)

eval_model(y_train, y_pred_train, "Linear Regression (Train)")
eval_model(y_test, y_pred_test, "Linear Regression (Test)")

Linear Regression (Train)  RMSE: 0.3415,  MAE: 0.2069
Linear Regression (Test)  RMSE: 0.3743,  MAE: 0.2305


(np.float64(0.37428325672990265), 0.23047379405583493)

In [21]:
#Now we will build a Ridge Regression model

ridge = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", Ridge())
])

param_grid = {
    "model__alpha": [0.1, 1, 10, 100]
}

cv = KFold(n_splits=5, shuffle=True, random_state=42)

ridge_grid = GridSearchCV(
    ridge,
    param_grid=param_grid,
    scoring="neg_root_mean_squared_error",
    cv=cv,
    n_jobs=-1
)

ridge_grid.fit(X_train, y_train)

print("Best Ridge Params:", ridge_grid.best_params_)
print("Best Ridge CV Score:", ridge_grid.best_score_)

ridge_best = ridge_grid.best_estimator_

y_pred_ridge = ridge_best.predict(X_test)
eval_model(y_test, y_pred_ridge, "Ridge Regression (Test)")

Best Ridge Params: {'model__alpha': 0.1}
Best Ridge CV Score: -0.37161959484139273
Ridge Regression (Test)  RMSE: 0.3710,  MAE: 0.2295


(np.float64(0.37095453920926913), 0.22945010507512376)

Unnamed: 0,model,RMSE,MAE
0,Linear Regression,0.374283,0.230474
1,Ridge Regression,0.370955,0.22945


### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [25]:
#Now we will compare the two models and see which one performed better
#We will use the Ridge Regression model as our final model


results = pd.DataFrame({
    "model": ["Linear Regression", "Ridge Regression"],
    "RMSE": [
        np.sqrt(mean_squared_error(y_test, y_pred_test)),
        np.sqrt(mean_squared_error(y_test, y_pred_ridge)),
    ],
    "MAE": [
        mean_absolute_error(y_test, y_pred_test),
        mean_absolute_error(y_test, y_pred_ridge),
    ]
})

results

Unnamed: 0,model,RMSE,MAE
0,Linear Regression,0.374283,0.230474
1,Ridge Regression,0.370955,0.22945


In [28]:
#Moving forward with ridge, we need to evaluate what features matter the most to the best performing model

#]Get feature names from the preprocessor
ohe = ridge_best.named_steps["preprocessor"].named_transformers_["cat"]
cat_feature_names = ohe.get_feature_names_out(cat_features)

num_feature_names = num_features
all_feature_names = np.concatenate([num_feature_names, cat_feature_names])

# Get coefficients from the Ridge model
coefs = ridge_best.named_steps["model"].coef_

coef_df = pd.DataFrame({
    "feature": all_feature_names,
    "coef": coefs
})

# Sort by impact
coef_sorted = coef_df.sort_values("coef", ascending=False)
coef_sorted.head(20), coef_sorted.tail(20)

(                           feature      coef
 12348                model_rampage  3.899266
 16226          model_vanagon l bus  3.887443
 11181                    model_nsx  3.347668
 9021          model_grand national  3.339066
 9575                     model_j10  3.298840
 13811   model_sierra classic jimmy  3.224339
 12742     model_riviera camper van  3.119995
 500                    model_230ge  3.113177
 14334        model_skyline gtr r32  3.064070
 13367              model_scrambler  2.997611
 3272   model_bus/vanagon gl camper  2.995873
 14962       model_supra twin turbo  2.994386
 16442      model_westfalia vanagon  2.966494
 14974             model_syncro 4x4  2.939167
 3726                model_capri rs  2.919008
 418              model_1985 blazer  2.912309
 14335  model_skyline gtr r32 bnr32  2.910975
 9927        model_landcruiser fj45  2.908537
 16228      model_vanagon westfalia  2.877759
 16229     model_vanagon/campmobile  2.847284,
                                 

In [None]:
#The top and bottom weighted features are model as would  be expected.
#If i were to do this again, I'd likely run through a few more variants without model and grouping some make into categories like basic, premium luxury etc.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.

In [None]:
Refer to README.txt for more details.