# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary.

Our goal is to understand the key factors that influence the price of used cars in your inventory. Using a dataset of vehicle listings, we applied data cleaning, feature engineering, and statistical modeling to identify which attributes—such as age, mileage, condition, and brand—drive value. Derived metrics like car age and mileage per year were created to better capture vehicle depreciation and usage patterns. We also standardized numerical features and encoded categorical features to make them suitable for analysis, ensuring that our models accurately reflect how different factors contribute to price. By systematically testing different modeling approaches and parameter settings, we determined which attributes are most predictive of price, ensuring that you are making informed pricing and marketing decisions.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [107]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [108]:
#load dataset
vehicles_df = pd.read_csv("/content/sample_data/vehicles.csv")

In [109]:
print(vehicles_df.head())

           id                  region  price  year manufacturer model  \
0  7222695916                prescott   6000   NaN          NaN   NaN   
1  7218891961            fayetteville  11900   NaN          NaN   NaN   
2  7221797935            florida keys  21000   NaN          NaN   NaN   
3  7222270760  worcester / central MA   1500   NaN          NaN   NaN   
4  7210384030              greensboro   4900   NaN          NaN   NaN   

  condition cylinders fuel  odometer title_status transmission  VIN drive  \
0       NaN       NaN  NaN       NaN          NaN          NaN  NaN   NaN   
1       NaN       NaN  NaN       NaN          NaN          NaN  NaN   NaN   
2       NaN       NaN  NaN       NaN          NaN          NaN  NaN   NaN   
3       NaN       NaN  NaN       NaN          NaN          NaN  NaN   NaN   
4       NaN       NaN  NaN       NaN          NaN          NaN  NaN   NaN   

  size type paint_color state  
0  NaN  NaN         NaN    az  
1  NaN  NaN         NaN    ar  
2 

In [110]:
print(vehicles_df.tail())

                id   region  price    year manufacturer  \
426875  7301591192  wyoming  23590  2019.0       nissan   
426876  7301591187  wyoming  30590  2020.0        volvo   
426877  7301591147  wyoming  34990  2020.0     cadillac   
426878  7301591140  wyoming  28990  2018.0        lexus   
426879  7301591129  wyoming  30590  2019.0          bmw   

                           model condition    cylinders    fuel  odometer  \
426875         maxima s sedan 4d      good  6 cylinders     gas   32226.0   
426876  s60 t5 momentum sedan 4d      good          NaN     gas   12029.0   
426877          xt4 sport suv 4d      good          NaN  diesel    4174.0   
426878           es 350 sedan 4d      good  6 cylinders     gas   30112.0   
426879  4 series 430i gran coupe      good          NaN     gas   22716.0   

       title_status transmission                VIN drive size       type  \
426875        clean        other  1N4AA6AV6KC367801   fwd  NaN      sedan   
426876        clean        o

In [111]:
print(vehicles_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

In [112]:
numerical_cols = vehicles_df.select_dtypes(include=['int64', 'float64']).columns
print(f"numerical columns: {vehicles_df[numerical_cols].describe().round(2)}")
print("\n")
categorical_cols = vehicles_df.select_dtypes(include=['object']).columns
print(f"categorical columns: {vehicles_df.describe(include='O')}")
print("\n")

numerical columns:                  id         price       year     odometer
count  4.268800e+05  4.268800e+05  425675.00    422480.00
mean   7.311487e+09  7.519903e+04    2011.24     98043.33
std    4.473170e+06  1.218228e+07       9.45    213881.50
min    7.207408e+09  0.000000e+00    1900.00         0.00
25%    7.308143e+09  5.900000e+03    2008.00     37704.00
50%    7.312621e+09  1.395000e+04    2013.00     85548.00
75%    7.315254e+09  2.648575e+04    2017.00    133542.50
max    7.317101e+09  3.736929e+09    2022.00  10000000.00


categorical columns:           region manufacturer   model condition    cylinders    fuel  \
count     426880       409234  421603    252776       249202  423867   
unique       404           42   29649         6            8       5   
top     columbus         ford   f-150      good  6 cylinders     gas   
freq        3608        70985    8009    121456        94169  356209   

       title_status transmission                VIN   drive       size  \
c

In [113]:
# Check for missing values
missing_summary = vehicles_df.isnull().sum().sort_values(ascending=False)
missing_percent = (vehicles_df.isnull().mean() * 100).sort_values(ascending=False)

# Check for duplicate rows
duplicate_count = vehicles_df.duplicated().sum()

# 3. Check for invalid numerical values
invalid_price = (vehicles_df['price'] <= 0).sum()
invalid_year = (vehicles_df['year'] < 1900).sum()
invalid_odometer = (vehicles_df['odometer'] < 0).sum()

#Print summary
print("Missing Values (Top 10):")
print(pd.concat([missing_summary.head(10), missing_percent.head(10)], axis=1, keys=['Missing Count', '% of Total']))
print("\nDuplicate Rows:", duplicate_count)
print("\nInvalid Entries:")
print(f"Price <= 0: {invalid_price}")
print(f"Year < 1900: {invalid_year}")
print(f"Odometer < 0: {invalid_odometer}")

Missing Values (Top 10):
              Missing Count  % of Total
size                 306361   71.767476
cylinders            177678   41.622470
condition            174104   40.785232
VIN                  161042   37.725356
drive                130567   30.586347
paint_color          130203   30.501078
type                  92858   21.752717
manufacturer          17646    4.133714
title_status           8242    1.930753
model                  5277    1.236179

Duplicate Rows: 0

Invalid Entries:
Price <= 0: 32895
Year < 1900: 0
Odometer < 0: 0


In [114]:
num_cols = ['price','year','odometer']

filtered_df = vehicles_df.copy()
for col in num_cols:
    low = vehicles_df[col].quantile(0.01)
    high = vehicles_df[col].quantile(0.99)
    filtered_df = filtered_df[(filtered_df[col] >= low) & (filtered_df[col] <= high)]

print("NOTE: Histograms and boxplots use data filtered to the 1st-99th percentile to visualize the main distribution and reduce the effect of extreme outliers.")
print("\n")

# Histograms using the FILTERED data
for col in num_cols:
    plt.figure(figsize=(10,6))
    sns.histplot(filtered_df[col], bins=50, kde=True) # Added KDE for smooth distribution line
    plt.title(f"Distribution of {col.capitalize()} (Filtered: 1st-99th Percentile)")
    plt.xlabel(col.capitalize())
    plt.ylabel("Count")
    plt.tight_layout()
    #plt.show()
    plt.savefig(f"{col}_histogram.png")
    plt.close()

# Boxplots using the FILTERED data
for col in num_cols:
    plt.figure(figsize=(10,4))
    sns.boxplot(x=filtered_df[col])
    plt.title(f"Boxplot of {col.capitalize()} (Filtered: 1st-99th Percentile)")
    plt.tight_layout()
    #plt.show()
    plt.savefig(f"{col}_boxplot.png")
    plt.close()

NOTE: Histograms and boxplots use data filtered to the 1st-99th percentile to visualize the main distribution and reduce the effect of extreme outliers.




In [115]:
#Price by categorical data
categorical_cols = [
    'manufacturer','condition','cylinders','fuel',
    'title_status','transmission','drive','size',
    'type','paint_color','state','region'
]

for col in categorical_cols:
    # Median price barplot
    plt.figure(figsize=(12,6))
    median_vals = vehicles_df.groupby(col)['price'].median().sort_values(ascending=False).head(15)
    sns.barplot(x=median_vals.values, y=median_vals.index)
    plt.title(f"Median Price by {col}")
    plt.xlabel("Median Price ($)")
    plt.tight_layout()
    #plt.show()
    plt.savefig(f"{col}_median_price.png")
    plt.close()

    # Boxplot of price distribution
    plt.figure(figsize=(12,6))
    sns.boxplot(y=col, x="price", data=vehicles_df, order=vehicles_df[col].value_counts().index[:10])
    plt.title(f"Price Distribution by {col}")
    plt.xlim(0, 100000)
    plt.tight_layout()
    #plt.show()
    plt.savefig(f"{col}_price_distribution.png")
    plt.close()

# Categorical Data Distribution
for col in categorical_cols:
    plt.figure(figsize=(12,6))
    top_counts = vehicles_df[col].value_counts().head(10)
    sns.barplot(x=top_counts.values, y=top_counts.index)
    plt.title(f"Top {len(top_counts)} categories in {col}")
    plt.xlabel("Count")
    plt.tight_layout()
    #plt.show()
    plt.savefig(f"{col}_distribution.png")
    plt.close()

## Data Exploration Report
**Numerical Variables**

***Distribution***

Price is left-skewed, with a median of $13,950 and extreme high outliers (up to $3.7B). The interquartile range (IQR) is $20,586.
Year is right-skewed; most vehicles are newer (median 2013).
Odometer is right-skewed, with extreme values extending to 10 million miles. The distribution slowly decreases after typical mileage values (~100,000 miles).
Histograms and boxplots are visualized using the 1–99th percentile to reduce extreme outlier effects.

**Categorical Variables**

***Key Observations***


*   Manufacturer: Ferrari has the highest median price; less common makes can be very low. Displaying all manufacturers shows the full spectrum.
*   Condition: Excellent and Good vehicles have high median prices; Salvage and Parts Only are low. Interestingly, New is lower than Good.
*   Cylinders: More cylinders generally correspond to higher prices; fewest cylinders correspond to lowest prices.
*   Fuel Type: Diesel shows the highest median price; hybrid/gas is lowest; electric is close behind diesel.
*   Title Status: Clean dominates; Parts Only has the lowest prices.
*   Transmission: Manual cars have lower prices; Other transmissions show the highest median.
*   Drive: FWD vehicles are cheapest; 4WD most expensive.
*   Size: Full-size cars are the most expensive; compact the cheapest, though differences are small.
*   Type: Pickup trucks have the highest median prices; minivans are least expensive.
*   Paint Color: White is the most expensive; green is cheapest.
*   State: California has the highest median prices; lowest-price states are mostly outside typical listing histograms.


### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`.

In [116]:
columns_to_drop = ['id', 'VIN', 'size', 'region', 'model']
vehicles_df = vehicles_df.drop(columns=columns_to_drop)

print("Selected columns for modeling:", vehicles_df.columns.tolist())

Selected columns for modeling: ['price', 'year', 'manufacturer', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'drive', 'type', 'paint_color', 'state']


In [117]:
# Numerical: price <= 0 is invalid, replace with NaN temporarily
vehicles_df['price'] = vehicles_df['price'].replace(0, np.nan)

#dropping rows with missing target (price)
vehicles_df = vehicles_df.dropna(subset=['price'])

# Fill missing with 'Unknown' for cat col
categorical_cols = vehicles_df.select_dtypes(include='object').columns.tolist()
vehicles_df[categorical_cols] = vehicles_df[categorical_cols].fillna('Unknown')

# Quantitative data col outliers removed
vehicles_df['odometer'] = vehicles_df['odometer'].clip(upper=vehicles_df['odometer'].quantile(0.99))
vehicles_df['year'] = vehicles_df['year'].clip(lower=vehicles_df['year'].quantile(0.01),
                                              upper=vehicles_df['year'].quantile(0.99))

# <5% missing values impute with median (robust to outliers)
#vehicles_df['year'].fillna(vehicles_df['year'].median(), inplace=True)
#vehicles_df['odometer'].fillna(vehicles_df['odometer'].median(), inplace=True)

vehicles_df['year'] = vehicles_df['year'].fillna(vehicles_df['year'].median())
vehicles_df['odometer'] = vehicles_df['odometer'].fillna(vehicles_df['odometer'].median())

# Verify cleaning
print("Missing values after cleaning:\n", vehicles_df.isnull().sum())

Missing values after cleaning:
 price           0
year            0
manufacturer    0
condition       0
cylinders       0
fuel            0
odometer        0
title_status    0
transmission    0
drive           0
type            0
paint_color     0
state           0
dtype: int64


In [118]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np


df = vehicles_df.copy()

# Derived Features
df['car_age'] = 2025 - df['year']  # assuming current year = 2025
df['mileage_per_year'] = df['odometer'] / df['car_age']

# Handle missing values for derived features
df['car_age'] = df['car_age'].fillna(df['car_age'].median())
df['mileage_per_year'] = df['mileage_per_year'].fillna(df['mileage_per_year'].median())

# Cap extreme outliers at 1st and 99th percentiles
num_cols_to_cap = ['price', 'odometer', 'car_age', 'mileage_per_year']
for col in num_cols_to_cap:
    lower = df[col].quantile(0.01)
    upper = df[col].quantile(0.99)
    df[col] = np.clip(df[col], lower, upper)

# Log transformation for right-skewed columns
df['year_log'] = np.log1p(df['year'])


# One-hot encoding for categorical columns
categorical_cols = ['manufacturer', 'condition', 'cylinders', 'fuel',
                    'title_status', 'transmission', 'drive', 'type', 'paint_color', 'state']

ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded_cat = ohe.fit_transform(df[categorical_cols])
encoded_cat_df = pd.DataFrame(encoded_cat, columns=ohe.get_feature_names_out(categorical_cols))

# Select numeric columns for scaling
numeric_cols = ['price', 'odometer', 'car_age', 'mileage_per_year', 'year_log']

# Scale numeric columns
scaler = StandardScaler()
scaled_numeric = scaler.fit_transform(df[numeric_cols])
scaled_numeric_df = pd.DataFrame(scaled_numeric, columns=numeric_cols)

# Combine scaled numeric + one-hot encoded categorical
df_prepared = pd.concat([scaled_numeric_df.reset_index(drop=True),
                         encoded_cat_df.reset_index(drop=True)], axis=1)

print("Prepared data shape:", df_prepared.shape)

# Apply PCA
pca = PCA(n_components=0.95)
pca_data = pca.fit_transform(df_prepared)
print("PCA output shape:", pca_data.shape)


# Print the number of components selected
print(f"Number of components selected: {pca.n_components_}")



Prepared data shape: (393985, 163)
PCA output shape: (393985, 72)
Number of components selected: 72


### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [119]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn.pipeline import Pipeline


In [120]:
# Features (X) and Target (y)
X = df_prepared.drop(columns=['price'])
y = df_prepared['price']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Cross-validation setup
cv = KFold(n_splits=5, shuffle=True, random_state=42)


In [121]:
#Linear regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

mse_lr = mean_squared_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mse_lr)
mae_lr = mean_absolute_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

print("Linear Regression Performance:")
print("RMSE:", rmse_lr)
print("MAE:", mae_lr)
print("R2:", r2_lr)

Linear Regression Performance:
RMSE: 0.5994120935623549
MAE: 0.406384472353195
R2: 0.6394449125040521


In [122]:
#Liner Regression

model_1 = LinearRegression()
model_1.fit(X_train, y_train)

train_preds = model_1.predict(X_train)
test_preds = model_1.predict(X_test)

model_1_train_mse = mean_squared_error(y_train, train_preds)
model_1_test_mse = mean_squared_error(y_test, test_preds)
model_1_train_rmse = np.sqrt(model_1_train_mse)
model_1_test_rmse = np.sqrt(model_1_test_mse)
model_1_train_mae = mean_absolute_error(y_train, train_preds)
model_1_test_mae = mean_absolute_error(y_test, test_preds)
model_1_train_r2 = r2_score(y_train, train_preds)
model_1_test_r2 = r2_score(y_test, test_preds)

print(f"Model 1 Train MSE: {model_1_train_mse}")
print(f"Model 1 Test MSE: {model_1_test_mse}")
print(f"Model 1 Train RMSE: {model_1_train_rmse}")
print(f"Model 1 Test RMSE: {model_1_test_rmse}")
print(f"Model 1 Train MAE: {model_1_train_mae}")
print(f"Model 1 Test MAE: {model_1_test_mae}")
print(f"Model 1 Train R2: {model_1_train_r2}")
print(f"Model 1 Test R2: {model_1_test_r2}")


Model 1 Train MSE: 0.360383539889133
Model 1 Test MSE: 0.3592948579088053
Model 1 Train RMSE: 0.6003195314906329
Model 1 Test RMSE: 0.5994120935623549
Model 1 Train MAE: 0.40627135377959
Model 1 Test MAE: 0.406384472353195
Model 1 Train R2: 0.6399307131674692
Model 1 Test R2: 0.6394449125040521


In [123]:
#GridSearchCV - Rigde Regression

pipe = Pipeline([('scale', StandardScaler()), ('ridge', Ridge())])

param_dict = {'ridge__alpha': [0.001, 0.1, 1.0, 10.0, 100.0, 1000.0]}

grid = GridSearchCV(pipe, param_grid=param_dict)
grid.fit(X_train, y_train)

train_preds = grid.predict(X_train)
test_preds = grid.predict(X_test)

model_2_train_mse = mean_squared_error(y_train, train_preds)
model_2_test_mse = mean_squared_error(y_test, test_preds)
model_2_train_rmse = np.sqrt(model_2_train_mse)
model_2_test_rmse = np.sqrt(model_2_test_mse)
model_2_train_mae = mean_absolute_error(y_train, train_preds)
model_2_test_mae = mean_absolute_error(y_test, test_preds)
model_2_train_r2 = r2_score(y_train, train_preds)
model_2_test_r2 = r2_score(y_test, test_preds)
model_2_best_alpha = grid.best_params_

print(f"Model 2 Train MSE: {model_2_train_mse}")
print(f"Model 2 Test MSE: {model_2_test_mse}")
print(f"Model 2 Train RMSE: {model_2_train_rmse}")
print(f"Model 2 Test RMSE: {model_2_test_rmse}")
print(f"Model 2 Train MAE: {model_2_train_mae}")
print(f"Model 2 Test MAE: {model_2_test_mae}")
print(f"Model 2 Train R2: {model_2_train_r2}")
print(f"Model 2 Test R2: {model_2_test_r2}")
print(f'Best Alpha: {list(model_2_best_alpha.values())[0]}')




Model 2 Train MSE: 0.3603835641974108
Model 2 Test MSE: 0.3592939844687993
Model 2 Train RMSE: 0.6003195517367487
Model 2 Test RMSE: 0.5994113649813451
Model 2 Train MAE: 0.40627197132443155
Model 2 Test MAE: 0.4063852582925861
Model 2 Train R2: 0.6399306888803882
Model 2 Test R2: 0.6394457890076559
Best Alpha: 0.001


In [124]:
#Lasso Regression

lasso = Lasso(alpha=0.01)
lasso.fit(X_train, y_train)

train_preds = lasso.predict(X_train)
test_preds = lasso.predict(X_test)

model_3_train_mse = mean_squared_error(y_train, train_preds)
model_3_test_mse = mean_squared_error(y_test, test_preds)
model_3_train_rmse = np.sqrt(model_3_train_mse)
model_3_test_rmse = np.sqrt(model_3_test_mse)
model_3_train_mae = mean_absolute_error(y_train, train_preds)
model_3_test_mae = mean_absolute_error(y_test, test_preds)
model_3_train_r2 = r2_score(y_train, train_preds)
model_3_test_r2 = r2_score(y_test, test_preds)

print(f"Model 3 Train MSE: {model_3_train_mse}")
print(f"Model 3 Test MSE: {model_3_test_mse}")
print(f"Model 3 Train RMSE: {model_3_train_rmse}")
print(f"Model 3 Test RMSE: {model_3_test_rmse}")
print(f"Model 3 Train MAE: {model_3_train_mae}")
print(f"Model 3 Test MAE: {model_3_test_mae}")
print(f"Model 3 Train R2: {model_3_train_r2}")
print(f"Model 3 Test R2: {model_3_test_r2}")



Model 3 Train MSE: 0.46613286145218974
Model 3 Test MSE: 0.46491358266072097
Model 3 Train RMSE: 0.6827392338603295
Model 3 Test RMSE: 0.6818457176375906
Model 3 Train MAE: 0.4764556169667026
Model 3 Test MAE: 0.4768033288045889
Model 3 Train R2: 0.534273604604887
Model 3 Test R2: 0.5334557292305107


In [125]:
# Model performance summary
model_summary = {
    'Model': ['Linear Regression', 'Ridge Regression', 'Lasso Regression'],
    'RMSE': [model_1_test_rmse, model_2_test_rmse, model_3_test_rmse],
    'MAE': [model_1_test_mae, model_2_test_mae, model_3_test_mae],
    'R2': [model_1_test_r2, model_2_test_r2, model_3_test_r2]
}

summary_df = pd.DataFrame(model_summary)

summary_df = summary_df.sort_values(by='RMSE')

print("Model Performance Summary for Test Data:")
display(summary_df)

Model Performance Summary for Test Data:


Unnamed: 0,Model,RMSE,MAE,R2
1,Ridge Regression,0.599411,0.406385,0.639446
0,Linear Regression,0.599412,0.406384,0.639445
2,Lasso Regression,0.681846,0.476803,0.533456


In [126]:
# Feature importance for Linear Regression
print("\nLinear Regression Feature Coefficients:")
lr_best = LinearRegression(fit_intercept=True)
lr_best.fit(X_train, y_train)
for name, coef in zip(X_train.columns, model_1.coef_):
    print(f"{name}: {coef:.4f}")


# Feature importance for Ridge
print("\nRidge Regression Feature Coeffieceints Importances:")
grid_best = GridSearchCV(pipe, param_grid=param_dict)
grid_best.fit(X_train, y_train)
for name, coef in zip(X_train.columns, grid_best.best_estimator_.named_steps['ridge'].coef_):
    print(f"{name}: {coef:.4f}")

# Feature importance for Lasso Regression
print("\nLasso Regression Feature Coefficients:")
lasso_best = Lasso(alpha=0.001, max_iter=10000)  # choose the best alpha based on CV
lasso_best.fit(X_train, y_train)
for name, coef in zip(X_train.columns, lasso_best.coef_):
    print(f"{name}: {coef:.4f}")





Linear Regression Feature Coefficients:
odometer: -0.0775
car_age: -76.1040
mileage_per_year: -0.1397
year_log: -75.7145
manufacturer_Unknown: -0.0147
manufacturer_acura: -0.0216
manufacturer_alfa-romeo: 0.1056
manufacturer_aston-martin: 1.0785
manufacturer_audi: 0.1354
manufacturer_bmw: -0.0557
manufacturer_buick: -0.2043
manufacturer_cadillac: -0.0007
manufacturer_chevrolet: -0.0942
manufacturer_chrysler: -0.2971
manufacturer_datsun: 0.2440
manufacturer_dodge: -0.2951
manufacturer_ferrari: 2.6210
manufacturer_fiat: -0.6526
manufacturer_ford: -0.1132
manufacturer_gmc: 0.0156
manufacturer_harley-davidson: -0.5513
manufacturer_honda: -0.1252
manufacturer_hyundai: -0.3798
manufacturer_infiniti: -0.0871
manufacturer_jaguar: 0.0900
manufacturer_jeep: -0.1332
manufacturer_kia: -0.4361
manufacturer_land rover: 0.2212
manufacturer_lexus: 0.1769
manufacturer_lincoln: -0.0133
manufacturer_mazda: -0.2906
manufacturer_mercedes-benz: 0.0720
manufacturer_mercury: -0.2516
manufacturer_mini: -0.2224

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

We compared three regression models: Linear Regression, Ridge Regression , and Lasso Regression to determine which factors most influence used car prices. Based on the initial model performance, all three models showed similar RMSE and MAE but Lasso regression had lower R2 value compared to other two models.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.

From a business perspective, these results showcase that the features listed below have the most impact on how much a buyer will spend on a used vehicle:


*   Vehicle Age and Mileage: Older vehicles and cars with higher odometer readings consistently sell for less. Buyers prioritize newer, low-mileage cars. Brand and Manufacturer Reputation: Luxury and high-performance brands like Ferrari, Porsche, Tesla, and Aston Martin significantly increase perceived value, while lower-end brands tend to decrease it.

*   Vehicle Condition: Cars in excellent or like-new condition command higher prices, while fair, salvage, or rebuilt vehicles have diminished value.
Fuel Type and Engine Size: Diesel vehicles and cars with higher cylinder counts are valued more, whereas some electric and small-engine vehicles are less influential in price.

*   Vehicle Type: Convertibles, pickups, and SUVs generally have a higher market value than buses, hatchbacks, or wagons.

*  Title Status: Clean titles or lien-free vehicles increase buyer confidence and price potential; salvage or parts-only titles reduce it.

*   Regional Differences: Certain states (e.g., Utah, Montana, Washington) show higher valuation trends, suggesting regional demand patterns matter.


Actionable Recommendations for the Dealership:

*   Pricing Strategy: Use the Decision Tree and feature importance insights to develop a dynamic pricing model that considers age, mileage, brand, condition, and type. This ensures market-competitive, value-aligned pricing.

*   Inventory Management: Focus acquisition efforts on newer, low-mileage vehicles from high-demand brands. Refurbish vehicles with slightly lower condition scores to improve sale price. Consider geographic preferences when sourcing inventory.

*   Marketing and Sales: Highlight high-impact attributes — such as low mileage, clean title, fuel efficiency, or luxury brand — in online listings, showroom displays, and promotional materials. Tailor messaging based on vehicle type and regional demand trends.

*   Value-Adding Services: Offer minor maintenance, detailing, or feature upgrades to enhance perceived value. Even small improvements in condition or aesthetics can justify higher prices and attract more buyers.

*   Data-Driven Decision Making: Continuously track sales performance relative to feature importance to refine inventory acquisition, pricing, and marketing strategies over time. This allows the dealership to adapt to evolving consumer preferences.
