# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

In [78]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import make_column_transformer
from sklearn.inspection import permutation_importance

import pandas as pd
import plotly.express as px
import seaborn as sns

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

Ideally every business wants to maximize profit. When it comes to used cars, by identifying key features that dictate the price, dealerships can purchase these used vehicles and resell them for profit.
This can accomplished with a regression model. The model will tell us what features determine the price of a car, the coefficients for each feature can also tell us how much of an impact it has on the used 
vehicle price. This model can than be used for any future vehicles that dealership plans on buying.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

It's always good to look through what data you have or what data can compileded to take the next forward. Typically not all types of data points would be required, and that depends on the business need, which can vary. Some important features I would say that are needed is the cars: color, make, model, year, trim, mileage. Some features I would say that are not relevant or will not have has much of signficant impact on the price of car are the id, region, state, and VIN.

In [2]:
cars = pd.read_csv('data/vehicles.csv')

In [3]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

In [4]:
cars.head()

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
0,7222695916,prescott,6000,,,,,,,,,,,,,,,az
1,7218891961,fayetteville,11900,,,,,,,,,,,,,,,ar
2,7221797935,florida keys,21000,,,,,,,,,,,,,,,fl
3,7222270760,worcester / central MA,1500,,,,,,,,,,,,,,,ma
4,7210384030,greensboro,4900,,,,,,,,,,,,,,,nc


### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

It's vital to have clean data when trying to build a model. By not having clean data, it can be difficult to build a model and rely on its predictions. If data is missing you can remove those rows as one method, just ensure that a bulk of data doesn't get wiped out. If so, you can take the approach of filling in the missing values. This I don't find as reliable, and I avoid as much as possible. 

In [5]:
cars.isnull().sum()

id                   0
region               0
price                0
year              1205
manufacturer     17646
model             5277
condition       174104
cylinders       177678
fuel              3013
odometer          4400
title_status      8242
transmission      2556
VIN             161042
drive           130567
size            306361
type             92858
paint_color     130203
state                0
dtype: int64

In [6]:
# Drop features that I believe would have little to no impact on the price of a car
cars_drop = cars.drop(['region', 'model', 'VIN', 'state', 'paint_color'], axis=1)

In [7]:
# Remove all nan 
cars_drop_clean = cars_drop.dropna()

In [8]:
# Set index to id column
cars_final = cars_drop_clean.set_index('id')
cars_final.info()

<class 'pandas.core.frame.DataFrame'>
Index: 83047 entries, 7316356412 to 7302301268
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         83047 non-null  int64  
 1   year          83047 non-null  float64
 2   manufacturer  83047 non-null  object 
 3   condition     83047 non-null  object 
 4   cylinders     83047 non-null  object 
 5   fuel          83047 non-null  object 
 6   odometer      83047 non-null  float64
 7   title_status  83047 non-null  object 
 8   transmission  83047 non-null  object 
 9   drive         83047 non-null  object 
 10  size          83047 non-null  object 
 11  type          83047 non-null  object 
dtypes: float64(2), int64(1), object(9)
memory usage: 8.2+ MB


In [9]:
# Create our training and test datasets
X = cars_final.drop(['price'], axis = 1)
y = cars_final['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

### Functions

In [23]:
def print_scores(model):
    score = model.score(X_test, y_test)
    mse_train = mean_squared_error(model.predict(X_train), y_train)
    mse_test = mean_squared_error(model.predict(X_test), y_test)

    print(f'Accuracy score: {score}')
    print(f'MSE train: {mse_train}')
    print(f'MSE test: {mse_test}')

In [34]:
def get_permutation_importance(model):
    r = permutation_importance(model, X_test, y_test, n_repeats=30, random_state=0)

    df = {}

    for i in r.importances_mean.argsort()[::-1]:
        print(f"{X.columns[i]}:{r.importances_mean[i]}")
        df[X.columns[i]] = r.importances[i]

    data = pd.DataFrame.from_dict(df)
    return data

In [48]:
def plot_permutation_importance(df, model):
    fig = px.box(df)
    fig.update_layout(
        title=f"Permutation Importance based on Features in {model} Regression Model",
        xaxis_title="Features", 
        yaxis_title="Permutation Importance"
    )
    fig.show()

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

Since our target value is a continous one the best model to use is regression model, linear, ridge, and lasso. These regression models are ideal for continous values and I will using each model to see which one is best for our dataset. There are many different models and algorithms out there, and there isn't one model that "fits all". It's always best to not only test out a few models, but also their hyperparameters as well.

### LinearRegression

LinearRegression model with:

- OneHotEncoding
- StandardScaling
- permutation_importance

In [13]:
transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'), ['manufacturer', 'size', 'condition', 'cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'type']), 
                                     remainder = StandardScaler())

In [18]:
linreg_pipe = Pipeline([('transformer', transformer),
                        ('linreg', LinearRegression())])

In [27]:
linreg_pipe.fit(X_train, y_train)
print_scores(linreg_pipe)

Accuracy score: 2.659506775870568e-05
MSE train: 20609946096314.027
MSE test: 672550734601033.5


In [46]:
linear_data = get_permutation_importance(linreg_pipe)

type:0.00011120016637394074
cylinders:1.986496610876613e-05
transmission:4.239666621519422e-06
year:2.564554427967695e-07
title_status:9.839450014759165e-08
odometer:-4.5474045812972836e-07
size:-5.140614085926328e-07
fuel:-4.168040360356488e-06
manufacturer:-4.503382120650083e-06
condition:-1.4249542630858277e-05
drive:-2.4275561080248456e-05


In [49]:
plot_permutation_importance(linear_data, "Linear")

### Ridge Regression

Ridge model with:

- OneHotEncoding
- StandardScaling
- GridSearchCV
- permutation_importance

In [28]:
params = {'ridge__alpha': range(1, 100, 5)}

In [29]:
ridge_pipe = Pipeline([('transformer', transformer),
                        ('ridge', Ridge())])

In [41]:
ridge_grid = GridSearchCV(ridge_pipe, param_grid=params)
ridge_grid.fit(X_train, y_train)
print_scores(ridge_grid)

Accuracy score: 2.5008550405769547e-05
MSE train: 20609978491620.055
MSE test: 672551801642822.8


KeyError: 0

In [42]:
print(f"Best alpha parameter: {ridge_grid.best_params_}")

Best alpha parameter: {'ridge__alpha': 96}


In [44]:
ridge_data = get_permutation_importance(ridge_grid)

type:0.00010796083762141251
cylinders:1.9570829637941914e-05
transmission:4.2038129676476766e-06
year:2.467798444002097e-07
title_status:1.0192350885137221e-07
size:-3.119920242193134e-07
odometer:-4.4845724077591244e-07
fuel:-4.308824555417553e-06
manufacturer:-4.899077903689856e-06
condition:-1.413728710516112e-05
drive:-2.3436728347366506e-05


In [50]:
plot_permutation_importance(ridge_data, "Ridge")

### Lasso Regression

Lasso model with:

- OneHotEncoding
- StandardScaling
- GridSearchCV
- permutation_importance

In [64]:
params = {'lasso__alpha': range(1, 100, 5)}

In [65]:
lasso_pipe = Pipeline([('transformer', transformer),
                        ('lasso', Lasso(tol=1))])

In [66]:
lasso_grid = GridSearchCV(lasso_pipe, param_grid=params)
lasso_grid.fit(X_train, y_train)
print_scores(lasso_grid)

Accuracy score: 2.299898403812506e-05
MSE train: 20610414635872.75
MSE test: 672553153214104.6


In [67]:
print(f"Best alpha parameter: {lasso_grid.best_params_}")

Best alpha parameter: {'lasso__alpha': 96}


In [57]:
lasso_data = get_permutation_importance(lasso_grid)

type:9.332122777360545e-05
cylinders:2.2484832976791638e-05
size:6.0097723416160005e-06
transmission:3.001795152527779e-06
title_status:1.3708149597668752e-07
year:5.2949064836364336e-08
odometer:-4.581074879624521e-07
fuel:-4.577564319774296e-06
manufacturer:-8.598878732128205e-06
condition:-1.4688869879305363e-05
drive:-1.8527843837568005e-05


In [58]:
plot_permutation_importance(ridge_data, "Lasso")

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [59]:
print("Linear Regression model scores:")
print_scores(linreg_pipe)

Linear Regression model scores:
Accuracy score: 2.659506775870568e-05
MSE train: 20609946096314.027
MSE test: 672550734601033.5


In [62]:
print("Ridge Regression model scores:")
print_scores(ridge_grid)
print(f"Best alpha parameter: {ridge_grid.best_params_}")

Ridge Regression model scores:
Accuracy score: 2.5008550405769547e-05
MSE train: 20609978491620.055
MSE test: 672551801642822.8
Best alpha parameter: {'ridge__alpha': 96}


In [68]:
print("Lasso Regression model scores:")
print_scores(lasso_grid)
print(f"Best alpha parameter: {lasso_grid.best_params_}")

Lasso Regression model scores:
Accuracy score: 2.299898403812506e-05
MSE train: 20610414635872.75
MSE test: 672553153214104.6
Best alpha parameter: {'lasso__alpha': 96}


### Findings and interpertation

- For cross-validation I took the simple and split the dataset into training and test. We can also take another approach with GridSearchCV, and that's the k-fold cross-validation. This is a more complext approach then the simple one I took, but I believe my approach should suffice.

- We looked at three models to see what would fit best for our given dataset. As we can see all three models gave roughly the same scoring, despite the score being so low Lasso Regression seemed to do the best. As for the MSE being so high (in the millions) this is expected. This expectation comes from the fact that we have many categorical features that we had to encode, and by encoding them, led to the increased complexity of our model. 

- Based off the permutations importance we could perhaps exclude the last four features and refit our model based on that.

- For future analysis try our best to get all the datapoints for all features so that we have more data the model can work off of and we do not need to clean up the data as much

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

# Report

### What drives the price of a car?

From my anlysis these are top three features that drive the price of a car.

- Vehicle type
- Engine size (number of cylinders)
- Transmission

Below is a plot demonstrating this

In [69]:
plot_permutation_importance(ridge_data, "Lasso")

### Barplots on the top three features

In [92]:
fig = px.bar(cars_final, x="type", title="Number of Cars for each Type")
fig.update_traces(marker_color='blue',
                  marker_line_color='blue',
                  selector=dict(type="bar"))
fig.show()

In [93]:
fig = px.bar(cars_final, x="cylinders", title="Number of Cars with Engine Size")
fig.update_traces(marker_color='blue',
                  marker_line_color='blue',
                  selector=dict(type="bar"))
fig.show()

In [94]:
fig = px.bar(cars_final, x="transmission", title="Number of Cars with Transmission Type")
fig.update_traces(marker_color='blue',
                  marker_line_color='blue',
                  selector=dict(type="bar"))
fig.show()

### Final thoughts...

Based on the analysis done and what the charts show us I believe the dealership should focus on these types of cars in their inventory.

- Vehicles that are either a truck, suv, or sedan
- Vehicles with engine size of either 4, 6, 8 cylinders
- Lastly, vehicles which has an automatics transmission

Of course, we can further dive into this analysis and just specifically look at the features that I have mentioned and see, if perhaps, the manufacturer plays a role in this or even look at the models as well.