# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

In [1]:
# Business Understanding -

# From a business perspective, we are tasked with identifying key drivers for used car prices. 
# In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition. 
# Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary.

# Background - 

# We have several car dealers buying and selling used cars. The dealer typically looks at few key parameters 
# like make, model, millage, year, type etc.. and also uses 3rd party application in deciding an optimal price 
# for buying or selling used cars. This is done manually everytime the decision is made. The process is somewhat 
# successfuly but in many ways time consuming and requires skill, experience and man power. On top of this the 
# dealer may go wrong in his estimation due to the number of criterias and their combination plus the 3rd party 
# application is more of a black box and the dealer may not have any idea on how the estimation is done and how 
# accurate the estimations are.

# Business Objectives -

# The main objective of this task is to find out if there is an easier, more accurate, less error prone, 
# [1%-2%] manual effort, profitable way of determining the price of a car that could be either bought or sold. 
# The will be very useful for the car dealers in minimizing human resource and have them focus on other productive 
# areas. The other benefit includes better planning and strategy by the car dealer in improving future business. 
# He/She could pretty much target the right consumers by knowing their choices, finanical standards based on the 
# location and zero-in on those specific cars/trucks.

# Business Success Criteria -

# If the application can provide the car dealer with prices that improves the profitability, marketting, sales with 
# less manual effort and resources will form the basis of the success criteria.

# Access Sitiation -

# Inventory of Resources Requirements, Assumption and constraints - How good or how worthy the data collected so far 
# would be useful for predicting the price of a car given the set of input criteria. Here the assumtion is based on 
# the volume of data, their age, source and content of the data. The assumption as part of the pre-requisit would 
# be to glance over some of the data and make an assessment. The contraints here could be

# 1. Too much of data and too different? Difficult to make an assessment just by glacing over.
# 2. How much of the features are absent?
# 3. How much of these data are unwanted vs wanted?
# 4. Authenticity of the data ie are these data good or are they simply generated by a 3rd party?
# 5. Skill/Knowledge to process the data.

# Risk and Contingencies -

# 1. What if the predictions go horribly wrong? There needs to be some kind of verification following the 
# prediction(s).
# 2. What if we start getting new features? Requires time and effort in incorporating the new features and 
# their verifications/testing.
# 3. Time required to build, test and stabilize the application.

# Cost and benefits -

# 1. Data Cost
# 2. Cost of build the application
# 3. Verification Cost
# 4. Ongoing maintainance and feature enhancement cost
# 5. The biggest benefit is automation, time and gives a deeper understanding on the business towards making profit


# Determine the Data Mining Goals - 

# Identify the features that are required for predicting the target which in our case - price. How are we going to 
# determine the features to be selected. We could use PCA with SVD in this determination. 

# Produce Project Plan -

# Project Plan - here are the different stages and their high level estimation in terms of the time in hours 
# (irrespective of the resource)
#     a. Data Analysis includes the quality of data in each of the features, decide if they are required or not, what type of encoding to use in case of non-numeric feature, determine the ratio of Train and Test Data -> 2 hours
#     b. Data Cleanup -> 2 hours
#     c. Design and Implement the Regression -> 3-4 hours
#     d. Train the model - 1 hour
#     e. Test the model for evaluation - 1 hour
#     f. Review the model and predictions - 1 hour

# Initial Assessment of Tools and Techniques

# 1. The data is already clustered so will not do this activity
# 2. Using GridSearchCV, determine the best alpha value for Ridge Regression
# 3. Using GridSearchCV, determine the degree of the PolynomialFeature for Linear Regression
# 3. Find out the feature selection with Pipeline - [StandardScaler, OneHotEncoder, Ridge] for the below --
# 	a. No SFS with Ridge Regression
# 	b. SFS with LASSO using Ridge Regression
# 	c. RFE with LASSO using Ridge Regression
# 	d. SFS with LASSO using Linear Regression
# 	e. RFE with LASSO using Linear Regression



### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [2]:
# The dataset has a total of 426880 rows and 18 columns

# Numeric Fields - id, price, year, odometer
# Non Numeric Fields - region, manufacturer, model, condition, cylinders, fuel, title_status, transmission, 
#                      VIN, drive, size, type, paint_color, state

# Not Null Features - id, region, price, state are the only features that have not null data. The rest of the 
# remaining 14 features have null data.

# Describe Data

# region - US abrevated states
# price - price of the vehical as double type in USD
# year - Year of manufacture
# condition - specifies the current condition/health of the vehical
# transmission - specifies the type ie auto, manual, other
# type - type of the car ie pickup, sedan, truck etc..

In [3]:
import pandas as pd
vehical_df = pd.read_csv("data/vehicles.csv")

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [4]:
# Fields dropped includes --> id, VIN, drive, size, paint_color, title_status - Does not look relevant


# Fields that could be dropped includes --
# id, region, model, VIN, drive, size, paint_color, title_status - Does not look relevant.

# 1. Drop all null values. 
# 2. The dataset now has 152466 rows
# 4. Number of Unique condition - 6 as shown below
#    ['gas' 'diesel' 'other' 'hybrid' 'electric']. Replaced as shown below --
#     'good' - 1
#     'excellent' - 2
#     'fair' - 3
#     'like new' - 4
#     'new' - 5
#     'salvage' - 6

# 5. Number of Unique cylinders - 8 as shown below 
#    ['8 cylinders' '6 cylinders' '4 cylinders' '5 cylinders' '10 cylinders' '3 cylinders' 'other' '12 cylinders']
#    Remove value 'other'. Replace all other values with the first 2 string as int values.
#    ie for '12 cylinders' replaced with 12, '4 cylinders' replaced as 4

# 6. Number of Unique fuel - 5. Replaced as shown below --
#     'gas' - 1 
#     'diesel' - 2
#     'other' - 3
#     'hybrid' - 4
#     'electric' - 5

# 7. Number of Unqiue transmission - 3. Replaced as shown below --
#     'other' - 1 
#     'automatic' - 2
#     'manual' - 3


# For features list above from [3-8] drive and play an important role in predicting the price and 
# so we will use oneHotEncoder for transforming the aplha values for training/predictions.

# Split the data for train:test in the ratio 75:25


In [5]:
updated_vehical_df = pd.DataFrame(vehical_df, copy=True)
updated_vehical_df = updated_vehical_df.drop(columns=['id', 'model', 'region', 'VIN', 'drive', 'manufacturer',  
                                                      'size', 'paint_color', 'title_status', 'type'])
updated_vehical_df = updated_vehical_df.dropna()

cylinder_series = updated_vehical_df['cylinders']
temp_fuel = []
for x in cylinder_series:
    if x != 'other':
        temp_fuel.append(int(x[:2].strip()))
    else:
        temp_fuel.append(2)

print(updated_vehical_df['fuel'].unique())
updated_vehical_df['cylinders'] = temp_fuel

fuel = updated_vehical_df['fuel']
temp_fuel = []
for x in fuel:
    if x == 'gas':
        temp_fuel.append(1)
    elif x == 'diesel':
        temp_fuel.append(2)
    elif x == 'other':
        temp_fuel.append(3)
    elif x == 'hybrid':
        temp_fuel.append(4)
    else:
        temp_fuel.append(5)
updated_vehical_df['fuel'] = temp_fuel

condition = updated_vehical_df['condition']
temp_condition = []
for x in condition:
    if x == 'good':
        temp_condition.append(1)
    elif x == 'excellent':
        temp_condition.append(2)
    elif x == 'fair':
        temp_condition.append(3)
    elif x == 'like new':
        temp_condition.append(4)
    elif x == 'new':
        temp_condition.append(5)
    else:
        temp_condition.append(6)
updated_vehical_df['condition'] = temp_condition

#print(updated_vehical_df['state'].unique())
transmission = updated_vehical_df['transmission']
temp = []
for x in condition:
    if x == 'other':
        temp.append(1)
    elif x == 'automatic':
        temp.append(2)
    else:
        temp.append(3)
updated_vehical_df['transmission'] = temp

from sklearn.model_selection import train_test_split
y = updated_vehical_df['price']
X = updated_vehical_df.drop(columns=['price'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=42)
X_train_temp, X_test_temp, y_train_temp, y_test_temp = train_test_split(X, y, test_size=.95, random_state=42)

X_train_temp


['gas' 'diesel' 'other' 'hybrid' 'electric']


Unnamed: 0,year,condition,cylinders,fuel,odometer,transmission,state
385986,2016.0,2,6,1,63707.0,3,ut
206823,2011.0,1,6,1,32652.0,3,mi
122,2005.0,2,6,2,180000.0,3,al
359456,2018.0,2,4,1,97441.0,3,tn
388137,2005.0,1,4,1,98000.0,3,vt
...,...,...,...,...,...,...,...
279141,2012.0,2,4,2,92000.0,3,ny
242590,2014.0,1,8,2,72772.0,3,nc
304097,2017.0,2,8,1,110618.0,3,ok
343668,2016.0,1,6,1,29318.0,3,ri


### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [6]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import GridSearchCV
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# oneHotEncoder for transforming the aplha values

#non_numeric_features = ['manufacturer', 'condition', 'cylinders', 'fuel', 'transmission', 'type', 'state']
#non_numeric_features = ['manufacturer', 'condition', 'cylinders', 'fuel', 'transmission', 'state']
#non_numeric_features = ['condition', 'cylinders', 'fuel', 'transmission', 'state']
non_numeric_features = ['state']
ohe_transformer = make_column_transformer((OneHotEncoder(sparse_output = False, drop='if_binary'), non_numeric_features), 
                                          remainder='passthrough')

# Determine the different hyperparameter and their combination using GridSearchCV.

# Determine the alpha for Ridge

# param_dict = {"poly_features__degree":[1,2,3],
#              "ridge__alpha":[1.0, 10.0, 100.0]}


from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import Lasso

pipe = Pipeline([
    ('ohe_transformer', ohe_transformer),
    ('scale', StandardScaler()),
    ('ridge', Ridge())
])



param_dict = {'ridge__alpha':range(7650000000000000000, 7665000000000000000, 500000000000000)}

grid_search_cv = GridSearchCV(
    estimator = pipe,
    param_grid = param_dict
)

grid_search_cv.fit(X_train, y_train)

print(grid_search_cv.best_estimator_)
best_estimator = grid_search_cv.best_estimator_
best_alpha = grid_search_cv.best_params_['ridge__alpha']
print(best_alpha)
pipe

Pipeline(steps=[('ohe_transformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(drop='if_binary',
                                                                sparse_output=False),
                                                  ['state'])])),
                ('scale', StandardScaler()),
                ('ridge', Ridge(alpha=7660000000000000000))])
7660000000000000000


In [7]:
# Determine the optimal n_features_to_select using the GridSearchCV

sfs_linear = SequentialFeatureSelector(
    estimator = LinearRegression()
)

linear_pipe = Pipeline([
    ('ohe_transformer', ohe_transformer),
    ('scale', StandardScaler()),
    ('selector', sfs_linear),
    ('linear', LinearRegression())
])

param_dict = {'selector__n_features_to_select':range(2,8,1)}

search = GridSearchCV(
    estimator = linear_pipe,
    param_grid = param_dict
)

search.fit(X_train_temp, y_train_temp)

best_feature_selector = search.best_estimator_
print(best_feature_selector)
selected_features = best_feature_selector['selector'].get_feature_names_out()
print(selected_features)
ohe_transformer_feature_names = best_feature_selector.named_steps['ohe_transformer'].get_feature_names_out()
selected_feature_names = []
for x in selected_features:
    print(x[-2:])
    selected_feature_names.append(ohe_transformer_feature_names[int(x[-2:])])
print(selected_feature_names)
n_features_to_select = len(selected_feature_names)
print(n_features_to_select)
linear_pipe

Pipeline(steps=[('ohe_transformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(drop='if_binary',
                                                                sparse_output=False),
                                                  ['state'])])),
                ('scale', StandardScaler()),
                ('selector',
                 SequentialFeatureSelector(estimator=LinearRegression(),
                                           n_features_to_select=2)),
                ('linear', LinearRegression())])
['x51' 'x55']
51
55
['remainder__year', 'remainder__odometer']
2


In [8]:
# 1. Ridge Regression and capture their coef_

ridge_pipe = Pipeline([
    ('ohe_transformer', ohe_transformer),
    ('scaler', StandardScaler()),
    ('ridge', Ridge(alpha=best_alpha))
])
ridge_pipe.fit(X_train, y_train)

ridge_model = ridge_pipe.named_steps['ridge'].coef_
ridge_coef_df = pd.DataFrame({
    'coef':ridge_model
}, index = ohe_transformer.get_feature_names_out())


ridge_train_mse = mean_squared_error(ridge_pipe.predict(X_train), y_train)
ridge_test_mse = mean_squared_error(ridge_pipe.predict(X_test), y_test)
print('Train MSE using Ridge ==> ' + str(ridge_train_mse))
print('Test MSE using Ridge ==> ' + str(ridge_test_mse))
ridge_pipe

Train MSE using Ridge ==> 103664753624392.53
Test MSE using Ridge ==> 337946635151112.0


In [9]:
# 2. Feature Selection using SFS + Lasso, regression using LinearRegression and find out the mse
from sklearn.linear_model import Lasso

sfs_lasso = SequentialFeatureSelector(
    n_features_to_select = 2,
    scoring = 'neg_mean_squared_error',
    estimator = Lasso(random_state = 42)
)

lasso_pipe = Pipeline([
    ('ohe_transformer', ohe_transformer),
    ('scale', StandardScaler()),
    ('selector', sfs_lasso),
    ('linear', LinearRegression())
])



lasso_pipe.fit(X_train, y_train)

sfs_lasso_train_mse = mean_squared_error(lasso_pipe.predict(X_train), y_train)
sfs_lasso_test_mse = mean_squared_error(lasso_pipe.predict(X_test), y_test)
print('Train MSE using SFS Lasso ==> ' + str(sfs_lasso_train_mse))
print('Test MSE using SFS Lasso ==> ' + str(sfs_lasso_test_mse))

Train MSE using SFS Lasso ==> 103664675591078.12
Test MSE using SFS Lasso ==> 337944185756067.9


In [10]:
# 2. Feature Selection using SFS + Lasso, regression using Lasso and find out the mse
from sklearn.linear_model import Lasso

sfs_lasso = SequentialFeatureSelector(
    n_features_to_select = 2,
    scoring = 'neg_mean_squared_error',
    estimator = Lasso(random_state = 42)
)

lasso_pipe = Pipeline([
    ('ohe_transformer', ohe_transformer),
    ('scale', StandardScaler()),
    ('selector', sfs_lasso),
    ('linear', Lasso(random_state=42))
])



lasso_pipe.fit(X_train, y_train)

sfs_lasso_lasso_train_mse = mean_squared_error(lasso_pipe.predict(X_train), y_train)
sfs_lasso_lasso_test_mse = mean_squared_error(lasso_pipe.predict(X_test), y_test)
print('Train MSE using SFS Lasso ==> ' + str(sfs_lasso_lasso_train_mse))
print('Test MSE using SFS Lasso ==> ' + str(sfs_lasso_lasso_test_mse))

Train MSE using SFS Lasso ==> 103664675591079.14
Test MSE using SFS Lasso ==> 337944186024533.5


In [11]:
# 3. Feature Selection using SFS + LinearRegression, regression using Lasso and find out the mse

sfs_linear = SequentialFeatureSelector(
    n_features_to_select = 2,
    scoring = 'neg_mean_squared_error',
    estimator = LinearRegression()
)

lasso_pipe = Pipeline([
    ('ohe_transformer', ohe_transformer),
    ('scale', StandardScaler()),
    ('selector', sfs_linear),
    ('linear', Lasso(random_state=42))
])

lasso_pipe.fit(X_train, y_train)

sfs_linear_lasso_train_mse = mean_squared_error(lasso_pipe.predict(X_train), y_train)
sfs_linear_lasso_test_mse = mean_squared_error(lasso_pipe.predict(X_test), y_test)
print('Train MSE using SFS Linear ==> ' + str(sfs_linear_lasso_train_mse))
print('Test MSE using SFS Linear ==> ' + str(sfs_linear_lasso_test_mse))

Train MSE using SFS Linear ==> 103664675591079.14
Test MSE using SFS Linear ==> 337944186024533.5


In [12]:
# 4. Feature Selection using SFS + LinearRegression, regression using LinearRegression and find out the mse

sfs_linear = SequentialFeatureSelector(
    n_features_to_select = 2,
    scoring = 'neg_mean_squared_error',
    estimator = LinearRegression()
)

linear_pipe = Pipeline([
    ('ohe_transformer', ohe_transformer),
    ('scale', StandardScaler()),
    ('selector', sfs_linear),
    ('linear', LinearRegression())
])

linear_pipe.fit(X_train, y_train)

sfs_linear_linear_train_mse = mean_squared_error(linear_pipe.predict(X_train), y_train)
sfs_linear_linear_test_mse = mean_squared_error(linear_pipe.predict(X_test), y_test)
print('Train MSE using SFS Linear ==> ' + str(sfs_linear_linear_train_mse))
print('Test MSE using SFS Linear ==> ' + str(sfs_linear_linear_test_mse))


Train MSE using SFS Linear ==> 103664675591078.12
Test MSE using SFS Linear ==> 337944185756067.9


In [13]:
# 5. Feature Selection using RFE + Lasso, regression using LinearRegression. and find out the mse
from sklearn.feature_selection import RFE

rfe_selector = RFE(
    n_features_to_select = 2,
    estimator = Lasso()
)

rfe_pipe = Pipeline([
    ('ohe_transformer', ohe_transformer),
    ('scale', StandardScaler()),
    ('selector', rfe_selector),
    ('linear', LinearRegression())
])


rfe_pipe.fit(X_train, y_train)

sfs_rfe_train_mse = mean_squared_error(rfe_pipe.predict(X_train), y_train)
sfs_rfe_test_mse = mean_squared_error(rfe_pipe.predict(X_test), y_test)
print('Train MSE using RFE Lasso ==> ' + str(sfs_rfe_train_mse))
print('Test MSE using RFE Lasso ==> ' + str(sfs_rfe_test_mse))

Train MSE using RFE Lasso ==> 103630688136142.42
Test MSE using RFE Lasso ==> 337986209345136.9


### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [14]:
vehical_result_df = pd.DataFrame({
    'Ridge':[str(ridge_train_mse), str(ridge_test_mse)],
    'SFS Lasso + Linear': [str(sfs_lasso_train_mse), str(sfs_lasso_test_mse)],
    'SFS Lasso + Lasso': [str(sfs_lasso_lasso_train_mse), str(sfs_lasso_lasso_test_mse)],
    'SFS Linear + Lasso': [str(sfs_linear_lasso_train_mse), str(sfs_linear_lasso_test_mse)],
    'SFS Linear + Linear': [str(sfs_linear_linear_train_mse), str(sfs_linear_linear_test_mse)],
    'RFE Lasso': [str(sfs_rfe_train_mse), str(sfs_rfe_test_mse)]
}, index=['Train MSE', 'Test MSE'])

vehical_result_df



Unnamed: 0,Ridge,SFS Lasso + Linear,SFS Lasso + Lasso,SFS Linear + Lasso,SFS Linear + Linear,RFE Lasso
Train MSE,103664753624392.52,103664675591078.12,103664675591079.14,103664675591079.14,103664675591078.12,103630688136142.42
Test MSE,337946635151112.0,337944185756067.9,337944186024533.5,337944186024533.5,337944185756067.9,337986209345136.9


In [None]:
# The Test MSE appear of the above models appear more of less the same. 
# Among them the model using 
# 1. SFS using Lasso and Linear Regression and 
# 2. SFS using Linear and Linear Regression the least

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

In [None]:
# The objective of this task is to identify key drivers for used car prices. From the data that was given, 
# we ran a series of algorithms and trained the system to predict the price of a car given the input data. 
# Following the results of each of the algorithms, we analyised and concluded the below --

# 1. The system ignored fields id, VIN, drive, size, paint_color, title_status, manufacturer as they had a 
# lot of variations of data and were non-numeric.
# 2. The year and odometer when used as the input produced the most optimal price.
# 3. On top of this we could add on the secondary features such as condition, cylinder, fuel, transmission and 
# state if required to predict the price.
# 4. The dealer can use our predition and validate the price thereby providing the system a feedback on the 
# accuracy of the prediction. This will make our algorithm more efficient and productive to the dealers.