# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

The goal of this project is to find the features/attributes of a used car that determine it's price. The task involves  creating multiple regression models of the features (X) with the price (y) and evaluating the coefficients of the model. The  statistics of root mean squared error will be used to hone in on the best model to use on the entire dataset. The regression results will further tell us the importance of the features in determining the price of the used car.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

The steps to get to know the dataset involve getting the descriptive statistics of the dataset, determining the categorical features from the numeric features and looking for missing data. Any features with a large number of missing data can be dropped whereas null records from the other fields can be removed before proceeding to the data analysis stage.

In [None]:
# import the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, FunctionTransformer, PolynomialFeatures, OneHotEncoder

from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline, make_pipeline

from sklearn.compose import make_column_transformer, ColumnTransformer
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.inspection import permutation_importance

In [None]:
cars_df = pd.read_csv('data/vehicles.csv')
cars_df_original = cars_df

In [None]:
cars_df.shape

In [None]:
cars_df.head()

In [None]:
cars_df.tail()

In [None]:
cars_df.info()

In [None]:
cars_df.describe()

In [None]:
# null values
cars_df.isnull().sum()

In [None]:
# region, price and state have no missing data, plotting them
cars_df_to_plot = cars_df.loc[cars_df['price'] > 0]
px.box(x=np.log10(cars_df_to_plot['price']),y=cars_df_to_plot['state']).update_layout(
    title="BoxPlot of log10(price) by state", xaxis_title="log10(price)", yaxis_title="state"
)

In [None]:
plt.figure(figsize=(10,4))
cars_df['state'].value_counts().plot(kind='bar')
plt.xlabel('states')
plt.ylabel('Number of records')
plt.title('Number of used-car records by state')
plt.xticks(rotation=90)
plt.show()

In [None]:
# Scatterplot of odometer versus price
cars_df_price_odo = cars_df[(cars_df['price'] <= 1000000) & (cars_df['odometer'] <= 1000000)]

plt.figure(figsize=(10,4))
sns.scatterplot(data=cars_df_price_odo,x='odometer',y='price')
plt.title('ScatterPlot of odometer vs price')
plt.show()

In [None]:
# Plot of average price of car by manufacturer
cars_df_manufacturer_price = cars_df[(cars_df['manufacturer'].notna()) & (cars_df['price'] > 0) & (cars_df['price'] <= 1000000)]

plt.figure(figsize=(10,4))
cars_df_manufacturer_price.groupby('manufacturer')['price'].mean().sort_values().plot(kind='bar')
plt.title('Average Price of used car by Manufacturer')
plt.xlabel('Car Manufacturer')
plt.ylabel('Price ($)')
plt.show()

In [None]:
# Price of used car by condition
cars_df_to_plot = cars_df.loc[(cars_df['price'] > 0) & (cars_df['price'] <= 1000000) & (cars_df['condition'].notna())]
px.box(x=np.log10(cars_df_to_plot['price']),y=cars_df_to_plot['condition']).update_layout(
    title="BoxPlot of log10(price) by condition", xaxis_title="log10(price)", yaxis_title="condition"
)

In [None]:
# Odometer vs Transmission
cars_df_odo_transmission = cars_df[(cars_df['transmission'].notna()) & (cars_df['odometer'] > 0) & (cars_df['odometer'] <= 1000000)]
px.box(x=np.log10(cars_df_odo_transmission['odometer']),y=cars_df_odo_transmission['transmission']).update_layout(
    title="BoxPlot of log10(odometer) by transmission", xaxis_title="log10(odometer)", yaxis_title="transmission"
)

In [None]:
# Price vs Fuel
cars_df_to_plot = cars_df.loc[(cars_df['price'] > 0) & (cars_df['price'] <= 1000000) & (cars_df['fuel'].notna())]
px.box(x=cars_df_to_plot['fuel'],y=np.log10(cars_df_to_plot['price']),color_discrete_sequence=['gray']).update_layout(
    title="BoxPlot of log10(price) by fuel", xaxis_title="fuel", yaxis_title="log10(price)"
)

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [None]:
# drop the id, VIN, model and size columns
cars_df = cars_df.drop(['id','VIN','model','size'],axis=1)

In [None]:
cars_df.isnull().sum()

In [None]:
# drop rows where cylinders, condition, drive, paint_color, type, manufacturer, odometer, transmission are null
cars_df = cars_df.loc[cars_df['cylinders'].notna()]
cars_df = cars_df.loc[cars_df['condition'].notna()]
cars_df = cars_df.loc[cars_df['drive'].notna()]
cars_df = cars_df.loc[cars_df['paint_color'].notna()]
cars_df = cars_df.loc[cars_df['type'].notna()]
cars_df = cars_df.loc[cars_df['manufacturer'].notna()]
cars_df = cars_df.loc[cars_df['odometer'].notna()]
cars_df = cars_df.loc[cars_df['transmission'].notna()]

In [None]:
cars_df.isnull().sum()

In [None]:
# remove price and odometer records that are zero and also those larger than 1 million (to remove outliers)
cars_df = cars_df[(cars_df['price'] > 0) & (cars_df['price'] <= 1000000)]
cars_df = cars_df[(cars_df['odometer'] > 0) & (cars_df['odometer'] <= 1000000)]

In [None]:
cars_df.shape

In [None]:
# Percentage of records retained for analysis
total_records_in_original_dataset = cars_df_original.shape[0]
print('Percent of records for the analysis =',np.round((cars_df.shape[0]/total_records_in_original_dataset)*100,2),'%')

In [None]:
# corr heatmap - between price, year and odometer
corr = cars_df[['price','year','odometer']].corr()
plt.figure(figsize=(6, 4))
sns.heatmap(corr,annot=True,cmap='coolwarm')
plt.title('Correlation between price, year and odometer')
plt.show()

In [None]:
# populate the X and y values
X = cars_df.drop('price', axis = 1)
y = cars_df['price']

In [None]:
# train_test_split into training and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train.head()

In [None]:
# preprocessor - dummy variables for categorical features and scaling numerical features
num_features = ['year','odometer']
cat_features = ['region','manufacturer','condition','cylinders','fuel', \
                'title_status','transmission','drive','type','paint_color','state']

preprocessor = ColumnTransformer(
    transformers=[
        ('num_features', make_pipeline(StandardScaler(), PolynomialFeatures(degree=3, include_bias=False)), num_features),
        ('cat_features', make_pipeline(OneHotEncoder(handle_unknown='ignore')), cat_features)
    ])

In [None]:
preprocessor

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [None]:
# (a) Linear Regression

In [None]:
# set up the pipleline for modeling with LinearRegression
pipe_lin_reg = Pipeline([
    ('preprocessor', preprocessor), 
    ('linreg', LinearRegression())
])

In [None]:
pipe_lin_reg.fit(X_train, y_train)

In [None]:
train_preds = pipe_lin_reg.predict(X_train)
test_preds = pipe_lin_reg.predict(X_test)
rmse_train = np.sqrt(mean_squared_error(y_train, train_preds))
rmse_test = np.sqrt(mean_squared_error(y_test, test_preds))
print('RMSE train =',np.round(rmse_train,2),'and RMSE test =',np.round(rmse_test,2))

In [None]:
# Coefficients of the model and feature names
with pd.option_context('display.max_rows', None,):
    print(pd.DataFrame(pipe_lin_reg.named_steps['linreg'].coef_, index = pipe_lin_reg.named_steps['preprocessor'].get_feature_names_out()).rename(columns = { 0 : "coeff"}).sort_values(by = "coeff"))


In [None]:
pipe_lin_reg.named_steps['linreg'].intercept_

In [None]:
# r-squared and adjusted r-squared
r2 = r2_score(y_test, test_preds)
adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)
print("r_squared =",np.round(r2,4),'and adjusted r-squared =',np.round(adj_r2,4))

In [None]:
# permutation importance of the features
results = permutation_importance(pipe_lin_reg, X_test, y_test)
importances = pd.DataFrame(data=results.importances_mean, index=X_test.columns, columns=['Importance']).sort_values(by='Importance', ascending=False)
print(importances)

In [None]:
# The results from the Linear Regression model show that year of the car and the odometer are the most important features, 
# followed by number of cylinders, fuel, type and manufacturer, in determining the price of a used car. The transmission and 
# paint color affect the price of the used car the least.

In [None]:
# (b) Ridge Regression (with 10-fold cross-validation) ; Note: takes more than a minute to fit the model

In [None]:
# set up the pipleline for modeling with RidgeRegression
pipe_ridge_reg = Pipeline([
    ('preprocessor', preprocessor), 
    ('ridge', Ridge())
])

In [None]:
param_ridge_alpha_dict = {'ridge__alpha': [0.01, 0.1, 1.0, 10.0, 100.0], 'ridge__fit_intercept': [True,False]}

In [None]:
grid_ridge = GridSearchCV(pipe_ridge_reg, param_grid=param_ridge_alpha_dict, cv=10)
grid_ridge.fit(X_train, y_train)

In [None]:
print(f'Best Alpha chosen: {list(grid_ridge.best_params_.values())[0]}')
print(f'Fit_Intercept chosen: {list(grid_ridge.best_params_.values())[1]}')

In [None]:
train_preds = grid_ridge.best_estimator_.predict(X_train)
test_preds = grid_ridge.best_estimator_.predict(X_test)
rmse_train = np.sqrt(mean_squared_error(y_train, train_preds))
rmse_test = np.sqrt(mean_squared_error(y_test, test_preds))
print('RMSE train =',np.round(rmse_train,2),'and RMSE test =',np.round(rmse_test,2))

In [None]:
# r-squared and adjusted r-squared
r2 = r2_score(y_test, test_preds)
adj_r2 = 1 - (1-r2)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)
print("r_squared =",np.round(r2,4),'and adjusted r-squared =',np.round(adj_r2,4))

In [None]:
# (c) Sequential Feature Selection ; Note: takes more than 3 minutes to fit the model

In [None]:
# set up the pipleline for modeling using LinearRegression with Sequential Feature Selection
selector = SequentialFeatureSelector(LinearRegression(), n_features_to_select=5)

pipe_lin_reg_seq = Pipeline([
    ('preprocessor', preprocessor),
    ('column_selector', selector),    
    ('linreg', LinearRegression())
])

In [None]:
pipe_lin_reg_seq.fit(X_train, y_train)

In [None]:
train_preds = pipe_lin_reg_seq.predict(X_train)
test_preds = pipe_lin_reg_seq.predict(X_test)
rmse_train = np.sqrt(mean_squared_error(y_train, train_preds))
rmse_test = np.sqrt(mean_squared_error(y_test, test_preds))
print('RMSE train =',np.round(rmse_train,2),'and RMSE test =',np.round(rmse_test,2))

In [None]:
pipe_lin_reg_seq.named_steps['linreg'].coef_

In [None]:
pipe_lin_reg_seq.named_steps['column_selector'].get_feature_names_out()

In [None]:
pd.DataFrame(pipe_lin_reg_seq.named_steps['preprocessor'].get_feature_names_out()).loc[[0,1,4,467,482],0]

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [None]:
# Therefore, best model (based on lowest mse and highest adj r-squared) = Linear Regression model, fit it on entire X data

In [None]:
best_model = pipe_lin_reg
best_model.fit(X,y)
y_preds = best_model.predict(X)
rmse = np.sqrt(mean_squared_error(y, y_preds))
print('RMSE using entire dataset =',np.round(rmse,2))

In [None]:
# r-squared and adjusted r-squared
r2 = r2_score(y, y_preds)
adj_r2 = 1 - (1-r2)*(len(y)-1)/(len(y)-X.shape[1]-1)
print("r_squared = ",np.round(r2,4),'and adjusted r-squared =',np.round(adj_r2,4))

In [None]:
# Coefficients of the model and feature names (based on sign and magnitude of the coeff, we can infer its effect on price)
with pd.option_context('display.max_rows', None,):
    print(pd.DataFrame(best_model.named_steps['linreg'].coef_, index = best_model.named_steps['preprocessor'].get_feature_names_out()).rename(columns = { 0 : "coeff"}).sort_values(by = "coeff"))


In [None]:
# Permutation importance of the features
results = permutation_importance(best_model, X, y)
importances = pd.DataFrame(data=results.importances_mean, index=X.columns, columns=['Importance']).sort_values(by='Importance', ascending=False)
print(importances)

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.

The dataset of the attributes of about 426,000 used cars and their respective prices was analyzed to determine the importance of these features in its sale price.

The regression analysis reveals that the most important feature in determining the price of a used car is the year of the car (closer the year is to the present year, higher the price of the used car). The second is the miles driven as given by the odometer reading (higher implies a lower price) and the third is the number of cylinders of the engine (higher implies a higher price). The year is more than twice as important than the odometer reading in affecting the price.

It was found that cars from manufacturers such as Ferrari, Tesla, Aston-Martin and Porsche have much higher priced cars (which is not a surprise since they are luxury car brands). The car brands of Fiat, Dodge, Nissan, Kia and Chrysler are at the other end of the spectrum with the lowest prices. The condition of the car also affects the price, that is cars which are in excellent, new, like-new or good condition command a higher price than those of fair or salvage condition. A clean title status also helps in increasing the price commanded by the used car, whereas a missing title adversely impacted the price. In terms of region, Southwest VA, Texarkana TX, Susanville CA, Panama City, FL and Heartland Florida commanded a higher price for a used car (compared to other regions in the country). The states of NC, UT, NE, NV and WA had higher used car prices compared to other states.

And in terms of fuel, diesel fuel cars were priced higher than gas fuel, electric or hybrid cars. The paint color is the least important feature followed by transmission type, and they have a negligible impact on the price of the car.
