# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

In [590]:
# The target variable is price, we would like to advise our customer, the dealership of the used car, which attributes affect the prices more and

In [592]:
# give him some recommendation: for example, what kind of cars have more value, which ones are cheaper, and why each one of them might be appropriate

In [594]:
# for a specific customer. Following i started with data clean up and conversion of categorical data to numerical (encoding), and imputation. 

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [75]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import TargetEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error


In [27]:
vehicles =  pd.read_csv("vehicles.csv")
vehicles = pd.DataFrame(vehicles)
vehicles.describe()


Unnamed: 0,id,price,year,odometer
count,426880.0,426880.0,425675.0,422480.0
mean,7311487000.0,75199.03,2011.235191,98043.33
std,4473170.0,12182280.0,9.45212,213881.5
min,7207408000.0,0.0,1900.0,0.0
25%,7308143000.0,5900.0,2008.0,37704.0
50%,7312621000.0,13950.0,2013.0,85548.0
75%,7315254000.0,26485.75,2017.0,133542.5
max,7317101000.0,3736929000.0,2022.0,10000000.0


In [28]:
vehicles.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

In [35]:
# model column has 28266 unique names, so we will drop it for now...
vehicles.model.unique()

array([nan, 'sierra 1500 crew cab slt', 'silverado 1500', ...,
       'gand wagoneer', '96 Suburban', 'Paige Glenbrook Touring'],
      dtype=object)

In [39]:
vehicles.tail()

Unnamed: 0,region,price,year,manufacturer,condition,cylinders,fuel,odometer,title_status,transmission,drive,type,paint_color,state
426875,wyoming,23590,2019,nissan,good,6 cylinders,gas,32226,clean,other,4wd,sedan,,wy
426876,wyoming,30590,2020,volvo,good,,gas,12029,clean,other,4wd,sedan,red,wy
426877,wyoming,34990,2020,cadillac,good,,diesel,4174,clean,other,,hatchback,white,wy
426878,wyoming,28990,2018,lexus,good,6 cylinders,gas,30112,clean,other,4wd,sedan,silver,wy
426879,wyoming,30590,2019,bmw,good,,gas,22716,clean,other,rwd,coupe,,wy


In [41]:
vehicles.manufacturer.unique()

array([nan, 'gmc', 'chevrolet', 'toyota', 'ford', 'jeep', 'nissan', 'ram',
       'mazda', 'cadillac', 'honda', 'dodge', 'lexus', 'jaguar', 'buick',
       'chrysler', 'volvo', 'audi', 'infiniti', 'lincoln', 'alfa-romeo',
       'subaru', 'acura', 'hyundai', 'mercedes-benz', 'bmw', 'mitsubishi',
       'volkswagen', 'porsche', 'kia', 'rover', 'ferrari', 'mini',
       'pontiac', 'fiat', 'tesla', 'saturn', 'mercury', 'harley-davidson',
       'datsun', 'aston-martin', 'land rover', 'morgan'], dtype=object)

In [55]:
vehicles.cylinders.unique()

array([ 6,  8,  4,  3, 10], dtype=int64)

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [None]:
#Drop the columns that are not affecting the prices (id, VIN), also drop the 'model' for now as it has 28266 unique names. Also, removing size as more than 50% of the values are null and it would negatively skew our model.  
vehicles_n = vehicles.drop(columns=['id', 'VIN', 'model', 'size'])
#Drop rows that the target variable (price) is zero
vehicles_n = vehicles_n[vehicles_n['price'] != 0]
#convert object to strings
vehicles_n = vehicles_n.convert_dtypes()
#after exammination of unique values in drive column, it was observed that 4wd and fwd should be the same, so converted fwd to 4wd in all rows
vehicles_n['drive'] = vehicles_n['drive'].replace('fwd', '4wd')

In [None]:
# Fill the null values in categorical columns with the mode value of the column. 
categorical_cols = ['state', 'paint_color', 'type', 'drive','region', 'condition', 'transmission', 'title_status', 'fuel', 'cylinders', 'manufacturer']

for col in categorical_cols:
    vehicles_n[col] = vehicles_n[col].fillna(vehicles_n[col].mode()[0])

In [None]:
# Fill the null values in categorical columns with the mode value of the column. 
numerical_cols = ['price', 'odometer', 'year']

for col in numerical_cols:
    vehicles_n[col] = round(vehicles_n[col].fillna(vehicles_n[col].mode()[0]))

In [None]:
# Convert Cylinder column to numbers

def convert_cylinders(cylinders):
    if cylinders == '8 cylinders':
        return 8
    elif cylinders == '6 cylinders':
        return 6
    elif cylinders == '3 cylinders':
        return 3
    elif cylinders == '10 cylinders':
        return 10
    elif cylinders == '4 cylinders':
        return 4
    else:
        return 6  # Use mode to handle nulls
    
vehicles_n['cylinders'] = vehicles_n['cylinders'].apply(convert_cylinders)

In [None]:
# Convert Drive column to numbers

def convert_drive(drive):
    if drive == '4wd':
        return 4
    elif drive == 'rwd':
        return 2
    else:
        return vehicles_n['drive'].mode()[0]  # Use mode to handle nulls
    
vehicles_n['drive'] = vehicles_n['drive'].apply(convert_drive)

In [None]:
# Encode Condition Column, since it is ordinal data
condition_mapping = {'new': 5, 'like new': 4, 'excellent': 3, 'good': 2, 'fair': 1, 'salvage': 0}

# Apply the mapping to the 'condition' column
vehicles_n['condition'] = vehicles_n['condition'].map(condition_mapping)

In [None]:
# Define features (X) and target (y)
y = vehicles_n['price']

# Identify categorical columns to be encoded
categorical_encode_cols = ['state', 'paint_color', 'type', 'region', 'transmission', 'title_status', 'fuel', 'manufacturer']

# Create a TargetEncoder object
encoder = TargetEncoder()

# Fit and transform the encoder on the training data
for col in categorical_encode_cols:
        vehicles_n[col] = encoder.fit_transform(vehicles_n[[col]],y)


### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [85]:
# Using Sequential Feature Selection method, to select most important parameters that affect the price. Five most imprtant features are selected here.
# Scale the data, so that 
scaler = StandardScaler()
scaled_data = pd.DataFrame(scaler.fit_transform(vehicles_n),columns=scaler.get_feature_names_out())

# Define features (X) and target (y)
X = scaled_data.drop('price', axis=1)
y = scaled_data['price']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the linear regression model
model = LinearRegression()

# Initialize the Sequential Feature Selector
sfs = SequentialFeatureSelector(model, n_features_to_select=5, direction='forward', cv=5)  # Adjust n_features_to_select

# Fit the SFS to the training data
sfs.fit(X_train, y_train)

# Get the selected features
selected_features = list(X_train.columns[sfs.get_support()])
print(f"Selected Features: {selected_features}")

# Train the model on selected features
model.fit(X_train[selected_features], y_train)

Selected Features: ['year', 'condition', 'fuel', 'drive', 'paint_color']


In [184]:
# Running 5-fold Cross Validation (cv=5)
# A piplie is created that fits 2 degree polynomiar features, after scaling the columns, and running Rdige model 
training_cols = ['state', 'paint_color', 'type', 'drive','region', 'condition', 'transmission', 'odometer', 'title_status', 'fuel', 'cylinders', 'manufacturer', 'year']
scaled_ridge_model = Pipeline([
    ('saeed_transform', PolynomialFeatures(degree = 1, include_bias = False)),
    ('saeed_scale', StandardScaler()),
    ('saeed_regression', Lasso())
])
parameters_to_try = {'saeed_regression__alpha': 10**np.linspace(-5,5,12)}
model_finder = GridSearchCV(estimator = scaled_ridge_model,
                            param_grid = parameters_to_try,
                            scoring = "neg_mean_squared_error",
                            cv=5)
model_finder.fit(vehicles_n[training_cols], vehicles_n['price'])
best_model = model_finder.best_estimator_



In [185]:
best_model

In [186]:
best_model.named_steps["saeed_regression"].coef_

array([   -0.        ,    -0.        ,     0.        ,     0.        ,
           0.        ,    -0.        ,    -0.        , 36956.95882917,
          -0.        ,    -0.        ,     0.        ,     0.        ,
          -0.        ])

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [190]:
#test the SFS model with calculating the mean suared error
y_predict_sfs = model.predict(X_test[selected_features])
sfs_error = mean_squared_error(y_test, y_predict_sfs)
print(sfs_error)

1.8243305808260315


In [193]:
# Test the Ridge model for accuracy with calculating the mean squared error

In [195]:
y_predict_ridge = best_model.predict((vehicles_n[training_cols]))
ridge_error = mean_squared_error(vehicles_n['price'], y_predict_ridge)
print(ridge_error)

160789386471749.66


### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.

In [None]:
# The following parameters will be recommended to the dealership:
#'year'
#'condition'
#'fuel'
#'drive'
#'paint_color'
#‘Odometer’
