# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

### Student Answer
## Predicting the price of the car based on features and attributes. The used car dealership is looking for what features/data points in the histhistorocal data drive the proce of a used car. We could look at this as what data points from the data set can be more benificial to predicting the price of a used car.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

## Some steps which we would need to take for the data undersatnding are as 
###    1. Collect the data 
###    2. Describe the data 
### 3. Explore the data 
### 4. Verify data quality

## Collection and Loading of Data to Data Frame


In [None]:
import pandas as pd 

df= pd.read_csv('data/vehicles.csv')
df.info()

In [None]:
df.tail(500)

In [None]:
df.head(10)

# Data checkes 
### Missing values Check

In [None]:
null_counts = df.isnull().sum()
total_rows = len(df)
null_percentages = (null_counts / total_rows) * 100

# Print the result
print(null_percentages)


### Let us explore the Condition column a bit to understand what to do with the null Values 

In [None]:
df['condition'].value_counts(dropna=False)


### We will replace NaN with good as this is the most common value in the data set and in real world since this is a used car we can consider it as good.

In [None]:
df['condition'] = df['condition'].fillna('good')
df['condition'].isnull().sum()

### Analyzing the missing values in cylinder column?

In [None]:
df['cylinders'].value_counts(dropna=False)

In [None]:

#df['cylinders'].fillna('6 cylinders',inplace=True )
#df['cylinders'].value_counts(dropna=False)
df['cylinders'].isnull().sum()

### Let us fill the NaN value with most popular value of 6 cylinders. 

In [None]:
df['cylinders'].fillna('6 cylinders', inplace=True)

In [None]:
df['cylinders'].value_counts(dropna=False)

## Dropping Vin and ID as it is a unique identifier.

In [None]:
df.drop(axis=1, columns= 'VIN', inplace=True)

In [None]:
df['id'].duplicated().sum()

In [None]:
df.drop(axis=1, columns= 'id', inplace=True)
df.info()

In [None]:
null_counts = df.isnull().sum()
total_rows = len(df)
null_percentages = (null_counts / total_rows) * 100

# Print the result
print(null_percentages)


In [None]:
df.head()

## Exploring the Drive column

In [None]:
df['drive'].value_counts(dropna=False)

## Replace Nan with 4 wheel drive as that is the most common occurange. 

In [None]:
df['drive'].fillna('4wd', inplace=True)
df['drive'].value_counts(dropna=False)

In [None]:
df['size'].value_counts(dropna=False)

## Drop this column as 70% are null

In [None]:
df.drop(axis=1, columns= 'size', inplace=True)
df.info()

In [None]:
null_counts = df.isnull().sum()
total_rows = len(df)
null_percentages = (null_counts / total_rows) * 100

# Print the result
print(null_percentages)


In [None]:

df['type'].fillna('sedan', inplace=True)
df['type'].value_counts(dropna=False)

In [None]:
df['paint_color'].fillna('white',inplace=True)
df['paint_color'].value_counts(dropna=False)

In [None]:
null_counts = df.isnull().sum()
total_rows = len(df)
null_percentages = (null_counts / total_rows) * 100

# Print the result
print(null_percentages)


In [None]:
df = df.dropna()


In [None]:

df.info()

In [None]:
df.duplicated().sum()


### Remove Duplicates 

In [None]:
df = df = df.drop_duplicates()


In [None]:
df.duplicated().sum()

### No duplicates in the data and No  Nulls it looks clean to the next

In [None]:
df.dtypes

### lets check the unique columns in ech column

In [None]:
df.describe()

## Lets analyze the data now as initial cleaning is applied.

In [None]:
df.head()
df.nunique()
df.dtypes


In [None]:
#importing plotly to vialualize the data 
import plotly.express as px 


In [None]:
fig = px.histogram(df, x="region")
fig.show()

### there seems to be no out liers in the region column.

In [None]:
fig = px.box(df, x="price")
fig.show()

### There seems to be some outliers here, need to explore more

In [None]:
#df['price'].sort_values(ascending=False)
df = df.query('price < 55000')


In [None]:
fig = px.box(df, x="price")
fig.show()

### We have a more meaningful data now as we have excluced the outliers in price. 

In [None]:
fig = px.histogram(df, x="price")
fig.show()

In [None]:
#df['price'].sort_values(ascending=False)
df = df.query('price > 1000')
fig = px.histogram(df, x="price")
fig.show()


In [None]:
px.histogram(df,x='year')

### Let us explore categorical columns

In [None]:
#fig = px.histogram(df, x=['manufacturer'], marginal='rug', barmode='overlay')
#fig.show()

df['manufacturer'].value_counts().plot(kind='bar')


In [None]:
#fig = px.histogram(df, x=['model'], marginal='rug', barmode='overlay')
#fig.show()

df['model'].nunique()



In [None]:
df['model'].value_counts().head(250)

## Given the huge number of models lets try and reduce the models with more marking any thing less then 250 occurances as other. 


In [None]:
first_250_values = df['model'].value_counts().index[:250].tolist()

# Create a new column where values not in the first 100 are grouped into a single category
df['model'] = df['model'].apply(lambda x: x if x in first_250_values else 'Other')


In [None]:
df['model'].value_counts()

In [None]:
fig = px.histogram(df, x=['condition'], marginal='rug', barmode='overlay')
fig.show()

#Categorical columns 

# condition        object
# cylinders        object
# fuel             object


In [None]:
df['cylinders'].value_counts()

In [None]:
df.info()

In [None]:
numerical_columns = df.select_dtypes(include=['number']).columns.tolist()

# Select categorical columns
categorical_columns = df.select_dtypes(include=['object', 'category']).columns.tolist()

print("Numerical columns:", numerical_columns)
print("Categorical columns:", categorical_columns)

In [None]:
df['drive'].value_counts()

In [None]:
df['type'].value_counts()

In [None]:
df['paint_color'].value_counts()

In [None]:
df['transmission'].value_counts()

In [None]:
df['state'].value_counts()

In [None]:
df['transmission'].value_counts()

In [None]:
df['fuel'].value_counts().plot(kind='bar')

## Now that we have looked at all the categorical and numerical features and cleaned them as necessary let us explore the relationship of these features to price which we want to predict. 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 8))
for feature in numerical_columns:
    plt.subplot(3, 5, numerical_columns.index(feature) + 1)
    sns.histplot(data=df[feature], bins=50, kde=True)
    plt.title(feature)
plt.tight_layout()
plt.show()

## There seem to be some odometer outliers, we need to take care of them 

In [None]:
df['odometer'].sort_values(ascending=False).plot(kind='box')

percentage_over_40000 = (len(df[df['odometer'] > 400000]))
print(percentage_over_40000)

### These 655 cars have reading over 250000 which is creating issues with our distribution so we will remove it. 

In [None]:
df =  df[df['odometer'] <= 250000]

In [None]:
plt.figure(figsize=(12, 8))
for feature in numerical_columns:
    plt.subplot(3, 5, numerical_columns.index(feature) + 1)
    sns.histplot(data=df[feature], bins=50, kde=True)
    plt.title(feature)
plt.tight_layout()
plt.show()

In [None]:
df.info()

Lets exp[lore the relation of categorical   features to price

In [None]:
categorical_columns

In [None]:
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(20, 9))
axes = axes.ravel()  # Flatten the 2D array of axes

# Loop through each categorical column
for i, column in enumerate(categorical_columns):
    sns.countplot(x=df[column], data=df, palette='bright', ax=axes[i], saturation=0.95)
    for container in axes[i].containers:
        axes[i].bar_label(container, color='black', size=10)
    axes[i].set_title(f'Count Plot of {column.capitalize()}')
    axes[i].set_xlabel(column.capitalize())
    axes[i].set_ylabel('Count')

# Adjust layout and show plots
plt.tight_layout()
plt.show()

### Need to take a look at region manufacturer and state and region 

In [None]:
df['region'].value_counts()

### i am going to drop the region column as I think itds not very well corelated with price. 

In [None]:
df = df.drop('region', axis=1)
df.head()

In [None]:
df['model'].value_counts()

In [None]:
categorical_columns.remove('region')

In [None]:
categorical_columns

In [None]:
df.info()

In [None]:
# Categorical Feature vs. Price
plt.figure(figsize=(40, 30))
for feature in categorical_columns:
#    print (categorical_columns.index(feature)+1)
    plt.subplot(3, 4, categorical_columns.index(feature)+1)
    sns.boxplot(data=df, x=feature, y='price')
    plt.title(f'{feature} vs. Price')
plt.tight_layout()
plt.show()

In [None]:
# Categorical Feature vs. Price
plt.figure(figsize=(30, 20))
for feature in numerical_columns:
    plt.subplot(3, 3, numerical_columns.index(feature) + 1)
    sns.scatterplot(data=df, x=feature, y='price')
    plt.title(f'{feature} vs. Price')
plt.tight_layout()
plt.show()

In [None]:
# Correlation Analysis
correlation_matrix = df[numerical_columns].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

### Data Preparation After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

## Importing libraries for transformation and modelling 

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV



## Splitting the data into test and train sets. 

In [None]:
X = df.drop('price', axis=1)
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.3,random_state = 42)

X_test.info() 



## With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

## Now that we have split the data lets us write some pre processors to process the data which will be used in the pipeline. 

In [None]:

categorical_features = ['manufacturer', 'condition', 'cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'type', 'paint_color']
numerical_features = ['year', 'odometer']
print ( categorical_features, numerical_features)

preprocessor = make_column_transformer(
    (OneHotEncoder(), categorical_features),
    (StandardScaler(), numerical_features)
)

In [None]:
linear_pipe = Pipeline([('preprocessor', preprocessor),('model',LinearRegression())])

In [None]:
param_grid = {'model__fit_intercept': [True, False]} 

In [None]:
grid_search = GridSearchCV(linear_pipe, param_grid = param_grid, cv = 5).fit(X_train, y_train)

In [None]:
best_params = grid_search.best_params_
best_score = grid_search.best_score_
print("Best parameters:", best_params)
print("Best R-squared score (on training data):", best_score)

## Lets explore the test score 

In [None]:
best_model =  grid_search.best_estimator_
test_score  = best_model.score(X_test,y_test)
print("Test score:", test_score)
type(best_model)

# intercept = best_model.intercept_
# coefficients = best_model.coef_  # Returns an array of coefficients


model_step = best_model.named_steps['model']  # Assuming 'model' is your model's name in the pipeline 
coefficients = model_step.coef_
intercept = model_step.intercept_

print(coefficients , "  ", intercept)

In [None]:
preprocessor = best_model.named_steps['preprocessor']
transformed_feature_names = preprocessor.get_feature_names_out()
all_feature_names = ['intercept'] + transformed_feature_names.tolist()
coefficients_df = pd.DataFrame({'Feature': all_feature_names, 'Coefficient': [intercept] + coefficients.tolist()})

In [None]:
coefficients_df.shape

In [None]:
from sklearn.inspection import permutation_importance
encoded_matrix = preprocessor.transform(X_train)
# Permutation Importance (Calculates importance with encoded features)
result = permutation_importance(model_step, encoded_matrix.toarray() , y_train, n_repeats=10)
feature_importances = result.importances_mean

# Get original feature names
feature_names = preprocessor.get_feature_names_out()

# Print Results
for name, importance in zip(feature_names, feature_importances):
    print(f"Feature: {name}, Importance: {importance:.4f}")

### Let us plot this data 

In [None]:
# Create DataFrame
df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})

# Sort by importance
df = df.sort_values(by='Importance', ascending=False)

# Plot
plt.figure(figsize=(20, 80))
plt.barh(df['Feature'], df['Importance'], color='skyblue')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Permutation Importance')
plt.gca().invert_yaxis()  # Invert y-axis to display most important on top
plt.grid(axis='x', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

## From the above we can see that The odometer and year are the main contributirs to the price followed by fuel type diesel and 8 cylinder vehicles. This is giving us a good idea of what affects the price. 

## Since there is not much use for the manufacturer and the model columns we will use the Sequential feature selection to limit the number of parameters. 

In [None]:
## We are importing libraries to feature selection 
from sklearn.feature_selection import SequentialFeatureSelector

param_grid = {'selector__n_features_to_select': [10]}

linear_pipe = Pipeline([('preprocessor', preprocessor),
                        ('selector', SequentialFeatureSelector(LinearRegression()) ), 
                        ('model', LinearRegression())])
                                                        
grid_search = GridSearchCV(linear_pipe, param_grid=param_grid, cv=5).fit(X_train, y_train)
best_params = grid_search.best_params_
best_score = grid_search.best_score_
print("Best parameters:", best_params)

print("Best R-squared score (on training data):", best_score)

In [None]:
## retriving the best model 
best_model =  grid_search.best_estimator_
model_step = best_model.named_steps['model']
selector = best_model.named_steps['selector']
coefficients = model_step.coef_
intercept = model_step.intercept_

preprocessor = best_model.named_steps['preprocessor']
selector_step = best_model.named_steps['selector']
selected_features_mask = selector_step.get_support()

# Examining the coeffients
feature_names = preprocessor.get_feature_names_out()
selected_feature_names = [name for name, selected in zip(feature_names, selected_features_mask) if selected]
print(selected_feature_names)
print("Coefficients:", coefficients)


## Lets us now move on ridge regression

In [None]:
import numpy as np
from sklearn.linear_model import  Ridge
from sklearn.metrics import mean_squared_error

ridge_param_dict = {'ridge__alpha':np.logspace(0, 10, 50)}

ridge_pipe = Pipeline([('preprocessor', preprocessor),
                                     ('ridge', Ridge())])

ridge_grid = GridSearchCV(estimator = ridge_pipe,
                          param_grid = ridge_param_dict,
                          scoring = "neg_mean_squared_error")

ridge_grid.fit(X_train,y_train)

train_preds = ridge_grid.best_estimator_.predict(X_train)
test_preds = ridge_grid.best_estimator_.predict(X_test)

ridge_train_mse = mean_squared_error(train_preds,y_train)
ridge_test_mse = mean_squared_error(test_preds,y_test)


print(f'Train MSE: {ridge_train_mse}')
print(f'Test MSE: {ridge_test_mse}')
ridge_pipe


## We will add Lasso Regression to the mix

In [None]:
from sklearn.linear_model import Lasso

param_grid = {'model__alpha':[0.0001, 0.001, 0.01, 0.1, 1.0, 10]} 

laso_pipe = Pipeline([('preprocessor', preprocessor),
                      ('model', Lasso())])

lasso_grid = GridSearchCV(laso_pipe, param_grid = param_grid,cv=5, scoring='neg_mean_squared_error')

lasso_grid.fit(X_train,y_train)



### It seems like Lasso regressor is unable to converge to a solution.

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

## Upon Running a few algorithms it is clear that odometer and model year parameters are a big drive of the used car prices. Since the large dimentioanl space some of our regression algorithms are unable to converge or complete grid search.

In [None]:
## Lets add poly features to the numeric data so we can see if the accuracy or Linearregression improves that that is best score as of now.
from sklearn.preprocessing import PolynomialFeatures
categorical_features = ['manufacturer', 'condition', 'cylinders', 'fuel', 'title_status', 'transmission', 'drive',
                        'type', 'paint_color']
numerical_features = ['year', 'odometer']
print(categorical_features, numerical_features)

preprocessor = make_column_transformer(
    (OneHotEncoder(), categorical_features),
   ( PolynomialFeatures(degree=3, include_bias=False),numerical_features ),
    (StandardScaler(), numerical_features)
)
linear_pipe = Pipeline([('preprocessor', preprocessor), ('model', LinearRegression())])
param_grid = {'model__fit_intercept': [True, False]}
grid_search = GridSearchCV(linear_pipe, param_grid=param_grid, cv=5).fit(X_train, y_train)
best_params = grid_search.best_params_
best_score = grid_search.best_score_
print("Best parameters:", best_params)
print("Best R-squared score (on training data):", best_score)


### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

### We have analyzed the data and after the modeling we see that LinearRegression with most of the features gives the best result in our model comparision.
The take away for the used car dealers is that the most significant factors affecting the prices of the car is Year and Odometer reading, as the year increase the prices of the car decreases and as the odometer reading increases the price of the car also decreases. There are other factors like Number of cylinders and front sheel drive vehice also affect the price of the used cars.