# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

##background
#it is quite competitive for the used car market, because multiple factors could affect used car prices. It's important to let the dealships have a better understanding of the associated factors, which can help them have optimized strategies and further have great sale performance and high profit margins.


##objectives
#identify the key factors that influence the price of used cars. The dealships can have better pricing strategy and comfortable experience.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

##load the data and have a brief understanding of the data

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('./data/vehicles.csv')
# Display the first few rows, and know data structure
df.head()

In [None]:

df.info()
df.describe()

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [None]:
# Drop duplicates
df.drop_duplicates(inplace=True)
print(df.isnull().sum())

In [None]:
df.head()

In [None]:
# Fill or drop missing 
df['odometer'] = df['odometer'].fillna(df['odometer'].median())
df = df.dropna(subset=['manufacturer','model', 'cylinders','fuel','state','title_status','transmission','condition'])

In [None]:
print(df.isnull().sum())

In [None]:
# Convert categorical variables to numeric using one-hot encoding
#df = pd.get_dummies(df, columns=['manufacturer','model', 'cylinders','fuel','size','state'], drop_first=True)

In [None]:
df.head()

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Visualize the distribution of car prices
plt.figure(figsize=(10, 6))
sns.histplot(df['price'], bins=30, kde=True)
plt.title('Distribution of Car Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()


In [None]:
# Scatter plot to visualize the relationship between mileage and price
plt.figure(figsize=(10, 6))
sns.scatterplot(x='year', y='price', data=df)
plt.title('odometer vs Price')
plt.xlabel('odometer')
plt.ylabel('Price')
plt.show()

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Load your dataset
# df = pd.read_csv('your_dataset.csv')  # Uncomment and modify to load your dataset

# Prepare the data for modeling
X = df[['manufacturer', 'model', 'cylinders', 'fuel', 'state', 'title_status', 'year', 'odometer', 'transmission', 'condition']]
y = df['price']

# Impute missing values
numeric_features = ['odometer', 'year']
categorical_features = ['manufacturer', 'model', 'cylinders', 'fuel', 'state', 'title_status', 'transmission', 'condition']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline for Linear Regression
model_lr = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Fit the Linear Regression model
model_lr.fit(X_train, y_train)

# Make predictions with Linear Regression
y_pred_lr = model_lr.predict(X_test)

# Evaluate the Linear Regression model
mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

print(f'Linear Regression Mean Squared Error: {mse_lr:.2f}')
print(f'Linear Regression R^2 Score: {r2_lr:.2f}')

# Cross-validation for Linear Regression
cv_scores_lr = cross_val_score(model_lr, X, y, cv=5, scoring='neg_mean_squared_error')
print(f'Cross-Validated MSE (Linear Regression): {-cv_scores_lr.mean():.2f}')

# Create a pipeline for Random Forest
model_rf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(random_state=42))
])

# Fit the Random Forest model
model_rf.fit(X_train, y_train)

# Make predictions with Random Forest
y_pred_rf = model_rf.predict(X_test)

# Evaluate the Random Forest model
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f'Random Forest Mean Squared Error: {mse_rf:.2f}')
print(f'Random Forest R^2 Score: {r2_rf:.2f}')

# Cross-validation for Random Forest
cv_scores_rf = cross_val_score(model_rf, X, y, cv=5, scoring='neg_mean_squared_error')
print(f'Cross-Validated MSE (Random Forest): {-cv_scores_rf.mean():.2f}')

# Hyperparameter tuning for Random Forest
param_grid = {
    'regressor__n_estimators': [100, 200],
    'regressor__max_depth': [None, 10, 20],
    'regressor__min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(model_rf, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

print(f'Best parameters for Random Forest: {grid_search.best_params_}')

# Make predictions with the best Random Forest model
best_rf_model = grid_search.best_estimator_
y_pred_best_rf = best_rf_model.predict(X_test)

# Evaluate the best Random Forest model
mse_best_rf = mean_squared_error(y_test, y_pred_best_rf)
r2_best_rf = r2_score(y_test, y_pred_best_rf)

print(f'Best Random Forest Mean Squared Error: {mse_best_rf:.2f}')
print(f'Best Random Forest R^2 Score: {r2_best_rf:.2f}')

# Visualize the distribution of the target variable
sns.histplot(y, bins=30)
plt.title('Distribution of Price')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

# Examine residuals for the best Random Forest model
residuals = y_test - y_pred_best_rf
plt.scatter(y_pred_best_rf, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot for Random Forest')
plt.show()

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [None]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'R^2 Score: {r2:.2f}')

# Visualize the predictions
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')  # Line for perfect prediction
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.title('True Values vs Predictions')
plt.grid()
plt.show()

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.

#Executive Summary
This report presents the findings from our analysis of used car pricing based on various features, including manufacturer, model, year, odometer reading, fuel type, and more. We developed and evaluated two regression models—Linear Regression and Random Forest—to predict the selling price of used cars. The objective is to provide insights that can help used car dealers fine-tune their inventory and pricing strategies.

#Introduction
The used car market is competitive, and pricing vehicles accurately is crucial for maximizing profitability. Understanding the factors that influence car pricing can help dealers make informed decisions about their inventory. This report outlines the modeling process, results, and actionable insights derived from the analysis.

Model Selection
We evaluated two regression models:


Linear Regression: A baseline model to understand the linear relationships between features and the target variable (price).
Random Forest Regressor: A more complex model that captures non-linear relationships and interactions between features.

Recommendations for Used Car Dealers
Inventory Optimization: Focus on acquiring vehicles from manufacturers and models that are predicted to retain higher resale values based on the model findings.
Pricing Strategy: Utilize the pricing model to set competitive prices based on the vehicle's features, condition, and market trends.
Regular Updates: Continuously update the model with new data to adapt to market changes and maintain accuracy in pricing predictions.
Feature Awareness: Pay attention to features that significantly impact pricing, such as mileage and year, when evaluating potential inventory.