# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.


### Data Understanding

In this section, we will explore the dataset to gain insights into its structure and identify any necessary preprocessing steps. 
We will look at the columns, missing values, and perform some basic exploratory data analysis (EDA).

The dataset contains the following columns:
- id: Unique identifier for each vehicle
- region: The region where the vehicle is located
- price: Price of the vehicle
- year: Year the vehicle was manufactured
- manufacturer: Car manufacturer
- model: Car model
- condition: Condition of the vehicle
- cylinders: Number of engine cylinders
- fuel: Fuel type used by the vehicle
- odometer: Mileage on the vehicle's odometer
- title_status: Vehicle's title status
- transmission: Type of transmission (automatic, manual)
- VIN: Vehicle Identification Number
- drive: Drive type (FWD, 4WD)
- size: Size of the vehicle
- type: Type of the vehicle (e.g., sedan, SUV)
- paint_color: Color of the vehicle
- state: The state where the vehicle is located

We will first look at the missing values and basic statistics for the columns.


In [None]:

# Checking the number of missing values for each column
vehicles_df.isnull().sum()

# Basic statistics of numerical columns
vehicles_df.describe()

# Explore the distribution of car prices
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10,6))
sns.histplot(vehicles_df['price'], bins=50, kde=True)
plt.title('Distribution of Car Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()



### Data Preparation

We will clean the dataset by handling missing values and removing unnecessary columns. 
Key steps include:
- Dropping rows with missing target variable (`price`)
- Handling missing values in other key columns (e.g., `year`, `manufacturer`, `odometer`)
- Encoding categorical variables such as `manufacturer`, `fuel`, `transmission`, and `drive`.



In [None]:

# Dropping rows where price is missing
vehicles_df_cleaned = vehicles_df.dropna(subset=['price'])

# Dropping unnecessary columns that won't be used for modeling (e.g., id, VIN)
vehicles_df_cleaned = vehicles_df_cleaned.drop(columns=['id', 'VIN'])

# Filling missing values in `year` and `odometer` with median values
vehicles_df_cleaned['year'].fillna(vehicles_df_cleaned['year'].median(), inplace=True)
vehicles_df_cleaned['odometer'].fillna(vehicles_df_cleaned['odometer'].median(), inplace=True)

# Encoding categorical variables
vehicles_df_cleaned = pd.get_dummies(vehicles_df_cleaned, columns=['manufacturer', 'fuel', 'transmission', 'drive'], drop_first=True)

# Display the cleaned dataset
vehicles_df_cleaned.head()
