# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

From a data science perspective, the business objective of identifying key price drivers for used cars translates to a **supervised learning task**. Specifically, we will employ **regression modeling** to address this problem.  The goal is to construct a **data-driven model** capable of predicting used car prices based on a set of **input features**. These features, acting as **independent variables**, will encompass car attributes such as mileage, manufacturing year, vehicle type, and condition. The used car price itself will serve as the **dependent variable**. Through rigorous **regression analysis**, we aim to not only develop an accurate price prediction model but also to derive **actionable insights** into the relative influence of each feature on used car valuations. This will allow us to quantitatively determine the most significant factors driving price fluctuations in the used car market.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

```markdown
### Data Understanding Steps

**1. Load Data and Initial Inspection:**

Load the data into a pandas DataFrame to explore it further.  Examine the first few rows and last few rows of the CSV to get a sense of the data structure, variable types, and potential outliers. Check for nulls or missing values, potential outliers, duplicates, and whether the data "feels right"—is it a valid dataset to answer our questions? Does it include critical fields, or does it look "wrong"? For example, for our purposes, I expect data related to used cars in some relevant manner. I expect it to have sufficient data, sources, and make sense from experience. If the data does not include mileage but has weather information, the dataset is incomplete at best, useless at worst.

**2. Data Profiling and Validation:**

Generate summary statistics for each column (mean, median, standard deviation, min, max, number of unique values, number of missing values, data types). This will help identify potential outliers and data inconsistencies. Validate that the data types in each column are consistent. Note non-numeric columns that may be relevant; we will have to handle these differently.

**3. Missing Data Handling:**

Check for missing data and determine if it can be fixed via imputation or removal. See what percent of data is missing, if there's a pattern to the missing data, and if it makes sense to include it in our analysis.

**4. Outlier Handling:**

Identify outliers using box plots, scatter plots, or z-scores. Handle outliers by removal or transformation.

**5. Duplicate Handling:**

Identify and handle duplicate rows by removal or merging.

```

In [5]:
import pandas as pd
df = pd.read_csv("data/vehicles.csv")

df.head(5)

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
0,7222695916,prescott,6000,,,,,,,,,,,,,,,az
1,7218891961,fayetteville,11900,,,,,,,,,,,,,,,ar
2,7221797935,florida keys,21000,,,,,,,,,,,,,,,fl
3,7222270760,worcester / central MA,1500,,,,,,,,,,,,,,,ma
4,7210384030,greensboro,4900,,,,,,,,,,,,,,,nc


In [6]:
df.iloc[15:31]

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
15,7223509794,bellingham,13995,,,,,,,,,,,,,,,wa
16,7222753076,bellingham,24999,,,,,,,,,,,,,,,wa
17,7222206015,bellingham,21850,,,,,,,,,,,,,,,wa
18,7220030122,bellingham,26850,,,,,,,,,,,,,,,wa
19,7218423006,bellingham,11999,,,,,,,,,,,,,,,wa
20,7216672204,bellingham,24999,,,,,,,,,,,,,,,wa
21,7215617048,bellingham,21850,,,,,,,,,,,,,,,wa
22,7213839225,bellingham,26850,,,,,,,,,,,,,,,wa
23,7208549803,bellingham,11999,,,,,,,,,,,,,,,wa
24,7213843538,skagit / island / SJI,24999,,,,,,,,,,,,,,,wa


In [10]:
#lets get some stats
print(df.describe().round(2).astype(str))


                  id         price      year    odometer
count       426880.0      426880.0  425675.0    422480.0
mean   7311486634.22      75199.03   2011.24    98043.33
std       4473170.41   12182282.17      9.45    213881.5
min     7207408119.0           0.0    1900.0         0.0
25%    7308143339.25        5900.0    2008.0     37704.0
50%     7312620821.0       13950.0    2013.0     85548.0
75%     7315253543.5      26485.75    2017.0    133542.5
max     7317101084.0  3736928711.0    2022.0  10000000.0


### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.