# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle that contains information on 3 million used cars.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

How do you predict the price of a car from various properties of that car?  What properties affect the car's price the most.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

There appears to be a lot of missing data in the dataset.  The first step is to drop the rows with missing data, so we are left with a clean dataset.  There are also lines with Sales = 0, which need to be dropped.  The id and VIN columns do not contain useful data and should be dropped

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

Price, year, and odometer need to be standardized.  Cyclinders, size, title_status, type, and condition need to be converted into numbers.  Region, manufacturer, model, Fuel, drive, transmission, paint_color, and state need to be one hot encoded.  

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

Try different number of polynomials to see which degree fits best.  Drop one variable at a time to see which has the greatest impact on test mse values.

Polynomial degree 2 has the best best test_mse.  I then dropped each variable, one at a time, to see which provide the best mse.  Model, manufacturer, odometer, and year are the most important variables.  The others had a small effect on the mses.

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

It was determined that model, manufacturer, odometer, and year have the greatest effect on the price of the vehicle.  The other parameters had an effect, but it was smaller.  This is expected and different make/model combination are priced very differently.  Odometer and year both relate to the age of the vehicle, which is known to have a large effect on the price

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.