# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

## CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

## Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

### Business Objectives
#### Background
The client is a used car dealership, an organization involved with the purchasing and reselling of used cars from and to customers.

#### Business Objectives
Based on the available dataset, we want to determine what factors play the most significant role in determining the price that a customer is willing to pay for a given used car. In doing so, we hope to answer the question "What factors should we focus on in regards to car inventory acquisition in order to improve revenue from used car sells?"

#### Business Success Criteria
Taking the provided dataset as the current state of car inventory, business success criteria here can be described as "if we were to focus on increasing inventory based on factors deemed most important to price, we should expect an increase in revenue performance". That is, success of this analysis would be actionable insights that allow the client to make changes to their inventory strategy and ultimately lead to an improvement in used car sales.

### Situation Assessment
#### Inventory of Resources
Resources available for this analysis are quite limited. No personnel are available for feedback, and have only one iteration of static data available for analysis.

Further, only one local computer is available, significantly limiting the amount of computational resources available. 

#### Requirements, Assumptions, and Constraints
In regards to resources, one of the major constraints is the limit of time. For this project, we have slightly under one week of time available, with the actual man-hours availible as a small fraction of this time period due to other previously agreed to committments to other engagements. As such, we will need to be very conservative in our choice of features to examine and in the time committed to model training. With this in mind, we should view this initial cycle of analysis as an MVP meant for the purpose of initial strategy assessment, with further cycles focusing more in-depth on identified key factors and allocating more time and resources as necessary.

In regards to the dataset available, we make the assumption that the data provided is sufficiently recent enough to provide an accurate image of the current state of inventory and its performance. Further, while the data entry process is unkown, we assume that minimal faulty data has been entered such that the data can be confidently relied on for accurate analysis.

We assume that the client maintains the legal rights to the dataset provided and that we are free to use it for any internally facing (eg. used/viewed only by the client for strategy reassessment) purposes.

#### Risks and Contingencies
|                 **Risk**                |                                                                                                                                                              **Contingency**                                                                                                                                                             |
|:---------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| A significant amount of data is invalid | Given that we are working with a dataset of over 400K cars the risk of this is low, however if this proves to be the case we will need to make a note of the reliability of the produced model and follow up in future cycles with better data.                                                                                          |
| Model chosen is suboptimal              | Given current limitations on available time, resources, and knowledge of regression models, there is a significant risk that the model chosen for evaluation will not be the best possible model available. The best way to handle this is to make note of model performance while making suggestions for future avenues of improvement. |
| Incorrect understanding of dataset      | Given that no feedback is available during the analysis cycle, there is a risk that data in the dataset may be incorrectly interpreted. Contingency for this is to spend sufficient time developing an understanding of the dataset and to go over understanding of data with client during the review phase.                            |

#### Costs and Benefits
While the current revenue of the client is currently unclear at this stage, given that the dataset provided for analysis has over 400K cars and the original dataset had over 3 million cars, it is not unreasonable to assume a significant benefit from this initial analysis.

For example, [using the following article as a base of reference](https://www.sapling.com/12129768/much-money-average-used-car-dealership-make-year), if we were to assume annual sales of cars total about \\$3,900,000, a conservative number given the age of the data referenced in the article, and thus about \\$325,000/month, a 5% improvement in sales would result in revenue of \\$341,250/month, or an increase of \\$16,250/month. This more than justifies the cost of a rough initial analysis.

### Data Mining Goals
Deliverables for this project include a notebook with all aspects of the CRISP-DM framework such as data cleaning, model training, and evaluation. Additionally, a README will be included summarizing findings in an easy to read format for non-technical consumers of this analysis.

The data will be made available via a Github Repo.

#### Data Mining Success Criteria
Key factors relating to used cars have been identified such that next steps relating to inventory strategy can be made.

### Project Plan
The first stage of this project involves an assessment of the dataset available. Given that we have a fixed timeline and no opportunity for follow-up during the current cycle, we will have to make due with whatever data is made available, making note of any issues as a point of follow-up for future analysis cycles in the initial data reports.

The next step is preparing our data for modeling. This involves cleaning the data, removing and creating any features as necesssary, and a report summarizing what steps have been taken to clean the data for future replication.

We will then perform an initial pass on model training, which, after evaluation of the trained model, may result in further passes to improve performance. Iterations here will likely be limited given the time constrains of the current cycle.

After modeling and evaluation, a summary of the work performed will be created along with key insights and next step recommendations.

## Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

## Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

## Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

## Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.