# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

The objective is to understand, analyze the given used car dataset and develop a predictive machine learning model to identify the features of the car which has high impact on its price. Each feature will be evaluated for their influence on the car price by applying different regression models. After all analysis, aim is provided list of features which has positive/negative impact on car price.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

1. Load the dataset into system
    - Found some features have and some dont have impact on car price
    - Some features are more specific to identify a car itself
2. Identify the features that doesn't impact car price and drop those features from the dataset
    - Features like id, VIN, region, cylinders, model are good candidates
3. Some features have data missing
    - If we remove all entries with missing values then we are going to loose a lot of data, so educated choice has to be made 
4. Price column has outliers i.e. either very low or high values. This kind of data can cause inaccurate model, so this has to handled
    - We need to remove outliers
6. Choice has to be made whether we need to do state specific analysis
    - This can help to come up with accurate model specific to the state, but involves repetition of modelling for each state
    - If we make a choice to come up with a general model, then we can remove feature from the dataset

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

1. Handle Price outliers
    - Remove entries where price is less than 1000
    - Consider entries with price within 3 times standard deviation
2. Handle missing data
    - Some features are very critical, so we cannot fill the missing values. We are going remove such entries from the dataset
    - Some missing can be logically filled
3. Custom logic to fill missing data
    - For categorical features like title_status, size, type, drive are filled by random choice of categorical value and maintaining the same probability of value distribution
    - For categorical features like fuel, condition, there is no one logic
        - For all tesla car, fuel type will be electric
        - Cars before 2014 are mostly gas/diesel
        - For condition feature, calculate the mean price for each categorical value and compare with price to come up with appropriate value
4. Convert categorical values to integers

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

1. Below listed regression models are built
    - Linear Regression Model
    - Ridge Model
    - Lasso Model
2. GridSearchCV is used to find best parameters
    - For Ridge Model, alpha of 0.01 happens to be best parameter
3. cross_val_score is used to scoring the model using r2
4. permutation_importance is used to identify features importance
    - Visualization is built for ease of understanding



### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

1. To evaluate accuracy of the models, calculated Mean Square Error, R2 regression score
2. Since identified high impactful features are categorical, built regression model to identify which among has high impact
    - Used Ridge Model
    - Modify dataset to contain only high impact features
    - Visualization is built for the results 


### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

After all the analysis and modelling found the below top features has impact on car price. All the linear regression models employed resulted with same features. These observations resonates with real world expectations.

- Condition (State of the car)
- Year (Age of the car)
- Type (Car segment)
- Manufacturer (Brand matters)
     
Further analysis and modelling only this top features using ridge model, found R2 score close to zero i.e. there is no linear relationship with individual features. This observation convey that there are multiple features affecting the car price not just one.

Car dealership should incorporate these findings when filling their inventories to get better price for the used car.