## Table of Contents
* [Introduction](#section-one)
    - [Data Sources | Task Definition | Planned Approach](#subsection-zero)
    - [Library Imports](#subsection-one)
    - [Data Imports](#subsection-two)
* [Exploratory Data Analysis](#section-two)
    - [Population Distribution](#subsection-three)
    - [Correlations](#subsection-four)
    - [Boxplots](#subsection-five)
* [Pre-Processing | Feature Engineering](#section-three)
    - [Unifying in a single Df](#subsection-six)
    - [Handling Missing Values](#subsection-seven)
    - [Checking Data Types](#subsection-eight)
    - [Obtaining Age](#subsection-nine)
    - [Encoding Categorical Features](#subsection-ten)
* [Model Implementation](#section-four)
    - [Test-Train Split](#subsection-eleven_)
    - [Decision Tree Regressor](#subsection-twelve)
    - [XGB Regressor](#subsection-twelve_)
* [Conclusions](#section-five)
    - [Optimal Model](#subsection-fourteen)
    - [Feature Importance](#subsection-fifteen)

<a id='section-one'></a>
## Introduction

<a id='subsection-zero'></a>
#### *Data Source & Contents*
Scraped data of used cars listings. 100,000 listings, which have been separated into files corresponding to each car manufacturer. I collected the data to make a tool to predict how much my friend should sell his old car for compared to other stuff on the market, and then just extended the data set. Then made a more general car value regression model.
The cleaned data set contains information of price, transmission, mileage, fuel type, road tax, miles per gallon (mpg), and engine size.

#### *Tasks*
1. Create a regression model that **predicts selling price** based on the inidividual car characteristics
2. **Idenitify which features drive car price** and the positive or negative effect on selling price


#### *Approach*
1. Do any necessary cleaning of the datsets so they can be unified 
2. Create a single DataFrame that unifies all brand specific datasets
3. EDA. Explore the data to try and identify main price drivers
4. Pre-Processing / Feature Engineering to prepare the data for model implementation
5. Model testing
6. Explore feature importance and feature impact on price to elaborate conclusion

<a id='subsection-one'></a>
#### *Library Imports*

In [None]:
# Programming
import pandas as pd
import numpy as np

# Sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Other 
import warnings
warnings.filterwarnings('ignore')

<a id='subsection-two'></a>
#### *Data Imports*

In [None]:
# Loading individual datasets
audi = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/audi.csv')
bmw = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/bmw.csv')
cclass = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/cclass.csv')
focus = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/focus.csv')
hyundi = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/hyundi.csv')
merc = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/merc.csv')
skoda = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/skoda.csv')
toyota = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/toyota.csv')
vauxhall = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/vauxhall.csv')
vw = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/vw.csv')

# Adding a 'Brand' column
audi['brand'] = 'Audi'
bmw['brand'] = 'BMW'
cclass['brand'] = 'Mercedes'
focus['brand'] = 'Ford'
hyundi['brand'] = 'Hyundai'
merc['brand'] = 'Mercedes'
skoda['brand'] = 'Skoda'
toyota['brand'] = 'Toyota'
vauxhall['brand'] = 'Vauxhall'
vw['brand'] = 'Volkswagen'

# Creating a single Df from the 'clean' Dfs
df_clean_list = [audi, bmw, cclass, focus, merc, skoda, toyota, vauxhall, vw]
df = pd.concat(df_clean_list, sort=False)

<a id='section-two'></a>
## Exploratory Data Analysis

<a id='subsection-three'></a>
#### *Population Distribution*

In [None]:
# Seaborn Settings
sns.set()
sns.set(rc={'figure.figsize':(30,5)})

# GRAPH 0
    # Price Distribution
sns.histplot(x='price',kde=True, data=df)
plt.title('PRICE DISTRIBUTION')
plt.show()
plt.close()

# GRAPH 1
    # Brand Distribution
plt.subplot(1,2,1)
sns.countplot(x='brand', data=df)
plt.xticks(rotation=90)
plt.title('AMOUNT OF OBSERVATIONS PER BRAND')
    # Launch Year Distribution 
plt.subplot(1,2,2)
sns.histplot(x='year',kde=True, data=df)
plt.title('LAUNCH YEAR')
plt.show()
plt.close()

# GRAPH 2
    # Transmission Distribution
plt.subplot(1,2,1)
sns.countplot(x='transmission', data=df)
plt.title('TRANSMISSION TYPE')
    # Transmission Distribution
plt.subplot(1,2,2)
sns.countplot(x='fuelType', data=df)
plt.title('FUEL TYPE')
plt.show()
plt.close()

# GRAPH 3
    # Model Distribution
sns.countplot(x='model', data=df)
plt.xticks(rotation=90)
plt.title('AMOUNT OF OBSERVATIONS PER MODEL')
plt.show()
plt.close()

<a id='subsection-four'></a>
#### *Correlations*

In [None]:
# Seaborn Settings
sns.set()
sns.set(rc={'figure.figsize':(30,5)})

# GRAPH 0
    # Numerical Features Correlations
corr = df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(df.corr(), mask=mask, vmin=-1, vmax=1,
     cmap="Spectral", annot=True, fmt='.2f')
plt.yticks(rotation=0)

<a id='subsection-five'></a>
#### *Boxplots*

In [None]:
# GRAPH 0
    # Transmission
plt.subplot(1,3,1)
sns.boxplot(x='transmission', y='price', data=df)
plt.title('TRANSMISSION BOXPLOT')
    # Fuel Type
plt.subplot(1,3,2)
sns.boxplot(x='fuelType', y='price', data=df)
plt.title('FUEL TYPE BOXPLOT')
    # Brans
plt.subplot(1,3,3)
sns.boxplot(x='fuelType', y='price', data=df)
plt.title('BRAND BOXPLOT')
plt.show()
plt.close()

<a id='section-three'></a>
## Pre-Processing

<a id='subsection-six'></a>
#### *Unifying in a single Df*

In [None]:
# Creating a single Df from the 'clean' Dfs
df_clean_list = [audi, bmw, cclass, focus, merc, skoda, toyota, vauxhall, vw]
df_clean = pd.concat(df_clean_list, sort=False)
del df_clean_list

# Concatinating all Dfs
df = df_clean

# Visually checking the Df
df.head()

<a id='subsection-seven'></a>
#### *Handling Missing Values*

In [None]:
# Check for missing values
print('# of Missing Values')
print(df.isnull().sum())

# Visually inspect missing values
msno.matrix(df)
plt.title('MISSING VALUES')

# Dropping all observations with missing values
df = df.dropna()

msno.matrix(df)
plt.title('MISSING VALUES AFTER OBSERVATION DROP')

<a id='subsection-eight'></a>
#### *Checking DataTypes*

In [None]:
print(df.dtypes)
df.head()

<a id='subsection-nine'></a>
#### *Obtaining Age*

In [None]:
age_list = []
for x in df.year:
    y = 2021 - x
    age_list.append(y)
df['age'] = age_list
df = df.drop('year', axis=1).copy()
df.head()

<a id='subsection-ten'></a>
#### *Encoding Categorical Features*

In [None]:
# Dropping the 'brand' column (information already included on the 'model' feature)
df = df.drop('brand', axis=1)
# Categorical Feature Encoding
    # transmission ['Manual' 'Automatic' 'Semi-Auto' 'Other']
df_transmission = df[['transmission']]
df_transmission = pd.get_dummies(df_transmission, drop_first=True, prefix='transmission')
    # fuelType ['Petrol' 'Diesel' 'Hybrid' 'Other' 'Electric']
df_fuel = df[['fuelType']]
df_fuel = pd.get_dummies(df_fuel, drop_first=True, prefix='fuel')
    # model print(df_model.model.unique())
df_model = df[['model']]
df_model = pd.get_dummies(df_model, drop_first=True, prefix='model')

# Selecting numerical features
df_numerical = df[['mileage', 'tax', 'mpg', 'engineSize', 'age']]

# Unifying all in a single df
df = pd.concat([df[['price']], df_transmission], axis=1).reset_index(drop=True)
df = pd.concat([df, df_fuel.reset_index(drop=True)], axis=1)
df = pd.concat([df, df_model.reset_index(drop=True)], axis=1)
df = pd.concat([df, df_numerical.reset_index(drop=True)], axis=1)

df.head()

<a id='section-four'></a>
## Model Implementation

<a id='subsection-eleven_'></a>
#### *Test-Train Split*

In [None]:
# Fixing seed for reproducibility
seed = 1

# Creating the Features/Label split as numpy arrays
features =  df.drop('price', axis=1).values
target = df[['price']].values

# Creating the test/train split
training_features, test_features,\
training_target, test_target = \
train_test_split(features, target, test_size=0.8, random_state=seed)

<a id='subsection-twelve'></a>
#### *Decision Tree Regressor*

In [None]:
# DECISION TREE REGRESSOR
    # Hyperparameters Options
dectree_params = {'criterion': ['mse', 'mae'],
                  'splitter': ['best', 'random'],
                  'random_state': [seed]}
    # Model
dectree = DecisionTreeRegressor()
    # Hyperparameter Tuning + CrossValidation
dectree_gridcv = GridSearchCV(estimator=dectree, param_grid=dectree_params,
                              scoring='r2', cv=5)
dectree_gridcv.fit(training_features, training_target)
    # Optimal Model Output
dectree_best_est = dectree_gridcv.best_estimator_
dectree_feat_imp = dectree_best_est.feature_importances_
print('BEST ESTIMATOR')
print(dectree_best_est)
print('\n')
    # Scores
        # Training
dectree_training_prediction = dectree_best_est.predict(training_features)
dectree_best_score = dectree_gridcv.best_score_ 
dectree_rmse_training = np.sqrt(mean_squared_error(training_target, dectree_training_prediction))
print('TRANINIG DATA')
print('r2: {:.4f}'.format(dectree_best_score))
print('RMSE: {:.0f}'.format(dectree_rmse_training))
print('\n')
        # Test
dectree_test_prediction = dectree_best_est.predict(test_features)
dectree_test_score = dectree_gridcv.score(test_features, test_target)
dectree_rmse_test = np.sqrt(mean_squared_error(test_target, dectree_test_prediction))
print('TEST DATA')
print('r2: {:.4f}'.format(dectree_test_score))
print('RMSE: {:.0f}'.format(dectree_rmse_test))
    # Graph
sns.set(rc={'figure.figsize':(15,40)})
        # Feature Importance
feature_names = df.drop('price', axis=1).columns.to_list()
plt.barh(feature_names, dectree_feat_imp)
plt.title('FEATURE IMPORTANCE')
plt.show()
plt.close()

<a id='subsection-twelve_'></a>
#### *XGB Regressor*

In [None]:
# XGBREGRESSOR
    # Hyperparameters Options
xgb_params = {'objective': ['reg:squarederror'],
              'random_state': [seed]}
    # Model
xgb = XGBRegressor()
    # Hyperparameter Tuning + CrossValidation
xgb_gridcv = GridSearchCV(estimator=xgb, param_grid=xgb_params,
                              scoring='r2', cv=5)
xgb_gridcv.fit(training_features, training_target)
    # Optimal Model Output
xgb_best_est =xgb_gridcv.best_estimator_
xgb_feat_imp = xgb_best_est.feature_importances_
print('BEST ESTIMATOR')
print(xgb_best_est)
print('\n')
    # Scores
        # Training
xgb_training_prediction = xgb_best_est.predict(training_features)
xgb_best_score = xgb_gridcv.best_score_ 
xgb_rmse_training = np.sqrt(mean_squared_error(training_target, xgb_training_prediction))
print('TRANINIG DATA')
print('r2: {:.4f}'.format(xgb_best_score))
print('RMSE: {:.0f}'.format(xgb_rmse_training))
print('\n')
        # Test
xgb_test_prediction = xgb_best_est.predict(test_features)
xgb_test_score = xgb_gridcv.score(test_features, test_target)
xgb_rmse_test = np.sqrt(mean_squared_error(test_target, xgb_test_prediction))
print('TEST DATA')
print('r2: {:.4f}'.format(xgb_test_score))
print('RMSE: {:.0f}'.format(xgb_rmse_test))
    # Graph
sns.set(rc={'figure.figsize':(15,40)})
        # Feature Importance
plt.barh(feature_names, xgb_feat_imp)
plt.title('FEATURE IMPORTANCE')
plt.show()
plt.close()

<a id='section-five'></a>
## Conclusions

<a id='subsection-fourteen'></a>
#### *Optimal Model*

**XGB Regressor is the optimal model**, as we can see on the table below it delivers **higher R2** and **lower RMSE** on the test data than the Decision Tree Regressor model.

Model | Training r2 | Training RMSE | Test r2 | Test RMSE
- | - | - | - | -
Decision Tree Regressor | 0.8790 | 182 | 0.8906 | 3503
XGB Regressor | 0.9332 | 1918 | 0.9434| 2520

<a id='subsection-fifteen'></a>
#### *Feature Importance*

The most relevant features based on the two models applied are:
- **age**, drives price down. The oldest the car is, the cheaper it becomes. This is due not only due to just the age of the vehicle, but also to the fact htat age is also strongly associated with mileage, which also drives the price down.
- **engineSize**, drives price up. Vehicles with more powerful engines are sold at higher prices as vehicle performance is expected to be better.
- **transmission_Manual**, drives price down. Manual transmission, the most basic transmission type has on average lower prices than the other trnasmission types.
- **mileage**, drives price down. The more mileage a vehicle has, the more it deteriorates, increasing the need for repairs. This drives price down, as more future expenses are expected.
- **mpg**, drives price down. My assumption would be that efficient cars (bigger mpg) are designed to be used and sold as affordable vehicles (low price), whilst inefficient cars (low mpg) are designed to be high performing vehicles (high price)