In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import pandas_profiling

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Data Preparation

First we merge the dataset and add a column of the car manufacturer, the dataset has different versions and mantains old data, so we need to focus only on the most recent version.

In [None]:
data_folder = '/kaggle/input/used-car-dataset-ford-and-mercedes/'
dataset_names = ['bmw', 'merc', 'hyundi', 'ford', 'vauxhall', 'vw', 'audi','skoda', 'toyota']

In [None]:
df = pd.DataFrame()
for dataset_name in dataset_names:
    dataset = pd.read_csv(data_folder+dataset_name + '.csv')
    if(dataset_name == 'hyundi'):
        dataset.rename(columns={"tax(£)": "tax"}, inplace=True)
    dataset['manufacturer'] = dataset_name
    df = pd.concat([df, dataset], ignore_index=True)

# Data Exploration

In [None]:
df.info()

In [None]:
df.describe()

The general distribution of the prices is the following:

In [None]:
plt.figure(figsize=(12,7))

plt.title('Price distribution')

sns.distplot(df['price'])

We can see there a are plenty of outliers and one thing we can explore is if models increase in accuracy if we remove those, (or if we perform some data augmentation).

In [None]:
plt.figure(figsize=(14,11))
sns.boxplot(x="manufacturer", y="price", data=df)

We can see that Mercedes, Audi  and Bmw have outliers with high prices in respect to the brand mean, this is probably due to them manufacturing faster car models.
We can take a look at cars that cost more than 100k £.

In [None]:
costly = df[df.price > 100000]
costly.describe()

In [None]:
list(costly['manufacturer'].unique())

We can see that a lot of them are recent models, and that engineSize is also pretty big on average.
As commons sense tells, **we can expect year and engineSize to be good predictors of price**.

In [None]:
cheap = df[df.price < 1000]
cheap.describe()

In [None]:
list(cheap['manufacturer'].unique())

In [None]:
cheap[cheap['manufacturer'] == 'merc']

The only mercedes model is one from 2003 with very high mileage and also small engine Size.
**Mileage also can be a good predictor for pricing** (nothing too surprising here).

Another interesting thing to look is the comparison of how much data for each manufacturer we have.

In [None]:
plt.figure(figsize=(7,6))
sns.countplot(x="manufacturer", data=df)

We can clearly see we have less data for hyudai skoda and toyota but nonetheless it is still succifient to create predictive models in machine learning.
One possibility to explore is that to use the same number of entries for each of the manufacturers

As a last analysis step i'm going to generate a report from pandas profiling package and take a look at the things we missed:

In [None]:
profile = df.profile_report(title="Pandas Profiling Report")

The widget version is easier to access for future reference inside notebooks

In [None]:
profile.to_widgets()

Correlation matrix plot confirms what we where saying about the data: year and engineSize can be good predictors for price.
Also there is as expected a negative correlation between mileage and price.

The positive correlation between tax and price needs to be further explored, being not very high it makes us a little confident about it being not something that is calculated on the price (so a variable that can't be used to predict price), but also the fact that it is not higly correlated to things like engineSize makes me wonder what it is.

Looking more at the data from the kaggle dataset page i can see that it is "road tax". It can be good to increase topic knowledge before even thinking about using it as a predictor.

Researching on the internet i found this article that explains well the topic of road tax in UK:
https://www.autoexpress.co.uk/car-news/consumer-news/88361/tax-disc-changes-everything-you-need-to-know-about-uk-road-tax





The annual standard rate is £145 and there is the following statement
"**Cars above £40,000 pay £325 annual supplement for five years from the second year of registration.**"

Rate is calculated on the first registration of the car so the listing price doesn't  affect this last part (and should be totally uncorrelated with it)

Cars **older than 40 years are tax-exempt**.

Also is good to take a look at the warnings, as we find out that there are 1475 duplicates (worth exploring to check for scraping errors).

Worth exploring is if cars with 0 tax are effectively 40+ years old, otherwise it could be another scraping error. This **can also be useful to generate some features if the car is near the 40 years date (could increase in price as after that you are tax-exempt)**

The high-cardinality of model is a not a big warning for us, rather than that we shouldn't use one-hot encoding for it in our predictive models

Lastly we saw that there are duplicates in the dataset, a good idea is to take a look at them and then to drop

In [None]:
duplicates = df[df.duplicated(keep=False)] # Just for visualization
duplicates.head()

In [None]:
df.drop_duplicates(ignore_index=True, inplace=True)

In [None]:
df.describe()

We can see that dropping duplicates didn't change much, so our analysis is still valid (but in case we want to increase model accuracy even more in the future we can always go back and explore more)

# Feature selection

In [None]:
categorical_cols = list(df.select_dtypes('object').columns)
categorical_cols

The intuitive idea is to encode using one-hot enconding both transmission and fuelType (and maybe even manufacturer if we decide to create a more specific model given the absence of other manufacturer data).
But let's first check their cardinality

In [None]:
threshold = 10 # If less than 10 unique values we suppose it is low cardinality

low_cardinality_cols = [col for col in categorical_cols if df[col].nunique() < threshold]

high_cardinality_cols = list(set(categorical_cols)-set(low_cardinality_cols))

print(low_cardinality_cols)
print(high_cardinality_cols)

We could also check how the model performs without this column(and model) entirely (so that we can derive a really general model).
Actually this is probably the best way to go since we want to derive the most general model possible and it should be extensible to other car brands and models.
This also makes it really easy to adapt input without destroying our pipelines in case there are new values.

Later will be discussed the utilization of more models based on input data (and how it could be hidden to the final user too).

N.B. We should always have in mind that the **model performance will degrade overtime** as car values decrease more more, some strategy should be used to account this (the simplest of them is to scrape new data periodically and to retrain a new model, some advanced strategy that let us retain old data is to create a deflation model or to add a column of the period the data was scraped on and let machine learning understand the pattern).

In [None]:
general_df = df.drop(['model', 'manufacturer'],axis=1)
general_df

The only thing that remains to do is to one-hot encode transmission and fueltype columns

Be careful of the 'Dummy variable trap' when one-hot encoding in regressions models.
Check out https://www.algosome.com/articles/dummy-variable-trap-regression.html for more informations about this problem.
With pandas get_dummies we can easily fix it by setting to true the option drop_first.

In [None]:
general_df_encoded = pd.get_dummies(general_df, drop_first=True)
general_df_encoded

In [None]:
y = general_df_encoded['price']
X = general_df_encoded.drop('price', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=4242)

# Baseline model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
linreg = LinearRegression().fit(X_train, y_train)

In [None]:
preds = linreg.predict(X_test)

In [None]:
baseline_rmse = mean_squared_error(y_test, preds, squared=False) # RMSE
baseline_rmse

In [None]:
baseline_mae = mean_absolute_error(y_test, preds)
baseline_mae

The big difference between them is that the model tends to get wrong cars with big prices as they are usually outliers. This is something to take into consideration, but for the moment we can skip on that.

# Regression model

In [None]:
def score_result(y_test, preds):
    print("----------------")
    rmse = mean_squared_error(y_test, preds, squared=False)
    mae = mean_absolute_error(y_test, preds)
    print("RMSE: ", rmse)
    print("MAE: ", mae)
    print("\nImprovement from baseline:")
    print("RMSE Improvement:",  baseline_rmse - rmse)
    print("MAE Improvement:", baseline_mae - mae)
    print("----------------")

The algorithm we will use is Random Forest as it usually performs very well for regression problems.
Depending on the task sometimes we don't need a model beast, something that works and has good accuracy is just enough to test an idea, you can think of more complex architecture later (for example using multiple algorithms of different types and using some kind of weighted average to get an optimal price estimation)

An alternative could also be that to throw the data to an Auto-ML library or cloud service and use that model and then later thin of complex feature engineering and improvements.

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf_basic = RandomForestRegressor(random_state=4242)
rf_basic.fit(X_train, y_train)

rf_basic_preds = rf_basic.predict(X_test)

In [None]:
score_result(rf_basic_preds, y_test)

# Final considerations

This is just a simple notebook that aims to to basic analysis and model creation for the dataset at hand. In future releases i plan to analyze further the data and create a production architecture that could increase dramatically performance by choosing the right model based on data input.

The model as is can already provide some kind of range of the price with some accuracy (for example we can use prediction-rmse and prediction+rmse as the range)