A (good) used car is a better (depreciating) asset to own than a new one

Using this Craigslist used car dataset and considering the last 20 years' worth of data I use the below algorithms to best predict the value of used cars, based on 10 features

* Linear Regression 
* Decision Trees
* Bagging 
* Random Forest
* Adaptive Boosting 
* Gradient Boosting 
* XGBoost


In [None]:
# Install missingno to visualize missing data
#!pip install missingno

# Install xgboost - an algorithm used in this project
#!pip install xgboost

# Utilities
import os

# Numpy & Pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
#import pandas_profiling as pp

# Models
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor, AdaBoostRegressor, GradientBoostingRegressor
import xgboost as xgb
from sklearn import metrics
import missingno as msno

# Others (warnings etc)
from warnings import simplefilter
%matplotlib inline


In [None]:
# Declare variables required
DASHES = '-' * 10
TABS = '\t' * 8
pd.set_option('mode.chained_assignment', None)
pd.set_option('display.max_columns', 12)
pd.set_option('display.expand_frame_repr', False)
pd.options.display.float_format = '{:,.2f}'.format

# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

# Define a function to show values on bar charts
def show_values_on_bars(axs, space=0.4):
    def _show_on_single_plot(ax):        
        for p in ax.patches:
            _x = p.get_x() + p.get_width() / 2
            _y = p.get_y() + p.get_height()
            value = '{:.0f}'.format(p.get_height())
            ax.text(_x, _y, value, ha="center", va="bottom") 

    if isinstance(axs, np.ndarray):
        for idx, ax in np.ndenumerate(axs):
            _show_on_single_plot(ax)
    else:
        _show_on_single_plot(axs)


In [None]:
# Read from input file

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        with open(os.path.join(dirname, filename),  encoding='utf-8') as f:
            %time vehicles_df_full = pd.read_csv(f)
            f.close()
        
# Print read info
print(f'Read {len(vehicles_df_full)} lines from the file vehicles.csv\n\n')

vehicles_df_full.info()

In [None]:
# Review the completeness of data

msno.bar(vehicles_df_full.sample(5000))


In [None]:
# Determine and remove the columns to drop based on the above graph
cols_to_drop = ['id','url', 'region', 'region_url', 'VIN', 'image_url', 'description', \
    'county', 'size', 'paint_color', 'drive', 'cylinders', 'state', 'lat','long']
vehicles_df = vehicles_df_full.drop(columns=cols_to_drop)

# Remove the larger data frame from memory
del vehicles_df_full

# Get info of the new data frame
vehicles_df.info()

In [None]:
# Preview the new dataframe
vehicles_df


>*With the previews and descrption of continuous variables above, it's immedealtely apparent that some first level cleanup is required. Continuing to do so...*

In [None]:
# Initial cleaning up
# Drop NaNs and duplicates
vehicles_df.dropna(inplace=True)
vehicles_df.drop_duplicates(inplace=True)

# Update index and change data type of year to string
vehicles_df.index = range(len(vehicles_df))
vehicles_df.year = vehicles_df.year.astype(int).astype(str)

vehicles_df



# Data Visualization & Cleaning
  
Visualizing the data reveals patterns that are not obvious to the human eye when reviewing raw data

Correlation matrices, histograms, category, scatter & box plots have helped identify relationships

Cleaned data based on visualizations:
* Removed NaNs & duplicates
* Price b/w 2k and 50k
* Odometer b/w 100 and 200k, etc..

In the section below, features that would help with better prediction are identified


---






In [None]:
# Describing the dataset to get a basic idea of the non-categorical features
vehicles_df.describe()

In [None]:
# Looking at the target column "price" first
f, ax = plt.subplots(figsize=(12, 8))
ax.set_title('Price Distribution', pad=12)
sns.histplot(vehicles_df, x="price", stat='count', bins=5)
show_values_on_bars(ax)

*It appears that the price ranges between 0 and an unrealistic $3.7B*

*To keep things simple and realistic, making a subset of prices between 2k and 50k*

---



In [None]:
vehicles_prc = vehicles_df[(vehicles_df.price >=2000) & (vehicles_df.price <=50000)]

# Then plot the distriution again
f, ax = plt.subplots(figsize=(12, 8))
ax.set_title('Price Distribution', pad=12)
sns.histplot(vehicles_prc, x="price", stat='count', bins=20)
show_values_on_bars(ax)


In [None]:

# Check for skewness
print(f"Skewness for odometer: {round(vehicles_prc['odometer'].skew(),2)}\n\n")
sns.displot(data=vehicles_prc, x="odometer", aspect=2, height=5, kde=True)


*It's evident that the distribution is highly skewed and there's some bad data with max odometer readings of 10mil miles etc.*

*Let's work on cleaning up some of that data*

*Doing some research, I found that Americans drive an average of 14,300 miles per year, according to the [Federal Highway Administration](https://www.thezebra.com/resources/driving/average-miles-driven-per-year/).*

*Let's look at the entries for odometer = 0 and odometer > 200k.*



In [None]:
print(vehicles_prc[(vehicles_prc.odometer == 0)].describe())
print('\n')
print(vehicles_prc[(vehicles_prc.odometer > 200000)].describe())

 *Based on the stats above, I can make a fair assumption that odometer readings be between 100 (CPO) to 200k (20 yo) will be a good dataset to continue with*

In [None]:
# Filtering the dataset and verifying again
vehicles_odo = vehicles_prc[(vehicles_prc.odometer >100) & (vehicles_prc.odometer <=200000)]

print(pd.DataFrame(vehicles_odo.odometer).describe())

print(f"\n\nSkewness for odometer: {round(vehicles_odo['odometer'].skew(),2)}\n\n")
sns.displot(data=vehicles_odo, x="odometer", aspect=2, height=5, kde=True)


*and with that, the skewness comes down from 41.63 to just 0.37 - although still positively skewed, it's worth exploring what log and square root can do..*

---



In [None]:

# Log
odo_log = np.log(vehicles_odo['odometer'])
print(f"Skewness for Log of Odometer Readings: {round(odo_log.skew(),2)}\n")
sns.displot(data=odo_log, aspect=2, height=5, kde=True, legend=True)


# Square Root
odo_sqrt = np.sqrt(vehicles_odo['odometer'])
print(f"Skewness for Square Root of Odometer Readings: {round(odo_sqrt.skew(),2)}\n\n")
sns.displot(data=odo_sqrt, aspect=2, height=5, kde=True, legend=True)



*That didn't help.. so proceeding without log or sqrt, next step is to see how the age of cars and the odometer readings are related to the price of cars*


---

In [None]:
f, ax = plt.subplots(figsize=(20, 10))
ax.set_title('Price vs Year', pad=12)
fig = sns.boxplot(x=vehicles_odo.year.astype(int), y='price', data=vehicles_odo)
plt.xticks(rotation=90);

*It appears that there is some inconsistency in the first 2/3rds of the dataset.*

*Price seems to consistently rise 2000 onwards until about 2021; and there seems to be some bad data for 2022 as well.*

*Filtering the dataset between 2000 and 2020 for further analysis*

---



In [None]:
year_list = list(range(2000, 2021))

vehicles_year = vehicles_odo[vehicles_odo.year.astype(int).isin(year_list)]

# Plot again to visualize distribution
f, ax = plt.subplots(figsize=(12, 8))
ax.set_title('Price vs Year', pad=12)
fig = sns.boxplot(x=vehicles_year.year.astype(int), y='price', data=vehicles_year)
plt.xticks(rotation=90);

*With this used 20 year set, next, trying to find how the three features come together and depict real-worl characteristics.*
*Checking how price varies with mean odometer ratings over the age of the car posted.*

---


In [None]:
# Calculate age of the posted car using "posting date"
# Convert year and posting date to datetime
vehicles_year.posting_date = pd.to_datetime(vehicles_year.posting_date, utc=True)
vehicles_year.posting_date = vehicles_year.posting_date.astype('datetime64[ns]')

# Add a new field for age of cars
vehicles_year['age'] = vehicles_year.posting_date.dt.year.astype(int) - vehicles_year.year.astype(int)

# Get a preview of the changes
vehicles_year.head()

In [None]:
# Get mean of odometer readings by age
grp_df = vehicles_year.groupby(by='age').mean()[['price','odometer']].astype(int).reset_index()

# Visualize how odometer average readings vary with price over age of cars
# Set axes and points 
x = x=grp_df.odometer
y = grp_df.price
points = grp_df.age
s = [30*n for n in range(len(y))]

f, ax = plt.subplots(figsize=(12, 8))
# Plot for each year
plt.title(f"Mean of Odometer vs Price over age of cars")
plt.xlabel("Odometer Readings (mean)")
plt.ylabel("Price (mean) ($)")

# Add labels for weeks
for i, week in enumerate(points):
    plt.annotate(week, (x[i], y[i]), size=14, va="bottom", ha="center")
    plt.scatter(x, y, s=s)

plt.show()

*It's evident from the visualization above that cars that have been driven less are more expensive than older cars which have been driven more. There seem to be a good chunk of cars under 10k that have been driven 120k and over and are 12 years and older - this is an interesting insight.*

---



In [None]:
sns.catplot(x='condition', y ='price', hue='title_status', data=vehicles_year,
             kind="bar", aspect=2, height=5)

*Since we want to look at only used cars, ignoring new cars for the moment*.

*It also looks like there are only parts being sold - which might affect the price.*

*Removing both these attributes..*

---

In [None]:
vehicles_used = vehicles_year[vehicles_year.condition != 'new']

vehicles_used = vehicles_used[vehicles_used.title_status != 'parts only']

sns.catplot(x='condition', y ='price', hue='title_status', data=vehicles_used, 
            kind="bar", aspect=2, height=5)

#del vehicles_year

*On to the next, understanding how price of cars is affected by the fuel and trasmission features...*

In [None]:
# Categorical plot between fuel and price for each type of trasmission 
sns.catplot(x='type', y ='price', hue='fuel', col='transmission', data=vehicles_used, kind="bar", 
            aspect=3, height=4,  palette="rocket", col_wrap=1)


*From the above visualization, it's noted that "other" values for type of fuels and trasmissions contribute to a considerable volume of data.*

*These, which are not a lot of value might affect the overall accuracy - hence removing them..*

---


In [None]:
# Remove "other" types of fuel
vehicles_used = vehicles_used[(vehicles_used.fuel != 'other')]

# Remove "other" type of trasmissions
vehicles_used = vehicles_used[(vehicles_used.transmission != 'other')]

# Plot again to visualize
sns.catplot(x='type', y ='price', hue='fuel', col='transmission', data=vehicles_used, kind="bar", 
            aspect=3, height=4,  palette="rocket", col_wrap=1)


*Next, we see how price is related to different kinds of manufacturers and the models they produce.*

---

In [None]:
# Visualize the relationship of average price by manufacturer
grp_man_df = vehicles_used.groupby(by='manufacturer').mean()['price'].reset_index()

x = grp_man_df.manufacturer
y = grp_man_df.price
y_mean = [np.median(y)]*len(grp_man_df)


f, ax = plt.subplots(figsize=(12, 8))
ax.scatter(x, y, s=y/100)
ax.plot(x, y_mean, label='Median price', linestyle='--')

plt.title(f"Mean Prices by Manufacturer")
plt.ylabel("Price (mean)")
plt.xlabel("Manufacturer")
plt.xticks(rotation=90)
plt.legend()
plt.show()


*It's obviously evident that luxury brands have a higer price, but except a couple outliers, the median price lies near most points*

*Finally, we explore the "model" feature which I imagine has the highest cardinality amongst all the features we've seen so far..*

---


In [None]:
# Add a field for row numbers
vehicles_used['row_num'] = np.arange(len(vehicles_used))

# Get counts of models
model_df = vehicles_used.groupby(by="model").count()['row_num'].reset_index()
model_df.columns=(['model','count'])

# Get only 10 frequent models and how much the other account to
lar10_df = model_df.nlargest(10, columns='count')
lar10_df.index = range(len(lar10_df))

# Get count of all other models and append to the end of data drame
other_val_sum = model_df[~model_df['model'].isin(lar10_df.model)].sum().T['count']
lar10_df.loc[10] = ['Other Models',other_val_sum]

# Plot what the counts of models look like
f, ax = plt.subplots(figsize=(12, 8))
ax.set_title('Count Distribution of Car Models', pad=12)
sns.barplot(x="model", y="count",  palette="icefire",  data=lar10_df)
show_values_on_bars(ax)

*As noted above, the "model" field has very high cardinality and this would have to be encoded with one of the encoders as they describe [in this (slightly older) article](https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159)*

 *I do want to note here that I want to experiment without dropping this high cardinality feature and work with models - if the accuracy turns to be too low, it's worth exploring without this feature*
 
 ---

# Encoding Categorical Data

*Since almost all features are categorical in this dataset, we'd have to encode them. I use Label Encoding*

---








In [None]:
# Get current information of the dataset
vehicles_used.info()

# Drop columns populated during clean-up or not required
vehicles_used.drop(columns=['posting_date','row_num'], inplace=True)

In [None]:
# Make a copy of the data frame for encoding
vehicles_used_enc = vehicles_used.copy()
vehicles_used_enc.info()


In [None]:
# Get fields that are categorical and remove only "model"
cat_features = vehicles_used_enc.select_dtypes(exclude=np.number).columns.to_list()
print(f'Categorical features: {cat_features}\n\n')

# Encode using LabelEncoder
for c in cat_features:
      le = LabelEncoder()
      le.fit(list(vehicles_used_enc[c].astype(str).values))

      vehicles_used_enc[c] = le.transform(list(vehicles_used_enc[c].astype(str).values))

# Encode "model" using OneHotEncoder
# model_arr = vehicles_used_enc.model.values.reshape(-1,1)
# oh = OneHotEncoder()
# model_encoded = oh.fit_transform(model_arr)
# vehicles_used_enc.model = oh.transform(model_arr)


vehicles_used_enc

In [None]:
# Get the correlation matrix for the encoded data frame
f, ax = plt.subplots(figsize=(12, 10))
ax.set_title('Encoded Correlation Heatmap for Used Vehicles Dataset', pad=12)
sns.heatmap(vehicles_used_enc.corr(), vmin=-1, vmax=1, annot=True, cmap='Spectral')


# Preparing Data and Modeling
## *Data Prep*


In [None]:
# X will be all features except price
feature_cols = vehicles_used_enc.columns.values.tolist()
feature_cols.remove('price')
#feature_cols.remove('model')
X = vehicles_used_enc[feature_cols]

# Y will be the target col = price
Y = vehicles_used_enc['price']

print(f"X (features):\n\n{X}")

print(f"\nY (target):\n\n{Y}")

In [None]:
#Splitting the dataset into training and testing sets for modeling later

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 42)



## *Modeling*

*Since the target field is non-catgorical, classifier I use regression rather than classification*.
*I start with Linear Regression and explore the variants of Decision Tree based models as below:*

Decision Trees --> Bagging --> Random Forest --> Boosting --> Gradient Boosting --> XGBoost

*References*
1. [Titanic Data Scince Solutions (Models)](https://www.kaggle.com/startupsci/titanic-data-science-solutions?scriptVersionId=10431564&cellId=77)
2. [XGBoost Algorithm](https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d)

---

*To start with, define a function for regression metrics*

1. R² measures how much variability in dependent variable can be "explained by the model.
2. While R² is a relative measure of how well the model fits dependent variables, \nMean Square Error is an absolute measure of the goodness for the fit.
3. Mean Absolute Error(MAE) is similar to MSE, however, unlike MSE, MAE takes the sum of the ABSOLUTE value of error.

*[Regression Merics Reference](https://towardsdatascience.com/regression-an-explanation-of-regression-metrics-and-what-can-go-wrong-a39a9793d914)*

In [None]:
# Define a function for output statistics
def reg_metrics(pred_model, x_train, x_test, y_train, y_test):
    """ Function takes in training and testing sets, prediction model, 
    and ouputs the below metrics:
    1. R² or Coefficient of Determination.
    2. Adjusted R²
    3. Mean Squared Error(MSE)
    4. Root-Mean-Squared-Error(RMSE).
    5. Mean-Absolute-Error(MAE).
    """
    # Get predicted values on x_test
    y_pred = pred_model.predict(x_test)

    #1 & 2 Coefficient of Determination (R² & Adjusted R²)
    print("\n\t--- Coefficient of Determination (R² & Adjusted R²) ---")
    r2 = metrics.r2_score(y_pred=y_pred, y_true=y_test)
    adj_r2 = 1 - (1-r2)*(len(y_train)-1)/(len(y_train)-x_train.shape[1]-1)

    print(f"R²\t\t: {round(r2, 2)}")
    print(f"Adjusted R²\t: {round(adj_r2, 2)}")
    

    #3 & 4. MSE and RMSE
    print("\n\t--- Mean Squared Error (MSE & RMSE) ---")
   
    mse = metrics.mean_squared_error(y_pred=y_pred, y_true=y_test, squared=True)
    rmse = metrics.mean_squared_error(y_pred=y_pred, y_true=y_test, squared=False)

    print(f"MSE\t: {round(mse, 2)}")
    print(f"RMSE\t: {round(rmse, 2)}")


    #5. MAE
    print("\n\t--- Mean Absolute Error (MAE) ---")
    mae = metrics.mean_absolute_error(y_pred=y_pred, y_true=y_test)
    print(f"MAE\t: {round(mae, 2)}")
    
    # Return Accuracy
    train_acc = round(pred_model.score(x_train, y_train)*100, 2)
    test_acc = round(pred_model.score(x_test, y_test)*100, 2)

    return (train_acc, test_acc)


In [None]:
# Define a dataframe to summarize accuracies for later
algo_list = ['Linear Regression','Decision Trees', 'Bagging', 'Random Forest', 'Adaptive Boosting', 'Gradient Boosting', 'XGBoost']
acc_cols = ['Training Accuracy', 'Testing Accuracy']

acc_df = pd.DataFrame(columns=acc_cols, index=algo_list)

acc_df.index.name='Algorithm'


In [None]:
# Linear Regression
linear_reg = LinearRegression()
linear_reg.fit(x_train, y_train)
print("\t------- Linear Regression -------")
linreg_acc = reg_metrics(linear_reg, x_train, x_test, y_train, y_test)

acc_df.loc['Linear Regression'] = linreg_acc

In [None]:
# Decision Tree 
# A graphical representation of possible solutions to a decision based on certain conditions

dtree_reg = DecisionTreeRegressor()
dtree_reg.fit(x_train, y_train)

print("\t------- Decision Tree Regression -------")
dtree_acc = reg_metrics(dtree_reg, x_train, x_test, y_train, y_test)

acc_df.loc['Decision Trees'] = dtree_acc

In [None]:
# Bagging Regression
# Meta-algorithm combining predictions from multiple-decision
#  trees through a majority voting mechanism

bag_reg = BaggingRegressor()
bag_reg.fit(x_train, y_train)

print("\t------- Bagging Regression -------")
bag_acc = reg_metrics(bag_reg, x_train, x_test, y_train, y_test)

acc_df.loc['Bagging'] = bag_acc



In [None]:
# Random Forest Regression
# Bagging-based algorithm where only a subset of features are selected at
# random to build a forest or collection of decision trees

rf_reg = RandomForestRegressor()
rf_reg.fit(x_train, y_train)

print("\t------- Random Forest Regression -------")
rf_acc = reg_metrics(rf_reg, x_train, x_test, y_train, y_test)

acc_df.loc['Random Forest'] = rf_acc

In [None]:
# Adaboost Regression
# Models are built sequentially by minimizing the errors from previous models while
# increasing (or boosting) influence ofnigh-performing models
ab_reg = AdaBoostRegressor()
ab_reg.fit(x_train, y_train)


print("\t------- Adaboost Regression -------")
ab_acc = reg_metrics(ab_reg, x_train, x_test, y_train, y_test)

acc_df.loc['Adaptive Boosting'] = ab_acc


In [None]:
# Gradient Boosting Regression
# Gradient Boosting employs gradient descent algorithm to minimize errors in sequential models

gb_reg = GradientBoostingRegressor()
gb_reg.fit(x_train, y_train)


print("\t------- Gradient Boosting Regression -------")
gb_acc = reg_metrics(gb_reg, x_train, x_test, y_train, y_test)

acc_df.loc['Gradient Boosting'] = gb_acc


In [None]:
# XGBoost
# Optimized Gradient Boosting algorithm through parallel processing, tree-pruning,
# handling missing values and regularization to avoid overfitting/bias

xgb_reg = xgb.XGBRegressor() 
xgb_reg.fit(x_train, y_train)

print("\t------- XGBoost Regression -------")
xgb_acc = reg_metrics(xgb_reg, x_train, x_test, y_train, y_test)

acc_df.loc['XGBoost'] = xgb_acc

**Great Accuracy!**

*Now moving on to summarize and concluding this project..*
*From the above, it's evident that Random Forest gives the best accuracy of ~90%.*

# *Summary & Conclusion*

In [None]:
# All accuracies from the above algorithms
acc_df.astype(str) + '%'

In [None]:
# Rearrange the data frame
acc_df.columns=['Training','Testing']
acc_plot_df = acc_df.reset_index().melt(id_vars=['Algorithm'])
acc_plot_df.columns=['Algorithm','Dataset','Accuracy']

# Plot the final accuracies
f, ax = plt.subplots(figsize=(12, 8))
sns.barplot(x="Algorithm", y="Accuracy", hue="Dataset", data=acc_plot_df)
ax.set_title("Accuracies by Algorithm", pad=12)
show_values_on_bars(ax)



>**From the above, it's evident that each algorithm has a method of working but they accomplish a common goal.**


>**Which one to choose and the best one to use would have to be determined per requirements at hand.**