<a href="https://colab.research.google.com/github/yanezdavid/Selling-Used-Mercedez/blob/main/Selling_Used_Mercedez.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Research Question

**How much should I sell my used Mercedes for?**

What is a fair price to sell my Mercedes for on online marketplaces? Using specifications from previously sold Mercedes on online marketplaces, I use develop several machine learning models to predict an optimal price to sell my car for. Data was recieved from [Kaggle](https://www.kaggle.com/mysarahmadbhat/mercedes-used-car-listing.



In [91]:
from google.colab import drive #link drive
drive.mount('/content/drive')

import pandas as pd #import libraries
import numpy as np

df = pd.read_csv('/content/drive/MyDrive/Data Science/Projects/merc 2.csv') #load dataframe
df.head() #preview data

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
0,SLK,2005,5200,Automatic,63000,Petrol,325,32.1,1.8
1,S Class,2017,34948,Automatic,27000,Hybrid,20,61.4,2.1
2,SL CLASS,2016,49948,Automatic,6200,Petrol,555,28.0,5.5
3,G Class,2016,61948,Automatic,16000,Petrol,325,30.4,4.0
4,G Class,2016,73948,Automatic,4000,Petrol,325,30.1,4.0
...,...,...,...,...,...,...,...,...,...
95,A Class,2019,18141,Automatic,15689,Diesel,145,65.7,1.5
96,GLS Class,2016,37385,Automatic,21169,Diesel,305,37.2,3.0
97,CLS Class,2017,26166,Automatic,9165,Diesel,150,51.4,3.0
98,B Class,2019,24616,Automatic,3710,Diesel,145,55.4,2.0


# Data Dictionary

First, some information about the data.
* model:  Mercedez model.
* year: registraion year.
* price: price in Euros.
* transmission: type of gear box.
* mileage: distance used.
* fuelType: engine fuel.
* tax: road tax.
* mpg: miles per galoon.
* engineSize: size in litres.


# Clean Data

Before the data is ready machine learning it is necessary to check for errors, missing values, and other irregularities.

In [65]:
df.info() #check for incorrect datatypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13119 entries, 0 to 13118
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         13119 non-null  object 
 1   year          13119 non-null  int64  
 2   price         13119 non-null  int64  
 3   transmission  13119 non-null  object 
 4   mileage       13119 non-null  int64  
 5   fuelType      13119 non-null  object 
 6   tax           13119 non-null  int64  
 7   mpg           13119 non-null  float64
 8   engineSize    13119 non-null  float64
dtypes: float64(2), int64(4), object(3)
memory usage: 922.6+ KB


In [66]:
df.shape #check the number of rows and columns in the dataframe

(13119, 9)

In [67]:
df.isnull().sum() #check for missing data in all columns

model           0
year            0
price           0
transmission    0
mileage         0
fuelType        0
tax             0
mpg             0
engineSize      0
dtype: int64

In [68]:
df.duplicated().sum() #check for duplicated data

259

In [69]:
percent_duplicated = ((259/13119) * 100)
print('Percentage of duplicated data:', percent_duplicated) #get the % of duplicated data in the dataset

Percentage of duplicated data: 1.9742358411464287


Approximately 2% of the dataset is duplicated data. Since there are over 13,000 cars listed, and because there are many of the same models included, it is expected that there will be some values that are the same.

In [70]:
df['model'].value_counts() #ensure that the categorical data correctly grouped

 C Class      3747
 A Class      2561
 E Class      1953
 GLC Class     960
 GLA Class     847
 B Class       591
 CL Class      511
 GLE Class     461
 SL CLASS      260
 CLS Class     237
 V Class       207
 S Class       197
 GL Class      121
 SLK            95
 CLA Class      86
 X-CLASS        82
 M Class        79
 GLS Class      74
 GLB Class      19
 G Class        15
 CLK             7
 CLC Class       3
 R Class         2
220              1
230              1
180              1
200              1
Name: model, dtype: int64

In [71]:
df['transmission'].value_counts() #ensure that the categorical data correctly grouped

Semi-Auto    6848
Automatic    4825
Manual       1444
Other           2
Name: transmission, dtype: int64

In [72]:
df['fuelType'].value_counts() #ensure that the categorical data correctly grouped

Diesel    9187
Petrol    3752
Hybrid     173
Other        7
Name: fuelType, dtype: int64

For an American audience, Euros are not a good metric for the price of Mercedez cars. This should be converted into USD for applicability.

In [73]:
df['price'] = df['price'] * 1.17 #convert Euros to USD

# Preprocessing for Machine Learning

With the data cleaned, it is also necessary to preprocess the data so that it is ready for machine learning, such as encoding categorical data to numerical values or standardization when necessary.

In [74]:
df = pd.get_dummies(df, columns=['fuelType',
                                 'transmission',
                                 'model']) #encode categorical data for machine learning

In [75]:
from sklearn.model_selection import train_test_split #import train test split


x = df.drop(columns=['price']) #create feature matrix without the target vector
y = df['price'] #create target vector
x_train, x_test, y_train, y_test = train_test_split(x,
                                                    y,
                                                    random_state=42,
                                                    test_size=.3) #use train test splits to divide data into testing and training sets for machine learning

In [87]:
df.head()

Unnamed: 0,year,price,mileage,tax,mpg,engineSize,fuelType_Diesel,fuelType_Hybrid,fuelType_Other,fuelType_Petrol,transmission_Automatic,transmission_Manual,transmission_Other,transmission_Semi-Auto,model_ A Class,model_ B Class,model_ C Class,model_ CL Class,model_ CLA Class,model_ CLC Class,model_ CLK,model_ CLS Class,model_ E Class,model_ G Class,model_ GL Class,model_ GLA Class,model_ GLB Class,model_ GLC Class,model_ GLE Class,model_ GLS Class,model_ M Class,model_ R Class,model_ S Class,model_ SL CLASS,model_ SLK,model_ V Class,model_ X-CLASS,model_180,model_200,model_220,model_230
0,2005,6084.0,63000,325,32.1,1.8,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,2017,40889.16,27000,20,61.4,2.1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
2,2016,58439.16,6200,555,28.0,5.5,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
3,2016,72479.16,16000,325,30.4,4.0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,2016,86519.16,4000,325,30.1,4.0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# Machine Learning Models

I use several top-performing machine learning models for regression and compare their accuracy: XGBoost, Random Forest, and 

In [76]:
from xgboost import XGBRegressor #import XGBoost
from math import sqrt #import library for necessary calculations
from sklearn.metrics import mean_squared_error #import mean squared error for evaluating machine learning performance

XGB = XGBRegressor() 
XGB.fit(x_train, y_train) #create and fit default XGBoost model with training data
XGB_predicted = XGB.predict(x_test) #generate predictions using testing data

print('RMSE: ', str(sqrt(mean_squared_error(XGB_predicted, y_test)))) #valuate erroer using root mean squared error
print('----------------------------------')
print('Testing Score: ', str(XGB.score(x_test, y_test))) #core accuracy

RMSE:  3927.8965492592115
----------------------------------
Testing Score:  0.9274522575620282


In [77]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor() 
rf.fit(x_train, y_train) #create and fit default Random Forest model with training data
rf_predicted = rf.predict(x_test) #generate predictions with testing data

print('RMSE: ' + str(sqrt(mean_squared_error(rf_predicted, y_test)))) #valuate erroer using root mean squared error
print('---------------------------------')
print('Testing Score: ' + str(rf.score(x_test, y_test))) #score accuracy

RMSE: 3285.9749767049348
---------------------------------
Testing Score: 0.9492270585706621


In [86]:
from sklearn.ensemble import VotingRegressor #import ensemble model

ensemble = VotingRegressor(estimators = [('Random Forest', rf),
                                          ('XGBoost', XGB)]) #create ensemble model using Random Forest and XGBoost
ensemble.fit(x_train, y_train) #fit with training data
ensemble_predicted = ensemble.predict(x_test) #generate predictions using testing data

print('RMSE: ' + str(sqrt(mean_squared_error(ensemble_predicted, y_test)))) #evaluate error using root mean squared error
print('---------------------------------')
print('Testing Score: ' + str(ensemble.score(x_test, y_test))) #score accuracy

RMSE: 3326.3066330383094
---------------------------------
Testing Score: 0.9479730479356447


# Generate Prediction

In [94]:
x.head()

Unnamed: 0,year,mileage,tax,mpg,engineSize,fuelType_Diesel,fuelType_Hybrid,fuelType_Other,fuelType_Petrol,transmission_Automatic,transmission_Manual,transmission_Other,transmission_Semi-Auto,model_ A Class,model_ B Class,model_ C Class,model_ CL Class,model_ CLA Class,model_ CLC Class,model_ CLK,model_ CLS Class,model_ E Class,model_ G Class,model_ GL Class,model_ GLA Class,model_ GLB Class,model_ GLC Class,model_ GLE Class,model_ GLS Class,model_ M Class,model_ R Class,model_ S Class,model_ SL CLASS,model_ SLK,model_ V Class,model_ X-CLASS,model_180,model_200,model_220,model_230
0,2005,63000,325,32.1,1.8,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,2017,27000,20,61.4,2.1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
2,2016,6200,555,28.0,5.5,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
3,2016,16000,325,30.4,4.0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,2016,4000,325,30.1,4.0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
