# CAR PRICE ESTIMATOR 

The following JupyterNotebook contains a program using classification to estimate the price of used cars based on several factors included in csv files. It will also contain the thought process and reasons for methods used.

***

# IMPORTS

Importing all the necessary libraries and modules.

In [422]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error

Read csv data and defining parameters. Splitting the data will not be required as testing and training files are seperate. In the future, validity of the model should also be tested on the same data set by allocating, for example 80%, of the data for training. X is conventionally used for independant variables while y for the target.

In [423]:
#Importing training and tesing Data sets
df_train = pd.read_csv('data_train.csv')
df_test = pd.read_csv('data_test.csv') 

#".astype(int)" is used on target values to avoid continuous variable error
#Defining training X and y
X_train = df_train.drop(columns=['price_usd']) 
y_train = df_train['price_usd'].astype(int)                

#defining testing X and y
X_test = df_test.drop(columns=['price_usd'])   
y_test = df_test['price_usd'].astype(int)                               

***

# CLEAN UP

Since the DecisionTreeClassifier will be used for this predictive model, strings within the data must be converted to numerical values. This can be done by assigning the LabelEncoder transformer to the string values.

In [424]:
#Transforming the training and testing string variables to numerical values

let = LabelEncoder()

#Assigning and transforming the likned variables by using ".fit_transform"

#Training Data

X_train['manufacturer_name_n']=let.fit_transform(X_train['manufacturer_name'])
X_train['transmission_n']=let.fit_transform(X_train['transmission'])
X_train['color_n']=let.fit_transform(X_train['color'])
X_train['engine_fuel_n']=let.fit_transform(X_train['engine_fuel'])
X_train['engine_type_n']=let.fit_transform(X_train['engine_type'])
X_train['body_type_n']=let.fit_transform(X_train['body_type'])
X_train['has_warranty_n']=let.fit_transform(X_train['has_warranty'])
X_train['ownership_n']=let.fit_transform(X_train['ownership'])
X_train['type_of_drive_n']=let.fit_transform(X_train['type_of_drive'])
X_train['is_exchangeable_n']=let.fit_transform(X_train['is_exchangeable'])

#Testing Data

X_test['manufacturer_name_n']=let.fit_transform(X_test['manufacturer_name'])
X_test['transmission_n']=let.fit_transform(X_test['transmission'])
X_test['color_n']=let.fit_transform(X_test['color'])
X_test['engine_fuel_n']=let.fit_transform(X_test['engine_fuel'])
X_test['engine_type_n']=let.fit_transform(X_test['engine_type'])
X_test['body_type_n']=let.fit_transform(X_test['body_type'])
X_test['has_warranty_n']=let.fit_transform(X_test['has_warranty'])
X_test['ownership_n']=let.fit_transform(X_test['ownership'])
X_test['type_of_drive_n']=let.fit_transform(X_test['type_of_drive'])
X_test['is_exchangeable_n']=let.fit_transform(X_test['is_exchangeable'])

To avoid any errors with the classifier, the string values must nowbe dropped from the data set

In [425]:
#Dropping string values from the training data set

X_train_n = X_train.drop(['manufacturer_name', 'transmission', 'color',
                          'engine_fuel', 'engine_type', 'body_type',
                         'has_warranty', 'ownership', 'type_of_drive',
                         'is_exchangeable'], axis='columns')

#Dropping string values from the testing data set

X_test_nt = X_test.drop(['manufacturer_name', 'transmission', 'color',
                          'engine_fuel', 'engine_type', 'body_type',
                         'has_warranty', 'ownership', 'type_of_drive',
                         'is_exchangeable'], axis='columns')

#After checking for NaN values in the data, 15 occurences where found and must be dealt with
#Using ".fillna" to assign a values to NaN instances 

X_train_n2 = X_train_n.fillna(value = 0)
X_test_nt2 = X_test_nt.fillna(value = 0)

#For the sake of trial and error 0 was initially assigned for the NaN values
#Other methods of dealing with NaN values may include using a different clasiffier with NaN tolerance
#or by simply removing the row including any NaN cell

***

# TRAIN AND TEST THE MODEL

Creating and training the model using the DecisionTreeClassifier, and training the model by using ".fit(X, y)"

In [426]:
model = DecisionTreeClassifier()   #creating the model
model.fit(X_train_n2, y_train)      #training the model


The final steps consist of predicting the target values and calculating the mean absolute error in USD

In [427]:
#predicting the y values for test data

predictions = model.predict(X_test_nt2) 

In [428]:
#Calculating the mean absolute error between the test and predicted data

mean_absolute_error(y_test, predictions)

653.2338888888889

# Conclusion 

After several itterations, the MAE ranges in value between approximately 650 and 670 USD. Considering that the average price of car in the data set is 6611.33, MAE is around 10% deviance. This can be improved in several ways in the future like testing different classifiers or training with more data.