*1. INTRODUCTION*

We will be working on the vehicle dataset from cardekho dataset. This dataset contains information about used cars listed on www.cardekho.com.  

We will predict the selling price of a used car based on various factors such as  

        Kms_Driven     -> kilometers driven 

        Year                   -> year of purchase  

        Present_Price  -> present price of a new car 

        Fuel_Type         -> type of fuel being used (petrol, diesel, CNG) 

        Transmission    -> automatic or manual gear transmission  

        Owner               -> number of previous owners 

        Seller_Type       -> dealer or individual  

In [None]:
import pandas as pd

import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns
 
%matplotlib inline 
%config InlineBackend.figure_format = 'retina' 
import warnings  
warnings.filterwarnings('ignore') 
import os 
print(os.listdir("../input")) 

In [None]:
cars=pd.read_csv("../input/car data.csv")
cars.sample()

Let us have a look at the number of data entries we have pertining to each vehicle.

In [None]:
cars['Car_Name'].value_counts()

*2. Data handling *

Check for null values. Since our data has none, we can skip this step. 

In [None]:
cars.isnull().any()

Convert categorical variables (Fuel_Type,Transmission,Seller_Type) into or labels into numeric form so as to convert it into the machine-readable form. 

Machine learning algorithms can then decide in a better way on how those labels must be operated. 

We use label encoder to transform values of Fuel type, Transmission and Seller type to 0,1 and 2. 

In [None]:
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
cars["trans_code"] = lb.fit_transform(cars["Transmission"])
cars["Transmission"].replace(cars["trans_code"], inplace=True)
cars["seller_code"] = lb.fit_transform(cars["Seller_Type"])
cars["Seller_Type"].replace(cars["seller_code"], inplace=True)
cars["fuel_code"] = lb.fit_transform(cars["Fuel_Type"])
cars["Fuel_Type"].replace(cars["fuel_code"], inplace=True)

We combine the two variables Year and Kms-Driven to a new feature Rating which is a measure of number of kilometers driven each day.

Rating is the ratio of number of kilometers driven in total to the approximate number of days since the vehicle has been purchased.

In [None]:
year=cars["Year"]
mile=cars["Kms_Driven"]
rate=mile/((2019-year)*365)
cars["Rating"]=rate

*3. Understanding the data and Data Visualization *

Plot relationship with numerical variables. 

Following variables can play an important role in this problem: 

    1. Kms_Driven 
    2. Year
    3. Present_Price

In [None]:
var = 'Kms_Driven'
data = pd.concat([cars["Selling_Price"], cars[var]], axis=1)
data.plot.scatter(x=var, y='Selling_Price');

    Kms_Driven 

As expected, this graph shows a negative relation between kilometers driven and the selling price. 
The more the kilometers driven, the lesser the selling price because this means more wear and tear of various parts of the car which reduces its value. 

In [None]:
var = 'Year'
data = pd.concat([cars["Selling_Price"], cars[var]], axis=1)
data.plot.scatter(x=var, y='Selling_Price',);

    Year 

As mentioned earlier, older cars tend to lower longevity and hence sell at a lower price compared to newer cars maintained in good condition. 

In [None]:
var = 'Present_Price'
data = pd.concat([cars["Selling_Price"], cars[var]], axis=1)
data.plot.scatter(x=var, y='Selling_Price',);

    Present_Price 

We can observe a linear behavior. There are no outliers above the line because the selling price of a used vehicle is always lower than the present value of a new one. 

*4. Training and testing a model *

We split the data into training and testing set. 
We first fit each model on the training set. 
Then we predict the dependent variable in the test set. 
Now we compare the obtained results with the actual data at hand and get a score for the performance of each algorithm. 

    Linear regressor  

Build a linear regression model by ordinary least squares and test it for given data set. 
We check the significance of different features by comparing their p value to see if it is below the confidence level of 0.05. 

Fit the model for different combinations of independent variables to see which yields the best results.  

In [None]:
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

In [None]:
X=cars[["Present_Price","fuel_code","seller_code","trans_code","Year","Kms_Driven","Owner"]]
y=cars["Selling_Price"]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
X = sm.add_constant(X)
model = sm.OLS(y_train.astype(float), X_train.astype(float)).fit()
predictions = model.predict(X_test)
model.summary()

In [None]:
X=cars[["Present_Price","fuel_code","seller_code","trans_code","Rating","Owner"]]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
X = sm.add_constant(X)
model = sm.OLS(y_train.astype(float), X_train.astype(float)).fit()
predictions = model.predict(X_test)
model.summary()

We can see that multicollinearity existed between two of our variables Year and Kms driven

In [None]:
X=cars[["Present_Price","fuel_code","seller_code","trans_code","Year","Kms_Driven","Owner"]]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
X = sm.add_constant(X)
model1 = sm.OLS(y_train.astype(float), X_train.astype(float)).fit()
r1=model1.rsquared
print('R squared value=', +r1)

predictions = model1.predict(X_test)
predictions=pd.DataFrame(predictions)
predictions=predictions.reset_index()
test_index=y_test.reset_index()["Selling_Price"]
ax=test_index.plot(label="originals",figsize=(12,6),linewidth=2,color="r")
ax=predictions[0].plot(label = "predictions",figsize=(12,6),linewidth=2,color="g")
plt.legend(loc='upper right')
plt.title("Linear Regressor")
plt.xlabel("index")
plt.ylabel("values")
plt.show()

Now we begin to remove each variable one after the other to see which values give the best result. 
If we replace Year and Kms_Driven variables with Rating feature that we created the following results are obtained

In [None]:
X=cars[["Present_Price","fuel_code","trans_code","seller_code","Rating","Owner"]]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
X = sm.add_constant(X)
model2 = sm.OLS(y_train.astype(float), X_train.astype(float)).fit()
r2=model2.rsquared
print('R squared value=', +r2)

Removing Present price we see the performance of the model goes down drastically. :

In [None]:
X=cars[["fuel_code","seller_code","trans_code","Year","Kms_Driven","Owner"]]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
X = sm.add_constant(X)
model3 = sm.OLS(y_train.astype(float), X_train.astype(float)).fit()
r3=model3.rsquared
print('R squared value=', +r3)

Similarly, by removing categorical values we observe that we get a slightly higher value of Rsquared when we neglect seller type feature. 

In [None]:
X=cars[["Present_Price","fuel_code","trans_code","Year","Kms_Driven","Owner"]]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
X = sm.add_constant(X)
model4 = sm.OLS(y_train.astype(float), X_train.astype(float)).fit()
r4=model4.rsquared
print('R squared value=', +r4)

Removing Fuel type:

In [None]:
X=cars[["Present_Price","seller_code","trans_code","Year","Kms_Driven","Owner"]]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
X = sm.add_constant(X)
model5 = sm.OLS(y_train.astype(float), X_train.astype(float)).fit()
r5=model5.rsquared
print('R squared value=', +r5)

Removing Transmission:

In [None]:
X=cars[["Present_Price","fuel_code","seller_code","Year","Kms_Driven","Owner"]]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
X = sm.add_constant(X)
model6 = sm.OLS(y_train.astype(float), X_train.astype(float)).fit()
r6=model6.rsquared
print('R squared value=', +r6)

In [None]:
##Plot results of linear regression by considering different features

data=[r1,r2,r3,r4,r5,r6]

plt.subplots(figsize = (15,8))
ax = plt.plot(data)
plt.axis([0, 5, 0.6, 1])
plt.ylabel('Rsquared value',size=25)
plt.xlabel('Trial',size=25)
labels = (['default', 'w rating', 'w/o present \n price','w/o seller','w/o fuel','w/o \n transmission'])
val = [0,1,2,3,4,5]  
plt.xticks(val, labels);
plt.show()

    AdaBoost regressor  

Adaptive boosting is a boosting technique which helps you combine multiple “weak classifiers” into a single “strong classifier”. 

Each new classifier/predictor is given a training set where the difficult examples are increasingly represented, this is achieved either through weighting or resampling. 

In [None]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import make_regression

In [None]:
regr = AdaBoostRegressor()
X=cars[["Present_Price","fuel_code","trans_code","Year","Kms_Driven","Owner"]]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
model=regr.fit(X_train, y_train) 
x3=regr.score(X,y)
print('R squared value=', +x3)
predictions = model.predict(X_test)
predictions=pd.DataFrame(predictions)

predictions=predictions.reset_index()
test_index=y_test.reset_index()["Selling_Price"]
ax=test_index.plot(label="originals",figsize=(12,6),linewidth=2,color="r")
ax=predictions[0].plot(label = "predictions",figsize=(12,6),linewidth=2,color="g")
plt.legend(loc='upper right')
plt.title("ADABOOST Regressor")
plt.xlabel("index")
plt.ylabel("values")
plt.show()

    Decision tree regressor 

Decision tree regressor is a model of decisions and all of their possible results, including outcomes, input costs and utility. 

Decision tree algorithm falls under the category of supervised learning algorithms. It works for both continuous as well as categorical output variables. 

In [None]:
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score

In [None]:
X=cars[["Present_Price","fuel_code","trans_code","Year","Kms_Driven","Owner"]]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.1)
dtr.fit(X_train,y_train)
predicts=dtr.predict(X_test)
prediction=pd.DataFrame(predicts)
R_2=r2_score(y_test,prediction)


    
    # Printing results  
print(dtr,"\n") 
print("R squared value=",R_2,"\n")

    
    # Plot for prediction vs originals
test_index=y_test.reset_index()["Selling_Price"]
ax=test_index.plot(label="originals",figsize=(12,6),linewidth=2,color="r")
ax=prediction[0].plot(label = "predictions",figsize=(12,6),linewidth=2,color="g")
plt.legend(loc='upper right')
plt.title("Decision Tree Regressor")
plt.xlabel("index")
plt.ylabel("values")
plt.show()

    Decision tree regression with Adaboost 

In [None]:
regr_2 = AdaBoostRegressor(DecisionTreeRegressor())

model=regr_2.fit(X_train, y_train)

y_2 = regr_2.predict(X_test)

x4=regr_2.score(X_test,y_test)
print('R squared value=', +x4)

predictions=pd.DataFrame(y_2)
predictions=predictions.reset_index()
test_index=y_test.reset_index()["Selling_Price"]
ax=test_index.plot(label="originals",figsize=(12,6),linewidth=2,color="r")
ax=predictions[0].plot(label = "predictions",figsize=(12,6),linewidth=2,color="g")
plt.legend(loc='upper right')
plt.title("Decision tree with ADABOOST Regressor")
plt.xlabel("index")
plt.ylabel("values")
plt.show()

5. Conclusions 

Even though we obtained a fairly high Rsqured value for the linear regression algorithm, our results were improved by adabost technique. 

We tested the model for different combinations of independent variables to figure out which yields the best results. 

The decision tree model turned out to be the best suited algorithm to predict the selling price of used cars. 

In [None]:
from tabulate import tabulate
print(tabulate([['Linear regression', r1], ['Adaboost linear regression', x3],['Decision tree regressor',R_2],['Decision tree with adaboost',x4]], headers=['Model used', 'Rsquared value']))