

# ***Used car price prediction using Machine Learning.***

# **AUTHOR : VARIGONDA SAI NIRMAL VIGNU**


# Dataset link:[True Value Cars](https://www.kaggle.com/focusedmonk/true-value-cars-dataset)

# **CONTENT**


# 1.   Context
# 2.  Problem Statement
# 3. Data Description
# 4. Importing Libraries
# 5. Loading Train data
# 6. Getting information about Data
# 7. Correlation
# 8.Handling missing values
# 9.Handling Outliers
# 10.Exploratory Data Analysis
# 11.Loading and Handling Test Data
# 12.Transformation for feature variables
# 13.Training our Models




*  Linear Regression
*  Lasso Regression
*  Ridge Regression
*  Random Forest Regression
*  XGBoost Regression

# 14.Model Evaluation
# 15.Comparing Model Performances
# 16.Conclusion




# 1.  Context

**What determines the price of used cars?**

The value of a car drops right from the moment it is bought and the depreciation continues with each passing year.

In fact, in the first year itself, the value of a car decreases by 20 percent of its initial value.

The make and model of a car, total kilometers driven, overall condition of the vehicle and various other factors further affect the car’s resale value.

# 2.Problem Statement

The prices of new cars in the industry is fixed by the manufacturer with some additional costs
incurred by the Government in the form of taxes. So, customers buying a new car can be
assured of the money they invest to be worthy. But due to the increased price of new cars and
the inability of customers to buy new cars due to the lack of funds, used cars sales are on a
global increase (Pal, Arora and Palakurthy, 2018). There is a need for a used car price
prediction system to effectively determine the worthiness of the car using a variety of features.
Even though there are websites that offers this service, their prediction method may not be the
best. Besides, different models and systems may contribute on predicting power for a used car’s
actual market value. It is important to know their actual market value while both buying and
selling.


# 3. Data Description

This dataset contains over 7000+ true value cars data across all major tier 1 and tier 2 cities in India which is ready to accept a different owner. The information includes car manufacturer, model, fuel type, year of manufacture to mention a few. 

Content:

**id**: Unique ID for every car

**car_name**: Name of a car

**yr_mfr**: Car manufactured year

**fuel_type**: Type of fuel car runs on

**kms_run**: Number of kilometers run

**body_type**: Car body type. Ex: Sedan, hatchback etc.

**transmission**: Type of transmission. Ex: Manual, Automatic

**variant**: Car variant

**make**: Car manufacturing company

**model**: Car model name

**is_hot**: Is it a top selling car? Indicates the demand for a car.

**car_availability**: Car availability status

**total_owners**: How many owners have already owned it?

**car_rating**: How good is the car to buy?

**fitness_certificate**: Does the car have fitness certificate?

**source**: Method of selling a car

**registered_city**: City where the car is registered

**registered_state**: State where the car is registered

**rto**: Regional Transport Office where the car is registered

**city**: City where the car is being sold

**times_viewed**: Number of times people have shown interest for the car

**assured_buy**: Broker assured car

**broker_quote**: Price quoted for previous owner (in INR)

**original_price**: Original price of a car (in INR)

**emi_starts_from**: Opting for EMI? Monthly EMI for the car starts from! (in 
INR)

**booking_down_pymnt**: Decided to buy? Please pay the down payment (in INR)

**ad_created_on**: Listed date for selling a car

**reserved**: Car reserved status

**warranty_avail**: Warranty availability status

**sale_price**: Selling price of a car (in INR)'''

# 4. Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# 5. Loading Train data

In [None]:
df=pd.read_csv('../input/true-value-cars-dataset/train.csv')


In [None]:
df.head()

# 6. Getting information about Data

In [None]:
df.shape #to know rows and columns

In [None]:
df.columns #column names

**value count**

In [None]:
df['car_name'].value_counts()

In [None]:
df['city'].value_counts()

In [None]:
df['sale_price'].value_counts()

In [None]:
df.info() #info about each column how many nullvalues and data type of each column

In [None]:
df.nunique(axis=0) #no of unique values in each column

In [None]:
df.duplicated().sum() #no duplicate values

**No duplicate records**

In [None]:
df.isnull().sum() #checking for null values

**Here original_price column contains more null values so we can decide after correaltion analysis whether to  remove it or not**

# 7. Correlation

In [None]:
corr=df.corr() #to find correlation
corr

In [None]:
corr = df.corr()
sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 2.5})
plt.figure(figsize=(13,7))
a = sns.heatmap(corr, annot=True, fmt='.2f')
rotx = a.set_xticklabels(a.get_xticklabels(), rotation=90)
roty = a.set_yticklabels(a.get_yticklabels(), rotation=30)

 **We can observe from above that sale_price,emi_starts_from,booking_down_pymnt
 original_price ,broker_quote are highly correlated and 
 sale_price is our target variable
That means emi_starts_from , booking_down_payment,original_price,broker_qoute effect more our target variable**

 **So we can keep any one of them and drop remaining columns**

 **Here i am removing above mentioned columns and keeping booking_down_payment as it is
 I am removing original_price also because as we observed above it has 2824 null values**

 **And also removing Id column because it doesn't effect our target column**

In [None]:
def remove(df):
  df1=df.drop(['id','emi_starts_from','original_price','broker_quote'],axis=1)
  return df1
df1=remove(df)

In [None]:
df1

In [None]:
df1.head()

In [None]:
df1.isnull().sum()

In [None]:
df1.shape

In [None]:
df1.info()

In [None]:
sns.heatmap(df1.isnull(),yticklabels=False,cbar=False)

# 8.Handling missing values

**In our data body_type,transmission,source,car_availability,car_rating,ad_created_on,fitness_certificate,registered_city,registered_state contains null values our next step is to handle missing data**

In [None]:
df1.dtypes

In [None]:
for i in df.columns:
  print(i)
  print(df[i].unique())
  print("_____________________________________________________________________")
  #printing unique values of each column

**Here i am replacing object type column missing values with their mode and numeric type columns with their mean**

In [None]:
category_columns=df1.select_dtypes(include=['object']).columns.tolist()
integer_columns=df1.select_dtypes(include=['int64','float64']).columns.tolist()

for column in df1:
    if df1[column].isnull().any():
        if(column in category_columns):
            df1[column]=df1[column].fillna(df1[column].mode()[0])
        else:
            df1[column]=df1[column].fillna(df1[column].mean)

In [None]:
df1.head()

In [None]:
df1.isnull().sum()

In [None]:
sns.heatmap(df1.isnull(),yticklabels=False,cbar=False,cmap='YlGnBu')

In [None]:
df1.describe() #gives statistical description about our numerical data

In [None]:
df1.describe(include='object') #description about categorical data

In [None]:
#finding correlation again
corr = df1.corr()
sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 2.5})
plt.figure(figsize=(13,7))
a = sns.heatmap(corr, annot=True, fmt='.2f')
rotx = a.set_xticklabels(a.get_xticklabels(), rotation=90)
roty = a.set_yticklabels(a.get_yticklabels(), rotation=30)

# 9.Handling Outliers

In [None]:
for i in integer_columns:
  plt.figure()
  sns.boxplot(x=df1[i])

**We can observe from above box plots that we have to handle outliers in kms_run,sale_price and times_viewed column .**

**As sales_price and booking_down_payment are highly correlated handling one column will reflect on another**

**Taking kms_run upto max value**

In [None]:
max_km=df1['kms_run'].max()
max_km

In [None]:
df1=df1[df1['kms_run']<max_km]
df1.shape

**From boxplot observations we can take sales_price < 2500000 only**

In [None]:
df1=df1[df1['sale_price']<2500000]
df1.shape

**From boxplot observations we can take times_viewed < 20000 only**

In [None]:
df1=df1[df1['times_viewed']<20000]
df1.shape

In [None]:
df1=df1[df1['yr_mfr']>2005]
df1.shape

In [None]:
#  df1=df1[df1['broker_quote']<2500000]
# df1.shape

In [None]:
# def outlinefree(dataCol):
#     sorted(dataCol)
#         # getting percentile 25 and 27 that will help us for getting IQR (interquartile range)
#     Q1,Q3 = np.percentile(dataCol,[25,75])
#         # getting IQR (interquartile range)
#     IQR = Q3-Q1
#         # getting Lower range error
#     LowerRange = Q1-(1.5 * IQR)
#         # getting upper range error
#     UpperRange = Q3+(1.5 * IQR)
#         # return Lower range and upper range.
#     return LowerRange,UpperRange

In [None]:
# lwyr_mfr,upyr_mfr = outlinefree(df1['yr_mfr'])
# lwkms_run,upkms_run = outlinefree(df1['kms_run'])
# lwsale_price,upsale_price = outlinefree(df1['sale_price'])
# lwtimes_viewed,uptimes_viewed = outlinefree(df1['times_viewed'])
# lwttl_own,upttl_own = outlinefree(df1['total_owners'])
# lwbdwnpy,updwnpy = outlinefree(df1['booking_down_pymnt'])

In [None]:
# df1['yr_mfr'].replace(list(df1[df1['yr_mfr'] < lwyr_mfr].yr_mfr) ,lwyr_mfr,inplace=True)
# df1['kms_run'].replace(list(df1[df1['kms_run'] > upkms_run].kms_run) ,upkms_run,inplace=True)
# df1['sale_price'].replace(list(df1[df1['sale_price'] > upsale_price].sale_price) ,upsale_price,inplace=True)
# df1['times_viewed'].replace(list(df1[df1['times_viewed'] > uptimes_viewed].times_viewed) ,uptimes_viewed,inplace=True)
# #df1['total_owners'].replace(list(df1[df1['total_owners'] > upttl_own].total_owners) ,upttl_own,inplace=True)
# df1['booking_down_pymnt'].replace(list(df1[df1['booking_down_pymnt'] > updwnpy].booking_down_pymnt) ,updwnpy,inplace=True)

In [None]:
for i in integer_columns:
  plt.figure()
  sns.boxplot(x=df1[i])

In [None]:
for col1 in integer_columns:
  sns.FacetGrid(df1,height=5).map(sns.distplot,col1).add_legend()

**We can Observe that there is some skewness in our data**

In [None]:
for i in integer_columns:
  plt.figure()
  sns.displot(df1[i])

# 10.Exploratory Data Analysis

In [None]:
plt.figure(figsize=(12,8))
sns.set(rc={'axes.facecolor':'#283747','axes.grid': True,'xtick.labelsize':16})
sns.lineplot(x='kms_run',y='sale_price',data=df1)

**From above graph we can say that there is not much relationship between kms_run and sale_price**

In [None]:
plt.figure(figsize=(12,8))
sns.set(rc={'axes.facecolor':'#283747','axes.grid': True,'xtick.labelsize':16})
sns.lineplot(x='times_viewed',y='sale_price',data=df1)



**From above graph we can say that there is not much relationship between times_viewed and sale_price**

In [None]:
plt.figure(figsize=(12,8))
sns.set(rc={'axes.facecolor':'white','axes.grid': True,'xtick.labelsize':16})
sns.lineplot(x='yr_mfr',y='sale_price',data=df1,hue='transmission')

**From the above  figure we can observe the variation in prices of cars of two transmission categories in relation to their manufacturing year. **

In [None]:
plt.figure(figsize=(12,8))
sns.set(rc={'axes.facecolor':'white','axes.grid': False,'xtick.labelsize':16})
sns.lineplot(x='yr_mfr',y='sale_price',data=df,hue='body_type')

**From the above  figure we can observe the variation in prices of cars of different body types in relation to their manufacturing year. **

In [None]:
plt.figure(figsize=(12,8))
sns.set(rc={'axes.facecolor':'#283747','axes.grid': True,'xtick.labelsize':16})
sns.lineplot(x='total_owners',y='sale_price',data=df1)

**From the above  figure we can observe the variability in prices of cars of two transmission categories in relation to their total_owners.We can see if total_owners are more the price of that car is generally less**

In [None]:
plt.figure(figsize=(12,8))
sns.set(rc={'axes.facecolor':'#283747','axes.grid': True,'xtick.labelsize':16})
sns.countplot(df['body_type'])

**we can observe that the cars with body type 'hatchback' are maximum.**

In [None]:
plt.figure(figsize=(12,8))
sns.set(rc={'axes.facecolor':'#283747','axes.grid': True,'xtick.labelsize':16})
sns.barplot(df['body_type'],df['sale_price'])

**We can observe that mean for luxury suv are is highest and around 1.05 . Inter quartile range is 1.0-1.2**

In [None]:
plt.figure(figsize=(12,8))
sns.set(rc={'axes.facecolor':'#283747','axes.grid': True,'xtick.labelsize':16})
sns.barplot(df['transmission'],df['sale_price'])

**We can observe that mean for automatic cars  is more and its value is 700000 . Inter quartile range is 670000-750000**

In [None]:
plt.figure(figsize=(12,8))
sns.set(rc={'axes.facecolor':'#283747','axes.grid': True,'xtick.labelsize':16})
sns.barplot(df['city'],df['sale_price'])
plt.xticks(rotation=45)

**We can observe that mean for cars from chennai  is highest and around 490000 . Inter quartile range is 1.0-1.2 470000-520000**

In [None]:
#sns.pairplot(df1,kind='kde')

# 11.Loading and Handling Test Data

In [None]:
df_test=pd.read_csv('../input/true-value-cars-dataset/test.csv')

In [None]:
df_test.shape

In [None]:
df_test.head()

In [None]:
df_test.isnull().sum()

**We removed some columns in train data so we have to remove them in test data also**

In [None]:
df_test1=remove(df_test)

In [None]:
df_test1.shape

In [None]:
df_test1.head()

**Handling null values in test data**

In [None]:
cateogry_columns=df_test1.select_dtypes(include=['object']).columns.tolist()
integer_columns=df_test1.select_dtypes(include=['int64','float64']).columns.tolist()

for column in df_test1:
    if df_test1[column].isnull().any():
        if(column in cateogry_columns):
            df_test1[column]=df_test1[column].fillna(df_test1[column].mode()[0])
        else:
            df_test1[column]=df_test1[column].fillna(df_test1[column].mean)

In [None]:
df_test1.isnull().sum()

# 12.Transformation for feature variables

**For Training Data**

**Standard Scaler for numerical data**

In [None]:
X_train=df1.drop('sale_price',axis=1)
Y_train=df1['sale_price'].values

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

# get numeric data
num_d = X_train.select_dtypes(exclude=['object'])

# update the cols with their normalized values
X_train[num_d.columns] = sc.fit_transform(num_d)



In [None]:
X_train.head()

In [None]:
X_train.nunique()

In [None]:
# Import label encoder
from sklearn import preprocessing
  
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
  
# Encode labels in categorical_column 
for i in category_columns:
  X_train[i]= label_encoder.fit_transform(X_train[i])
  


In [None]:
X_train.head()

In [None]:
X_train1=X_train.values

In [None]:
X_train1

**FOr Testing data**

In [None]:
X_test=df_test1.drop('sale_price',axis=1)
Y_test=df_test1['sale_price'].values

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

# get numeric data
num_d = X_test.select_dtypes(exclude=['object'])

# update the cols with their normalized values
X_test[num_d.columns] = sc.fit_transform(num_d)



In [None]:
# Import label encoder
from sklearn import preprocessing
  
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
  
# Encode labels in categorical_column 
for i in category_columns:
  X_test[i]= label_encoder.fit_transform(X_test[i])
  


In [None]:
X_test.head()

In [None]:
X_test1=X_test.values
X_test1

In [None]:
Y_test

In [None]:
Y_train

# **13.Training our Models**

# A) Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(X_train1,Y_train)

In [None]:
y_pred=lr.predict(X_test1)

In [None]:
tsc1=lr.score(X_test1,Y_test)
tsc1

In [None]:
sc1=lr.score(X_train1,Y_train)
sc1

# B) Lasso Regression

In [None]:
from sklearn.linear_model import Lasso
lasso_reg = Lasso()
lasso_reg.fit(X_train1,Y_train)

In [None]:
y_pred2=lasso_reg.predict(X_test1)

In [None]:
tsc2=lasso_reg.score(X_train1,Y_train)
tsc2

In [None]:
sc2=lasso_reg.score(X_test1,Y_test)
sc2

# C) Ridge Regression

In [None]:
from sklearn.linear_model import Ridge
ridge_reg=Ridge()
ridge_reg.fit(X_train1,Y_train)

In [None]:
y_pred3=ridge_reg.predict(X_test1)

In [None]:
tsc3=ridge_reg.score(X_train1,Y_train)
tsc3

In [None]:
sc3=ridge_reg.score(X_test1,Y_test)
sc3

# D) Random Forest Regression

In [None]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 20, random_state = 0)
regressor.fit(X_train1, Y_train)

In [None]:
y_pred4=regressor.predict(X_test1)

In [None]:
tsc4=regressor.score(X_train1,Y_train)
tsc4

In [None]:
sc4=regressor.score(X_test1,Y_test)
sc4

# E) XGBOOST Regressor

In [None]:
from xgboost import XGBRegressor
xgb = XGBRegressor()
xgb.fit(X_train1, Y_train)

In [None]:
y_pred5=xgb.predict(X_test1)

In [None]:
tsc5=regressor.score(X_train1,Y_train)
tsc5

In [None]:
sc5=regressor.score(X_test1,Y_test)
sc5

# 14.Model Evaluation

In [None]:
from sklearn.metrics import mean_squared_error,mean_absolute_error
def metric(y_test,y_predict):
    mae=mean_absolute_error(y_test,y_predict) #mean_absolute_error
    mse=mean_squared_error(y_test,y_predict) #mean_squared_error
    rmse=mean_squared_error(y_test,y_predict,squared=False)
    return [mae,mse,rmse]

In [None]:
linearregressoin=metric(Y_test,y_pred)
linearregressoin.append(sc1)
linearregressoin.append(tsc1)
linearregressoin

In [None]:
lassoregression=metric(Y_test,y_pred2)
lassoregression.append(sc2)
lassoregression.append(tsc2)
lassoregression

In [None]:
Ridgeregression=metric(Y_test,y_pred3)
Ridgeregression.append(sc3)
Ridgeregression.append(tsc3)
Ridgeregression

In [None]:
RandomForestRegressor=metric(Y_test,y_pred4)
RandomForestRegressor.append(sc4)
RandomForestRegressor.append(tsc4)
RandomForestRegressor

In [None]:
XGBRegressor=metric(Y_test,y_pred5)
XGBRegressor.append(sc5)
XGBRegressor.append(tsc5)
XGBRegressor

In [None]:
algorithms=['Linear Regression','Lasso Regression','Ridge Regression','Random Forest Regression','XGBoost Regression']
eval=pd.DataFrame([linearregressoin,lassoregression,Ridgeregression,RandomForestRegressor,XGBRegressor],columns=['Mean Squared Error','Mean Absolute Error','Root Mean SquareError','Test Score','Train Score'],index=algorithms)
eval

In [None]:
score=[]
for i in range(5):
  score.append(eval.iloc[:,3][i])
score

# 15.Comparing Model Performances

In [None]:

plt.figure(figsize=(15,7))
plt.scatter(algorithms,score,linewidth=2,s=50,marker='s',edgecolors='green')

plt.xlabel("Regression Models") 
plt.ylabel("Scores") 
plt.title("Algorithm Comparison")
plt.show()
df=pd.DataFrame(score,index=algorithms,columns=['score'])
df

In [None]:
algo=['lr','lasso','ridge','rfr','xgbr']
ind = np.arange(len(score))  # the x locations for the groups
width = 0.35  # the width of the bars

fig,ax = plt.subplots()


rects1 = ax.bar(ind - width/2, eval['Mean Squared Error'], width, 
                label='mse')
rects2 = ax.bar(ind + width/2, eval['Root Mean SquareError'], width, 
                label='rmse')

ax.set_ylabel('Scores')
ax.set_title('Algorithm performance')
ax.set_xticks(ind)
ax.set_xticklabels(algo)
ax.legend()


def autolabel(rects, xpos='center'):
   

    ha = {'center': 'center', 'right': 'left', 'left': 'right'}
    offset = {'center': 0, 'right': 1, 'left': -1}

    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(offset[xpos]*3, 3),  # use 3 points offset
                    textcoords="offset points",  # in both directions
                    ha=ha[xpos], va='bottom')


# autolabel(rects1, "left")
# autolabel(rects2, "right")


fig.tight_layout()

plt.show()

df=pd.DataFrame([eval['Mean Squared Error'],eval['Root Mean SquareError']],columns=algorithms,index=['mse','rmse'])
df


# Observation from comparing model performances

 
*   From these above results it would suggest that Linear Regression is perhaps worthy of further study on this problem.
*   Among all the methods Linear Regression is the best of all which gives best results we can observe it's score (0.9999999999969837) and also We can observe from second graph that among all Linear Regression has lowest mse and rmse that means less error
*   So we can use **Linear Regression** for best results






# **16.Conclusion**

The value of a car drops right from the moment it is bought and the depreciation continues with each passing year.

In fact, in the first year itself, the value of a car decreases by 20 percent of its initial value.

The make and model of a car, total kilometers driven, overall condition of the vehicle and various other factors further affect the car’s resale value.

We can observe from above that sale_price,emi_starts_from,booking_down_pymnt original_price ,broker_quote are highly correlated and sale_price is our target variable
Here i am removing above mentioned columns and keeping booking_down_payment as it is
I am removing original_price also because as we observed above it has 2824 null values.

Total Five Techniques were used in this study.

1.   Linear Regression
2.   Lasso Regression
3.   Ridge Regression
4.   Random Forest Regression
5.   XGBoost Regression

Below are the results observed in all the Five models

**Linear Regression**


*   Mean Squared Error = 26448.012383				
	  
*    Mean Absolute Error = 1.771641e+09
    
*    Root Mean SquareError = 42090.863694
    
*    Test Score = 0.999997
  	
*    Train Score = 0.981738

**Lasso Regression**
				

*   Mean Squared Error = 25923.515013		
	  
*    Mean Absolute Error = 1.789480e+09
    
*    Root Mean SquareError = 42302.240195
    
*    Test Score = 0.981554
  	
*    Train Score = 0.999993
				
**Ridge Regression**
				

*   Mean Squared Error = 26446.516419
	  
*    Mean Absolute Error = 1.775618e+09
    
*    Root Mean SquareError = 42138.083184
    
*    Test Score = 0.981697
  	
*    Train Score = 0.9999999456

**Random Forest Regression**
							

*   Mean Squared Error = 28538.731150
	  
*    Mean Absolute Error = 5.407858e+09
    
*    Root Mean SquareError = 73538.138936
    
*    Test Score = 0.944255
  	
*    Train Score = 0.999924	

**XGBoost Regression**
											

*   Mean Squared Error = 28202.355617
	  
*    Mean Absolute Error = 4.607117e+09
    
*    Root Mean SquareError = 67875.749496
    
*    Test Score = 0.944255
  	
*    Train Score = 0.999924

From these above results it would suggest that Linear Regression is perhaps worthy of further study on this problem.

Among all the methods Linear Regression is the best of all which gives best results we can observe it's score (0.9999999999969837) and also We can observe from second graph that among all Linear Regression has lowest mse and rmse that means less error.

I prefer Linear Regression for better predictions

so we can choose **Linear Regression**  as our final model which will predict best results





