# Car Price Prediction Project (Flip Robo Technologies)

With the covid 19 impact in the market, we have seen lot of changes in the car market. Now some cars are in demand hence making them costly and some are not in demand hence cheaper. One of our clients works with small traders, who sell used cars. With the change in market due to covid 19 impact, our client is facing problems with their previous car price valuation machine learning models. So, they are looking for new machine learning models from new data. We have to make car price valuation model. This project contains two phase-

### Data Collection Phase.

You have to scrape at least 5000 used cars data. You can scrape more data as well, it’s up to you. more the data better the model

In this section You need to scrape the data of used cars from websites (Olx, cardekho, Cars24 etc.) You need web scraping for this. You have to fetch data for different locations. The number of
columns for data doesn’t have limit, it’s up to you and your creativity. Generally, these columns are Brand, model, variant, manufacturing year, driven kilometers, fuel, number of owners, location and at last target variable Price of the car. This data is to give you a hint about important variables in used car model. You can make changes to it, you can add or you can remove some columns, it completely depends on the website from which you are fetching the data.
Try to include all types of cars in your data for example- SUV, Sedans, Coupe, minivan, Hatchback.

#### Note – The data which you are collecting is important to us. Kindly don’t share it on any public platforms.

### Model Building Phase

After collecting the data, you need to build a machine learning model. Before model building do all data pre-processing steps. Try different models with different hyper parameters and select the best model.
Follow the complete life cycle of data science. Include all the steps like.

1. Data Cleaning
2. Exploratory Data Analysis
3. Data Pre-processing
4. Model Building
5. Model Evaluation
6. Selecting the best model

In [1]:
# Let's import the necessary libraries

import numpy as np
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Let's import the dataset

data = pd.read_csv("Used_Car_Data.csv")
data

Unnamed: 0,Model Year,Regestered Year,Fuel Type,Engine Type,RTO,Transmission,Insurance Type,Price
0,2018,Dec-18,Petrol,1197 cc,DL9C,Manual,Not Available,611700
1,2014,Apr-15,Diesel,1248 cc,DL8C,Manual,Not Available,543000
2,2015,May-15,Diesel,1498 cc,HR51,Manual,Third Party insurance,556000
3,2018,Nov-18,Petrol,1197 cc,HR26,Manual,Third Party insurance,502500
4,2012,Feb-12,Diesel,1396 cc,DL2C,Manual,Third Party insurance,295000
...,...,...,...,...,...,...,...,...
4995,2018,Apr-19,Diesel,1248 cc,UP14,Automatic,Third Party insurance,711500
4996,2018,Jul-18,Diesel,1248 cc,6 2022,HR51,Manual,680500
4997,2013,Aug-13,Diesel,1498 cc,DL10,Manual,Not Available,405000
4998,2017,Apr-17,Diesel,1498 cc,4 2022,DL10,Manual,684500


In [3]:
# Shape of the dataset

data.shape

(5000, 8)

In [4]:
# Quick Information about dataste

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Model Year       5000 non-null   int64 
 1   Regestered Year  5000 non-null   object
 2   Fuel Type        5000 non-null   object
 3   Engine Type      5000 non-null   object
 4   RTO              5000 non-null   object
 5   Transmission     5000 non-null   object
 6   Insurance Type   5000 non-null   object
 7   Price            5000 non-null   object
dtypes: int64(1), object(7)
memory usage: 312.6+ KB


Our Targer Column is Price but this column is in object form. So, let's convert it into numerical form.

In [5]:
# Let's convert the target column object to numeric

data["Price"] = pd.to_numeric(data["Price"].str.replace(",",""))
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Model Year       5000 non-null   int64 
 1   Regestered Year  5000 non-null   object
 2   Fuel Type        5000 non-null   object
 3   Engine Type      5000 non-null   object
 4   RTO              5000 non-null   object
 5   Transmission     5000 non-null   object
 6   Insurance Type   5000 non-null   object
 7   Price            5000 non-null   int64 
dtypes: int64(2), object(6)
memory usage: 312.6+ KB


In [6]:
# Let's check the value counts

#for column in data:
#    print(data[column].value_counts())
#    print()

In [7]:
# Let's check the null values

data.isnull().sum()

Model Year         0
Regestered Year    0
Fuel Type          0
Engine Type        0
RTO                0
Transmission       0
Insurance Type     0
Price              0
dtype: int64

There is not a single column has null values.

In [8]:
# Let's check the 0 value counts of each column

for column in data:
    print(column+ " = "+str(data[data[column]==0].shape[0]))

Model Year = 0
Regestered Year = 0
Fuel Type = 0
Engine Type = 0
RTO = 0
Transmission = 0
Insurance Type = 0
Price = 0


In [9]:
# Let's convert the categorial column to numerical form

print("Shape of the dataset before converting : ",data.shape)
data = pd.get_dummies(data, drop_first=True)
print("Shape of the dataset after converting  : ",data.shape)

Shape of the dataset before converting :  (5000, 8)
Shape of the dataset after converting  :  (5000, 57)


In [10]:
# Let's separate the input and output variables

x = data.drop(columns = ["Price"], axis=1)
y = data["Price"]

In [11]:
# Let's import the necessary libraries for model buildings

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score
from time import time

In [12]:
# Let's find the best R2 Score according to best random state

def bestmodel(model):
    max_score = 0
    max_state = 0
    start = time()
    for i in range(51,100):
        x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25, random_state=i)
        model.fit(x_train, y_train)
        y_pre = model.predict(x_test)
        score = r2_score(y_test, y_pre)
        if score > max_score:
            max_score = score
            max_state = i
    print("Best Random State is      : ",max_state)
    print("Best R2_Score is          : ",max_score)
    print("Cross Validation Score is : ",cross_val_score(model, x, y, cv=5, scoring="r2").mean())
    end = time()
    print("\nTime taken by model for prediction is {:.4f} seconds: ".format(end-start))

In [13]:
# Linear Regression

from sklearn.linear_model import LinearRegression

LR = LinearRegression()
bestmodel(LR)

Best Random State is      :  51
Best R2_Score is          :  1.0
Cross Validation Score is :  1.0

Time taken by model for prediction is 1.2788 seconds: 


In [14]:
# Decision Tree Regressor

from sklearn.tree import DecisionTreeRegressor

DTR = DecisionTreeRegressor()
bestmodel(DTR)

Best Random State is      :  51
Best R2_Score is          :  1.0
Cross Validation Score is :  1.0

Time taken by model for prediction is 1.1726 seconds: 


In [15]:
# K-Neighbors Regressor

from sklearn.neighbors import KNeighborsRegressor

KNR = KNeighborsRegressor()
bestmodel(KNR)

Best Random State is      :  51
Best R2_Score is          :  1.0
Cross Validation Score is :  1.0

Time taken by model for prediction is 10.9720 seconds: 


In [16]:
# Gradient Boosting Regressor

from sklearn.ensemble import GradientBoostingRegressor

GBR = GradientBoostingRegressor()
bestmodel(GBR)

Best Random State is      :  52
Best R2_Score is          :  0.9993373734399859
Cross Validation Score is :  0.9992342347544876

Time taken by model for prediction is 30.4654 seconds: 


In [17]:
# Random Forest Regressor

from sklearn.ensemble import RandomForestRegressor

RFR = RandomForestRegressor()
bestmodel(RFR)

Best Random State is      :  51
Best R2_Score is          :  1.0
Cross Validation Score is :  1.0

Time taken by model for prediction is 44.3730 seconds: 


As we can see that all models are giving the best result. So, let's choose the Random Forest Regressor Model as our final Model.

In [18]:
# Saving the final model

import joblib

joblib.dump(RFR, "Final Model.pkl")

['Final Model.pkl']