# Description

This dataset contains information about used cars listed on www.cardekho.com
This data can be used for a lot of purposes such as price prediction to exemplify the use of linear regression in Machine Learning.
The columns in the given dataset are as follows:

    name
    year
    selling_price
    km_driven
    fuel
    seller_type
    transmission
    Owner

For used motorcycle datasets please go to https://www.kaggle.com/nehalbirla/motorcycle-dataset

# Steps Involved:


- [Reading and Understanding the Dataset](#1)
- [Data Preporcessing](#2)
- [Exploratory Data Analysis (EDA)](#3)
    - [Univariate Analysis](#3_a)
    - [Bivariate/Multi-Variate Analysis](#3_b)
- [Data Preparation ](#4)
    - [Creating dummies for categorical features](#4_a)
    - [Performing Train-Test split](#4_b)
- [Model Creation/Evaluation](#5)
- [Conclusion](#6)

<a id='1'></a>
# 1. Reading and Understanding the Dataset

In [None]:
# Importing prerequisites
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings

%matplotlib inline
pd.set_option("display.max_rows", None,"display.max_columns", None)
warnings.simplefilter(action='ignore')
plt.style.use('seaborn')

In [None]:
# Reading cat data.csv
df_main = pd.read_csv('../input/vehicle-dataset-from-cardekho/car data.csv')

In [None]:
df_main.head()

In [None]:
df_main.shape

In [None]:
df_main.info()

In [None]:
# Checking numerical stats
df_main.describe()

In [None]:
# Checking for missing values
df_main.isna().sum()

<a id='2'></a>
# 2. Data Preprocessing

Extracting Age of car using Year column

In [None]:
df_main['Age'] = 2020 - df_main['Year']
df_main.drop('Year',axis=1,inplace = True)

Renaming columns for better clarity

In [None]:
df_main.rename(columns = {'Selling_Price':'Selling_Price(lacs)','Present_Price':'Present_Price(lacs)',
                          'Owner':'Past_Owners'},inplace = True)

<a id='3'></a>
# 3. Exploratory Data Analysis (EDA)

<a id="3_a"></a>
## a) Univariate Analysis

In [None]:
df_main.columns

##### Plotting Categorical Columns

In [None]:
cat_cols = ['Fuel_Type','Seller_Type','Transmission','Past_Owners']
i=0
while i < 4:
    fig = plt.figure(figsize=[10,4])
    #ax1 = fig.add_subplot(121)
    #ax2 = fig.add_subplot(122)
    
    #ax1.title.set_text(cat_cols[i])
    plt.subplot(1,2,1)
    sns.countplot(x=cat_cols[i], data=df_main)
    i += 1
    
    #ax2.title.set_text(cat_cols[i])
    plt.subplot(1,2,2)
    sns.countplot(x=cat_cols[i], data=df_main)
    i += 1
    
    plt.show()

##### Plotting numerical columns

In [None]:
num_cols = ['Selling_Price(lacs)','Present_Price(lacs)','Kms_Driven','Age']
i=0
while i < 4:
    fig = plt.figure(figsize=[13,3])
    #ax1 = fig.add_subplot(121)
    #ax2 = fig.add_subplot(122)
    
    #ax1.title.set_text(num_cols[i])
    plt.subplot(1,2,1)
    sns.boxplot(x=num_cols[i], data=df_main)
    i += 1
    
    #ax2.title.set_text(num_cols[i])
    plt.subplot(1,2,2)
    sns.boxplot(x=num_cols[i], data=df_main)
    i += 1
    
    plt.show()

**Checking outliiers in Present_Price, Selling_Price and Kms_Driven**

In [None]:
df_main[df_main['Present_Price(lacs)'] > df_main['Present_Price(lacs)'].quantile(0.99)]

In [None]:
df_main[df_main['Selling_Price(lacs)'] > df_main['Selling_Price(lacs)'].quantile(0.99)]

In [None]:
df_main[df_main['Kms_Driven'] > df_main['Kms_Driven'].quantile(0.99)]

<a id="3_b"></a>
## b) Bivariate/Multi-Variate Analysis

In [None]:
sns.heatmap(df_main.corr(), annot=True, cmap="RdBu")
plt.show()

In [None]:
df_main.corr()['Selling_Price(lacs)']

<b>Inferences:</b>
- Present price and resale price are highly correlated, as observed in EDA.
- Age of the vehicle seems to show negative correlation with selling price.
- Past_Owners and Kms_Driven seems to show very less correlation with selling price.

Checking average selling price of vehicle based on its Seller type and Fuel type

In [None]:
df_main.pivot_table(values='Selling_Price(lacs)', index = 'Seller_Type', columns= 'Fuel_Type')

<b>Inferences:</b> Diesel Vehicles fetch higher price compared to petrol & CNG for both sellers.

Checking average selling price of vehicle based on its Seller type and Transmission

In [None]:
df_main.pivot_table(values='Selling_Price(lacs)', index = 'Seller_Type', columns= 'Transmission')

<b>Inferences:</b> Automatic vehicles fetch higher resale price compared to manual ones.

<a id="4"></a>
# 4. Data Preparation

<a id="4_a"></a>
##  a) Creating  dummies for categorical features

Dropping Car_Name column

In [None]:
df_main.drop(labels='Car_Name',axis= 1, inplace = True)

Converting categorical columns into integers using 1-hot encoding.

In [None]:
df_main.head()

In [None]:
df_main = pd.get_dummies(data = df_main,drop_first=True) 
# drop_first is set to True, to avoid "Dummy Trap"

In [None]:
df_main.head()

<a id="4_b"></a>
## b) Performing Train-Test Split

In [None]:
# Separating target variable and its features
y = df_main['Selling_Price(lacs)']
X = df_main.drop('Selling_Price(lacs)',axis=1)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print("x train: ",X_train.shape)
print("x test: ",X_test.shape)
print("y train: ",y_train.shape)
print("y test: ",y_test.shape)

<a id="5"></a>
# 5. Model Creation/Evaluation

## a) Applying regression models
- Linear Regression (OLS)
- Ridge Regression
- Lasso Regression
- Random Forest Regression
- Gradient Boosting regression

In [None]:
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score

In [None]:
CV = []
R2_train = []
R2_test = []

def car_pred_model(model,model_name):
    # Training model
    model.fit(X_train,y_train)
            
    # R2 score of train set
    y_pred_train = model.predict(X_train)
    R2_train_model = r2_score(y_train,y_pred_train)
    R2_train.append(round(R2_train_model,2))
    
    # R2 score of test set
    y_pred_test = model.predict(X_test)
    R2_test_model = r2_score(y_test,y_pred_test)
    R2_test.append(round(R2_test_model,2))
    
    # R2 mean of train set using Cross validation
    cross_val = cross_val_score(model ,X_train ,y_train ,cv=5)
    cv_mean = cross_val.mean()
    CV.append(round(cv_mean,2))
    
    # Printing results
    print("Train R2-score :",round(R2_train_model,2))
    print("Test R2-score :",round(R2_test_model,2))
    print("Train CV scores :",cross_val)
    print("Train CV mean :",round(cv_mean,2))
    
    # Plotting Graphs 
    # Residual Plot of train data
    fig, ax = plt.subplots(1,2,figsize = (10,4))
    ax[0].set_title('Residual Plot of Train samples')
    sns.distplot((y_train-y_pred_train),hist = False,ax = ax[0])
    ax[0].set_xlabel('y_train - y_pred_train')
    
    # Y_test vs Y_train scatter plot
    ax[1].set_title('y_test vs y_pred_test')
    ax[1].scatter(x = y_test, y = y_pred_test)
    ax[1].set_xlabel('y_test')
    ax[1].set_ylabel('y_pred_test')
    
    plt.show()

### 1) Standard Linear Regression or Ordinary Least Squares

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
car_pred_model(lr,"Linear_regressor.pkl")

### 2) Ridge

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV

# Creating Ridge model object
rg = Ridge()
# range of alpha 
alpha = np.logspace(-3,3,num=14)

# Creating RandomizedSearchCV to find the best estimator of hyperparameter
rg_rs = RandomizedSearchCV(estimator = rg, param_distributions = dict(alpha=alpha))

In [None]:
car_pred_model(rg_rs,"ridge.pkl")

### 3) Lasso

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import RandomizedSearchCV

ls = Lasso()
alpha = np.logspace(-3,3,num=14) # range for alpha

ls_rs = RandomizedSearchCV(estimator = ls, param_distributions = dict(alpha=alpha))

In [None]:
car_pred_model(ls_rs,"lasso.pkl")

### 4) Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

rf = RandomForestRegressor()

# Number of trees in Random forest
n_estimators=list(range(500,1000,100))
# Maximum number of levels in a tree
max_depth=list(range(4,9,4))
# Minimum number of samples required to split an internal node
min_samples_split=list(range(4,9,2))
# Minimum number of samples required to be at a leaf node.
min_samples_leaf=[1,2,5,7]
# Number of fearures to be considered at each split
max_features=['auto','sqrt']

# Hyperparameters dict
param_grid = {"n_estimators":n_estimators,
              "max_depth":max_depth,
              "min_samples_split":min_samples_split,
              "min_samples_leaf":min_samples_leaf,
              "max_features":max_features}

rf_rs = RandomizedSearchCV(estimator = rf, param_distributions = param_grid)

In [None]:
car_pred_model(rf_rs,'random_forest.pkl')

We can check best seletion of hyperparmeters for our model using below command.

In [None]:
print(rf_rs.best_estimator_)

### 5) Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV

gb = GradientBoostingRegressor()

# Rate at which correcting is being made
learning_rate = [0.001, 0.01, 0.1, 0.2]
# Number of trees in Gradient boosting
n_estimators=list(range(500,1000,100))
# Maximum number of levels in a tree
max_depth=list(range(4,9,4))
# Minimum number of samples required to split an internal node
min_samples_split=list(range(4,9,2))
# Minimum number of samples required to be at a leaf node.
min_samples_leaf=[1,2,5,7]
# Number of fearures to be considered at each split
max_features=['auto','sqrt']

# Hyperparameters dict
param_grid = {"learning_rate":learning_rate,
              "n_estimators":n_estimators,
              "max_depth":max_depth,
              "min_samples_split":min_samples_split,
              "min_samples_leaf":min_samples_leaf,
              "max_features":max_features}

gb_rs = RandomizedSearchCV(estimator = gb, param_distributions = param_grid)

In [None]:
car_pred_model(gb_rs,"gradient_boosting.pkl")

In [None]:
Technique = ["LinearRegression","Ridge","Lasso","RandomForestRegressor","GradientBoostingRegressor"]
results=pd.DataFrame({'Model': Technique,'R Squared(Train)': R2_train,'R Squared(Test)': R2_test,'CV score mean(Train)': CV})
display(results)

<a id="6"></a>
# Conclusion:

- Present price and resale price are highly correlated, as observed in EDA.
- Age of the vehicle seems to show negative correlation with selling price.
- Past_Owners and Kms_Driven are showing very less correlation with selling price.
- Automatic vehicles fetch higher resale price compared to manual ones.
- Ensemble techniques like Random Forest and Gradient Boosting produce better results than linear models, however they have more tendency to overfit.

#### Thanks for reading !! 


- Addition to this you can also check my github repository, where i have created an end-to-end ML Project using above dataset and deployed it on cloud.
 
 LinK : https://github.com/rppradhan08/Car_Price_Prediction

- Please do share your feedback.