<h2 style='height:50px;text-align:center;font-size:30px;background-color:gray;border:20px;color:white'>Car Price Prediction On Cardekho<h2>

### This dataset contains information about used cars listed on www.cardekho.com
This data can be used for a lot of purposes such as price prediction to exemplify the use of linear regression in Machine Learning.

## Import Relevant Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

In [None]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

## Reading and Understanding the Data

In [None]:
car= pd.read_csv("../input/vehicle-dataset-from-cardekho/car data.csv")

In [None]:
# Let's see how our dataset looks like

car.head()

In [None]:
# Let's see how many rows and columns do we have in the dataset.

car.shape

In [None]:
# Let's see some summary

car.describe()

In [None]:
car.info()

### The datatypes of the columns are perfect and need no conversion!

In [None]:
# To check if there are any outliers

car.describe(percentiles=[0.25,0.5,0.75,0.9,0.95,0.99])

### Here we conclude that we don't have any outliers as the values are gradually increasing!

In [None]:
# To check if there are any missing values in the dataset

import missingno as mn
mn.matrix(car)

### No missing values in the dataset



In [None]:
# It's important to know how many years old the car is.

car['car_age']= 2020-car['Year']

In [None]:
# It's time to drop the Year column after the needed info is derived.

car.drop('Year',axis=1,inplace=True)

In [None]:
car.head()

## Visualization with Target variable

In [None]:
sns.pairplot(car)

In [None]:
sns.heatmap(car.corr(),annot=True,cmap='summer')

In [None]:
car.columns

## 1) Seller Type

In [None]:
sns.barplot('Seller_Type','Selling_Price',data=car,palette='twilight')

### Selling Price of cars seems to have higher prices when sold by Dealers when compared to Individuals

## 2) Transmission

In [None]:
sns.barplot('Transmission','Selling_Price',data=car,palette='spring')

### It can be observed that Selling Price would be higher for cars that are Automatic.

## 3) Fuel Type

In [None]:
sns.barplot('Fuel_Type','Selling_Price',data=car,palette='summer')

### Selling Price of cars with Fuel Type of Diesel is higher than Petrol and CNG.

## 4) Present Price

In [None]:
sns.regplot('Selling_Price','Present_Price',data=car)

### Selling Price tends to increase with increase in the Present Price of cars.

## 5) Kms Driven

In [None]:
sns.regplot('Selling_Price','Kms_Driven',data=car)

### Lesser the Kms driven higher the Selling Price.

## 6) Owner

In [None]:
sns.barplot('Owner','Selling_Price',data=car,palette='ocean')

### Selling Price is high with less Owners used Cars

## 7) Car Age

In [None]:
plt.figure(figsize=(10,5))
sns.barplot('car_age','Selling_Price',data=car)

### Selling Price of cars 2 years old would be high and gradually decreases with car of 17 years old

## Dealing With Categorical Variables

In [None]:
car.columns

In [None]:
fuel = pd.get_dummies(car['Fuel_Type'])
transmission = pd.get_dummies(car['Transmission'],drop_first=True)
seller= pd.get_dummies(car['Seller_Type'],drop_first=True)

In [None]:
fuel.drop('CNG',axis=1,inplace=True)

In [None]:
car= pd.concat([car,fuel,transmission,seller],axis=1)

In [None]:
car.head()

In [None]:
car.drop(['Fuel_Type','Seller_Type','Transmission'],axis=1,inplace=True)

In [None]:
#The column car name doesn't seem to add much value to our analysis and hence dropping the column

car= car.drop('Car_Name',axis=1)

In [None]:
car.head()

## Splitting the Data into Test and Train

In [None]:
from sklearn.model_selection import train_test_split

np.random.seed(0)
df_train, df_test = train_test_split(car, test_size = 0.3, random_state = 100)

In [None]:
num_vars=['Selling_Price','Present_Price','Kms_Driven']

In [None]:
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()

In [None]:
df_train[num_vars]= scaler.fit_transform(df_train[num_vars])
df_test[num_vars]= scaler.transform(df_test[num_vars])

## Dividing dataset into Features(X) and Target(y)

In [None]:
y_train = df_train.pop('Selling_Price')
X_train = df_train

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [None]:
lm= LinearRegression()
lm.fit(X_train, y_train)

rfe= RFE(lm,10)
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]
col

In [None]:
X_train_rfe = X_train[col]

In [None]:
import statsmodels.api as sm
X_train_rfe= sm.add_constant(X_train_rfe)

In [None]:
model = sm.OLS(y_train,X_train_rfe).fit()
model.summary()

### Dropping the "Petrol" variable which has p-value>0.05 resulting in insigificant.

In [None]:
X_train1= X_train_rfe.drop('Petrol',axis=1)

In [None]:
X_train2= sm.add_constant(X_train1)
model1= sm.OLS(y_train,X_train2).fit()
model1.summary()

### Dropping the "Owner" variable which has p-value>0.05 resulting in insigificant.

In [None]:
X_train3= X_train2.drop('Owner',axis=1)

In [None]:
X_train4= sm.add_constant(X_train3)
model2= sm.OLS(y_train,X_train4).fit()
model2.summary()

### Dropping the "Kms_Driven" variable which has p-value>0.05 resulting in insigificant.

In [None]:
X_train5= X_train4.drop('Kms_Driven',axis=1)

In [None]:
X_train6= sm.add_constant(X_train5)
model3= sm.OLS(y_train,X_train6).fit()
model3.summary()

In [None]:
X_train_new= X_train6.drop('const',axis=1)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

### We have VIF < 5 and hence there is no Multicollinearity occurrence in our model.

## Residual Analysis of the train data

In [None]:
y_train_pred = model3.predict(X_train6)

In [None]:
fig = plt.figure()
sns.distplot((y_train - y_train_pred), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)  

## Making Predictions

In [None]:
#Dividing the test set into features and target.

y_test = df_test.pop('Selling_Price')
X_test = df_test

In [None]:
# Predicting the values by extracting the columns that our final model had

X_test_pred= X_test[X_train_new.columns]

X_test_pred= sm.add_constant(X_test_pred)

In [None]:
y_pred= model3.predict(X_test_pred)

In [None]:
fig = plt.figure()
plt.scatter(y_test,y_pred)
fig.suptitle('y_test vs y_pred', fontsize=20)              # Plot heading 
plt.xlabel('y_test', fontsize=18)                          # X-label
plt.ylabel('y_pred', fontsize=16) 

In [None]:
df = pd.DataFrame({'Actual':y_test,"Predicted":y_pred})
df.head()

In [None]:
from sklearn.metrics import r2_score
R2 = r2_score(y_test,y_pred)
R2

## Conclusions:
    
* Present price of a car plays an important role in predicting Selling Price, One increases the other gradually increases.
* Car age is effecting negatively as older the car lesser the Selling Price.
* Selling Price of cars with Fuel type Diesel is higher.
* Car of Manual type is of less priced whereas of Automatic type is high.
* Cars sold by Individual tend to get less Selling Price when sold by Dealers.

## Thank you!