# Car Price Prediction using Machine Learning

#### Table of content
1. Importing libraries
2. Reading dataset
3. Data Preprocessing 
      * Checking missing values
      * Checking for Unique values
4. Exploratory Data Analysis
      * Visualization
      * Looking for extreme high data entry
      * Understanding relationship between features
5. Model Building 
      * Creating dummy variables
      * Feature Importance
      * Linear Regression
      * Support Vector Regressor
      * Random Forest Regressor

**Importing the libraries**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**Reading the dataset**

In [None]:
data=pd.read_csv("../input/vehicle-dataset-from-cardekho/car data.csv")

**Having a look on the dataset**

In [None]:
data.head()

In [None]:
data.shape

**Checking for missing/null values**

In [None]:
data.isnull()

In [None]:
data.isnull().sum()

From the above output cell we can see there is no missing/null values in our dataset

In [None]:
data.describe()

In [None]:
data.info()

**Cheching for unique values in different coloumns**

In [None]:
print("Fuel type:-", data['Fuel_Type'].unique())
print("Seller:-", data['Seller_Type'].unique())
print("Transmission:-", data['Transmission'].unique())
print("Owner:-", data['Owner'].unique())
print("Year:-",data['Year'].unique())

**Importing libraries for visualization**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.figure(figsize=[12,8])
sns.countplot(x='Year',data=data)

#####  **The above Countplot shows the Number of cars for each year**

**Adding a new coloum to our dataset called vehicle_age**

In [None]:
data['Vehicle_age']=2021-data['Year']

**Droping the year coloum from our dataset**

In [None]:
data.drop(['Year'],axis=1,inplace=True)

In [None]:
data.head()

### **Exploratory Data Analysis**

In [None]:
sns.pairplot(data)

**Plotting countplot for varius features**

In [None]:
plt.figure(figsize=[12,7])
plt.subplot(2,2,1)
sns.countplot(x='Fuel_Type',data=data)
plt.subplot(2,2,2)
sns.countplot(x='Owner',data=data)

plt.subplot(2,2,3)
sns.countplot(x='Transmission',data=data)
plt.subplot(2,2,4)
sns.countplot(x='Seller_Type',data=data)
plt.show()

From the above count plots we can get to know various information like:-<br>
i) How many petrol,disesl and cng cars are there.<br>
ii) Number of cars for different Owners.<br>
iii) How many manual and automatic cars are there.<br>
iv) Count of numbers of sellers type.

**Plotting boxplots for different features**

In [None]:
plt.figure(figsize=[12,7])
plt.subplot(2,2,1)
sns.boxplot(x='Selling_Price',data=data)
plt.subplot(2,2,2)
sns.boxplot(x='Present_Price',data=data)

plt.subplot(2,2,3)
sns.boxplot(x='Kms_Driven',data=data)
plt.subplot(2,2,4)
sns.boxplot(x='Vehicle_age',data=data)
plt.show()

The above graphs provides us with the information like Q1,Median,Q3,Minimum,Maximum value and outliers of different features  

**Having a look on extreme data entry points**

In [None]:
# Cars with  selling price more than 20 lakhs 
data[data['Selling_Price']>20]

In [None]:
# Cars that are driven for more than 1 Lakh Kms.
data[data['Kms_Driven']>100000]

In [None]:
# Cars older than 15 years of age.
data[data['Vehicle_age']>15]

**Checking relationship between features**

In [None]:
plt.figure(figsize=[12,8])
sns.heatmap(data.corr(),annot=True,cmap='PuBuGn')

**Visualizing selling price relationship with other features**

In [None]:
plt.figure(figsize=[9,6])
plt.subplot(1,2,1)
sns.barplot(x='Fuel_Type',y='Selling_Price',data=data)
plt.subplot(1,2,2)
sns.stripplot(x='Fuel_Type',y='Selling_Price',data=data)


**We can say that Diesel cars have higher selling price than Petrol and CNG cars.**

In [None]:
plt.figure(figsize=[9,6])
plt.subplot(1,2,1)
sns.barplot(x='Seller_Type',y='Selling_Price',data=data)
plt.subplot(1,2,2)
sns.stripplot(x='Seller_Type',y='Selling_Price',data=data)

**From the graphs we can infer that the dealers are earning more money than the individual sellers as dealers are getting more selling price.**

In [None]:
plt.figure(figsize=[9,6])
plt.subplot(1,2,1)
sns.barplot(x='Transmission',y='Selling_Price',data=data)
plt.subplot(1,2,2)
sns.stripplot(x='Transmission',y='Selling_Price',data=data)

**From the above graphs, we can conclude that people prefer Automatic cars over Manual cars hence they have a high selling price.**

In [None]:
plt.figure(figsize=[9,6])
plt.subplot(1,2,1)
sns.barplot(x='Owner',y='Selling_Price',data=data)
plt.subplot(1,2,2)
sns.stripplot(x='Owner',y='Selling_Price',data=data)

In [None]:
plt.figure(figsize=[12,6])
sns.scatterplot(x='Vehicle_age',y='Selling_Price',data=data)

In [None]:
plt.figure(figsize=[12,6])
sns.scatterplot(x='Kms_Driven',y='Selling_Price',data=data)

### **Model Building**

**Creating dummy variables for categorical features**

In [None]:
data.drop(['Car_Name'],inplace=True,axis=1)
data=pd.get_dummies(data,drop_first=True)

In [None]:
data.head()

In [None]:
x=data.iloc[:,1:]
y=data.iloc[:,0].values

In [None]:
x.head()

In [None]:
print(y)

In [None]:
#Feature Importance
from sklearn.ensemble import ExtraTreesRegressor
etr=ExtraTreesRegressor()
etr.fit(x,y)

In [None]:
print(etr.feature_importances_)

In [None]:
#plotting important feature
imp=pd.Series(etr.feature_importances_,index=x.columns)
imp.nlargest(5).plot(kind='barh')
plt.show()

**Train Test Split**

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=2)

### Linear regression 

In [None]:
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(x_train,y_train)

In [None]:
# predicting value using linear regression
y_pred=lr.predict(x_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))


In [None]:
plt.figure(figsize=[10,6])
plt.plot(y_pred,label='Predicted')
plt.plot(y_test,label="Actual_test")
plt.legend()
plt.title("Linear Regression Model")

In [None]:
from sklearn.metrics import r2_score
lr_r2=r2_score(y_test, y_pred)
print(lr_r2)

### support vector regressor


In [None]:
# Feature scalling
from sklearn.preprocessing import StandardScaler
sc_x=StandardScaler()
sc_y=StandardScaler()
x_train_scaled=sc_x.fit_transform(x_train)
y_train=y_train.reshape(len(y_train),1)
y_train_scaled=sc_y.fit_transform(y_train)

In [None]:
print(x_train_scaled)

In [None]:
print(y_train_scaled)

In [None]:
from sklearn.svm import SVR
svr=SVR(kernel='rbf')
svr.fit(x_train_scaled,y_train_scaled)

In [None]:
#predicting values
y_pred=sc_y.inverse_transform(svr.predict(sc_x.transform(x_test)))
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

In [None]:
plt.figure(figsize=[10,6])
plt.plot(y_pred,label='Predicted')
plt.plot(y_test,label="Actual_test")
plt.legend()
plt.title("Support Vector Regressor Model")

In [None]:
svr_r2=r2_score(y_test, y_pred)
print(svr_r2)

### Random Forest 

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfr=RandomForestRegressor()


In [None]:
from sklearn.model_selection import RandomizedSearchCV
n_estimators=[100,200,300,400,500,600,700,800,900,1000,1100,1200]
max_features=['auto','sqrt']
max_depth=[5,10,15,20,25,30]
min_samples_split=[2,5,10,15,100]
min_samples_leaf=[1,2,5,10,12]

In [None]:
# creating random grid
random_grid={'n_estimators': n_estimators,'max_features': max_features,
            'max_depth': max_depth,'min_samples_split': min_samples_split,
            'min_samples_leaf': min_samples_leaf}
print(random_grid)

In [None]:
rfr_random = RandomizedSearchCV(estimator = rfr, param_distributions = random_grid, n_iter = 10, cv = 6, verbose=2, random_state=14, n_jobs = 1)


In [None]:
rfr_random.fit(x,y)

In [None]:
rfr_random.best_params_

In [None]:
rfr_random.best_score_

In [None]:
y_pred=rfr_random.predict(x_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

In [None]:
plt.figure(figsize=[10,6])
plt.plot(y_pred,label='Predicted')
plt.plot(y_test,label="Actual_test")
plt.legend()
plt.title("Random Forest Regressor Model")

In [None]:
rfr_r2=r2_score(y_test, y_pred)
print(rfr_r2)

In [None]:
# comparing models R^2 
model=["Linear Regression","SupportVectorRegressor","RandomForestRegressor"]
values=[lr_r2,svr_r2,rfr_r2]
table=pd.DataFrame({"Models":model,"R squared":values})
display(table)

**This is my first Kaggle Kernel. I tried my best to bring out the data insight and present in an easy way. Thank you for going through my notebook. I am open to new suggestions, so please do comment.**

### **Please upvote**