# Car Price Prediction


In this Notebook we will be building a model to predict Car Price. In doing so, we would be going through the life cycle of a data science project. Data is given from Kaggle.

Here's the link to the dataset: https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho

So, Let's begin...

* Import Libraries
* Load Train Data
* Basic EDA
* Handling Categorical Data
* Data Cleaning
* Split The Data
* Feature Selection
  * Feature Importance using ExtraTreeRegressor
* Model Building
  * RandomForestRegressor

# Import Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Load Train Data

In [None]:
df = pd.read_csv('../input/vehicle-dataset-from-cardekho/car data.csv')
df.head()

# Basic EDA

In [None]:
#check the shape of the data
df.shape

In [None]:
df.info()

As we can see we have some categorical data, which we need to convert.

In [None]:
df.isnull().sum()

In [None]:
df.describe()

# Data Cleaning

In [None]:
df.columns

dropping car_name column as it is not gonna help us with the accuracy

In [None]:
data = df[['Year', 'Selling_Price', 'Present_Price', 'Kms_Driven',
       'Fuel_Type', 'Seller_Type', 'Transmission', 'Owner']]
data.head()

create a column for current year, and subtract the year column to get the age of the car.

In [None]:
data['Current_Year'] = 2021
data['Car_Age'] = data.Current_Year - data.Year
data.head()

now that we have the age of the car, we can drop the Year, Current_Year columns

In [None]:
data.drop(['Current_Year'], axis = 1, inplace = True)
data.drop(['Year'], axis = 1, inplace = True)
data.head()

# Handling Categorical Columns

These are the categorical columns

In [None]:
print(f'Seller_Type: {df.Seller_Type.unique()} , \nTransmission: {df.Transmission.unique()} , \nFuel_Type: {df.Fuel_Type.unique()} , \nOwner: {df.Owner.unique()}')

convert categorical values to numerical

In [None]:
data = pd.get_dummies(data, drop_first = True)
data.head()

Let's find the correlation

In [None]:
data.corr()

Let's plot the data for better understanding

In [None]:
import seaborn as sns                                   
import matplotlib.pyplot as plt

In [None]:
sns.pairplot(data)

In [None]:
corrmat = data.corr()
top_corr_feature = corrmat.index
plt.figure(figsize = (20,20))
sns.heatmap(data[top_corr_feature].corr(), annot = True)

# Split The Data

In [None]:
x = data.iloc[:, 1:]
y = data.iloc[:, 0]

In [None]:
x

In [None]:
y

# Feature Importance

In [None]:
#Feature Importance using ExtraTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor()
model.fit(x, y)

In [None]:
print(model.feature_importances_)

plot the important features

In [None]:
feat_imp = pd.Series(model.feature_importances_, index = x.columns)
feat_imp.nlargest(5).plot(kind = 'barh')
plt.show()

# Model Building

In [None]:
#Splitthe data into train and test
from sklearn.model_selection import train_test_split

x_train, x_test,y_train, y_test = train_test_split(x, y, test_size = 0.2)

In [None]:
x_train.shape

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
#Randomized Search CV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 30, num = 6)]
# max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15, 100]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10]

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

print(random_grid)

In [None]:
# create model
rf = RandomForestRegressor()

In [None]:
# buildmodel with parameters
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, scoring = 'neg_mean_squared_error', n_iter = 10, cv = 5, verbose = 2, random_state = 42, n_jobs = 1)

In [None]:
#fit the model
rf_random.fit(x_train, y_train)

In [None]:
# get the best parameters
rf_random.best_params_

In [None]:
# check the score
rf_random.best_score_

In [None]:
# get the predictions on test data
predictions = rf_random.predict(x_test)

In [None]:
predictions

In [None]:
# plot the error
sns.distplot(y_test - predictions)

In [None]:
plt.scatter(y_test, predictions)

# Save The Model

In [None]:
import pickle

file = open('rfr_car_pred.pkl', 'wb')

pickle.dump(rf_random, file)