# Car Price Prediction

### Objective 
<p>To build suitable Machine Learning Model for Car Price Prediction on the bellow data set.</p>

<h1>Table of contents</h1>

<div class="alert alert-info alert-info" style="margin-top: 20px">

1. [Importing Libraries and Dataset](#1)<br>
2. [Exploratory Data Analysis](#2)<br>
3. [Feature Engineering](#3)<br>
4. [Data Vizualization](#4)<br>
5. [Model Building](#5)<br>
6. [Model Evaluation](#6)<br>    

<hr>

<h2>Importing Libraries and Dataset</h2><a id="1"></a>

In [None]:
# importing required libraries
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Load the Car dataset
This dataset contains information about used cars listed on <a href='www.cardekho.com'><u>website</u></a>
This data can be used for a lot of purposes such as price prediction to exemplify the use of linear regression in Machine Learning.
The columns in the given dataset are as follows:

| Column name    | Description                                         |
| ------------   | --------------------------------------------------- |
| Car_Name       | Name of Car sold                                    |
| company        | Car making company                                  |
| Year           | Year in which car was bought                        |
| Selling_Price  | Price at which car sold                             |
| Present_Price  | Price of same car model in current year             |
| Kms_Driven     | Number of Kilometers Car driven before it is sold   |
| Fuel_Type      | Type of fuel Car uses                               |
| Seller_Type    | Type of seller                                      |
| Transmission   | Gear transmission of the car (Automatic/Manual)     |
| Owner          | Number of previous owners                           |   

In [None]:
# Loading Dataset
df = pd.read_csv('../input/car-dekho-data/car data.csv')

df.head()

<h2>Exploratory Data Analysis</h2><a id="2"></a>

In [None]:
print('The size of Dataframe is: ', df.shape)
print('\n')
df.info()

- 'Selling_Price' is our Target variable.

In [None]:
# To find total_missing_values in different columns of data and their percentage
def missing_data(data):
    """
    This will take in a dataframe and 
    finds the total_missing_values as well as percentage of the value counts
    """
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [None]:
missing_data(data= df)

As from above dataframe, my observation on missing data are: 
<UL>
   <li>There is no missing values in our dataset.
   <li>Therefore there is no need of data cleaning.
</UL>

In [None]:
print("'Fuel_Type' variable have {} unique category : {}\n".format(df['Fuel_Type'].nunique(), df['Fuel_Type'].unique()))
print("'Seller_Type' variable have {} unique category : {}\n".format(df['Seller_Type'].nunique(), \
                                                                     df['Seller_Type'].unique()))
print("'Transmission' variable have {} unique category : {}\n".format(df['Transmission'].nunique(), \
                                                                    df['Transmission'].unique()))
print("'Owner' variable have {} unique category : {}".format(df['Owner'].nunique(), df['Owner'].unique()))

In [None]:
df['Owner'].replace(to_replace=3, value=2, inplace= True)
print("'Owner' variable have {} unique category : {}".format(df['Owner'].nunique(), df['Owner'].unique()))

In [None]:
df.describe()

In [None]:
df.describe(include= 'object')

<h3>Feature Engineering</h3><a id="3"></a>
Here, I'll derive new feature from feature 'Year'.

In [None]:
# Let's see all column names
df.columns

Here, I'll derive new feature from 'Year' to calculate how many year old the car is.

In [None]:
# Let's create a new variable 'Current_Year'
df['Current_Year'] = 2020

# To Calculate how old the car is, I created new feature "No_of_Years"
df['No_of_Years'] = df['Current_Year'] - df['Year']

df.head()

#### Remove features

In [None]:
final_df = df.copy()            # Creating copy of created dataframe
final_df.drop(labels= ['Car_Name','company','Year', 'Current_Year'], axis= 1, inplace= True)          #droping unnecessary features

final_df.head()

<h2>Data Visualization</h2><a id="4"></a>

In [None]:
sns.pairplot(data= final_df, hue= 'Fuel_Type', diag_kind= 'kde')

In [None]:
# Let's see the distribution of the two variable from our data
fig = plt.figure(figsize=(20,20)) # create figure

sns.set(font_scale= 1)
sns.set_style('darkgrid')

ax0 = fig.add_subplot(2, 2, 1) # add subplot 1 (2 row, 2 columns, first plot)
ax1 = fig.add_subplot(2, 2, 2) # add subplot 2 (2 row, 2 columns, second plot)
ax2 = fig.add_subplot(2, 2, 3) # add subplot 1 (2 row, 2 columns, third plot)
ax3 = fig.add_subplot(2, 2, 4) # add subplot 1 (2 row, 2 columns, fourth plot)

# Subplot 1: Distplot of 'Selling_Price' feature
k1 = sns.distplot(a = final_df['Selling_Price'], bins= 25, ax=ax0) # add to subplot 1
ax0.set_title('Distribution of Selling Price', fontsize=16)
ax0.set(xlabel= 'Selling Price', ylabel= 'Density')

# Subplot 2: Distplot of 'Present_Price' feature
k2 = sns.distplot(a = final_df['Present_Price'], bins= 25, ax=ax1) # add to subplot 2           
ax1.set_title('Distribution of Present Price', fontsize=16)
ax1.set(xlabel= 'Present Price', ylabel= 'Density')

# Subplot 3: Distplot of 'Kms_Driven' feature
k1 = sns.distplot(a = final_df['Kms_Driven'], bins= 25, ax=ax2) # add to subplot 3
ax2.set_title('Distribution of Kilometers Driven', fontsize=16)
ax2.set(xlabel= 'Kilometers Driven', ylabel= 'Density')

# Subplot 4: Distplot of 'No_of_Years' feature
k1 = sns.distplot(a = final_df['No_of_Years'], bins= 15, ax=ax3) # add to subplot 4
ax3.set_title('Distribution of Number of Years', fontsize=16)
ax3.set(xlabel= 'Number of Years', ylabel= 'Density')

plt.show()
#fig.savefig("Distributionplot.png")

In [None]:
print("'No_of_Years' variable have {} unique category : {}".format(final_df['No_of_Years'].nunique(), 
                                                                   final_df['No_of_Years'].unique()))

In [None]:
# Let's see categorical feature value counts
fig = plt.figure(figsize=(16,16)) # create figure

sns.set(font_scale= 1)
sns.set_style('darkgrid')

ax0 = fig.add_subplot(2, 2, 1) # add subplot 1 (2 row, 2 columns, first plot)
ax1 = fig.add_subplot(2, 2, 2) # add subplot 2 (2 row, 2 columns, second plot)
ax2 = fig.add_subplot(2, 2, 3) # add subplot 1 (2 row, 2 columns, third plot)
ax3 = fig.add_subplot(2, 2, 4) # add subplot 1 (2 row, 2 columns, fourth plot)

# Subplot 1: Countplot of 'Fuel_Type' feature
k1 = sns.countplot(data = final_df, x = 'Fuel_Type', ax= ax0) # add to subplot 1
ax0.set_title('Fuel_Type Value Counts', fontsize=16)
ax0.set(xlabel= 'Fuel_Type', ylabel= 'Count')

# Subplot 2: Countplot of 'Seller_Type' feature
k2 = sns.countplot(data = final_df, x = 'Seller_Type', ax= ax1) # add to subplot 2           
ax1.set_title('Seller_Type Value Counts', fontsize=16)
ax1.set(xlabel= 'Seller_Type', ylabel= 'Count')

# Subplot 3: Countplot of 'Transmission' feature
k1 = sns.countplot(data = final_df, x = 'Transmission', ax= ax2) # add to subplot 3
ax2.set_title('Transmission Value Counts', fontsize=16)
ax2.set(xlabel= 'Transmission', ylabel= 'Count')

# Subplot 4: Countplot of 'Owner' feature
k1 = sns.countplot(data = final_df, x = 'Owner', ax= ax3) # add to subplot 4
ax3.set_title('Owner Value Counts', fontsize=16)
ax3.set(xlabel= 'Owner', ylabel= 'Count')

plt.show()
#fig.savefig("Distributionplot.png")

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(data= final_df, x= 'No_of_Years')
plt.xlabel('Number of Years', fontsize=14)
plt.ylabel('Counts', fontsize=14)
plt.title('Number of Years Value Counts', fontsize=18)

### Convert Categorical variable into numerical
Here, I am using One Hot Encoding / get_dummies to convert categorical variables to numerical.

In [None]:
final_df = pd.get_dummies(final_df, drop_first=True)
final_df.head()

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(data = final_df.corr().round(2), annot= True, cmap= 'plasma', vmin= -1 , vmax= 1, linecolor='white', linewidths=2)

In [None]:
# Let's check data types of variables
final_df.dtypes

In [None]:
# Converting the datatypes of variables as of required datatype
final_df['Fuel_Type_Diesel'] = final_df['Fuel_Type_Diesel'].astype('int64')
final_df['Fuel_Type_Petrol'] = final_df['Fuel_Type_Petrol'].astype('int64')
final_df['Seller_Type_Individual'] = final_df['Seller_Type_Individual'].astype('int64')
final_df['Transmission_Manual'] = final_df['Transmission_Manual'].astype('int64')

In [None]:
X = final_df.iloc[:, 1:]            # Feature matrix (independent variables)
y = final_df.iloc[:, 0]             # Target variable (dependent variable)

In [None]:
# To check important feature
from sklearn.ensemble import ExtraTreesRegressor

model = ExtraTreesRegressor()
model.fit(X,y)

In [None]:
print(model.feature_importances_)

In [None]:
#plot graph of feature importances for better visualization

imp_feature = pd.Series(model.feature_importances_, index = X.columns)
imp_feature.nlargest(7).plot(kind = 'barh', color='red')
plt.title('Important Features', fontsize=16)
plt.show()

But in this project, we will use all features for prediction.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

<h2>Model Building</h2><a id="5"></a>

In [None]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor()

In [None]:
## Hyperparameters 
# number of trees
n_estimators = [int(x) for x in np.linspace(start=100, stop=1200, num=12)]

# number of features
max_features = ['auto', 'sqrt']

# max number of levels in tree
max_depth = [int(x) for x in np.linspace(start= 5, stop= 30, num= 6)]

# min. number of sample required to split a node
min_samples_split = [2,5,10,15,100]

# min. number of samples required at each leaf node
min_samples_leaf = [1,2,5,10]

In [None]:
# Create the random grid
random_grid= {'n_estimators': n_estimators, 
              'max_features' : max_features,
              'max_depth' : max_depth,
              'min_samples_split' : min_samples_split,
              'min_samples_leaf' : min_samples_leaf}
print(random_grid)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
regressor_random = RandomizedSearchCV(estimator=  regressor, param_distributions=  random_grid, scoring= 'neg_mean_squared_error', \
                                      n_iter = 10, cv=5, verbose = 2, random_state=42, n_jobs=1)
regressor_random.fit(X_train, y_train)

In [None]:
y_predictions = regressor_random.predict(X_test)
y_predictions

### Predicting Test Data by visualizing
*Now that I've fit and trained the model, I need to evaluate its performance by predicting the test values and visualize the results.*

In [None]:
plt.figure(figsize=(7,5))
plt.scatter(x= y_test, y= y_predictions)
plt.xlabel('Y Test (True values)')
plt.ylabel('Predicted Values')
plt.title('True value Vs Predicted values of Selling Price', fontsize=14)

### Residuals

*Next, I explore the residuals to make sure everything was okay with the data (i.e. it is Normally distributed).*

In [None]:
sns.distplot(y_test - y_predictions)

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, explained_variance_score

print('Mean Absolute Error: ', mean_absolute_error(y_test, y_predictions))
print('Mean Squareed Error: ', mean_squared_error(y_test, y_predictions))
print('Root Mean Square Error: ', np.sqrt(mean_squared_error(y_test, y_predictions)))
print('\nExplaned Variance Score: ', explained_variance_score(y_true= y_test, y_pred= y_predictions))