# **Car Price Prediction**

## **Overview:**
1. Introduction to problem statement.
3. Importing required librabies and Dataset.
4. Feature Preprocessing and Feature Engineering.
5. Model Building.
6. Model Evaluation and Prediction

## **1. Introduction to Problem Statement.**
This dataset contains information about used cars listed on www.cardekho.com

Task Details:<br>
Predict the best equation for Sell Price for the car using the any of available datasets

Expected Submission:<br>
Equation for the the task

## **2. Importing Required libraries and dataset**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/vehicle-dataset-from-cardekho/car data.csv')
df.head()

In [None]:
#Print the shape of the dataset.
df.shape

In [None]:
# Getting the unique values of the categorical features
print(df['Fuel_Type'].unique())
print(df['Seller_Type'].unique())
print(df['Transmission'].unique())
print(df['Owner'].unique())




In [None]:
df.info()

In [None]:
df.describe()

* Checking for missing values

In [None]:
# Check for missing value
df.isnull().sum()

## **3. Feature Preprocessing and Engineering**

In [None]:
df['#years'] = 2020 - df['Year']

In [None]:
# Now drop the year column
df.drop(columns= ['Year'], inplace = True)

In [None]:
# drop the car_name columns
df.drop(columns= ['Car_Name'], inplace = True)

In [None]:
df.columns

In [None]:
df.head()

## One Hot Encoding

In [None]:
df = pd.get_dummies(df, drop_first=True)
df.head()

In [None]:
df.corr()

In [None]:
plt.figure(figsize = (8,8))
sns.heatmap(df.corr(), annot = True, cmap = 'RdYlGn' )

### **Observation:**
* 'Present_price', 'Fuel_Type_Deisel','Fuel_Type_Petrol','Selling_Type_Individual','Transmission_Manual' are the important feature for predicting selling price because these features are highly correlated with selling price.
* 'Fuel_Type_Deisel' and 'Fuel_Type_Petrol', are highly correalted with each other we can drop one of these.But here we are not dropping.

In [None]:
X = df.iloc[:,1:]
Y = df.iloc[:,0]

In [None]:
X.head()

In [None]:
Y.head()

In [None]:
# Feature Importances
from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor()
model.fit(X,Y)

In [None]:
print(model.feature_importances_)

In [None]:
feat_importances = pd.Series(model.feature_importances_, index = X.columns)
feat_importances.nlargest(5).plot(kind = 'barh')
plt.show()

4. Model Building 
* Random Forest MOdel with Randomized Search Cv

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0)

In [None]:
# Hyperparameters
# Randomized search CV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]

# Number of features to consider for a best split 
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(start = 5, stop = 30, num = 6)]

#The minimum number of samples required to split an internal node:
min_samples_split = [2,5,10,15,100]

#The minimum number of samples required to be at a leaf node.
min_samples_leaf = [1,2,5,10]

# Create the random grid.
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split':min_samples_split,
               'min_samples_leaf':min_samples_leaf}
print(random_grid)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
rf = RandomForestRegressor()


In [None]:
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, scoring = 'neg_mean_squared_error',
                               n_iter = 10, cv = 5, verbose = 1, random_state = 42, n_jobs = 1)


In [None]:
rf_random.fit(X_train,y_train)

In [None]:
# Best parameters used to train a model is..
rf_random.best_params_

## **5. Model evaluation and Prectiction**

In [None]:
y_pred = rf_random.predict(X_test)
y_pred

In [None]:
from sklearn.metrics import mean_squared_error
rmse_value = mean_squared_error(y_test, y_pred, squared=False)
rmse_value

In [None]:
sns.distplot(y_test-y_pred)


### **Observation:**
The distribution of the above plot is looks like Normal distribution it means having mean is zero. The normal distribution of the above plot indicates that the difference between the true value and the predicted value on the test set is close to zero or zero, so it shows that the our predicted values are correct. Hence our model is a generalized model and showing correct predictions.

In [None]:
plt.scatter(y_test,y_pred)

### **Observation:**
The above scatterplot shows that our true values and the predicted values shows strong relationship between them.
Therefore the true values and the predictive are close enough to each other.

In [None]:
#import pickle

# open a file where you want to store the data.
#file = open('/content/gdrive/My Drive/Colab Notebooks/Projects/Vehicle Car Prediction/random_forest_model.pkl', 'wb')

# dump information to that file
#pickle.dump(rf_random, file)