# Used Car Data Price Prediction

Build of the 10+ most popular ML regression models to predict the price of used Car in Indian Market.

Let's design the flow of computation:
1. Supervised, Unsupervised, Reinforcement Learning? Ans. Supervised because provided with labeled data.
2. Classification, Regression, something else? Ans. Univariate Multiple Regression because we need to predict a single variable, selling price on the basis of multiple features.
3. Batch learning or online learning techniques? Ans. Plain Batch learning as there is no continous inflow of data apart from provided once.
4. Performance Measure? Ans. Root Mean Square Error(RMSE), a typical performance measure for regression problems. Will also consider, using Mean Absolute Error(MAE).
5. Hypothesis: Selling price of car will be more if less driven, of premium brand, sold by first owner, of latest year with automatic transmission.

<a class="anchor" id="0.1"></a>

## Table of Contents

1. [Import libraries & dataset](#1)
1. [EDA](#2)
1. [Preparing to modeling](#3)
1. [ML models](#4)
    -  [Linear Regression](#4.1)
    -  [Ridge Regression](#4.2)
    -  [K Neighbors Regressor](#4.3)
    -  [SVR](#4.4)
    -  [Stochastic Gradient Descent](#4.5)
    -  [Decision Tree Regressor](#4.6)
    -  [Random Forest](#4.7)
    -  [Gradient Boosting](#4.8)
    -  [XG Boost](#4.9)
    -  [ExtraTreesRegressor](#4.10)
    -  [VotingRegressor](#4.11)

## 1. Import libraries & dataset <a class="anchor" id="1"></a>

[Back to Table of Contents](#0.1)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
car_details = pd.read_csv('../input/vehicle-dataset-from-cardekho/CAR DETAILS FROM CAR DEKHO.csv')
car_details.head()

So, we have details of car with second hand selling price for a particular year

In [None]:
#Get the count of rows for each column
car_details.info()

## 2. EDA <a class="anchor" id="2"></a>

[Back to Table of Contents](#0.1)

In [None]:
# A good library to get basic EDA done
import seaborn as sns
print(car_details.describe())

sns.distplot(car_details.selling_price, color = 'blue')

In [None]:
sns.boxplot(car_details.iloc[:,1])

In [None]:
sns.pairplot(car_details)

This gives a sense of correlation among variables with integer values. As, price of car is maximum with less driven and sold in latest year

In [None]:
# car_details["is_duplicate"]= car_details.duplicated()
# car_details["is_duplicate"].value_counts()
#car_details.drop(car_details[car_details['is_duplicate'] == True].index, inplace = True)
#By default, for each set of duplicated values, the first occurrence is set on False and all others on True.
#value_counts is a Series method rather than a DataFrame method
#car_details[(car_details.name == 'Hyundai Verna SX') & (car_details.year == 2007)]

There are duplicate rows and let's drop them out

The or and and python statements require truth-values. For pandas these are considered ambiguous so you should use "bitwise" | (or) or & (and) operations:

In [None]:
car_details.drop_duplicates(inplace = True)
car_details1 = car_details.reset_index()
len(car_details)

In [None]:
car_details1.describe()

Categorical Plots

In [None]:
categorical_variables = car_details1.select_dtypes(exclude=["number"]).columns

In [None]:
# Categorical plots in the shape of Violin
for i in categorical_variables:
    if(len(car_details1[i].unique())<10 and len(car_details1[i].unique())>0):
        sns.catplot(x="selling_price", y=i,kind="violin", split=True, data=car_details1)

In [None]:
#Pair plot for integer variables
sns.pairplot(car_details1.iloc[:,2:5])

In [None]:
car_details1['name'].value_counts()

Due to many car models with less data associated. I will consider cars on the basis of their Company & car name.

In [None]:
#Taking company name and parent model name, not focused on exact model type
for i in range(len(car_details1)):
    car_details1.loc[i,'name_model'] = ' '.join(car_details1.loc[i,'name'].split()[:2]) #split and join the string

In [None]:
car_details1.name_model.value_counts()

In [None]:
#Drop models whose count is less than 5
counts = car_details1['name_model'].value_counts()

car_details1 = car_details1[~car_details1['name_model'].isin(counts[counts < 5].index)]

In [None]:
categorical_variables
# We will not encode the name variable. As it will result to more 188 columns

In [None]:
#one-hot encoding, adding dummy-variables
car_details2 = pd.get_dummies(car_details1, columns=categorical_variables[1:])

In [None]:
car_details2.info()

In [None]:
#Now, drop the columns not required for further analysis
car_details2.drop(columns = ['index','name'], inplace = True)
car_details2.head()

In [None]:
# Label encoding of model names
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(list(car_details2.name_model))
car_details2.name_model = le.transform(list(car_details2.name_model))
car_details2.head()

In [None]:
car_details2.info()
#Now all the columns are in numeric format

In [None]:
#Let's figure out the heat map among the features
plt.figure(figsize=(24,24)) # For a appropriate size

sns.heatmap(car_details2.corr(),annot=True,cmap='summer') #Annotation = enables value of each box visible

In [None]:
#sns.pairplot(car_details2)

In [None]:
#For models from Sklearn
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train = pd.DataFrame(scaler.fit_transform(car_details2), columns = car_details2.columns)

## 3. Preparing to modeling <a class="anchor" id="3"></a>

[Back to Table of Contents](#0.1)

In [None]:
target = train.selling_price
features = train.drop(columns = ['selling_price'])

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, target, stratify = features.name_model)

### 4.1 Linear Regression <a class="anchor" id="4.1"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Linear Regression
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
linreg.score(X_test,y_test)

In [None]:
from sklearn.model_selection import cross_val_score
np.mean(cross_val_score(LinearRegression(), X_train, y_train, cv=10))

### 4.2 Ridge Regression <a class="anchor" id="4.2"></a>

[Back to Table of Contents](#0.1)

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
param_grid = {'alpha': np.logspace(-3, 3, 13)}

In [None]:
grid = GridSearchCV(Ridge(), param_grid, cv=10, return_train_score=True, iid=False)
grid.fit(X_train, y_train)

In [None]:
grid.score(X_test, y_test)

In [None]:
np.mean(cross_val_score(Ridge(), X_train, y_train, cv=10))

### 4.3 K Neighbors Regressor <a class="anchor" id="4.3"></a>

[Back to Table of Contents](#0.1)

In [None]:
from sklearn.neighbors import KNeighborsRegressor
neighbors = range(1, 30, 2)

training_scores = []
test_scores = []
for n_neighbors in neighbors:
    knn = KNeighborsRegressor(n_neighbors=n_neighbors).fit(X_train, y_train)
    training_scores.append(knn.score(X_train, y_train))
    test_scores.append(knn.score(X_test, y_test))

In [None]:
plt.figure()
plt.plot(neighbors, training_scores, label="training scores")
plt.plot(neighbors, test_scores, label="test scores")
plt.ylabel("accuracy")
plt.xlabel("n_neighbors")
plt.legend()

In [None]:
knn = KNeighborsRegressor(n_neighbors=7)
score = cross_val_score(knn, X_train, y_train, cv=10)
print(f"best cross-validation score: {np.max(score):.3}")

knn.fit(X_train, y_train)
print(f"test-set score: {knn.score(X_test, y_test):.3f}")

### 4.4 SVR <a class="anchor" id="4.4"></a>

[Back to Table of Contents](#0.1)

In [None]:
from sklearn.svm import SVR
svr = SVR()
svr.fit(X_train, y_train)
print(f"test-set score: {svr.score(X_test, y_test):.3f}")

In [None]:
svr1 = SVR(kernel='poly')
svr1.fit(X_train, y_train)
print(f"test-set score: {svr1.score(X_test, y_test):.3f}")

### 4.5 Stochastic Gradient Descent <a class="anchor" id="4.5"></a>

[Back to Table of Contents](#0.1)

In [None]:
from sklearn.linear_model import SGDRegressor
sgd = SGDRegressor()
sgd.fit(X_train, y_train)
print(f"test-set score: {sgd.score(X_test, y_test):.3f}")

### 4.6 Decision Tree Regressor <a class="anchor" id="4.6"></a>

[Back to Table of Contents](#0.1)

In [None]:
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train)
print(f"test-set score: {dtr.score(X_test, y_test):.3f}")

### 4.7 Random Forest <a class="anchor" id="4.7"></a>

[Back to Table of Contents](#0.1)

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
print(f"test-set score: {rfr.score(X_test, y_test):.3f}")

### 4.8 Gradient Boosting <a class="anchor" id="4.8"></a>

[Back to Table of Contents](#0.1)

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)
print(f"test-set score: {gbr.score(X_test, y_test):.3f}")

### 4.9 XG Boost <a class="anchor" id="4.9"></a>

[Back to Table of Contents](#0.1)

In [None]:
from xgboost import XGBRegressor
xgb = XGBRegressor()
xgb.fit(X_train,y_train)
print(f"test-set score: {xgb.score(X_test, y_test):.3f}")

### 4.10 Extra Tree Regressor <a class="anchor" id="5.7"></a>

[Back to Table of Contents](#0.1)

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
etr = ExtraTreesRegressor()
etr.fit(X_train, y_train)
print(f"test-set score: {etr.score(X_test, y_test):.3f}")

### 4.11 Voting Regressor <a class="anchor" id="4.11"></a>

[Back to Table of Contents](#0.1)

In [None]:
from sklearn.ensemble import VotingRegressor
vr = VotingRegressor(estimators=[('rfr', rfr), ('gbr', gbr), ('xgb', xgb)])
vr.fit(X_train,y_train)
print(f"test-set score: {vr.score(X_test, y_test):.3f}")

In [None]:
from sklearn.metrics import mean_squared_error

for clf in (rfr, gbr, xgb, vr):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__,  mean_squared_error(y_test, y_pred))