## Used Car Price Prediction

### 1) Problem statement.
    This dataset comprises used car sold on cardekho.com in india as well as important features of these cars.
    if user can predict the price of the car based on input features.
    Prediction results can be used to give new seller the price suggestion based on market condition.
    
### 2) Data Collection.
    The Dataset is collected from scrapping from cardekho website.
    The data consists of 13 column and 15411 rows.


In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings

warnings.filterwarnings("ignore")

%matplotlib inline

In [5]:
df = pd.read_csv(r"cardekho.csv", index_col=[0])

In [6]:
df.head()

Unnamed: 0,car_name,brand,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Maruti Alto,Maruti,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Hyundai Grand,Hyundai,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,Hyundai i20,Hyundai,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Maruti Alto,Maruti,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ford Ecosport,Ford,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


### Data Cleaning

#### Handling Missing values

1. Handling Missing values
2. Handling Duplicates
3. Check data type
4. Uderstand the dataset

In [7]:
## Check Null values
df.isnull().sum()

car_name             0
brand                0
model                0
vehicle_age          0
km_driven            0
seller_type          0
fuel_type            0
transmission_type    0
mileage              0
engine               0
max_power            0
seats                0
selling_price        0
dtype: int64

In [8]:
#remove unecessary columns
df.drop('car_name', axis=1, inplace=True)
df.drop('brand', axis=1, inplace=True)

In [9]:
df.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


In [10]:
## Getting all different types of features
num_features=[feature for feature in df.columns if df[feature].dtype != 'O']
print('Number of numerical features: ', len(num_features))
cat_features=[feature for feature in df.columns if df[feature].dtype == 'O']
print('Number of categorical features: ', len(cat_features))
discrete_features=[feature for feature in num_features if len(df[feature].unique()) <= 25]
print('Number of discrete features: ', len(discrete_features))
continuous_features=[feature for feature in num_features if feature not in discrete_features]
print('Number of continuous features: ', len(continuous_features))

Number of numerical features:  7
Number of categorical features:  4
Number of discrete features:  2
Number of continuous features:  5


In [11]:
#Independent and dependent features
from sklearn.model_selection import train_test_split
X=df.drop(['selling_price'], axis=1)
y=df['selling_price']

In [12]:
X.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


## Feature Encoding and Scaling

One Hot encoding for columns by which had lesser unique and not ordinal


In [13]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
X['model']=le.fit_transform(X['model'])

In [14]:
X.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,7,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,54,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,118,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,7,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,38,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


In [15]:
#Column Transformer with 3 types of transformers
num_features = X.select_dtypes(exclude="object").columns
onehot_columns = ['seller_type', 'fuel_type', 'transmission_type']

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder(drop = "first")

preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder", oh_transformer, onehot_columns),
        ("StandardScaler", numeric_transformer, num_features)
    ], remainder='passthrough'
)

In [16]:
X = preprocessor.fit_transform(X)

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

## Model Training and Model Selection


In [18]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [19]:
#Function to evaluate model
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square

In [20]:
#Model training
from sklearn.metrics import roc_auc_score

models={
    "Linear Regression":LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor(),
    "Lasso": Lasso(),
    "Ridge": Ridge(),
    "K-Neighbors Regressor": KNeighborsRegressor(),
    "Adaboost": AdaBoostRegressor()
    
}
for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train)
    
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    #evaluate train and test data
    model_train_mae, model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)
    model_test_mae, model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

    print(list(models.keys())[i])
    
    print('Model performance for training set')
    print("- Root Mean Squared error: {:.4f}".format(model_train_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
    print("- R2 Score: {:.4f}".format(model_train_r2))
          
    print('------------------------------------')
     
    print('Model performance for test set')
    print("- Root Mean Squared error: {:.4f}".format(model_test_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
    print("- R2 Score: {:.4f}".format(model_test_r2))
    
    print("\n")
    

Linear Regression
Model performance for training set
- Root Mean Squared error: 478801.8476
- Mean Absolute Error: 260031.9512
- R2 Score: 0.6763
------------------------------------
Model performance for test set
- Root Mean Squared error: 753414.8310
- Mean Absolute Error: 280915.9804
- R2 Score: 0.5123


Decision Tree
Model performance for training set
- Root Mean Squared error: 19593.3765
- Mean Absolute Error: 4814.6279
- R2 Score: 0.9995
------------------------------------
Model performance for test set
- Root Mean Squared error: 401655.2177
- Mean Absolute Error: 131480.6087
- R2 Score: 0.8614


Random Forest
Model performance for training set
- Root Mean Squared error: 88318.0744
- Mean Absolute Error: 38406.7241
- R2 Score: 0.9890
------------------------------------
Model performance for test set
- Root Mean Squared error: 529494.0135
- Mean Absolute Error: 109963.7582
- R2 Score: 0.7591


Lasso
Model performance for training set
- Root Mean Squared error: 478801.8527
- Mean

In [21]:
#Parameters for Hyperparameter tuning
knn_params = {"n_neighbors": [2, 3, 10, 20, 40, 50]}
rf_params = {"max_depth": [5, 8, 15, None, 10],
             "max_features": [5, 7, "auto", 8],
             "min_samples_split": [2, 8, 15, 20],
             "n_estimators": [100, 200, 500, 1000]}
ada_params = {
    "n_estimators": [50, 60, 70, 80],
    "loss": ["linear", "square", "exponential"]
}

In [22]:
#Models List for hyperparameter tuning
randomcv_models = [('KNN', KNeighborsRegressor(), knn_params),
                    ("RF", RandomForestRegressor(), rf_params),
                   ("AD", AdaBoostRegressor(), ada_params)
                  ]

In [23]:
# Hyperparameter Tuning
from sklearn.model_selection import RandomizedSearchCV

model_param = {}
for name, model, params in randomcv_models:
    random = RandomizedSearchCV(estimator=model,
                                param_distributions=params,
                                n_iter=100,
                                cv=3,
                                verbose=2,
                                n_jobs=-1)
    random.fit(X_train, y_train)
    model_param[name] = random.best_params_
    
for model_name in model_param:
    print(f"------------- Best Params for {model_name} -----------")
    print(model_param[model_name])

Fitting 3 folds for each of 6 candidates, totalling 18 fits
Fitting 3 folds for each of 100 candidates, totalling 300 fits
[CV] END ......................................n_neighbors=2; total time=   0.1s
[CV] END .....................................n_neighbors=20; total time=   0.3s
[CV] END max_depth=15, max_features=auto, min_samples_split=15, n_estimators=1000; total time=  17.8s
[CV] END max_depth=8, max_features=auto, min_samples_split=15, n_estimators=500; total time=   6.6s
[CV] END max_depth=10, max_features=auto, min_samples_split=8, n_estimators=500; total time=   7.6s
[CV] END max_depth=10, max_features=8, min_samples_split=8, n_estimators=100; total time=   1.0s
[CV] END max_depth=10, max_features=8, min_samples_split=8, n_estimators=100; total time=   1.1s
[CV] END max_depth=10, max_features=8, min_samples_split=8, n_estimators=200; total time=   2.0s
[CV] END max_depth=15, max_features=auto, min_samples_split=2, n_estimators=1000; total time=  26.8s
[CV] END max_depth=10

[CV] END ......................................n_neighbors=3; total time=   0.1s
[CV] END .....................................n_neighbors=20; total time=   0.3s
[CV] END .....................................n_neighbors=50; total time=   0.3s
[CV] END max_depth=15, max_features=auto, min_samples_split=15, n_estimators=1000; total time=  17.9s
[CV] END max_depth=8, max_features=auto, min_samples_split=15, n_estimators=500; total time=   6.6s
[CV] END max_depth=10, max_features=auto, min_samples_split=8, n_estimators=500; total time=   7.5s
[CV] END max_depth=15, max_features=5, min_samples_split=20, n_estimators=500; total time=   4.3s
[CV] END max_depth=10, max_features=auto, min_samples_split=15, n_estimators=500; total time=   7.6s
[CV] END max_depth=5, max_features=auto, min_samples_split=20, n_estimators=1000; total time=  11.4s
[CV] END max_depth=None, max_features=7, min_samples_split=8, n_estimators=500; total time=   7.8s
[CV] END max_depth=10, max_features=8, min_samples_split

[CV] END ......................................n_neighbors=2; total time=   0.1s
[CV] END .....................................n_neighbors=20; total time=   0.4s
[CV] END max_depth=None, max_features=auto, min_samples_split=15, n_estimators=200; total time=   3.7s
[CV] END max_depth=10, max_features=8, min_samples_split=20, n_estimators=200; total time=   1.9s
[CV] END max_depth=5, max_features=auto, min_samples_split=2, n_estimators=200; total time=   1.8s
[CV] END max_depth=8, max_features=8, min_samples_split=20, n_estimators=1000; total time=   8.0s
[CV] END max_depth=8, max_features=8, min_samples_split=15, n_estimators=1000; total time=   8.2s
[CV] END max_depth=15, max_features=8, min_samples_split=8, n_estimators=500; total time=   6.3s
[CV] END max_depth=None, max_features=auto, min_samples_split=8, n_estimators=500; total time=  10.8s
[CV] END max_depth=5, max_features=auto, min_samples_split=20, n_estimators=1000; total time=  10.5s
[CV] END max_depth=10, max_features=auto, 

In [24]:
#Retraining the Model with best parameters
from sklearn.metrics import roc_auc_score

models={
    "Random Forest Regressor": RandomForestRegressor(n_estimators=100,min_samples_split=2,max_features='auto',max_depth=None,n_jobs=-1),
    "K-Neighbors Regressor": KNeighborsRegressor(n_neighbors=10, n_jobs=-1),
    "Adaboost Regressor": AdaBoostRegressor(n_estimators = 60, loss = 'linear')
    
}
for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train)
    
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    #evaluate train and test data
    model_train_mae, model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)
    model_test_mae, model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

    print(list(models.keys())[i])
    
    print('Model performance for training set')
    print("- Root Mean Squared error: {:.4f}".format(model_train_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
    print("- R2 Score: {:.4f}".format(model_train_r2))
          
    print('------------------------------------')
     
    print('Model performance for test set')
    print("- Root Mean Squared error: {:.4f}".format(model_test_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
    print("- R2 Score: {:.4f}".format(model_test_r2))
    
    print("\n")
    

Random Forest Regressor
Model performance for training set
- Root Mean Squared error: 92642.4408
- Mean Absolute Error: 38392.5422
- R2 Score: 0.9879
------------------------------------
Model performance for test set
- Root Mean Squared error: 524471.3731
- Mean Absolute Error: 109447.9588
- R2 Score: 0.7636


K-Neighbors Regressor
Model performance for training set
- Root Mean Squared error: 273708.3661
- Mean Absolute Error: 101227.7062
- R2 Score: 0.8942
------------------------------------
Model performance for test set
- Root Mean Squared error: 603546.5492
- Mean Absolute Error: 124324.9108
- R2 Score: 0.6870


Adaboost Regressor
Model performance for training set
- Root Mean Squared error: 491812.6059
- Mean Absolute Error: 412955.3291
- R2 Score: 0.6585
------------------------------------
Model performance for test set
- Root Mean Squared error: 733460.3869
- Mean Absolute Error: 419973.8915
- R2 Score: 0.5378


