## ABOUT DATASET
### The used car market in India is a dynamic and ever-changing landscape. Prices can fluctuate wildly based on a variety of factors including the make and model of the car, its mileage, its condition and the current market conditions. As a result, it can be difficult for sellers to accurately price their cars.This dataset contains information about used cars.This data can be used for a lot of purposes such as Used Car Price Prediction using different Machine Learning Techniques.

### Data description(information of features)
#### car_name: Car's Full name, which includes brand and specific model name.
#### brand: Brand Name of the particular car.
#### model: Exact model name of the car of a particular brand.
#### seller_type: Which Type of seller is selling the used car
#### fuel_type: Fuel used in the used car, which was put up on sale.
#### transmission_type: Transmission used in the used car, which was put on sale. manual->driver shift the gears automatic->car automatic shift the gears for you
#### vehicle_age: The count of years since car was bought.
#### mileage: It is the number of kilometer the car runs per litre.
#### engine: It is the engine capacity in cc(cubic centimeters)
#### max_power: Max power it produces in BHP.
#### seats: Total number of seats in car.
#### selling_price: The sale price which was put up on website.

In [50]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

In [51]:
df=pd.read_csv('cardekho_dataset.csv.zip')

In [52]:
df.head()

Unnamed: 0.1,Unnamed: 0,car_name,brand,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,0,Maruti Alto,Maruti,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,1,Hyundai Grand,Hyundai,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,2,Hyundai i20,Hyundai,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,3,Maruti Alto,Maruti,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,4,Ford Ecosport,Ford,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


In [53]:
df.columns,len(df.columns)

(Index(['Unnamed: 0', 'car_name', 'brand', 'model', 'vehicle_age', 'km_driven',
        'seller_type', 'fuel_type', 'transmission_type', 'mileage', 'engine',
        'max_power', 'seats', 'selling_price'],
       dtype='object'),
 14)

In [54]:
df.drop(columns='Unnamed: 0',axis=1,inplace =True)
df.drop(columns=['car_name','brand'],axis=1,inplace=True)

In [55]:
len(df.columns)

11

In [56]:
df.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


In [57]:
df.isnull().sum() ## there is no null values in my dataset

model                0
vehicle_age          0
km_driven            0
seller_type          0
fuel_type            0
transmission_type    0
mileage              0
engine               0
max_power            0
seats                0
selling_price        0
dtype: int64

In [58]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15411 entries, 0 to 15410
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   model              15411 non-null  object 
 1   vehicle_age        15411 non-null  int64  
 2   km_driven          15411 non-null  int64  
 3   seller_type        15411 non-null  object 
 4   fuel_type          15411 non-null  object 
 5   transmission_type  15411 non-null  object 
 6   mileage            15411 non-null  float64
 7   engine             15411 non-null  int64  
 8   max_power          15411 non-null  float64
 9   seats              15411 non-null  int64  
 10  selling_price      15411 non-null  int64  
dtypes: float64(2), int64(5), object(4)
memory usage: 1.3+ MB


In [59]:
df.describe()

Unnamed: 0,vehicle_age,km_driven,mileage,engine,max_power,seats,selling_price
count,15411.0,15411.0,15411.0,15411.0,15411.0,15411.0,15411.0
mean,6.036338,55616.48,19.701151,1486.057751,100.588254,5.325482,774971.1
std,3.013291,51618.55,4.171265,521.106696,42.972979,0.807628,894128.4
min,0.0,100.0,4.0,793.0,38.4,0.0,40000.0
25%,4.0,30000.0,17.0,1197.0,74.0,5.0,385000.0
50%,6.0,50000.0,19.67,1248.0,88.5,5.0,556000.0
75%,8.0,70000.0,22.7,1582.0,117.3,5.0,825000.0
max,29.0,3800000.0,33.54,6592.0,626.0,9.0,39500000.0


In [60]:
len(df['model'].unique())

120

In [61]:
##  numerical feature from dataset
num_feature=[feature for feature in df.columns if df[feature].dtypes!='O']
print("Number of numerical features are",len(num_feature))
      
## categorical features 
cat_feature = [feature for feature in df.columns if df[feature].dtype=='O']
print("Number of categorical features are",len(cat_feature))

## discrete features 
dis_feature = [feature for feature in num_feature if len(df[feature].unique())<=25]
print("Number of discretefeature features are",len(dis_feature))

## continous features 
con_feature = [feature for feature in num_feature if feature not in dis_feature]
print("Number of continous features are",len(con_feature))


Number of numerical features are 7
Number of categorical features are 4
Number of discretefeature features are 2
Number of continous features are 5


In [62]:
len(df['seller_type'].unique()) ,len(df['fuel_type'].unique()),len(df['transmission_type'].unique())

(3, 5, 2)

In [63]:
X=df.iloc[:,:-1]
y=df.iloc[:,-1]

In [64]:
## labelencoding on model feature
from sklearn.preprocessing import LabelEncoder # we do labelencoding because the there are more unique categorical values
le=LabelEncoder()
X['model']=le.fit_transform(X['model'])

In [65]:
## feature Engeineering(preprocessing)
num_features=X.select_dtypes(exclude="object").columns
cat_features=['seller_type','fuel_type','transmission_type']
len(num_features),len(cat_features)

(7, 3)

In [66]:
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.compose import ColumnTransformer
numTransformer=StandardScaler()
catTransformer=OneHotEncoder(drop='first')

preprocessor=ColumnTransformer(
    [("StandardScaler",numTransformer,num_features),
    ("OneHotEncoder",catTransformer,cat_features)],
    remainder='passthrough'
) ## drop': (default) Drops the columns that are not specified in the transformers list.

## 'passthrough': Leaves the unspecified columns unchanged and passes them through without any transformation.

In [67]:
preprocessor

In [68]:
## train test split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=42)

In [69]:
X_train=preprocessor.fit_transform(X_train)
X_test=preprocessor.transform(X_test)

In [70]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score

In [71]:
models={
    "Linear Regession":LinearRegression(),
    "Ridge":Ridge(),
    "Lasso":Lasso(),
    "k-Neighbors-Regressor": KNeighborsRegressor(),
    "Decision Tree":DecisionTreeRegressor(),
    "Random Forest":RandomForestRegressor(),
    "Adaboost Regressor ":AdaBoostRegressor(),
    "Gradient Regressor":GradientBoostingRegressor(),
    "XGBoost Regressor":XGBRegressor(),
}
for i in range(len(list(models))):
    model=list(models.values())[i]
    model.fit(X_train,y_train)
    
    ## test the data
    y_train_pred=model.predict(X_train)
    y_test_pred=model.predict(X_test)
    
               
    # Training set performance
    model_train_sqr_error = mean_squared_error(y_train, y_train_pred)
    model_train_abs_error = mean_absolute_error(y_train, y_train_pred)
    model_train_rsqr_error = np.sqrt(model_train_sqr_error)
    model_train_r2_score = r2_score(y_train, y_train_pred)
    
    # Test set performance
    model_test_sqr_error = mean_squared_error(y_test, y_test_pred)
    model_test_abs_error = mean_absolute_error(y_test, y_test_pred)
    model_test_rsqr_error = np.sqrt(model_test_sqr_error)
    model_test_r2_score = r2_score(y_test, y_test_pred)

    print(list(models.keys())[i])
    
    print('Model performance for Training set')
    print("- mean_squared_error: {:.4f}".format(model_train_sqr_error))
    print('- mean_absolute_error: {:.4f}'.format(model_train_abs_error))
    print('- root_mean_squared_error:{:.4f}'.format(model_train_rsqr_error))
    print('- r2_square:{:.4f}'.format(model_train_r2_score))

    
    
    print('----------------------------------')
    
    print('Model performance for Testing set')
    print("- mean_squared_error:{:.4f}".format(model_test_sqr_error))
    print('- mean_absolute_error:{:.4f}'.format(model_test_abs_error))
    print('- root_mean_squared_error:{:.4f}'.format(model_test_rsqr_error))
    print('- r2_square:{:.4f}' .format(model_test_r2_score))

    
    print('='*35)
    print('\n')

Linear Regession
Model performance for Training set
- mean_squared_error: 304874315292.8461
- mean_absolute_error: 266675.1076
- root_mean_squared_error:552154.2495
- r2_square:0.6220
----------------------------------
Model performance for Testing set
- mean_squared_error:270286925822.7529
- mean_absolute_error:284283.4460
- root_mean_squared_error:519891.2635
- r2_square:0.6525


Ridge
Model performance for Training set
- mean_squared_error: 304875008338.2687
- mean_absolute_error: 266635.7643
- root_mean_squared_error:552154.8771
- r2_square:0.6220
----------------------------------
Model performance for Testing set
- mean_squared_error:270276506373.7186
- mean_absolute_error:284241.5752
- root_mean_squared_error:519881.2426
- r2_square:0.6525


Lasso
Model performance for Training set
- mean_squared_error: 304874327627.6475
- mean_absolute_error: 266674.0551
- root_mean_squared_error:552154.2607
- r2_square:0.6220
----------------------------------
Model performance for Testing set

In [72]:
k_params=dict(n_neighbors=[2,5,7,10,15,20,50])

rf_params = {"max_depth": [5, 8, 15, None, 10],
             "max_features": [5, 7, "auto", 8],
             "min_samples_split": [2, 8, 15, 20],
             "n_estimators": [100, 200, 500, 1000]}

gb_params=dict(loss=['squared_error', 'absolute_error', 'huber', 'quantile'],learning_rate=[0.1,0.01,0.001],
                 n_estimators=[100,500,1000],max_depth=[3,5,8,10])

xb_params=dict(n_estimators=[100,500,1000],max_depth=[3,5,8,10],learning_rate=[0.1,0.01])

In [73]:
## Models with hyperparameter Tunning
randomcv_models=[
   ## ("k-Neighbors-Regressor", KNeighborsRegressor(),k_params),
    ##("Random Forest",RandomForestRegressor(),rf_params),
    #("Gradient Regressor",GradientBoostingRegressor(),gb_params),
    ("XGB Regressor",XGBRegressor(),xb_params),
]

In [74]:
xb_params

{'n_estimators': [100, 500, 1000],
 'max_depth': [3, 5, 8, 10],
 'learning_rate': [0.1, 0.01]}

In [75]:
randomcv_models

[('XGB Regressor',
  XGBRegressor(base_score=None, booster=None, callbacks=None,
               colsample_bylevel=None, colsample_bynode=None,
               colsample_bytree=None, device=None, early_stopping_rounds=None,
               enable_categorical=False, eval_metric=None, feature_types=None,
               gamma=None, grow_policy=None, importance_type=None,
               interaction_constraints=None, learning_rate=None, max_bin=None,
               max_cat_threshold=None, max_cat_to_onehot=None,
               max_delta_step=None, max_depth=None, max_leaves=None,
               min_child_weight=None, missing=nan, monotone_constraints=None,
               multi_strategy=None, n_estimators=None, n_jobs=None,
               num_parallel_tree=None, random_state=None, ...),
  {'n_estimators': [100, 500, 1000],
   'max_depth': [3, 5, 8, 10],
   'learning_rate': [0.1, 0.01]})]

In [76]:
from sklearn.model_selection import RandomizedSearchCV
model_param = {}
for name, model, params in randomcv_models:
    random = RandomizedSearchCV(estimator=model,
                                   param_distributions=params,
                                   n_iter=100,
                                   cv=3,
                                   verbose=2,
                                   n_jobs=-1)
    random.fit(X_train, y_train)
    model_param[name] = random.best_params_

for model_name in model_param:
    print(f"---------------- Best Params for {model_name} -------------------")
    print(model_param[model_name])

Fitting 3 folds for each of 24 candidates, totalling 72 fits
---------------- Best Params for XGB Regressor -------------------
{'n_estimators': 1000, 'max_depth': 3, 'learning_rate': 0.1}


In [79]:
models={
    ## "k-Neighbors-Regressor": KNeighborsRegressor(n_neighbors=5,n_jobs=-1),
    "Random Forest":RandomForestRegressor(n_estimators=100,min_samples_split=2,max_features=5,max_depth=15,n_jobs=-1),
    "Gradient Regressor":GradientBoostingRegressor(n_estimators=1000,max_depth=3,loss='squared_error',learning_rate=0.1),
    "XGBoost Regressor":XGBRegressor(n_estimators=1000,max_depth=3,learning_rate=0.1),
}
for i in range(len(list(models))):
    model=list(models.values())[i]
    model.fit(X_train,y_train)
    
    ## test the data
    y_train_pred=model.predict(X_train)
    y_test_pred=model.predict(X_test)
    
               
    # Training set performance
    model_train_sqr_error = mean_squared_error(y_train, y_train_pred)
    model_train_abs_error = mean_absolute_error(y_train, y_train_pred)
    model_train_rsqr_error = np.sqrt(model_train_sqr_error)
    model_train_r2_score = r2_score(y_train, y_train_pred)
    
    # Test set performance
    model_test_sqr_error = mean_squared_error(y_test, y_test_pred)
    model_test_abs_error = mean_absolute_error(y_test, y_test_pred)
    model_test_rsqr_error = np.sqrt(model_test_sqr_error)
    model_test_r2_score = r2_score(y_test, y_test_pred)

    print(list(models.keys())[i])
    
    print('Model performance for Training set')
    print("- mean_squared_error: {:.4f}".format(model_train_sqr_error))
    print('- mean_absolute_error: {:.4f}'.format(model_train_abs_error))
    print('- root_mean_squared_error:{:.4f}'.format(model_train_rsqr_error))
    print('- r2_square:{:.4f}'.format(model_train_r2_score))

    
    
    print('----------------------------------')
    
    print('Model performance for Testing set')
    print("- mean_squared_error:{:.4f}".format(model_test_sqr_error))
    print('- mean_absolute_error:{:.4f}'.format(model_test_abs_error))
    print('- root_mean_squared_error:{:.4f}'.format(model_test_rsqr_error))
    print('- r2_square:{:.4f}' .format(model_test_r2_score))

    
    print('='*35)
    print('\n')

Random Forest
Model performance for Training set
- mean_squared_error: 17453385589.8120
- mean_absolute_error: 56279.1946
- root_mean_squared_error:132111.2622
- r2_square:0.9784
----------------------------------
Model performance for Testing set
- mean_squared_error:52090120523.1871
- mean_absolute_error:100510.5603
- root_mean_squared_error:228232.6018
- r2_square:0.9330


Gradient Regressor
Model performance for Training set
- mean_squared_error: 12518319545.3278
- mean_absolute_error: 74339.4045
- root_mean_squared_error:111885.2964
- r2_square:0.9845
----------------------------------
Model performance for Testing set
- mean_squared_error:47679849311.5800
- mean_absolute_error:99374.0116
- root_mean_squared_error:218357.1600
- r2_square:0.9387


XGBoost Regressor
Model performance for Training set
- mean_squared_error: 14299869964.5940
- mean_absolute_error: 77715.5457
- root_mean_squared_error:119582.0637
- r2_square:0.9823
----------------------------------
Model performance fo

In [None]:
model_param