1. Problem Statement
- This dataset comprises used cars sold on cardekho.com in India as well as important features of these cars.
- User can predict the price of the car based on input features.
- Prediction results can be used to give new seller the price suggestion based on market condition.

2. Data collection
- Dataset is collected from scraping from cardekho website.
- The data consists of 13 columns and 15411 rows.

 Purpose of the Dataset

This dataset, collected from cardekho.com, comprises information on used cars sold in India. The primary purpose of this dataset and the predictive modeling task is:

 car aprice prediction: To predict the selling price of a used car based on various input features such as car name, brand, model, vehicle age, kilometers driven, seller type, fuel type, transmission type, mileage, engine size, maximum power, and number of seats.
market proce suggestion: The prediction results can be utilized to provide new sellers with a data-driven price suggestion for their used cars, reflecting current market conditions and car attributes. This helps sellers to price their vehicles competitively and realistically.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
df = pd.read_csv('cardekho_1csv.csv', index_col=0)
df.head()

### Feature Engineering

#### Data Cleaning
- Handling Missing Values
- Handling Duplicates
- Check data type
- Understand the dataset

In [None]:
# Check null values
# Check features with nan value
df.isnull().sum()

In [None]:
df.head(2)

In [None]:
## Remove unnecessary columns
df.drop('car_name', axis=1, inplace=True)
df.drop('brand', axis=1, inplace=True)

In [None]:
df.head()

In [None]:
df['model'].unique()

In [None]:
# Getting all different types of features

num_features = [feature for feature in df.columns if df[feature].dtype != 'O']
print('Number of numerical features: ', len(num_features))

cat_features = [feature for feature in df.columns if df[feature].dtype == 'O']
print('Number of categorical features: ', len(cat_features))

discrete_features = [feature for feature in num_features if len(df[feature].unique()) < 25]
print('Number of discrete features: ', len(discrete_features))

continuous_features = [feature for feature in num_features if feature not in discrete_features]
print('Number of continuous features: ', len(continuous_features))

In [None]:
## Independent and dependent features
X = df.drop('selling_price', axis=1)
y = df['selling_price']

In [None]:
X.head()

In [None]:
y.head()

##### **Feature Encoding and Scaling**
**One Hot Encoding for Columns which had lesser unique values and not ordinal**
- One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

In [None]:
df['model'].unique()

In [None]:
df['model'].value_counts()

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X['model'] = le.fit_transform(X['model'])

In [None]:
X.head()

In [None]:
len(df['seller_type'].unique()), len(df['fuel_type'].unique()), len(df['transmission_type'].unique())

In [None]:
# Create ColumnTransformer with 3 types of transformers

num_features = X.select_dtypes(exclude='object').columns
onehot_columns = ['seller_type', 'fuel_type', 'transmission_type']

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder(drop='first')

preprocessor = ColumnTransformer(
    [
        ('OneHotEncoder', oh_transformer, onehot_columns),
        ('StandardScaler', numeric_transformer, num_features)
    ], remainder='passthrough'
)

In [None]:
X = preprocessor.fit_transform(X)

In [None]:
X

In [None]:
pd.DataFrame(X).shape

In [None]:
pd.DataFrame(X).head()

In [None]:
# Separa dataset into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape

In [None]:
X_train

In [None]:
X_test

Model Training and Model Selection

In [None]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [None]:
## Create a function to evaluate the model
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mse)
    r2 = r2_score(true, predicted)
    return mae, rmse, r2

In [None]:
## Beginning of model training

models = {
    'XGBRegressor': XGBRegressor(),
}

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train the model

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Evaluate the model
    mae_train, rmse_train, r2_train = evaluate_model(y_train, y_train_pred)
    mae_test, rmse_test, r2_test = evaluate_model(y_test, y_test_pred)

    print(f'Model: {list(models.keys())[i]}')

    print("Model Performance on Training Set")
    print("- Root Mean Squared Error: {:.4f}".format(rmse_train))
    print("- Mean Absolute Error: {:.4f}".format(mae_train))
    print("- R2 Score: {:.4f}".format(r2_train))

    print("-----------------------------------")

    print("Model Performance on Testing Set")
    print("- Root Mean Squared Error: {:.4f}".format(rmse_test))
    print("- Mean Absolute Error: {:.4f}".format(mae_test))
    print("- R2 Score: {:.4f}".format(r2_test))

    print('='*35)
    print('\n')

In [None]:
 # Inilialize few parameters for hyperparameter tuning
xb_params = {
    'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3],
    'max_depth': [3, 4, 5, 6, 8, 10, 12, 15],
    'n_estimators': [100, 200, 300],
    'colsample_bytree': [0.3, 0.4, 0.5, 0.7, 1]
}

In [None]:
# Models list for hyperparameter tuning
randomcv_models = [
    ('XGBRegressor', XGBRegressor(), xb_params)
]

In [None]:
# Hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

model_params = {}

for name, model, params in randomcv_models:
    random = RandomizedSearchCV(estimator=model,param_distributions=params, n_iter=100, cv=3, verbose=2,n_jobs=-1)
    random.fit(X_train, y_train)
    model_params[name] = random.best_params_

for model_name in model_params:
    print(f"---------------------Best Parameters for {model_name}---------------------")
    print(model_params[model_name])

In [None]:
## Retrain the models with best parameters

models = {
    'XGBRegressor': XGBRegressor(learning_rate=0.2, max_depth=5, n_estimators=300, colsample_bytree=0.7),
}

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train the model

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Evaluate the model
    mae_train, rmse_train, r2_train = evaluate_model(y_train, y_train_pred)
    mae_test, rmse_test, r2_test = evaluate_model(y_test, y_test_pred)

    print(f'Model: {list(models.keys())[i]}')

    print("Model Performance on Training Set")
    print("- Root Mean Squared Error: {:.4f}".format(rmse_train))
    print("- Mean Absolute Error: {:.4f}".format(mae_train))
    print("- R2 Score: {:.4f}".format(r2_train))

    print("-----------------------------------")

    print("Model Performance on Testing Set")
    print("- Root Mean Squared Error: {:.4f}".format(rmse_test))
    print("- Mean Absolute Error: {:.4f}".format(mae_test))
    print("- R2 Score: {:.4f}".format(r2_test))

    print('='*35)
    print('\n')

In [None]:
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

models = {

    'SVR': SVR(),
    'KNeighborsRegressor': KNeighborsRegressor()
}

print("Models dictionary created successfully.")

In [None]:
model_results = {}

for i in range(len(list(models))):
    model_name = list(models.keys())[i]
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train the model

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Evaluate the model
    mae_train, rmse_train, r2_train = evaluate_model(y_train, y_train_pred)
    mae_test, rmse_test, r2_test = evaluate_model(y_test, y_test_pred)

    print(f'Model: {model_name}')

    print("Model Performance on Training Set")
    print(f"- Root Mean Squared Error: {rmse_train:.4f}")
    print(f"- Mean Absolute Error: {mae_train:.4f}")
    print(f"- R2 Score: {r2_train:.4f}")

    print("-----------------------------------")

    print("Model Performance on Testing Set")
    print(f"- Root Mean Squared Error: {rmse_test:.4f}")
    print(f"- Mean Absolute Error: {mae_test:.4f}")
    print(f"- R2 Score: {r2_test:.4f}")

    print('='*35)
    print('\n')

    model_results[model_name] = {
        'train_rmse': rmse_train,
        'train_mae': mae_train,
        'train_r2': r2_train,
        'test_rmse': rmse_test,
        'test_mae': mae_test,
        'test_r2': r2_test
    }

In [None]:
print("\nSummary of Model Performance:\n")
best_r2 = -float('inf')
best_model = ''

for model_name, metrics in model_results.items():
    print(f"Model: {model_name}")
    print(f"  Test RMSE: {metrics['test_rmse']:.4f}")
    print(f"  Test MAE: {metrics['test_mae']:.4f}")
    print(f"  Test R2 Score: {metrics['test_r2']:.4f}")
    print("-----------------------------------")

    if metrics['test_r2'] > best_r2:
        best_r2 = metrics['test_r2']
        best_model = model_name

print(f"\nThe best performing model based on Test R2 Score is: {best_model} with R2 Score: {best_r2:.4f}")

## Purpose of the Dataset

This dataset, collected from cardekho.com, comprises information on used cars sold in India. The primary purpose of this dataset and the predictive modeling task is:

*   **Car Price Prediction**: To predict the selling price of a used car based on various input features such as car name, brand, model, vehicle age, kilometers driven, seller type, fuel type, transmission type, mileage, engine size, maximum power, and number of seats.
*   **Market Price Suggestion**: The prediction results can be utilized to provide new sellers with a data-driven price suggestion for their used cars, reflecting current market conditions and car attributes. This helps sellers to price their vehicles competitively and realistically.