 ## **PROBLEM STATEMENT**

 * The used car market in India is a dynamic and ever-changing landscape. Prices can fluctuate wildly based on a variety of factors including the make and model of the car, its mileage, its condition and the current market conditions. As a result, it can be difficult for sellers to accurately price their cars.

* This dataset contains information about used cars.


## **DATA COLLECTION**
The data sets consist of 13 columns and 1415411 rows.
The data set is collected from scrapping from car dekho website.

In [89]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [90]:
# Load the dataset
df =pd.read_csv('/content/cardekho_dataset.csv' , index_col=[0])

In [110]:
# Display the first few rows of the dataframe
df.head(20)

Unnamed: 0,car_name,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Maruti Alto,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Hyundai Grand,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,Hyundai i20,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Maruti Alto,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ford Ecosport,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000
5,Maruti Wagon R,Wagon R,8,35000,Individual,Petrol,Manual,18.9,998,67.1,5,350000
6,Hyundai i10,i10,8,40000,Dealer,Petrol,Manual,20.36,1197,78.9,5,315000
7,Maruti Wagon R,Wagon R,3,17512,Dealer,Petrol,Manual,20.51,998,67.04,5,410000
8,Hyundai Venue,Venue,2,20000,Individual,Petrol,Automatic,18.15,998,118.35,5,1050000
12,Maruti Swift,Swift,4,28321,Dealer,Petrol,Manual,16.6,1197,85.0,5,511000


In [92]:
# Get information about the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15411 entries, 0 to 19543
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   car_name           15411 non-null  object 
 1   brand              15411 non-null  object 
 2   model              15411 non-null  object 
 3   vehicle_age        15411 non-null  int64  
 4   km_driven          15411 non-null  int64  
 5   seller_type        15411 non-null  object 
 6   fuel_type          15411 non-null  object 
 7   transmission_type  15411 non-null  object 
 8   mileage            15411 non-null  float64
 9   engine             15411 non-null  int64  
 10  max_power          15411 non-null  float64
 11  seats              15411 non-null  int64  
 12  selling_price      15411 non-null  int64  
dtypes: float64(2), int64(5), object(6)
memory usage: 1.6+ MB


In [93]:
# Check for missing values
df.isnull().sum()

Unnamed: 0,0
car_name,0
brand,0
model,0
vehicle_age,0
km_driven,0
seller_type,0
fuel_type,0
transmission_type,0
mileage,0
engine,0


In [94]:
# Drop unnecessary columns
df.drop('brand' , axis =1 , inplace = True)

In [95]:
# Get the value counts of the 'model' column
df['model'].value_counts()

Unnamed: 0_level_0,count
model,Unnamed: 1_level_1
i20,906
Swift Dzire,890
Swift,781
Alto,778
City,757
...,...
Altroz,1
C,1
Ghost,1
Quattroporte,1


In [96]:
# Identify numerical and categorical features
num_feature = [feature for feature in df.columns if df[feature].dtype != 'O']
print('no of numerical feature :', len(num_feature))
cat_feature = [feature for feature in df.columns if df[feature].dtype == 'O']
print('no of categorical feature :', len(cat_feature))
discreate_feature = [feature for feature in num_feature if len(df[feature].unique()) <= 25]
print('no of discreate feature :', len(discreate_feature))
continuous_feature = [feature for feature in num_feature if feature not in discreate_feature]
print('no of continous feature :', len(continuous_feature))

no of numerical feature : 7
no of categorical feature : 5
no of discreate feature : 2
no of continous feature : 5


In [97]:
# Separate features (x) and target (y)
x = df.drop(['selling_price'] , axis =1)
y = df['selling_price']

In [98]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
x_train , y_train  , x_test , y_test = train_test_split(x,y , test_size = 0.25 , random_state = 242)

## **FEATURE ENCODING**

In [99]:
# Get value counts for 'fuel_type'
df['fuel_type'].value_counts()

Unnamed: 0_level_0,count
fuel_type,Unnamed: 1_level_1
Petrol,7643
Diesel,7419
CNG,301
LPG,44
Electric,4


In [100]:
# Label encode the 'model' column
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
x['model'] = le.fit_transform(x['model'])

In [101]:
# Display the first few rows of the feature dataframe after label encoding
x.head()

Unnamed: 0,car_name,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,Maruti Alto,7,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,Hyundai Grand,54,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,Hyundai i20,118,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,Maruti Alto,7,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,Ford Ecosport,38,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


In [102]:
# Define preprocessing steps using ColumnTransformer
# Identify categorical and numerical features for preprocessing

num_features = x.select_dtypes(exclude = "object").columns
onehot_column = ['seller_type' , 'fuel_type' , 'transmission_type']
label_encoder_column = ['model']
# Import necessary preprocessing classes
from sklearn.preprocessing import OneHotEncoder , StandardScaler
from sklearn.compose import ColumnTransformer

onehot = OneHotEncoder(drop ='first') # One-hot encoder for categorical features

# Create a ColumnTransformer to apply different preprocessing steps to different columns
preprocessor = ColumnTransformer(
    [
        ("OneHotEncode" ,onehot , onehot_column  )

    ] , remainder = 'passthrough'
)

In [103]:
# Apply preprocessing to the features
x = preprocessor.fit_transform(x)

In [104]:
# Import necessary libraries for modeling and evaluation
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [105]:
# Define a function to evaluate model performance
def evaluate_model(model, X_train, y_train, X_test, y_test):
    """
    Trains a model and evaluates its performance on training and testing data.

    Args:
        model: The regression model to train and evaluate.
        X_train: Training features.
        y_train: Training target.
        X_test: Testing features.
        y_test: Testing target.

    Returns:
        A dictionary containing the evaluation scores for both training and testing sets.
    """
    # Train the model
    model.fit(X_train, y_train)

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Calculate evaluation metrics for training data
    train_mse = mean_squared_error(y_train, y_train_pred)
    train_mae = mean_absolute_error(y_train, y_train_pred)
    train_rmse = np.sqrt(train_mse)
    train_r2 = r2_score(y_train, y_train_pred)

    # Calculate evaluation metrics for testing data
    test_mse = mean_squared_error(y_test, y_test_pred)
    test_mae = mean_absolute_error(y_test, y_test_pred)
    test_rmse = np.sqrt(test_mse)
    test_r2 = r2_score(y_test, y_test_pred)

    return {
        'Train MSE': train_mse,
        'Train MAE': train_mae,
        'Train RMSE': train_rmse,
        'Train R2': train_r2,
        'Test MSE': test_mse,
        'Test MAE': test_mae,
        'Test RMSE': test_rmse,
        'Test R2': test_r2
    }

In [108]:
# Train and evaluate the regression models
# Separate features (x) and target (y) from the original dataframe
x = df.drop(['selling_price'] , axis =1)
y = df['selling_price']

# Label encode the 'model' column first
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
x['model'] = le.fit_transform(x['model'])

# Define columns for OneHotEncoding
onehot_column = ['seller_type' , 'fuel_type' , 'transmission_type']

# Create the initial ColumnTransformer for OneHotEncoding and passing through other columns
# Explicitly drop the 'car_name' column
preprocessor_initial = ColumnTransformer(
    [
        ("OneHotEncode" , OneHotEncoder(drop ='first'), onehot_column),
        ("drop_car_name", "drop", ['car_name']) # Drop the 'car_name' column
    ],
    remainder='passthrough' # Pass through remaining columns (including the label encoded 'model' and numerical features)
)

# Apply the initial preprocessor to the dataframe
x_processed_df = preprocessor_initial.fit_transform(x)

# Get the names of the remaining columns after OneHotEncoding
onehot_feature_names = preprocessor_initial.transformers_[0][1].get_feature_names_out(onehot_column)

# Determine the columns that were passed through (these are the original columns not in onehot_column or dropped)
original_columns_after_initial_processing = [col for col in x.columns if col not in onehot_column and col != 'car_name']

# Create a new ColumnTransformer to apply StandardScaler to the numerical columns among the passed-through columns
# Need to find the indices of the numerical columns in the transformed array.
# The order in x_processed_df will be onehot_features followed by passed_through_columns in their original order.

# Identify numerical columns among the original passed-through columns
numerical_passthrough_columns = [col for col in original_columns_after_initial_processing if x[col].dtype != 'O']

# Get the indices of the numerical passed-through columns in the x_processed_df array
# The indices of the passed-through columns start after the one-hot encoded columns
start_index_passthrough = len(onehot_feature_names)
# Need to adjust indices based on dropped column. The columns in x_processed_df are one-hot encoded features,
# followed by the columns from `remainder='passthrough'` in the order they appeared in the original `x`,
# excluding the columns handled by OneHotEncode and 'drop_car_name'.
# Let's rebuild the indices based on the columns that are actually passed through.

# Identify the columns that will be in the `remainder='passthrough'` part of preprocessor_initial
remainder_columns_initial = [col for col in x.columns if col not in onehot_column and col != 'car_name']
numerical_remainder_columns_initial = [col for col in remainder_columns_initial if x[col].dtype != 'O']

# The indices for StandardScaler in preprocessor_final should target the numerical columns within the remainder.
# The order in x_processed_df is: [one-hot features] + [remainder columns from preprocessor_initial]
# The remainder columns in x_processed_df are in the same order as in the original `remainder_columns_initial`.
# So, we need the indices of the numerical columns within this remainder part.

indices_for_standardscaler = [onehot_feature_names.shape[0] + remainder_columns_initial.index(col) for col in numerical_remainder_columns_initial]


preprocessor_final = ColumnTransformer(
    [
        ("StandardScaler", StandardScaler(), indices_for_standardscaler)
    ],
    remainder='passthrough' # Pass through the one-hot encoded and non-numerical passed-through columns
)

# Apply the final preprocessor (StandardScaler)
x_scaled = preprocessor_final.fit_transform(x_processed_df)


# Split the processed data into training and testing sets
x_train_scaled, x_test_scaled, y_train, y_test = train_test_split(x_scaled, y, test_size=0.25, random_state=242)

# Dictionary to store results
model_performance = {}

# 1. Random Forest Regressor
rf_model = RandomForestRegressor(random_state=42)
model_performance['Random Forest'] = evaluate_model(rf_model, x_train_scaled, y_train, x_test_scaled, y_test)

# 2. Ridge Regression
ridge_model = Ridge()
model_performance['Ridge'] = evaluate_model(ridge_model, x_train_scaled, y_train, x_test_scaled, y_test)

# 3. Lasso Regression
lasso_model = Lasso()
model_performance['Lasso'] = evaluate_model(lasso_model, x_train_scaled, y_train, x_test_scaled, y_test)

# 4. K-Nearest Neighbors Regressor
knn_model = KNeighborsRegressor()
model_performance['KNN'] = evaluate_model(knn_model, x_train_scaled, y_train, x_test_scaled, y_test)

# 5. Decision Tree Regressor
dt_model = DecisionTreeRegressor(random_state=42)
model_performance['Decision Tree'] = evaluate_model(dt_model, x_train_scaled, y_train, x_test_scaled, y_test)

print(model_performance)

{'Random Forest': {'Train MSE': 19128420273.285393, 'Train MAE': 40371.471057667026, 'Train RMSE': np.float64(138305.53233072563), 'Train R2': 0.977935995212133, 'Test MSE': 46193835384.686005, 'Test MAE': 97312.78462867842, 'Test RMSE': np.float64(214927.51193061814), 'Test R2': 0.9225176602296553}, 'Ridge': {'Train MSE': 331701199277.09235, 'Train MAE': 278890.30777501763, 'Train RMSE': np.float64(575935.0651567348), 'Train R2': 0.6173935565807196, 'Test MSE': 187936005110.58694, 'Test MAE': 259723.39575242164, 'Test RMSE': np.float64(433515.8648891491), 'Test R2': 0.68476916277259}, 'Lasso': {'Train MSE': 331699761186.15924, 'Train MAE': 278928.6163488934, 'Train RMSE': np.float64(575933.816671811), 'Train R2': 0.6173952153713975, 'Test MSE': 187956688816.80237, 'Test MAE': 259795.965946244, 'Test RMSE': np.float64(433539.71999898966), 'Test R2': 0.6847344693564805}, 'KNN': {'Train MSE': 121018802651.13991, 'Train MAE': 93354.92948607025, 'Train RMSE': np.float64(347877.5684794004),

In [109]:
# Display model performance in a DataFrame
model_performance_df = pd.DataFrame(model_performance)
model_performance_df_transposed = model_performance_df.T
display(model_performance_df_transposed)

Unnamed: 0,Train MSE,Train MAE,Train RMSE,Train R2,Test MSE,Test MAE,Test RMSE,Test R2
Random Forest,19128420000.0,40371.471058,138305.532331,0.977936,46193840000.0,97312.784629,214927.511931,0.922518
Ridge,331701200000.0,278890.307775,575935.065157,0.617394,187936000000.0,259723.395752,433515.864889,0.684769
Lasso,331699800000.0,278928.616349,575933.816672,0.617395,187956700000.0,259795.965946,433539.719999,0.684734
KNN,121018800000.0,93354.929486,347877.568479,0.860409,48356070000.0,105390.397093,219900.139665,0.918891
Decision Tree,401580800.0,4874.430409,20039.480053,0.999537,218995700000.0,122504.619777,467969.770249,0.632672
