# **Regression Empirical Study：Car Price Prediction**

**Group Number:** 97  
**Members:**  
Roy Rui #300176548  
Jiayi Ma #300263220
 

# **Dataset I: Car Details v4.csv**
**Source**: [CarDekho - Kaggle Dataset](https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho/data)  
**Shape**: **20 Columns, 2059 Rows**  

## **Description**
This dataset contains **detailed information about used cars**, including specifications such as engine capacity, max power, max torque, drivetrain, and seating capacity. It is useful for price prediction using regression models and can also be applied for feature engineering, classification, and exploratory data analysis.

## **Columns and Features**  

| **Feature**           | **Description**  | **Data Type**   |
|----------------------|-----------------|-----------------|
| `Make`              | Manufacturer of the car (e.g., Honda, Toyota) | Categorical |
| `Model`             | Specific model name of the car | Categorical |
| `Price`             | Selling price of the car (Target Variable) | Numerical |
| `Year`              | Year of manufacturing | Numerical |
| `Kilometer`         | Distance driven by the car (mileage) | Numerical |
| `Fuel Type`         | Type of fuel used (Petrol/Diesel/CNG) | Categorical |
| `Transmission`      | Type of transmission (Manual/Automatic) | Categorical |
| `Location`          | City where the car is listed for sale | Categorical |
| `Color`            | Exterior color of the car | Categorical |
| `Owner`            | Number of previous owners | Categorical |
| `Seller Type`      | Type of seller (Individual/Dealer/Corporate) | Categorical |
| `Engine`           | Engine displacement in cc | Numerical (with Units) |
| `Max Power`        | Maximum power output (bhp) | Categorical |
| `Max Torque`       | Maximum torque output (Nm) | Categorical |
| `Drivetrain`       | Drivetrain type (FWD/RWD/AWD) | Categorical |
| `Length`           | Length of the vehicle (mm) | Numerical |
| `Width`            | Width of the vehicle (mm) | Numerical |
| `Height`           | Height of the vehicle (mm) | Numerical |
| `Seating Capacity` | Number of seats available | Numerical |
| `Fuel Tank Capacity` | Fuel tank size (liters) | Numerical |

This dataset is **suitable for building predictive models** and understanding **factors influencing used car prices**.



## **General Imports**

In [55]:
import pandas as pd
import numpy as np
import re

# Load dataset
df = pd.read_csv("dataset1/car details v4.csv")

---

## **Data Cleaning**



In [56]:
# Display basic information
print("Dataset Overview:\n")
print(df.info())

# Check missing values
print("\nMissing Values:\n", df.isnull().sum())

# Show first few rows
df.head()

Dataset Overview:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2059 entries, 0 to 2058
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Make                2059 non-null   object 
 1   Model               2059 non-null   object 
 2   Price               2059 non-null   int64  
 3   Year                2059 non-null   int64  
 4   Kilometer           2059 non-null   int64  
 5   Fuel Type           2059 non-null   object 
 6   Transmission        2059 non-null   object 
 7   Location            2059 non-null   object 
 8   Color               2059 non-null   object 
 9   Owner               2059 non-null   object 
 10  Seller Type         2059 non-null   object 
 11  Engine              1979 non-null   object 
 12  Max Power           1979 non-null   object 
 13  Max Torque          1979 non-null   object 
 14  Drivetrain          1923 non-null   object 
 15  Length              1995 non-null   

Unnamed: 0,Make,Model,Price,Year,Kilometer,Fuel Type,Transmission,Location,Color,Owner,Seller Type,Engine,Max Power,Max Torque,Drivetrain,Length,Width,Height,Seating Capacity,Fuel Tank Capacity
0,Honda,Amaze 1.2 VX i-VTEC,505000,2017,87150,Petrol,Manual,Pune,Grey,First,Corporate,1198 cc,87 bhp @ 6000 rpm,109 Nm @ 4500 rpm,FWD,3990.0,1680.0,1505.0,5.0,35.0
1,Maruti Suzuki,Swift DZire VDI,450000,2014,75000,Diesel,Manual,Ludhiana,White,Second,Individual,1248 cc,74 bhp @ 4000 rpm,190 Nm @ 2000 rpm,FWD,3995.0,1695.0,1555.0,5.0,42.0
2,Hyundai,i10 Magna 1.2 Kappa2,220000,2011,67000,Petrol,Manual,Lucknow,Maroon,First,Individual,1197 cc,79 bhp @ 6000 rpm,112.7619 Nm @ 4000 rpm,FWD,3585.0,1595.0,1550.0,5.0,35.0
3,Toyota,Glanza G,799000,2019,37500,Petrol,Manual,Mangalore,Red,First,Individual,1197 cc,82 bhp @ 6000 rpm,113 Nm @ 4200 rpm,FWD,3995.0,1745.0,1510.0,5.0,37.0
4,Toyota,Innova 2.4 VX 7 STR [2016-2020],1950000,2018,69000,Diesel,Manual,Mumbai,Grey,First,Individual,2393 cc,148 bhp @ 3400 rpm,343 Nm @ 1400 rpm,RWD,4735.0,1830.0,1795.0,7.0,55.0


In [57]:
def extract_numerical(value):
    if isinstance(value, str):
        numbers = re.findall(r"[-+]?\d*\.\d+|\d+", value)
        return float(numbers[0]) if numbers else None
    return value

# Apply extraction to relevant columns
df["Engine"] = df["Engine"].apply(extract_numerical)

numerical_cols = ["Engine", "Length", "Width", "Height","Seating Capacity", "Fuel Tank Capacity"]
for col in numerical_cols:
    df[col] = df.groupby("Make")[col].transform(lambda x: x.fillna(x.median()))

categorical_cols = ["Drivetrain","Max Power", "Max Torque"]
for col in categorical_cols:
    df[col] = df.groupby("Make")[col].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else "Unknown"))
    
print(df.isnull().sum())


Make                  0
Model                 0
Price                 0
Year                  0
Kilometer             0
Fuel Type             0
Transmission          0
Location              0
Color                 0
Owner                 0
Seller Type           0
Engine                0
Max Power             0
Max Torque            0
Drivetrain            0
Length                0
Width                 0
Height                0
Seating Capacity      0
Fuel Tank Capacity    1
dtype: int64


  return np.nanmean(a, axis, out=out, keepdims=keepdims)


In [59]:
df["Fuel Tank Capacity"] = df["Fuel Tank Capacity"].fillna(df["Fuel Tank Capacity"].median())
print("\nMissing Values After Fix:\n", df.isnull().sum())


Missing Values After Fix:
 Make                  0
Model                 0
Price                 0
Year                  0
Kilometer             0
Fuel Type             0
Transmission          0
Location              0
Color                 0
Owner                 0
Seller Type           0
Engine                0
Max Power             0
Max Torque            0
Drivetrain            0
Length                0
Width                 0
Height                0
Seating Capacity      0
Fuel Tank Capacity    0
dtype: int64
