# ðŸš— Day 9 - Car Price Prediction 
### 

---

## 1. Introduction


---

## 2. Project Objectives


---

## 3. Dataset Overview


---

## 4. Methodology and Approach


---

## 5. Tools and Libraries Used


---

## 6. Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import filedialpy as fp
import joblib
import warnings

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin

# Configurations
warnings.filterwarnings("ignore")

print('All libraries loaded!')

All libraries loaded!


---

## 7. Data Loading & Initial Exploration

Before model development, it is essential to **load and examine the dataset** to understand its structure and contents.  
This step includes importing the data, inspecting column types, identifying missing values, and reviewing basic statistics.  
Early exploration helps ensure the data is accurate, consistent, and suitable for further preprocessing and modeling.


#### Load the Dataset

In [3]:
df = pd.read_csv(fp.openFile())
print('Data Loaded!')
df.head()

Data Loaded!


Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner


#### Dataset Shape

In [5]:
df.shape

(4340, 8)

#### Dataset Information

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4340 entries, 0 to 4339
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   name           4340 non-null   object
 1   year           4340 non-null   int64 
 2   selling_price  4340 non-null   int64 
 3   km_driven      4340 non-null   int64 
 4   fuel           4340 non-null   object
 5   seller_type    4340 non-null   object
 6   transmission   4340 non-null   object
 7   owner          4340 non-null   object
dtypes: int64(3), object(5)
memory usage: 271.4+ KB


#### Statistical Summary

In [9]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
year,4340.0,2013.090783,4.215344,1992.0,2011.0,2014.0,2016.0,2020.0
selling_price,4340.0,504127.311751,578548.736139,20000.0,208749.75,350000.0,600000.0,8900000.0
km_driven,4340.0,66215.777419,46644.102194,1.0,35000.0,60000.0,90000.0,806599.0


#### Missing Values

In [11]:
print(df.isnull().sum().sum(), "missing values found.")

0 missing values found.


#### Unique values in each colunm

In [13]:
print("Unique values:")
for col in df.columns:
    print(f"{col}: {df[col].nunique()}")

Unique values:
name: 1491
year: 27
selling_price: 445
km_driven: 770
fuel: 5
seller_type: 3
transmission: 2
owner: 5


---

## 8. Feature Engineering â€” Creating Meaningful Predictors

In this step, we enhance the dataset by creating new features that provide better insights to the model.  
Feature engineering helps the model capture real-world relationships between a carâ€™s attributes and its selling price.

#### Features Created:
1. **Car_Age:**  
   - Calculated as the difference between the current year (2025) and the car's manufacturing year.  
   - Represents how old the car is, which directly affects its resale value.

2. **Price_per_km:**  
   - Derived as `selling_price / (km_driven + 1)`  
   - Indicates the depreciation trend or value retention per kilometer driven.  
   - The `+1` prevents division-by-zero errors.

#### Columns Dropped:
- **name:** The carâ€™s model name â€” too specific and not numerically meaningful.  
- **year:** Already used to calculate `Car_Age`, so no longer needed.

In [15]:
df2 = df.copy()
current_year = 2025
df['Car_Age'] = current_year - df['year']
df['Price_per_km'] = df['selling_price'] / (df['km_driven'] + 1)  # avoid div by 0

# Drop unneeded or leak-prone columns
df.drop(['name', 'year'], axis=1, inplace=True)

df.head()

Unnamed: 0,selling_price,km_driven,fuel,seller_type,transmission,owner,Car_Age,Price_per_km
0,60000,70000,Petrol,Individual,Manual,First Owner,18,0.857131
1,135000,50000,Petrol,Individual,Manual,First Owner,18,2.699946
2,600000,100000,Diesel,Individual,Manual,First Owner,13,5.99994
3,250000,46000,Petrol,Individual,Manual,First Owner,8,5.434664
4,450000,141000,Diesel,Individual,Manual,Second Owner,11,3.191467


These transformations make the dataset more informative while preventing data leakage and redundancy.



---

## 9. Define Feature Types â€” Numeric and Categorical Columns

Now that weâ€™ve engineered new features, weâ€™ll categorize our datasetâ€™s columns into **numerical** and **categorical** types.  
This helps the preprocessing pipeline apply the correct transformations (scaling for numeric data, encoding for categorical data).


In [19]:
# Cell 4) Define columns
numeric_cols = ['km_driven', 'Car_Age', 'Price_per_km']
categorical_cols = ['fuel', 'seller_type', 'transmission', 'owner']

print("Numeric columns:", numeric_cols)
print("Categorical columns:", categorical_cols)

Numeric columns: ['km_driven', 'Car_Age', 'Price_per_km']
Categorical columns: ['fuel', 'seller_type', 'transmission', 'owner']


This separation ensures each data type is preprocessed optimally before model training.


---

## 10. Build the Preprocessing Pipeline

Before training, we must ensure the data is clean, standardized, and machine-readable.  
Weâ€™ll build a **scikit-learn ColumnTransformer pipeline** to handle preprocessing in an automated, consistent way.

#### Components:
1. **Numeric Transformer:**  
   - Imputes missing values using the median (robust to outliers).  
   - Standardizes data using `StandardScaler()` to center around zero.

2. **Categorical Transformer:**  
   - Fills missing categories with the most frequent value.  
   - Encodes categorical variables using `OneHotEncoder()`.

#### Why Use a Pipeline?
- Prevents data leakage  
- Ensures consistent preprocessing for both training and testing  
- Makes deployment simpler and reproducible

In [60]:
# Cell 5) Define transformers
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine with ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_cols),
    ('cat', categorical_transformer, categorical_cols)
])

print("Preprocessing pipelines for numeric and categorical data created successfully.")

Preprocessing pipelines for numeric and categorical data created successfully.


---

## 11. Create the Full Machine Learning Pipeline

Next, we integrate the preprocessing steps with a regression model inside a single unified pipeline.  
This ensures that all data transformations are applied automatically before training or prediction.

#### Model Used:
- **Random Forest Regressor** â€” an ensemble-based algorithm that captures nonlinear patterns and feature interactions effectively.

#### Benefits of the Full Pipeline:
- Streamlined training and evaluation  
- No manual reprocessing of test data  
- Easy to save, reload, and deploy

In [62]:
# Cell 6) Full Pipeline
model = RandomForestRegressor(n_estimators=200, random_state=42)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

print("Full Machine Learning pipeline (Preprocessor + Model) assembled.")

Full Machine Learning pipeline (Preprocessor + Model) assembled.


---

## 12. Split the Dataset and Training the Model

To evaluate how well our model generalizes, we divide the dataset into:
- **Training Set (80%)** â†’ used to train the model  
- **Testing Set (20%)** â†’ used to evaluate performance on unseen data

Weâ€™ll also apply a **log transformation** on the target variable (`selling_price`) to stabilize variance and reduce the impact of extreme price values.


In [25]:
# Cell 7) Split data
X = df.drop(columns=['selling_price'])
y = df['selling_price']

# Log-transform target for stability
y_log = np.log1p(y)

X_train, X_test, y_train, y_test = train_test_split(X, y_log, test_size=0.2, random_state=42)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

Train shape: (3472, 7)
Test shape: (868, 7)


#### 12.1 Train the Model

- We now fit the complete pipeline on the training data.  
- This step runs all preprocessing operations automatically (imputation, encoding, scaling),  
- followed by model training on the transformed features.

After fitting, the pipeline can directly be used for predictions or evaluation without any manual preprocessing.

In [27]:
# Cell 8) Fit pipeline
print("Training model...")
pipeline.fit(X_train, y_train)
print("Model trained successfully!")

Training model...
Model trained successfully!


---

## 13. Model Evaluation â€” Assessing Performance

After training the model, itâ€™s crucial to evaluate how well it performs on unseen data.  
Weâ€™ll use the **Root Mean Squared Error (RMSE)** and **RÂ² Score** as performance metrics:

- **RMSE (Root Mean Squared Error):**  
  Measures the average difference between predicted and actual prices (lower is better).

- **RÂ² Score (Coefficient of Determination):**  
  Indicates how much variance in the target variable is explained by the model (closer to 1 is better).

The target variable (`selling_price`) was log-transformed during training to reduce skewness,  
so we now reverse the log transformation (`np.expm1`) before calculating these metrics.


In [49]:
# Cell 9) Evaluate model
y_pred_log = pipeline.predict(X_test)
y_pred = np.expm1(y_pred_log)  # reverse log transform
y_true = np.expm1(y_test)

rmse = np.sqrt(mean_squared_error(y_true, y_pred))
r2 = r2_score(y_true, y_pred)

print(f"Test RMSE: {rmse:.2f}")
print(f"RÂ² Score: {r2:.3f}")

Test RMSE: 221840.41
RÂ² Score: 0.839


---

## 14. Feature Importance â€” Understanding Key Predictors

To interpret the model, we examine **feature importances** from the trained Random Forest.  
Feature importance indicates how much each input contributes to predicting the carâ€™s selling price.

#### Steps:
1. Retrieve the trained model and preprocessing steps from the pipeline.  
2. Get transformed feature names after OneHotEncoding.  
3. Compute feature importances using the modelâ€™s internal scoring.  
4. Sort and display the top 10 most influential features.

This helps us understand which car characteristics â€” such as age, kilometers driven, fuel type, or transmission â€”  
play the largest role in determining resale value.


In [31]:
# Cell 10) Feature importances
model = pipeline.named_steps['model']
pre = pipeline.named_steps['preprocessor']

# Get feature names after encoding
feature_names = (
    numeric_cols +
    list(pre.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_cols))
)

importances = model.feature_importances_
imp_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
imp_df = imp_df.sort_values('Importance', ascending=False)
imp_df.head(10)

Unnamed: 0,Feature,Importance
2,Price_per_km,0.731264
0,km_driven,0.147336
4,fuel_Diesel,0.045199
12,transmission_Manual,0.028428
11,transmission_Automatic,0.025266
1,Car_Age,0.011661
7,fuel_Petrol,0.007836
9,seller_type_Individual,0.000884
8,seller_type_Dealer,0.000833
15,owner_Second Owner,0.000465


---

## 15. Save the Trained Pipeline

Once the model is trained and evaluated, we save the complete pipeline for future use.  
This serialized file includes both preprocessing steps and the trained model,  
so we can directly use it later for predictions without re-running feature engineering or encoding.

Saving the pipeline makes the workflow **reproducible**, **deployable**, and **easy to reuse**.

In [47]:
# Cell 11) Save pipeline
joblib.dump(pipeline, "car_price_pipeline.joblib")
print("Saved as car_price_pipeline.joblib")

Saved as car_price_pipeline.joblib


---

## 16. 

---

## 17.