# Task-5 Decision Tree Implementation

## Question 3

a) Show the usage of your decision tree for the [automotive efficiency](https://archive.ics.uci.edu/ml/datasets/auto+mpg) problem.
    
b) Compare the performance of your model with the decision tree module from scikit learn.

> You should be editing `auto-efficiency.py` for the code containing the above experiments.

### Importing required libraries

In [1]:
import sys
import os

# Add the path to the directory containing tree.py
sys.path.append(os.path.abspath("../"))

import numpy as np
import pandas as pd
from tree.base import DecisionTree
from metrics import *
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

np.random.seed(42)

### Extracting the auto-mpg data 

In [19]:

# Load the Auto MPG dataset

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
data = pd.read_csv(url, delim_whitespace=True, header=None,
                  names=["mpg", "cylinders", "displacement", "horsepower", "weight",
                         "acceleration", "model year", "origin", "car name"])

data.info()

  data = pd.read_csv(url, delim_whitespace=True, header=None,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    float64
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car name      398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB


In [20]:
data.drop('car name', axis=1, inplace=True)

data['horsepower'].replace('?', 0, inplace=True)
data['horsepower'] = pd.to_numeric(data['horsepower'])

data.replace('?', np.nan, inplace=True)
print("Missing values in dataset:", data.isnull().sum().sum())
if data.isnull().values.any():
    data.dropna(inplace=True)


print("Duplicate rows in dataset:", data.duplicated().sum())
if data.duplicated().any():
    data.drop_duplicates(inplace=True)

print("Dataset shape after cleaning:", data.shape)

Missing values in dataset: 0
Duplicate rows in dataset: 0
Dataset shape after cleaning: (398, 8)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['horsepower'].replace('?', 0, inplace=True)


In [21]:
# separate X and y
X = data.drop('mpg', axis=1)
y = data['mpg']
print("X shape:", X.shape)
print("y shape:", y.shape)

X shape: (398, 7)
y shape: (398,)


### Splitting the data into training and testing sets

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("X_train size: ", X_train.shape)
print("y_train size: ", y_train.shape)
print("X_test size: ", X_test.shape)
print("y_test size: ", y_test.shape)


X_train size:  (278, 7)
y_train size:  (278,)
X_test size:  (120, 7)
y_test size:  (120,)


### Q3 a) Our custom  decision tree implementation 

In [23]:
our_dt = DecisionTree(criterion="information_gain", max_depth=5)
our_dt.fit(X_train, y_train)

y_train_pred = our_dt.predict(X_train)
y_test_pred = our_dt.predict(X_test)

train_rmse_my = rmse(y_train_pred, y_train)
train_mae_my = mae(y_train_pred, y_train)

print("Train Metrics (Custom Decision Tree):")
print(f"Root Mean Squared Error: {train_rmse_my:.4f}")
print(f"Mean Absolute Error: {train_mae_my:.4f}")

test_rmse_my = rmse(y_test_pred, y_test)
test_mae_my = mae(y_test_pred, y_test)

print("\nTest Metrics (Custom):")
print(f"Root Mean Squared Error: {test_rmse_my:.4f}")
print(f"Mean Absolute Error: {test_mae_my:.4f}")

Train Metrics (Custom Decision Tree):
Root Mean Squared Error: 3.5642
Mean Absolute Error: 2.3862

Test Metrics (Custom):
Root Mean Squared Error: 3.6783
Mean Absolute Error: 2.9078


### Q3 b) Using SkLearn's Decision Tree Regressor

In [24]:
sklearn_model = DecisionTreeRegressor(max_depth=5)
sklearn_model.fit(X_train, y_train)

y_train_pred = sklearn_model.predict(X_train)
y_test_pred = sklearn_model.predict(X_test)

train_rmse = rmse(pd.Series(y_train_pred), y_train)
train_mae = mae(pd.Series(y_train_pred), y_train)

print("Training Metrics (DecisionTreeRegressor):")
print(f"Root Mean Squared Error: {train_rmse:.5f}")
print(f"Mean Absolute Error: {train_mae:.5f}")

test_rmse = rmse(pd.Series(y_test_pred), y_test)
test_mae = mae(pd.Series(y_test_pred), y_test)

print("\nTest Metrics (DecisionTreeRegressor):")
print(f"Root Mean Squared Error: {test_rmse:.5f}")
print(f"Mean Absolute Error: {test_mae:.5f}")

Training Metrics (DecisionTreeRegressor):
Root Mean Squared Error: 2.09584
Mean Absolute Error: 1.54258

Test Metrics (DecisionTreeRegressor):
Root Mean Squared Error: 3.41111
Mean Absolute Error: 2.36398


## Performance Comparison

In [25]:
print("\nPerformance Comparison:\n")
print(f"Our Decision Tree - Train RMSE: {train_rmse_my:.5f}")
print(f"Our Decision Tree - Test RMSE: {test_rmse_my:.5f}")
print(f"Scikit-Learn Decision Tree - Train RMSE: {train_rmse:.5f}")
print(f"Scikit-Learn Decision Tree - Test RMSE: {test_rmse:.5f}")


Performance Comparison:

Our Decision Tree - Train RMSE: 3.56416
Our Decision Tree - Test RMSE: 3.67832
Scikit-Learn Decision Tree - Train RMSE: 2.09584
Scikit-Learn Decision Tree - Test RMSE: 3.41111


Scikit-Learn’s Decision Tree shows stronger performance compared to our custom implementation.

- **Training RMSE**: Custom = 3.5642, Scikit-Learn = 2.0958  
- **Test RMSE**: Custom = 3.6783, Scikit-Learn = 3.4111  

The Scikit-Learn model achieves lower errors on both training and test sets—likely due to optimized splitting strategies, pruning mechanisms, and more robust handling of edge cases.  
In contrast, our custom implementation shows signs of underfitting, with relatively higher errors across both datasets.  
Further refinement of our algorithm could help improve both fit and generalization.
