# **Classification Empirical Study with Decision Trees**

**Group Number:** 97  
**Members:**  
Roy Rui #300176548  
Jiayi Ma #300263220
 

# **Dataset II: Zoo Animal Classification** 
Briefly introduce the background of the Zoo dataset, the goal of this notebook, and the main steps to be implemented.  
- **Goal**: Use Decision Trees to classify animals into 7 categories (Mammal, Bird, Reptile, etc.).  
- **Data Source**: Zoo Animal Classification - Kaggle / UCI Machine Learning Repository (Richard Forsyth, 1990).  
- **Main Steps**: Data Cleaning, EDA and Outlier detection, Decision Trees, Feature Engineering, Empirical study, Result analysis.

---

# **Dataset Description**
**Authors & Collaborators**: Richard Forsyth  
**Source**: [Zoo Animal Classification](https://www.kaggle.com/datasets/uciml/zoo-animal-classification?select=zoo.csv)  
**Shape**: **18 Columns, 101 Rows**  
**Purpose**: Classify animals into seven categories based on their biological attributes, utilizing machine learning classification techniques.



The Zoo dataset includes **101 animals** described by **16 traits** (mostly boolean) plus one numeric feature (`legs`). Each animal is assigned a label from **1 to 7**, corresponding to the category it belongs to. This dataset offers valuable insights into how specific physical and behavioral traits can be leveraged effectively in machine learning classification tasks. By leveraging these attributes, we aim to build a **Decision Tree** classifier that accurately predicts each animal’s category.  

`zoo.csv` contains the core features and numeric labels.  
`class.csv` provides a mapping from numeric labels to textual class descriptions, along with lists of animal names.  

| Column       | Type    | Description                                              |
|--------------|---------|----------------------------------------------------------|
| `hair`       | Boolean | Whether the animal has hair (0 or 1)                    |
| `feathers`   | Boolean | Whether the animal has feathers (0 or 1)                |
| `eggs`       | Boolean | Whether the animal lays eggs (0 or 1)                   |
| `milk`       | Boolean | Whether the animal produces milk (0 or 1)               |
| `airborne`   | Boolean | Whether the animal can fly (0 or 1)                     |
| `aquatic`    | Boolean | Whether the animal is aquatic (0 or 1)                  |
| `predator`   | Boolean | Whether the animal is a predator (0 or 1)               |
| `toothed`    | Boolean | Whether the animal has teeth (0 or 1)                   |
| `backbone`   | Boolean | Whether the animal has a backbone (0 or 1)              |
| `breathes`   | Boolean | Whether the animal breathes air (0 or 1)                |
| `venomous`   | Boolean | Whether the animal is venomous (0 or 1)                 |
| `fins`       | Boolean | Whether the animal has fins (0 or 1)                    |
| `legs`       | Numeric | Number of legs (possible values: 0,2,4,5,6,8)            |
| `tail`       | Boolean | Whether the animal has a tail (0 or 1)                  |
| `domestic`   | Boolean | Whether the animal is domesticated (0 or 1)             |
| `catsize`    | Boolean | Whether the animal is roughly cat-sized (0 or 1)        |
| `class_type` | Numeric | Class label (1–7)                                        |


The `class.csv` file accompanying this dataset serves as a reference, mapping numeric class labels to textual class descriptions and listing animal names within each class. This allows better interpretability of classification outcomes.

---

### **Import Necessary Libraries and Read Data**

Here, we import the primary Python libraries needed for data analysis and machine learning:
- **pandas**, **numpy**: for data manipulation and array operations
- **sklearn.model_selection**: for data splitting and cross-validation
- **sklearn.tree**: for the decision tree model
- **sklearn.metrics**: for evaluation metrics (accuracy, f1_score, classification_report, confusion_matrix)
- **sklearn.neighbors**: LocalOutlierFactor for outlier detection


In [14]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
from sklearn.neighbors import LocalOutlierFactor

# Load zoo.csv and preview its shape and first few rows
df_zoo = pd.read_csv("zoo.csv")
print("zoo.csv shape:", df_zoo.shape)
print(df_zoo.head())

# Load class.csv for mapping numeric labels to descriptive class names and preview it
df_class = pd.read_csv("class.csv")
print("\nclass.csv preview:")
print(df_class.head())

# Create a mapping from numeric class labels to text labels
class_map = dict(zip(df_class["Class_Number"], df_class["Class_Type"]))

# Add the textual class labels to df_zoo
df_zoo["class_name"] = df_zoo["class_type"].map(class_map)

# Display a merged view with animal_name, numeric and textual class labels
print("\nMerged view:")
print(df_zoo[["animal_name", "class_type", "class_name"]].head())

zoo.csv shape: (101, 18)
  animal_name  hair  feathers  eggs  milk  airborne  aquatic  predator  \
0    aardvark     1         0     0     1         0        0         1   
1    antelope     1         0     0     1         0        0         0   
2        bass     0         0     1     0         0        1         1   
3        bear     1         0     0     1         0        0         1   
4        boar     1         0     0     1         0        0         1   

   toothed  backbone  breathes  venomous  fins  legs  tail  domestic  catsize  \
0        1         1         1         0     0     4     0         0        1   
1        1         1         1         0     0     4     1         0        1   
2        1         1         0         0     1     0     1         0        0   
3        1         1         1         0     0     4     0         0        1   
4        1         1         1         0     0     4     1         0        1   

   class_type  
0           1  
1          

---

## **Data Cleaning**
In this section, we handle any data cleaning steps in `zoo.csv`:
- **Checked for missing values** using the `isnull().sum()` method.
- **Confirmed no missing values existed**, so no imputation was required.
- **Dropped irrelevant column (`animal_name`)** as it does not contribute to classification.

We began by thoroughly inspecting the `zoo.csv` dataset for missing values. Since no missing values were detected, no further cleaning or imputation was necessary. Additionally, we removed the `animal_name` column to prevent irrelevant information from negatively influencing the classification.  



In [15]:
# Check for missing values in each column of the dataframe
missing_counts = df_zoo.isnull().sum()
print("\nNumber of missing values in each column:")
print(missing_counts)

# Drop 'animal_name' as it's not useful for classification
df_zoo.drop(columns=["animal_name"], inplace=True)


Number of missing values in each column:
animal_name    0
hair           0
feathers       0
eggs           0
milk           0
airborne       0
aquatic        0
predator       0
toothed        0
backbone       0
breathes       0
venomous       0
fins           0
legs           0
tail           0
domestic       0
catsize        0
class_type     0
class_name     0
dtype: int64


---

## **Numerical feature encoding**

 
Although decision trees can directly handle numerical features, **binning** can sometimes improve interpretability or performance by segmenting a continuous (or discrete) feature into categories. In this example, we apply a simple binning strategy to the `legs` column:

- **0:** Animals with 2 or fewer legs  
- **1:** Animals with more than 2 but fewer than 6 legs  
- **2:** Animals with 6 or more legs  

This creates a new feature `legs_binned`, which may be used alongside or instead of the original `legs` column during classification.



In [16]:
# Numeric feature binning and outlier detection with LOF
def bin_legs(x):
    if x <= 2:
        return 0
    elif 2 < x < 6:
        return 1
    else:
        return 2

# Apply binning to 'legs' and create 'legs_binned'
df_zoo["legs_binned"] = df_zoo["legs"].apply(bin_legs)

---

## **Outlier Detection**

- **Applied Local Outlier Factor (LOF)** after converting boolean columns into numerical format (0/1).
- **Identified outliers** based on a contamination parameter set at 5%.
- **Removed the identified outliers** to produce a cleaned dataset (`df_no_outlier`) for further analysis.

To ensure dataset quality, we employed the LOF method to detect and remove outliers. After identifying a few data points marked as outliers, we created a cleaned dataset (`df_no_outlier`). This section enabled us to later compare the impact of outlier removal on model accuracy.  



In [17]:
# Prepare numeric data by excluding non-numeric target columns
numeric_cols = df_zoo.drop(columns=["class_type", "class_name"]).columns
df_zoo_numeric = df_zoo[numeric_cols].copy()

# Convert boolean columns to integers
for c in df_zoo_numeric.columns:
    if df_zoo_numeric[c].dtype == bool:
        df_zoo_numeric[c] = df_zoo_numeric[c].astype(int)

# Run LOF to detect outliers with 5% contamination
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
outlier_labels = lof.fit_predict(df_zoo_numeric)
df_zoo["outlier"] = outlier_labels

# Display outlier counts
print("\nLOF results:")
print(df_zoo["outlier"].value_counts())

# Remove outliers to create a cleaned dataset
df_no_outlier = df_zoo[df_zoo["outlier"] != -1].copy()



LOF results:
outlier
 1    96
-1     5
Name: count, dtype: int64


---

## **Predictive analysis: Decision Trees**

- Explore the `DecisionTreeClassifier` method suggested in scikit-learn (or other packages).  
- Look at the parameters (e.g., splitting criterion like gini or entropy, max_depth, min_samples_split, etc.) and choose a baseline setting.

In this section, we define a helper function `prepare_features()` that **extracts only the relevant feature columns** (`class_type`, `class_name`, `outlier`). This helps streamline our process of training and comparing different decision tree configurations. 

We will later instantiate a `DecisionTreeClassifier` to train it on these prepared features, and evaluate it as per the assignment steps. 


In [18]:
def prepare_features(dataframe):
    # Exclude class_type, class_name, and outlier from features
    return dataframe.drop(columns=["class_type", "class_name", "outlier"], errors="ignore")


---

## **Feature Engineering**

- **Created two new features** to fulfill assignment criteria:
  - **`milk_hair`**: indicating animals that produce milk and have hair.
  - **`milk_feathers`**: indicating animals that produce milk and have feathers (used to examine unusual feature combinations).

In this section, we introduced two new engineered features (`milk_hair` and `milk_feathers`) designed to evaluate possible interactions between biological attributes. This step potentially provided the decision tree with enhanced discriminative capabilities, improving classification performance.


In [19]:
# 1st feature: milk_hair
df_zoo["milk_hair"] = (df_zoo["milk"] & df_zoo["hair"]).astype(int)
df_no_outlier["milk_hair"] = (df_no_outlier["milk"] & df_no_outlier["hair"]).astype(int)

# 2nd feature: milk_feathers
df_zoo["milk_feathers"] = (df_zoo["milk"] & df_zoo["feathers"]).astype(int)
df_no_outlier["milk_feathers"] = (df_no_outlier["milk"] & df_no_outlier["feathers"]).astype(int)


---

## **Empirical Study**

- **Split the dataset** into train-validation (80%) and test set (20%).
- **Evaluated three configurations** through 4-fold cross-validation:
  - **Baseline**: Original data, no outlier removal, no feature engineering.
  - **Feature Engineering (FeatEng)**: Included two new engineered features, no outlier removal.
  - **NoOutlier + FeatEng**: Combined outlier removal and engineered features.
- **Compared configurations using accuracy metrics**.

The dataset was systematically split, and three configurations were rigorously compared using 4-fold cross-validation on the training-validation subset. Accuracy scores obtained from these experiments were clearly documented to facilitate the selection of the best configuration.


In [20]:
# Split full dataset into train-validation and test sets (80:20)
X_all = prepare_features(df_zoo)
y_all = df_zoo["class_type"]

X_trainVal, X_test, y_trainVal, y_test = train_test_split(
    X_all, y_all, test_size=0.2, random_state=42, stratify=y_all
)

# Split outlier-removed dataset into train-validation and test sets
X_all_no_outlier = prepare_features(df_no_outlier)
y_all_no_outlier = df_no_outlier["class_type"]

X_trainVal_no_out, X_test_no_out, y_trainVal_no_out, y_test_no_out = train_test_split(
    X_all_no_outlier, y_all_no_outlier, test_size=0.2, random_state=42, stratify=y_all_no_outlier
)

# Create baseline datasets by dropping engineered features
df_baseline = df_zoo.drop(columns=["milk_hair", "milk_feathers"], errors="ignore").copy()
df_baseline_no_outlier = df_no_outlier.drop(columns=["milk_hair", "milk_feathers"], errors="ignore").copy()

# Define a function for 4-fold cross-validation to compute average accuracy
def cross_val_accuracy(X, y, n_splits=4):
    model = DecisionTreeClassifier(random_state=42)
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
    return scores.mean()

# Evaluate Baseline: no new features, no outlier removal
X_bsl = df_baseline.drop(columns=["class_type", "class_name", "outlier"], errors="ignore")
y_bsl = df_baseline["class_type"]
acc_baseline = cross_val_accuracy(X_bsl, y_bsl, 4)

# Evaluate with Feature Engineering (new features, no outlier removal)
X_feat = prepare_features(df_zoo)
y_feat = df_zoo["class_type"]
acc_feat = cross_val_accuracy(X_feat, y_feat, 4)

# Evaluate with Outlier Removal + Feature Engineering (new features, outliers removed)
X_noOut_feat = prepare_features(df_no_outlier)
y_noOut_feat = df_no_outlier["class_type"]
acc_noOut_feat = cross_val_accuracy(X_noOut_feat, y_noOut_feat, 4)

print("\n========= Cross-validation results (4-Fold) =========")
print(f"Baseline (no new features, no outlier removal): {acc_baseline:.4f}")
print(f"FeatEng (with new features, no outlier removal): {acc_feat:.4f}")
print(f"NoOutlier+FeatEng (with outlier removal and new features): {acc_noOut_feat:.4f}")



Baseline (no new features, no outlier removal): 0.9308
FeatEng (with new features, no outlier removal): 0.9308
NoOutlier+FeatEng (with outlier removal and new features): 0.9062


## **Result Analysis**

- **Selected the best-performing configuration (NoOutlier + FeatEng)** based on cross-validation results.
- **Trained final model** on the full training-validation set using the chosen configuration.
- **Evaluated performance** on the previously unseen test set, reporting metrics such as Accuracy, F1-score, classification report, and confusion matrix.
- **Discussed and compared final results** with cross-validation performance to assess consistency and generalizability.

After general evaluation, the optimal configuration was selected for final assessment. This model was then trained on the entire training-validation dataset and evaluated on the test set. The resulting metrics, along with a comparison to cross-validation outcomes, provided insights into the stability and predictive strength of the classification model.


In [21]:
# Train and predict
dtc_best = DecisionTreeClassifier(random_state=42)
dtc_best.fit(X_trainVal_no_out, y_trainVal_no_out)
y_pred_test = dtc_best.predict(X_test_no_out)

# Compute metrics
acc_test = accuracy_score(y_test_no_out, y_pred_test)
f1_test = f1_score(y_test_no_out, y_pred_test, average='macro')

print("\n========= Final Test Set Results =========")
print(f"Test Accuracy: {acc_test:.4f}")
print(f"Test F1 (macro): {f1_test:.4f}")

# Show detailed report
print("\nClassification Report:")
print(classification_report(y_test_no_out, y_pred_test))
print("Confusion Matrix:")
print(confusion_matrix(y_test_no_out, y_pred_test))


Test Accuracy: 1.0000
Test F1 (macro): 1.0000

Classification Report:
              precision    recall  f1-score   support

           1       1.00      1.00      1.00         8
           2       1.00      1.00      1.00         4
           3       1.00      1.00      1.00         1
           4       1.00      1.00      1.00         3
           6       1.00      1.00      1.00         2
           7       1.00      1.00      1.00         2

    accuracy                           1.00        20
   macro avg       1.00      1.00      1.00        20
weighted avg       1.00      1.00      1.00        20

Confusion Matrix:
[[8 0 0 0 0 0]
 [0 4 0 0 0 0]
 [0 0 1 0 0 0]
 [0 0 0 3 0 0]
 [0 0 0 0 2 0]
 [0 0 0 0 0 2]]


---

# **Conclusion**

In this classification task, we successfully implemented data cleaning, outlier detection (LOF), and created two new features (`milk_hair` and `milk_feathers`). We compared three different configurations (Baseline, Feature Engineering, and Outlier Removal + Feature Engineering) using a 4-fold cross-validation setup. Although removing outliers slightly reduced cross-validation accuracy, the final model evaluated on the test set achieved perfect accuracy (100%), indicating strong performance.  

Future improvements may include experimenting with ensemble methods or optimizing decision tree parameters for potential performance gains.


---

# **References**
1. **Zoo Animal Classification**: [Kaggle - Zoo Animal Classification](https://www.kaggle.com/datasets/uciml/zoo-animal-classification?select=zoo.csv)  
2. **Assignment 3 PDF**: *CSI4142 Data Science, CSI4142-Assignment3-Description*  
3. **PredictiveAnalysis-DecisionTrees**: *CSI4142 Data Science, Winter2025-CSI4142-Week7-PredictiveAnalysis-DecisionTrees*   
4. **Python Outlier Detection with Local Outlier Factor (LOF)**: [YouTube Video](https://www.youtube.com/watch?v=_LEaSHhcNGw)
5. **Decision Tree Classification in Python (from scratch!)**: [YouTube Video](https://www.youtube.com/watch?v=sgQAhG5Q7iY)

---

## **Acknowledgments**
- **ChatGPT**: Formatting markdown texts, paraphrasing, grammar checks, and code debugging.  