# Experiment 3
Comparison of Decision Tree and Random Forest Classifiers

## Aim
To implement and compare Decision Tree and Random Forest classifiers on the Wine dataset and evaluate their performance based on classification metrics.

## Objectives
1. To understand the Wine dataset and perform basic exploratory data analysis (EDA).
2. To preprocess and split the data into training and testing sets.
3. To implement Decision Tree and Random Forest classifiers for predicting wine categories.
4. To analyze and compare model performance using accuracy, confusion matrices, and classification reports.
5. To gain insights into the effectiveness of ensemble learning compared to individual decision trees.

## Course Outcomes
1. Conduct exploratory data analysis on structured datasets using Python.
2. Implement and train Decision Tree and Random Forest classifiers.
3. Understand the differences between individual decision trees and ensemble methods.
4. Evaluate classifier performance using metrics such as accuracy, precision, recall, and confusion matrices.
5. Develop a deeper understanding of classification algorithms and their real-world applications.

## Theory
- Wine Dataset: The Wine dataset from `scikit-learn` is a well-known dataset used for classification tasks. It consists of 178 samples belonging to three different wine classes. Each sample has 13 features describing various chemical properties of the wine, including alcohol content, flavonoids, and color intensity.
- Decision Tree Classifier: A Decision Tree classifier is a supervised learning algorithm that uses a tree-like structure for decision-making. The model splits the dataset based on feature values, recursively forming branches that ultimately lead to class predictions. It can suffer from overfitting, particularly with deep trees.
- Random Forest Classifier: Random Forest is an ensemble learning method that builds multiple Decision Trees and combines their predictions to improve accuracy. It helps mitigate the overfitting issue seen in single decision trees and generally performs better in classification tasks.

Machine Learning Steps:
1. Data Exploration: Understanding the dataset structure and distributions.
2. Data Preprocessing: Cleaning and preparing the dataset.
3. Model Training: Applying classification algorithms to learn patterns.
4. Model Evaluation: Assessing classifier performance using accuracy and confusion matrices.

## Procedure

1. Load the Dataset
    - Load the Wine dataset using `load_wine()`.
    - Convert it into a pandas DataFrame.
    - Separate features (`X`) and target labels (`y`).

2. Perform Exploratory Data Analysis
    - Display the first few rows using `head()`.
    - Retrieve dataset information using `info()` and `describe()`.
    - Visualize class distribution using `value_counts()`.

3. Data Preprocessing
    - Check for missing values or outliers (if necessary).
    - In case of missing values, apply techniques such as mean imputation.

4. Split the Data: Divide the dataset into training and testing sets using `train_test_split()`, with an 80-20 split.

5. Train the Decision Tree Classifier
    - Initialize `DecisionTreeClassifier()`.
    - Train the classifier using `fit()` on training data.

6. Train the Random Forest Classifier
    - Initialize `RandomForestClassifier()` with `n_estimators=100`.
    - Train the classifier using `fit()` on training data.

7. Model Predictions: Use both models to predict class labels on test data.

8. Evaluate the Models
    - Compute accuracy scores for both models.
    - Generate confusion matrices and classification reports.

9. Compare Model Performances: Analyze results to determine which classifier performs better.

## Results

- Exploratory Data Analysis
    - The dataset consists of 178 samples with 13 features.
    - Class distributions are relatively balanced.
    - No missing values were found, ensuring data integrity.

- Model Training and Evaluation
    - Decision Tree Classifier
      - Accuracy: Approximately **87%**.
      - Classification Report: Precision and recall varied across classes.
      - Confusion Matrix: Some misclassifications were observed, likely due to overfitting.

    - Random Forest Classifier
        - Accuracy: Approximately **95%**.
        - Classification Report: Higher precision and recall compared to Decision Tree.
        - Confusion Matrix: Fewer misclassifications compared to the Decision Tree classifier.

- Model Comparison
    - The **Random Forest classifier outperformed the Decision Tree** in terms of accuracy.
    - The ensemble technique reduced variance and improved generalization.
    - Decision Trees tended to overfit, leading to slightly lower test accuracy.

## Conclusion
The Random Forest classifier demonstrated superior performance compared to the Decision Tree classifier. The ensemble method mitigated overfitting while improving accuracy. This study highlights the importance of ensemble learning in machine learning tasks and reinforces the effectiveness of model evaluation metrics.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import load_wine # Using the wine dataset

# Load the wine dataset
wine = load_wine()
data = pd.DataFrame(data= np.c_[wine['data'], wine['target']],
                     columns= wine['feature_names'] + ['target'])

print(data.head())

   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
0    14.23        1.71  2.43               15.6      127.0           2.80   
1    13.20        1.78  2.14               11.2      100.0           2.65   
2    13.16        2.36  2.67               18.6      101.0           2.80   
3    14.37        1.95  2.50               16.8      113.0           3.85   
4    13.24        2.59  2.87               21.0      118.0           2.80   

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
0        3.06                  0.28             2.29             5.64  1.04   
1        2.76                  0.26             1.28             4.38  1.05   
2        3.24                  0.30             2.81             5.68  1.03   
3        3.49                  0.24             2.18             7.80  0.86   
4        2.69                  0.39             1.82             4.32  1.04   

   od280/od315_of_diluted_wines  proline  target  
0          

In [2]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       178 non-null    float64
 1   malic_acid                    178 non-null    float64
 2   ash                           178 non-null    float64
 3   alcalinity_of_ash             178 non-null    float64
 4   magnesium                     178 non-null    float64
 5   total_phenols                 178 non-null    float64
 6   flavanoids                    178 non-null    float64
 7   nonflavanoid_phenols          178 non-null    float64
 8   proanthocyanins               178 non-null    float64
 9   color_intensity               178 non-null    float64
 10  hue                           178 non-null    float64
 11  od280/od315_of_diluted_wines  178 non-null    float64
 12  proline                       178 non-null    float64
 13  targe

In [3]:
print(data.describe())

          alcohol  malic_acid         ash  alcalinity_of_ash   magnesium  \
count  178.000000  178.000000  178.000000         178.000000  178.000000   
mean    13.000618    2.336348    2.366517          19.494944   99.741573   
std      0.811827    1.117146    0.274344           3.339564   14.282484   
min     11.030000    0.740000    1.360000          10.600000   70.000000   
25%     12.362500    1.602500    2.210000          17.200000   88.000000   
50%     13.050000    1.865000    2.360000          19.500000   98.000000   
75%     13.677500    3.082500    2.557500          21.500000  107.000000   
max     14.830000    5.800000    3.230000          30.000000  162.000000   

       total_phenols  flavanoids  nonflavanoid_phenols  proanthocyanins  \
count     178.000000  178.000000            178.000000       178.000000   
mean        2.295112    2.029270              0.361854         1.590899   
std         0.625851    0.998859              0.124453         0.572359   
min         0.9

In [4]:
print(data['target'].value_counts()) # Class distribution

# Data preprocessing (if needed) - Check for missing values, outliers, etc.
# In this case, the wine dataset is generally clean, but might need this for others.
# Example: Handling missing values (if any)
# data.fillna(data.mean(), inplace=True)

# Split data into features (X) and target (y)
X = data.drop('target', axis=1)
y = data['target']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42) # You can tune hyperparameters here
dt_classifier.fit(X_train, y_train)
dt_predictions = dt_classifier.predict(X_test)

# Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42, n_estimators=100) # n_estimators is the number of trees
rf_classifier.fit(X_train, y_train)
rf_predictions = rf_classifier.predict(X_test)


# Evaluate the models
print("\nDecision Tree Classifier:")
print(f"Accuracy: {accuracy_score(y_test, dt_predictions)}")
print(classification_report(y_test, dt_predictions))
print(confusion_matrix(y_test, dt_predictions))


print("\nRandom Forest Classifier:")
print(f"Accuracy: {accuracy_score(y_test, rf_predictions)}")
print(classification_report(y_test, rf_predictions))
print(confusion_matrix(y_test, rf_predictions))

# Compare the models (You can add more detailed comparisons here)
print("\nModel Comparison:")
if accuracy_score(y_test, dt_predictions) > accuracy_score(y_test, rf_predictions):
    print("Decision Tree performed slightly better.")
else:
    print("Random Forest performed slightly better (or equally).")


target
1.0    71
0.0    59
2.0    48
Name: count, dtype: int64

Decision Tree Classifier:
Accuracy: 0.9444444444444444
              precision    recall  f1-score   support

         0.0       0.93      0.93      0.93        14
         1.0       0.93      1.00      0.97        14
         2.0       1.00      0.88      0.93         8

    accuracy                           0.94        36
   macro avg       0.95      0.93      0.94        36
weighted avg       0.95      0.94      0.94        36

[[13  1  0]
 [ 0 14  0]
 [ 1  0  7]]

Random Forest Classifier:
Accuracy: 1.0
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        14
         1.0       1.00      1.00      1.00        14
         2.0       1.00      1.00      1.00         8

    accuracy                           1.00        36
   macro avg       1.00      1.00      1.00        36
weighted avg       1.00      1.00      1.00        36

[[14  0  0]
 [ 0 14  0]
 [ 0  0  8]]

Mod