# ID3 Decision Tree Analysis 

This notebook demonstrates the application of the ID3 algorithm to classify a dataset based on attributes: color, shape, and size. We will also explore the implications of adding a new attribute.


## Dataset

The dataset consists of the following attributes:

- **Color**
- **Shape**
- **Size**
- **Class** (target attribute)

| Color | Shape  | Size  | Class |
|-------|--------|-------|-------|
| Red   | Square | Big   | +     |
| Blue  | Square | Big   | +     |
| Red   | Round  | Small | -     |
| Green | Square | Small | -     |
| Red   | Round  | Big   | +     |
| Green | Round  | Big   | -     |


## Step 1: Initial Entropy Calculation

The initial entropy of the dataset is calculated as follows:


In [7]:
import numpy as np

# Define the distribution of classes
num_positive = 3
num_negative = 3
total_count = num_positive + num_negative

# Function to calculate entropy
def calculate_entropy(positive_count, negative_count):
    prob_positive = positive_count / (positive_count + negative_count)
    prob_negative = negative_count / (positive_count + negative_count)
    # Calculate entropy if both probabilities are greater than zero
    return - (prob_positive * np.log2(prob_positive) + prob_negative * np.log2(prob_negative)) if (prob_positive > 0 and prob_negative > 0) else 0

# Calculate initial entropy based on class distribution
initial_entropy_value = calculate_entropy(num_positive, num_negative)
initial_entropy_value

1.0

## Step 2: Information Gain Calculation

We will calculate the entropy and information gain for each attribute: color, shape, and size.



In [6]:
# Calculating entropy for different attributes

# Function to compute average entropy for an attribute
def compute_average_entropy(attribute_values, target_values):
    unique_vals = set(attribute_values)
    total_count = len(attribute_values)
    average_entropy = 0
    
    for val in unique_vals:
        pos_count = sum(1 for i in range(total_count) if attribute_values[i] == val and target_values[i] == '+')
        neg_count = sum(1 for i in range(total_count) if attribute_values[i] == val and target_values[i] == '-')
        # Update average entropy with the weighted entropy of the current value
        average_entropy += (pos_count + neg_count) / total_count * entropy(pos_count, neg_count)
    
    return average_entropy

# Data setup
colors = ['red', 'blue', 'red', 'green', 'red', 'green']
shapes = ['square', 'square', 'round', 'square', 'round', 'round']
sizes = ['big', 'big', 'small', 'small', 'big', 'big']
classes = ['+', '+', '-', '-', '+', '-']

# Calculate initial entropy based on the overall class distribution
initial_entropy_value = entropy(num_positive, num_negative)

# Calculate information gain for each attribute
# For color
color_entropy_value = compute_average_entropy(colors, classes)
color_information_gain = initial_entropy_value - color_entropy_value

# For shape
shape_entropy_value = compute_average_entropy(shapes, classes)
shape_information_gain = initial_entropy_value - shape_entropy_value

# For size
size_entropy_value = compute_average_entropy(sizes, classes)
size_information_gain = initial_entropy_value - size_entropy_value

(color_information_gain, shape_information_gain, size_information_gain)

(0.5408520829727552, 0.08170416594551044, 0.4591479170272448)

## Step 3: Best Attribute Selection

The attribute with the highest information gain will be chosen as the first splitting attribute. In this case, the attribute **Color** has the highest gain.


## Decision Tree

The resulting decision tree based on the attribute **Color** is as follows:

- **Color**
  - **Red**: Class +
  - **Blue**: Class +
  - **Green**: Class -


## Step 4: Impact of Adding a New Attribute

If we add a new attribute, such as **Pattern of Shirt**, with values "checked," "striped," and "solid," the decision tree may change significantly. 

### Possible Changes:
- New splits based on the new attribute.
- Potential revision of existing nodes if the new attribute offers better classification.
- Increased complexity or improved accuracy of the model.

### Consequences of Missing the New Attribute
If a data scientist misses this attribute, it could lead to:
- Inaccurate predictions.
- Financial impacts due to misguided production decisions.
- Surprising findings if the attribute is later discovered, which could influence strategic decisions.
