### Entropy
**Entropy** is a measure of **uncertainty or confusion** in data.

In simple words, entropy tells us **how difficult it is to predict an outcome**.

For example, when tossing a **fair coin**, getting Head or Tail is equally likely, so we are not sure what will happen next. This situation has **high entropy**. But if a coin always gives **Head**, then the result is certain and there is **low entropy (zero entropy)**.

In machine learning, if a dataset contains **equal numbers of different classes**, the data is mixed and confusing, so entropy is high. If the dataset contains **only one class**, there is no confusion and entropy is zero.

**Conclusion:**
Entropy measures **how uncertain or mixed the data is**.


### Information Gain

**Information Gain** means **how much uncertainty is reduced after splitting the data**.

In easy words, Information Gain tells us **how useful a split is** in making data more clear.

For example, before a split, if a dataset is very mixed (high entropy), we are confused. After splitting the data using a good feature, the groups become more pure and less mixed, so uncertainty decreases. This reduction in uncertainty is called **Information Gain**.

In decision trees, the feature with **highest Information Gain** is chosen because it gives the **best split**.

**Conclusion :**
Information Gain is the **reduction in entropy after splitting the data**, showing how much confusion is removed.


In [2]:
# 1 Create a small toy dataset manually (10–12 rows) with:
#• 2–3 features
#• binary class label

import csv

# Lists to store features and labels
X = []
y = []

# Open the CSV file
with open("C:/Users/shres/OneDrive/Documents/play_tennis.csv", "r") as f:
    reader = csv.reader(f)
    header = next(reader)  # skip header

    # Take first 11 valid rows (day, outlook, Play)
    count = 0
    for row in reader:
        if count >= 11:
            break

        # row format:  day, outlook, play
        day= int(row[0])
        outlook= (row[1])
        play = (row[-1])

        X.append([day, outlook])
        y.append(play)

        count += 1

# Print the manually loaded dataset
print("Features (Day, Outlook):")
for i in range(len(X)):
    print(X[i], "-> Label:", y[i])

Features (Day, Outlook):
[1, 'Sunny'] -> Label: No
[2, 'Sunny'] -> Label: No
[3, 'Overcast'] -> Label: Yes
[4, 'Rain'] -> Label: Yes
[5, 'Rain'] -> Label: Yes
[6, 'Rain'] -> Label: No
[7, 'Overcast'] -> Label: Yes
[8, 'Sunny'] -> Label: No
[9, 'Sunny'] -> Label: Yes
[10, 'Rain'] -> Label: Yes
[11, 'Sunny'] -> Label: Yes


In [3]:
# 2 Write a function entropy(y) that:
#• counts class proportions
#• returns entropy value

import math

def entropy(y):
    total = len(y)
    counts = {}

    # count each class
    for label in y:
        if label in counts:
            counts[label] += 1
        else:
            counts[label] = 1

    ent = 0
    for label in counts:
        p = counts[label] / total
        ent -= p * math.log2(p)

    return ent



print("Entropy:", round(entropy(y), 3))


Entropy: 0.946


In [4]:
# 3 Test entropy() for:
#• perfectly balanced labels
#• fully pure labels

import math

def entropy(y):
    total = len(y)
    counts = {}

    for label in y:
        if label in counts:
            counts[label] += 1
        else:
            counts[label] = 1

    ent = 0
    for label in counts:
        p = counts[label] / total
        ent -= p * math.log2(p)

    return ent


# Test 1: Perfectly balanced labels
y_balanced = ['Yes', 'No', 'Yes', 'No']
print("Balanced labels entropy:", entropy(y_balanced))

# Test 2: Fully pure labels
y_pure = ['Yes', 'Yes', 'Yes', 'Yes']
print("Pure labels entropy:", entropy(y_pure))


Balanced labels entropy: 1.0
Pure labels entropy: 0.0


In [5]:
# 4 Write a function split_dataset(X, y, feature, value) that splits data into:
#• left group (feature <= value)
#• right group (feature > value)
import csv

# Lists to store features and labels
X = []
y = []

# Open the CSV file
with open("C:/Users/shres/OneDrive/Documents/play_tennis.csv", "r") as f:
    reader = csv.reader(f)
    header = next(reader)  # skip header

    # Take first 11 valid rows (day, outlook, Play)
    count = 0
    for row in reader:
        if count >= 11:
            break

        # row format:  day, outlook, play
        day= int(row[0])
        outlook= (row[1])
        play = (row[-1])

        X.append([day, outlook])
        y.append(play)

        count += 1

def split_dataset(X, y, feature_index, value):
    left_X, left_y = [], []
    right_X, right_y = [], []

    for i in range(len(X)):
        if X[i][feature_index] <= value:
            left_X.append(X[i])
            left_y.append(y[i])
        else:
            right_X.append(X[i])
            right_y.append(y[i])

    return (left_X, left_y), (right_X, right_y)

# Step 3: Example split on Day <= 5
(left_X, left_y), (right_X, right_y) = split_dataset(X, y, feature_index=0, value=5)

# Step 4: Print results
print("Left Group (Day <= 5):")
for i in range(len(left_X)):
    print(left_X[i], "->", left_y[i])

print("\nRight Group (Day > 5):")
for i in range(len(right_X)):
    print(right_X[i], "->", right_y[i])


Left Group (Day <= 5):
[1, 'Sunny'] -> No
[2, 'Sunny'] -> No
[3, 'Overcast'] -> Yes
[4, 'Rain'] -> Yes
[5, 'Rain'] -> Yes

Right Group (Day > 5):
[6, 'Rain'] -> No
[7, 'Overcast'] -> Yes
[8, 'Sunny'] -> No
[9, 'Sunny'] -> Yes
[10, 'Rain'] -> Yes
[11, 'Sunny'] -> Yes


In [6]:
# 5 Write a function information_gain(X, y, feature, value) using:
#• parent entropy
#• weighted child entropies

import math
import csv

# Lists to store features and labels
X = []
y = []

# Open the CSV file
with open("C:/Users/shres/OneDrive/Documents/play_tennis.csv", "r") as f:
    reader = csv.reader(f)
    header = next(reader)  # skip header

    # Take first 11 valid rows (day, outlook, Play)
    count = 0
    for row in reader:
        if count >= 11:
            break

        # row format:  day, outlook, play
        day= int(row[0])
        outlook= (row[1])
        play = (row[-1])

        X.append([day, outlook])
        y.append(play)

        count += 1


# Step 2: Entropy function
def entropy(y):
    total = len(y)
    counts = {}
    for label in y:
        if label in counts:
            counts[label] += 1
        else:
            counts[label] = 1

    ent = 0
    for label in counts:
        p = counts[label] / total
        ent -= p * math.log2(p)
    return ent


# Step 3: Split function (numeric: <= value)
def split_dataset(X, y, feature_index, value):
    left_X, left_y = [], []
    right_X, right_y = [], []

    for i in range(len(X)):
        if X[i][feature_index] <= value:
            left_X.append(X[i])
            left_y.append(y[i])
        else:
            right_X.append(X[i])
            right_y.append(y[i])
    return (left_X, left_y), (right_X, right_y)


# Step 4: Information Gain function
def information_gain(X, y, feature_index, value):
    parent_entropy = entropy(y)
    (left_X, left_y), (right_X, right_y) = split_dataset(X, y, feature_index, value)

    n = len(y)
    weighted_child_entropy = (len(left_y)/n)*entropy(left_y) + (len(right_y)/n)*entropy(right_y)

    ig = parent_entropy - weighted_child_entropy
    return ig


# Step 5: Example - split on Day <= 5
ig_day = information_gain(X, y, feature_index=0, value=5)
print("Information Gain for split Day <= 5:", round(ig_day, 3))


Information Gain for split Day <= 5: 0.003


In [7]:
# 6 Loop over all features & possible split values and:
#• compute information gain
#• print best split
import csv
import math

# Load Day, Outlook, Play
X = []
y = []

with open("C:/Users/shres/OneDrive/Documents/play_tennis.csv", "r") as f:
    reader = csv.reader(f)
    header = next(reader)

    for row in reader:
        day = int(row[0])   
        outlook = row[1]
        play = row[-1]

        X.append([day, outlook])
        y.append(play)


# Entropy function
def entropy(y):
    total = len(y)
    counts = {}

    for label in y:
        counts[label] = counts.get(label, 0) + 1

    ent = 0
    for label in counts:
        p = counts[label] / total
        ent -= p * math.log2(p)

    return ent


# Split dataset (numeric)
def split_dataset(X, y, feature_index, value):
    left_X, left_y = [], []
    right_X, right_y = [], []

    for i in range(len(X)):
        if X[i][feature_index] <= value:
            left_X.append(X[i])
            left_y.append(y[i])
        else:
            right_X.append(X[i])
            right_y.append(y[i])

    return left_y, right_y


# Information Gain
def information_gain(X, y, feature_index, value):
    parent_entropy = entropy(y)
    left_y, right_y = split_dataset(X, y, feature_index, value)

    n = len(y)
    weighted_entropy = (len(left_y)/n)*entropy(left_y) + (len(right_y)/n)*entropy(right_y)

    return parent_entropy - weighted_entropy



# Loop over all features & split values


best_ig = -1
best_value = None

values = sorted(set(row[0] for row in X))  # Day values

for value in values:
    ig = information_gain(X, y, feature_index=0, value=value)
    print(f"Feature Day <= {value} | Information Gain = {round(ig, 3)}")

    if ig > best_ig:
        best_ig = ig
        best_value = value


print("\nBest Split:")
print("Feature: Day")
print("Split Condition: <=", best_value)
print("Best Information Gain:", round(best_ig, 3))


Feature Day <= 1 | Information Gain = 0.113
Feature Day <= 2 | Information Gain = 0.245
Feature Day <= 3 | Information Gain = 0.079
Feature Day <= 4 | Information Gain = 0.025
Feature Day <= 5 | Information Gain = 0.003
Feature Day <= 6 | Information Gain = 0.048
Feature Day <= 7 | Information Gain = 0.016
Feature Day <= 8 | Information Gain = 0.09
Feature Day <= 9 | Information Gain = 0.045
Feature Day <= 10 | Information Gain = 0.015
Feature Day <= 11 | Information Gain = 0.0
Feature Day <= 12 | Information Gain = 0.01
Feature Day <= 13 | Information Gain = 0.113
Feature Day <= 14 | Information Gain = 0.0

Best Split:
Feature: Day
Split Condition: <= 2
Best Information Gain: 0.245


In [8]:
# 7 Create a function best_split(X, y) that returns:
#• best feature
#• best split value
#• best info gain

import csv
import math

# Load Day, Outlook, Play
X = []
y = []

with open("C:/Users/shres/OneDrive/Documents/play_tennis.csv", "r") as f:
    reader = csv.reader(f)
    header = next(reader)

    for row in reader:
        day = int(row[0])   
        outlook = row[1]
        play = row[-1]

        X.append([day, outlook])
        y.append(play)


# Entropy function
def entropy(y):
    total = len(y)
    counts = {}

    for label in y:
        counts[label] = counts.get(label, 0) + 1

    ent = 0
    for label in counts:
        p = counts[label] / total
        ent -= p * math.log2(p)

    return ent


# Split dataset (numeric)
def split_dataset(X, y, feature_index, value):
    left_X, left_y = [], []
    right_X, right_y = [], []

    for i in range(len(X)):
        if X[i][feature_index] <= value:
            left_X.append(X[i])
            left_y.append(y[i])
        else:
            right_X.append(X[i])
            right_y.append(y[i])

    return left_y, right_y


# Information Gain
def information_gain(X, y, feature_index, value):
    parent_entropy = entropy(y)
    left_y, right_y = split_dataset(X, y, feature_index, value)

    n = len(y)
    weighted_entropy = (len(left_y)/n)*entropy(left_y) + (len(right_y)/n)*entropy(right_y)

    return parent_entropy - weighted_entropy
    
def best_split(X, y):
    best_ig = -1
    best_feature = None
    best_value = None

    num_features = len(X[0])

    for feature_index in range(num_features):

        # only numeric feature: Day (index 0)
        if feature_index == 0:
            values = sorted(set(row[feature_index] for row in X))

            for value in values:
                ig = information_gain(X, y, feature_index, value)

                if ig > best_ig:
                    best_ig = ig
                    best_feature = feature_index
                    best_value = value

    return best_feature, best_value, best_ig


feature, value, ig = best_split(X, y)

print("Best Feature Index:", feature)
print("Best Split Value:", value)
print("Best Information Gain:", round(ig, 3))


Best Feature Index: 0
Best Split Value: 2
Best Information Gain: 0.245


In [9]:
# 8 Build a recursive function build_tree(X, y, depth) that:
#• finds best split
#• creates left + right child nodes

import csv
import math

# Load Day, Outlook, Play
X = []
y = []

with open("C:/Users/shres/OneDrive/Documents/play_tennis.csv", "r") as f:
    reader = csv.reader(f)
    header = next(reader)

    for row in reader:
        day = int(row[0])   
        outlook = row[1]
        play = row[-1]

        X.append([day, outlook])
        y.append(play)


# Entropy function
def entropy(y):
    total = len(y)
    counts = {}

    for label in y:
        counts[label] = counts.get(label, 0) + 1

    ent = 0
    for label in counts:
        p = counts[label] / total
        ent -= p * math.log2(p)

    return ent


# Split dataset (numeric)
def split_dataset(X, y, feature_index, value):
    left_X, left_y = [], []
    right_X, right_y = [], []

    for i in range(len(X)):
        if X[i][feature_index] <= value:
            left_X.append(X[i])
            left_y.append(y[i])
        else:
            right_X.append(X[i])
            right_y.append(y[i])

    return left_y, right_y


# Information Gain
def information_gain(X, y, feature_index, value):
    parent_entropy = entropy(y)
    left_y, right_y = split_dataset(X, y, feature_index, value)

    n = len(y)
    weighted_entropy = (len(left_y)/n)*entropy(left_y) + (len(right_y)/n)*entropy(right_y)

    return parent_entropy - weighted_entropy
    
def best_split(X, y):
    best_ig = -1
    best_feature = None
    best_value = None

    num_features = len(X[0])

    for feature_index in range(num_features):

        # only numeric feature: Day (index 0)
        if feature_index == 0:
            values = sorted(set(row[feature_index] for row in X))

            for value in values:
                ig = information_gain(X, y, feature_index, value)

                if ig > best_ig:
                    best_ig = ig
                    best_feature = feature_index
                    best_value = value

    return best_feature, best_value, best_ig


feature, value, ig = best_split(X, y)

def build_tree(X, y, depth, max_depth=2):
    # Count labels
    if y.count(y[0]) == len(y):
        return {"label": y[0]}  # pure node

    if depth == max_depth:
        # majority class
        label = max(set(y), key=y.count)
        return {"label": label}

    # Find best split (Day only)
    best_feature, best_value, best_ig = best_split(X, y)

    if best_ig == 0 or best_feature is None:
        label = max(set(y), key=y.count)
        return {"label": label}

    # Split data
    left_X, left_y = [], []
    right_X, right_y = [], []

    for i in range(len(X)):
        if X[i][best_feature] <= best_value:
            left_X.append(X[i])
            left_y.append(y[i])
        else:
            right_X.append(X[i])
            right_y.append(y[i])

    # Build children recursively
    left_child = build_tree(left_X, left_y, depth + 1, max_depth)
    right_child = build_tree(right_X, right_y, depth + 1, max_depth)

    return {
        "feature": best_feature,
        "value": best_value,
        "left": left_child,
        "right": right_child
    }
tree = build_tree(X, y, depth=0)
print(tree)


{'feature': 0, 'value': 2, 'left': {'label': 'No'}, 'right': {'feature': 0, 'value': 13, 'left': {'label': 'Yes'}, 'right': {'label': 'No'}}}


In [10]:
# 9 Add stopping conditions:
#• pure node
#• max depth reached
#• minimum samples per node

def build_tree(X, y, depth=0, max_depth=3, min_samples=2):

    # Pure node
    if y.count(y[0]) == len(y):
        return {"label": y[0]}

    # Max depth reached
    if depth == max_depth:
        label = max(set(y), key=y.count)
        return {"label": label}

    # Minimum samples
    if len(y) <= min_samples:
        label = max(set(y), key=y.count)
        return {"label": label}

    # Find best split
    feature, value, ig = best_split(X, y)

    # If no useful split
    if ig == 0:
        label = max(set(y), key=y.count)
        return {"label": label}

    # Split dataset
    left_y, right_y = split_dataset(X, y, feature, value)
    left_X = [X[i] for i in range(len(X)) if X[i][feature] <= value]
    right_X = [X[i] for i in range(len(X)) if X[i][feature] > value]

    # Recursive calls
    return {
        "feature": feature,
        "value": value,
        "left": build_tree(left_X, left_y, depth+1, max_depth, min_samples),
        "right": build_tree(right_X, right_y, depth+1, max_depth, min_samples)
    }

  
tree = build_tree(X, y, depth=0)
print(tree)

{'feature': 0, 'value': 2, 'left': {'label': 'No'}, 'right': {'feature': 0, 'value': 13, 'left': {'feature': 0, 'value': 8, 'left': {'label': 'Yes'}, 'right': {'label': 'Yes'}}, 'right': {'label': 'No'}}}


In [11]:
# 10 Write a function predict(tree, x_test) to classify a single input sample.

def predict(tree, x_test):

    # If leaf node
    if "label" in tree:
        return tree["label"]

    # Get split info
    feature = tree["feature"]
    value = tree["value"]

    # Go left or right
    if x_test[feature] <= value:
        return predict(tree["left"], x_test)
    else:
        return predict(tree["right"], x_test)
        
x_test = [6, "Sunny"]   
result = predict(tree, x_test)
print("Prediction:", result)


Prediction: Yes


In [12]:
# 11 Extend prediction for multiple samples.
def predict_multiple(tree, X_test):
    predictions = []

    for x in X_test:
        result = predict(tree, x)
        predictions.append(result)

    return predictions

X_test = [
    [1, "Sunny"],
    [5, "Rain"],
    [10, "Sunny"],
    [14, "Overcast"]
]

results = predict_multiple(tree, X_test)

for i in range(len(X_test)):
    print(X_test[i], "->", results[i])


[1, 'Sunny'] -> No
[5, 'Rain'] -> Yes
[10, 'Sunny'] -> Yes
[14, 'Overcast'] -> No


In [13]:
# 12 Add support for categorical features (Yes/No/High/Low).
def predict(tree, x_test):

    # Leaf node
    if "label" in tree:
        return tree["label"]

    feature = tree["feature"]
    value = tree["value"]
    ftype = tree["type"]

    # Numeric feature
    if ftype == "num":
        if x_test[feature] <= value:
            return predict(tree["left"], x_test)
        else:
            return predict(tree["right"], x_test)

    # Categorical feature
    else:
        if x_test[feature] == value:
            return predict(tree["left"], x_test)
        else:
            return predict(tree["right"], x_test)
tree = {
    "feature": 1,      
    "value": "Sunny",
    "type": "cat",
    "left": {"label": "No"},
    "right": {"label": "Yes"}
}  

x_test = [5, "Sunny"]
print(predict(tree, x_test))


No


In [14]:
# 13 Modify the tree to store class probability instead of only label.

import csv
import math

X = []
y = []

with open("C:/Users/shres/OneDrive/Documents/play_tennis.csv", "r") as f:
    reader = csv.reader(f)
    header = next(reader)

    for row in reader:
        day = int(row[0])          # numeric
        outlook = row[1]           # categorical (ignored for split)
        play = row[-1]             # Yes / No

        X.append([day, outlook])
        y.append(play)
        
# Entropy function
def entropy(y):
    total = len(y)
    counts = {}

    # count each class
    for label in y:
        if label in counts:
            counts[label] += 1
        else:
            counts[label] = 1

    ent = 0
    for label in counts:
        p = counts[label] / total
        ent -= p * math.log2(p)

    return ent

# Class probability function
def class_probabilities(y):
    total = len(y)
    probs = {}

    for label in y:
        probs[label] = probs.get(label, 0) + 1

    for label in probs:
        probs[label] = probs[label] / total

    return probs

# Split datase
def split_dataset(X, y, feature, value):
    left_X, left_y = [], []
    right_X, right_y = [], []

    for i in range(len(X)):
        if X[i][feature] <= value:
            left_X.append(X[i])
            left_y.append(y[i])
        else:
            right_X.append(X[i])
            right_y.append(y[i])

    return left_X, left_y, right_X, right_y

# Information Gain
def information_gain(X, y, feature, value):
    parent_entropy = entropy(y)

    left_X, left_y, right_X, right_y = split_dataset(X, y, feature, value)

    n = len(y)
    weighted_entropy = (
        (len(left_y)/n) * entropy(left_y) +
        (len(right_y)/n) * entropy(right_y)
    )

    return parent_entropy - weighted_entropy

# Best split function
def best_split(X, y):
    best_ig = -1
    best_feature = None
    best_value = None

    values = sorted(set(row[0] for row in X))  # Day only

    for value in values:
        ig = information_gain(X, y, 0, value)

        if ig > best_ig:
            best_ig = ig
            best_feature = 0
            best_value = value

    return best_feature, best_value, best_ig

# Build Tree (stores probabilities instead of label)
def build_tree(X, y, depth=0, max_depth=3, min_samples=2):

    # Pure node
    if y.count(y[0]) == len(y):
        return {"prob": class_probabilities(y)}

    # Max depth reached
    if depth == max_depth:
        return {"prob": class_probabilities(y)}

    # Minimum samples
    if len(y) <= min_samples:
        return {"prob": class_probabilities(y)}

    feature, value, ig = best_split(X, y)

    # No useful split
    if ig == 0:
        return {"prob": class_probabilities(y)}

    left_X, left_y, right_X, right_y = split_dataset(X, y, feature, value)

    return {
        "feature": feature,
        "value": value,
        "type": "num",
        "left": build_tree(left_X, left_y, depth+1, max_depth, min_samples),
        "right": build_tree(right_X, right_y, depth+1, max_depth, min_samples)
    }

# Predict function (returns probabilities)
def predict(tree, x_test):

    # Leaf node
    if "prob" in tree:
        return tree["prob"]

    feature = tree["feature"]
    value = tree["value"]

    if x_test[feature] <= value:
        return predict(tree["left"], x_test)
    else:
        return predict(tree["right"], x_test)

# Final label from probability
def predict_label(tree, x_test):
    probs = predict(tree, x_test)
    return max(probs, key=probs.get)

tree = build_tree(X, y)

x_test = [5, "Sunny"]
print(tree)
print("Probabilities:", predict(tree, x_test))
print("Final Prediction:", predict_label(tree, x_test))


{'feature': 0, 'value': 2, 'type': 'num', 'left': {'prob': {'No': 1.0}}, 'right': {'feature': 0, 'value': 13, 'type': 'num', 'left': {'feature': 0, 'value': 8, 'type': 'num', 'left': {'prob': {'Yes': 0.6666666666666666, 'No': 0.3333333333333333}}, 'right': {'prob': {'Yes': 1.0}}}, 'right': {'prob': {'No': 1.0}}}}
Probabilities: {'Yes': 0.6666666666666666, 'No': 0.3333333333333333}
Final Prediction: Yes


In [15]:
# 14 Add a function print_tree(tree) to display the structure clearly.
def print_tree(tree, depth=0):
    space = "  " * depth  # indentation

    if "prob" in tree:  # Leaf node
        print(space + str(tree["prob"]))
        return

    # Internal node
    print(space + f"Feature {tree['feature']} <= {tree['value']} ?")
    print(space + "Left ->")
    print_tree(tree["left"], depth + 1)
    print(space + "Right ->")
    print_tree(tree["right"], depth + 1)


print_tree(tree)



Feature 0 <= 2 ?
Left ->
  {'No': 1.0}
Right ->
  Feature 0 <= 13 ?
  Left ->
    Feature 0 <= 8 ?
    Left ->
      {'Yes': 0.6666666666666666, 'No': 0.3333333333333333}
    Right ->
      {'Yes': 1.0}
  Right ->
    {'No': 1.0}


In [16]:
# 15 Implement pruning rule:
#• stop splitting if gain < threshold
def build_tree(X, y, depth=0, max_depth=3, min_samples=2, gain_threshold=0.01):
    # 1. Pure node
    if y.count(y[0]) == len(y):
        return {"prob": class_probabilities(y)}

    # 2. Max depth reached
    if depth == max_depth:
        return {"prob": class_probabilities(y)}

    # 3. Minimum samples
    if len(y) <= min_samples:
        return {"prob": class_probabilities(y)}

    # 4. Find best split
    feature, value, ig = best_split(X, y)

    # 5. Pruning rule: stop if gain is too small
    if ig < gain_threshold:
        return {"prob": class_probabilities(y)}

    # 6. Split dataset
    left_X, left_y, right_X, right_y = split_dataset(X, y, feature, value)

    # 7. Recursively build left and right subtrees
    return {
        "feature": feature,
        "value": value,
        "type": "num",
        "left": build_tree(left_X, left_y, depth+1, max_depth, min_samples, gain_threshold),
        "right": build_tree(right_X, right_y, depth+1, max_depth, min_samples, gain_threshold)
    }
tree = build_tree(X, y, max_depth=3, gain_threshold=0.05)
print_tree(tree)


Feature 0 <= 2 ?
Left ->
  {'No': 1.0}
Right ->
  Feature 0 <= 13 ?
  Left ->
    Feature 0 <= 8 ?
    Left ->
      {'Yes': 0.6666666666666666, 'No': 0.3333333333333333}
    Right ->
      {'Yes': 1.0}
  Right ->
    {'No': 1.0}


In [17]:
# 16 Compare performance:
#• train/test split
#• compute accuracy manually
#• compare with and without pruning
import csv
import random
import math

X = []
y = []

with open("C:/Users/shres/OneDrive/Documents/play_tennis.csv", "r") as f:
    reader = csv.reader(f)
    header = next(reader)
    for row in reader:
        day = int(row[0])  # numeric
        outlook = row[1]   # categorical (ignored for now)
        play = row[-1]
        X.append([day, outlook])
        y.append(play)


# Step 1: Train/Test Split

def train_test_split(X, y, test_size=0.3):
    combined = list(zip(X, y))
    random.shuffle(combined)
    split_idx = int(len(X)*(1-test_size))
    X_train, y_train = zip(*combined[:split_idx])
    X_test, y_test = zip(*combined[split_idx:])
    return list(X_train), list(y_train), list(X_test), list(y_test)

X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.3)


# Step 2: Helper Functions

# Entropy
def entropy(y):
    total = len(y)
    counts = {}
    for label in y:
        counts[label] = counts.get(label, 0) + 1
    ent = 0
    for label in counts:
        p = counts[label]/total
        ent -= p*math.log2(p)
    return ent

# Class probabilities
def class_probabilities(y):
    total = len(y)
    counts = {}
    for label in y:
        counts[label] = counts.get(label,0)+1
    for k in counts:
        counts[k] /= total
    return counts

# Split dataset
def split_dataset(X, y, feature, value):
    left_X, left_y, right_X, right_y = [], [], [], []
    for i in range(len(X)):
        if X[i][feature] <= value:
            left_X.append(X[i])
            left_y.append(y[i])
        else:
            right_X.append(X[i])
            right_y.append(y[i])
    return left_X, left_y, right_X, right_y

# Information gain
def information_gain(X, y, feature, value):
    parent_entropy = entropy(y)
    left_y, right_y = split_dataset(X, y, feature, value)[1::2]
    n = len(y)
    weighted_entropy = (len(left_y)/n)*entropy(left_y) + (len(right_y)/n)*entropy(right_y)
    return parent_entropy - weighted_entropy

# Best split
def best_split(X, y):
    best_ig = -1
    best_feature, best_value = None, None
    for feature_index in range(len(X[0])):  # only numeric features
        if feature_index != 0:  # skip non-numeric
            continue
        values = sorted(set(row[feature_index] for row in X))
        for value in values:
            ig = information_gain(X, y, feature_index, value)
            if ig > best_ig:
                best_ig = ig
                best_feature = feature_index
                best_value = value
    return best_feature, best_value, best_ig


# Step 3: Build Tree

def build_tree(X, y, depth=0, max_depth=3, min_samples=2, gain_threshold=0.01):
    # Pure node
    if y.count(y[0]) == len(y):
        return {"prob": class_probabilities(y)}
    # Max depth
    if depth == max_depth:
        return {"prob": class_probabilities(y)}
    # Min samples
    if len(y) <= min_samples:
        return {"prob": class_probabilities(y)}
    
    # Best split
    feature, value, ig = best_split(X, y)
    # Pruning
    if ig < gain_threshold:
        return {"prob": class_probabilities(y)}
    
    left_X, left_y, right_X, right_y = split_dataset(X, y, feature, value)
    return {
        "feature": feature,
        "value": value,
        "type": "num",
        "left": build_tree(left_X, left_y, depth+1, max_depth, min_samples, gain_threshold),
        "right": build_tree(right_X, right_y, depth+1, max_depth, min_samples, gain_threshold)
    }


# Step 4: Prediction

def predict_label(tree, x):
    if "prob" in tree:
        return max(tree["prob"], key=tree["prob"].get)
    if x[tree["feature"]] <= tree["value"]:
        return predict_label(tree["left"], x)
    else:
        return predict_label(tree["right"], x)

def predict(tree, X):
    return [predict_label(tree, x) for x in X]


# Step 5: Accuracy

def accuracy(y_true, y_pred):
    correct = 0
    for i in range(len(y_true)):
        if y_true[i] == y_pred[i]:
            correct += 1
    return correct / len(y_true)


# Step 6: Train Trees

# Tree without pruning
tree_no_prune = build_tree(X_train, y_train, gain_threshold=0.0)
y_pred_no_prune = predict(tree_no_prune, X_test)
acc_no_prune = accuracy(y_test, y_pred_no_prune)

# Tree with pruning
tree_prune = build_tree(X_train, y_train, gain_threshold=0.05)
y_pred_prune = predict(tree_prune, X_test)
acc_prune = accuracy(y_test, y_pred_prune)


# Step 7: Print Results

print("Accuracy without pruning:", round(acc_no_prune,3))
print("Accuracy with pruning:", round(acc_prune,3))

# Optional: print trees
def print_tree(tree, depth=0):
    space = "  " * depth
    if "prob" in tree:
        print(space + str(tree["prob"]))
        return
    print(space + f"Feature {tree['feature']} <= {tree['value']} ?")
    print(space + "Left ->")
    print_tree(tree["left"], depth + 1)
    print(space + "Right ->")
    print_tree(tree["right"], depth + 1)

print("\nTree without pruning:")
print_tree(tree_no_prune)

print("\nTree with pruning:")
print_tree(tree_prune)


Accuracy without pruning: 0.2
Accuracy with pruning: 0.2

Tree without pruning:
Feature 0 <= 2 ?
Left ->
  {'No': 1.0}
Right ->
  Feature 0 <= 11 ?
  Left ->
    {'Yes': 1.0}
  Right ->
    {'No': 1.0}

Tree with pruning:
Feature 0 <= 2 ?
Left ->
  {'No': 1.0}
Right ->
  Feature 0 <= 11 ?
  Left ->
    {'Yes': 1.0}
  Right ->
    {'No': 1.0}


In [18]:
# 17 Add max_leaf_nodes stopping condition.

# Recursive tree building with max_leaf_nodes
def build_tree(X, y, depth=0, max_depth=3, min_samples=2, gain_threshold=0.01, max_leaf_nodes=None, leaf_counter=[0]):
    # Pure node
    if y.count(y[0]) == len(y):
        leaf_counter[0] += 1
        return {"prob": class_probabilities(y)}
    
    # Max depth
    if depth == max_depth:
        leaf_counter[0] += 1
        return {"prob": class_probabilities(y)}
    
    # Min samples
    if len(y) <= min_samples:
        leaf_counter[0] += 1
        return {"prob": class_probabilities(y)}
    
    # Max leaf nodes reached
    if max_leaf_nodes is not None and leaf_counter[0] >= max_leaf_nodes:
        leaf_counter[0] += 1
        return {"prob": class_probabilities(y)}
    
    # Best split
    feature, value, ig = best_split(X, y)
    # Pruning based on info gain
    if ig < gain_threshold:
        leaf_counter[0] += 1
        return {"prob": class_probabilities(y)}
    
    # Split dataset
    left_X, left_y, right_X, right_y = split_dataset(X, y, feature, value)
    
    # Recursively build left and right subtrees
    return {
        "feature": feature,
        "value": value,
        "type": "num",
        "left": build_tree(left_X, left_y, depth+1, max_depth, min_samples, gain_threshold, max_leaf_nodes, leaf_counter),
        "right": build_tree(right_X, right_y, depth+1, max_depth, min_samples, gain_threshold, max_leaf_nodes, leaf_counter)
    }
    
# Reset leaf counter
leaf_counter = [0]

tree_limited_leaves = build_tree(
    X_train, y_train,
    max_depth=5,
    gain_threshold=0.01,
    max_leaf_nodes=3,  # Limit to 3 leaf nodes
    leaf_counter=leaf_counter
)

print_tree(tree_limited_leaves)


Feature 0 <= 2 ?
Left ->
  {'No': 1.0}
Right ->
  Feature 0 <= 11 ?
  Left ->
    {'Yes': 1.0}
  Right ->
    {'No': 1.0}


In [19]:
# 18 Add max_depth stopping condition.

# Recursive tree building with max_leaf_nodes
def build_tree(X, y, depth=0, max_depth=3, min_samples=2, gain_threshold=0.01, max_leaf_nodes=None, leaf_counter=[0]):
    # Pure node
    if y.count(y[0]) == len(y):
        leaf_counter[0] += 1
        return {"prob": class_probabilities(y)}
    
    # Max depth
    if depth >= max_depth:
        leaf_counter[0] += 1
        return {"prob": class_probabilities(y)}
    
    # Min samples
    if len(y) <= min_samples:
        leaf_counter[0] += 1
        return {"prob": class_probabilities(y)}
    
    # Max leaf nodes reached
    if max_leaf_nodes is not None and leaf_counter[0] >= max_leaf_nodes:
        leaf_counter[0] += 1
        return {"prob": class_probabilities(y)}
    
    # Best split
    feature, value, ig = best_split(X, y)
    # Pruning based on info gain
    if ig < gain_threshold:
        leaf_counter[0] += 1
        return {"prob": class_probabilities(y)}
    
    # Split dataset
    left_X, left_y, right_X, right_y = split_dataset(X, y, feature, value)
    
    # Recursively build left and right subtrees
    return {
        "feature": feature,
        "value": value,
        "type": "num",
        "left": build_tree(left_X, left_y, depth+1, max_depth, min_samples, gain_threshold, max_leaf_nodes, leaf_counter),
        "right": build_tree(right_X, right_y, depth+1, max_depth, min_samples, gain_threshold, max_leaf_nodes, leaf_counter)
    }
    

leaf_counter = [0]
tree = build_tree(X_train, y_train, max_depth=2, gain_threshold=0.01, max_leaf_nodes=3, leaf_counter=leaf_counter)

print_tree(tree)


Feature 0 <= 2 ?
Left ->
  {'No': 1.0}
Right ->
  Feature 0 <= 11 ?
  Left ->
    {'Yes': 1.0}
  Right ->
    {'No': 1.0}


In [20]:
# 19 Compute Gini Index and compare with Entropy splitting.
import csv

# Lists to store features and labels
X = []
y = []

# Open the CSV file
with open("C:/Users/shres/OneDrive/Documents/play_tennis.csv", "r") as f:
    reader = csv.reader(f)
    header = next(reader)  # skip header

    # Take first 11 valid rows (day, outlook, Play)
    count = 0
    for row in reader:
        if count >= 11:
            break

        # row format:  day, outlook, play
        day= int(row[0])
        outlook= (row[1])
        play = (row[-1])

        X.append([day, outlook])
        y.append(play)

        count += 1

def gini_index(y):
    total = len(y)
    counts = {}

    # count each class
    for label in y:
        if label in counts:
            counts[label] += 1
        else:
            counts[label] = 1

    gini = 1
    for label in counts:
        p = counts[label] / total
        gini -= p**2  # square of probability

    return gini

def gini_split(X, y, feature, value):
    left_X, left_y, right_X, right_y = split_dataset(X, y, feature, value)
    n = len(y)
    weighted_gini = (len(left_y)/n)*gini_index(left_y) + (len(right_y)/n)*gini_index(right_y)
    return weighted_gini
    
def best_split_gini(X, y):
    best_gini = float('inf')
    best_feature, best_value = None, None
    for feature_index in range(len(X[0])):  # numeric only
        if feature_index != 0: 
            continue
        values = sorted(set(row[feature_index] for row in X))
        for value in values:
            gini = gini_split(X, y, feature_index, value)
            if gini < best_gini:
                best_gini = gini
                best_feature = feature_index
                best_value = value
    return best_feature, best_value, best_gini

def best_split_entropy(X, y):
    best_ig = -1
    best_feature, best_value = None, None
    for feature_index in range(len(X[0])):
        if feature_index != 0: 
            continue
        values = sorted(set(row[feature_index] for row in X))
        for value in values:
            ig = information_gain(X, y, feature_index, value)
            if ig > best_ig:
                best_ig = ig
                best_feature = feature_index
                best_value = value
    return best_feature, best_value, best_ig



# Best split by Entropy
f_e, v_e, ig = best_split_entropy(X, y)
print("Best split by Entropy: Feature", f_e, "<=", v_e, "| Info Gain:", round(ig,3))

# Best split by Gini
f_g, v_g, g = best_split_gini(X, y)
print("Best split by Gini: Feature", f_g, "<=", v_g, "| Gini:", round(g,3))


Best split by Entropy: Feature 0 <= 2 | Info Gain: 0.32
Best split by Gini: Feature 0 <= 2 | Gini: 0.283


In [22]:
# 20 Allow user to choose between Gini and Entropy.
import math
import random


# Entropy

def entropy(y):
    total = len(y)
    counts = {}
    for label in y:
        counts[label] = counts.get(label, 0) + 1
    ent = 0
    for label in counts:
        p = counts[label] / total
        ent -= p * math.log2(p)
    return ent


# Gini Index

def gini_index(y):
    total = len(y)
    counts = {}
    for label in y:
        counts[label] = counts.get(label,0)+1
    gini = 1
    for label in counts:
        p = counts[label]/total
        gini -= p**2
    return gini


# Class probabilities

def class_probabilities(y):
    total = len(y)
    counts = {}
    for label in y:
        counts[label] = counts.get(label,0)+1
    for k in counts:
        counts[k] /= total
    return counts


# Split dataset

def split_dataset(X, y, feature, value):
    left_X, left_y, right_X, right_y = [], [], [], []
    for i in range(len(X)):
        if X[i][feature] <= value:
            left_X.append(X[i])
            left_y.append(y[i])
        else:
            right_X.append(X[i])
            right_y.append(y[i])
    return left_X, left_y, right_X, right_y


# Best split (user chooses criterion)

def best_split(X, y, criterion="entropy"):
    best_score = -1 if criterion=="entropy" else float("inf")
    best_feature, best_value = None, None
    
    for feature_index in range(len(X[0])):  # only numeric
        if feature_index != 0:
            continue
        values = sorted(set(row[feature_index] for row in X))
        for value in values:
            left_X, left_y, right_X, right_y = split_dataset(X, y, feature_index, value)
            n = len(y)
            if criterion=="entropy":
                # Information Gain
                weighted_entropy = (len(left_y)/n)*entropy(left_y) + (len(right_y)/n)*entropy(right_y)
                score = entropy(y) - weighted_entropy
                if score > best_score:
                    best_score = score
                    best_feature = feature_index
                    best_value = value
            elif criterion=="gini":
                # Weighted Gini
                weighted_gini = (len(left_y)/n)*gini_index(left_y) + (len(right_y)/n)*gini_index(right_y)
                score = weighted_gini
                if score < best_score:
                    best_score = score
                    best_feature = feature_index
                    best_value = value
    return best_feature, best_value, best_score


# Build tree

def build_tree(X, y, depth=0, max_depth=3, min_samples=2, gain_threshold=0.01,
               max_leaf_nodes=None, leaf_counter=[0], criterion="entropy"):
    # Pure node
    if y.count(y[0]) == len(y):
        leaf_counter[0] += 1
        return {"prob": class_probabilities(y)}
    # Max depth
    if depth >= max_depth:
        leaf_counter[0] += 1
        return {"prob": class_probabilities(y)}
    # Min samples
    if len(y) <= min_samples:
        leaf_counter[0] += 1
        return {"prob": class_probabilities(y)}
    # Max leaf nodes
    if max_leaf_nodes is not None and leaf_counter[0] >= max_leaf_nodes:
        leaf_counter[0] += 1
        return {"prob": class_probabilities(y)}
    
    # Best split
    feature, value, score = best_split(X, y, criterion)
    
    # Pruning: stop if gain too small (only for entropy)
    if criterion=="entropy" and score < gain_threshold:
        leaf_counter[0] += 1
        return {"prob": class_probabilities(y)}
    
    # Split dataset
    left_X, left_y, right_X, right_y = split_dataset(X, y, feature, value)
    
    return {
        "feature": feature,
        "value": value,
        "type": "num",
        "left": build_tree(left_X, left_y, depth+1, max_depth, min_samples, gain_threshold, max_leaf_nodes, leaf_counter, criterion),
        "right": build_tree(right_X, right_y, depth+1, max_depth, min_samples, gain_threshold, max_leaf_nodes, leaf_counter, criterion)
    }


# Predict

def predict_label(tree, x):
    if "prob" in tree:
        return max(tree["prob"], key=tree["prob"].get)
    if x[tree["feature"]] <= tree["value"]:
        return predict_label(tree["left"], x)
    else:
        return predict_label(tree["right"], x)

def predict(tree, X):
    return [predict_label(tree, x) for x in X]


# Print tree

def print_tree(tree, depth=0):
    space = "  " * depth
    if "prob" in tree:
        print(space + str(tree["prob"]))
        return
    print(space + f"Feature {tree['feature']} <= {tree['value']} ?")
    print(space + "Left ->")
    print_tree(tree["left"], depth + 1)
    print(space + "Right ->")
    print_tree(tree["right"], depth + 1)

# Build tree using Entropy
leaf_counter = [0]
tree_entropy = build_tree(X, y, max_depth=3, criterion="entropy", leaf_counter=leaf_counter)
print("Tree using Entropy:")
print_tree(tree_entropy)

# Build tree using Gini
leaf_counter = [0]
tree_gini = build_tree(X, y, max_depth=3, criterion="gini", leaf_counter=leaf_counter)
print("\nTree using Gini:")
print_tree(tree_gini)

Tree using Entropy:
Feature 0 <= 2 ?
Left ->
  {'No': 1.0}
Right ->
  Feature 0 <= 5 ?
  Left ->
    {'Yes': 1.0}
  Right ->
    Feature 0 <= 8 ?
    Left ->
      {'No': 0.6666666666666666, 'Yes': 0.3333333333333333}
    Right ->
      {'Yes': 1.0}

Tree using Gini:
Feature 0 <= 2 ?
Left ->
  {'No': 1.0}
Right ->
  Feature 0 <= 5 ?
  Left ->
    {'Yes': 1.0}
  Right ->
    Feature 0 <= 8 ?
    Left ->
      {'No': 0.6666666666666666, 'Yes': 0.3333333333333333}
    Right ->
      {'Yes': 1.0}


In [25]:
# 21 Handle missing values when splitting.
def split_dataset(X, y, feature, value):
    left_X, left_y = [], []
    right_X, right_y = [], []
    missing_X, missing_y = [], []

    for i in range(len(X)):
        if X[i][feature] is None or X[i][feature] == "":
            missing_X.append(X[i])
            missing_y.append(y[i])
        elif X[i][feature] <= value:
            left_X.append(X[i])
            left_y.append(y[i])
        else:
            right_X.append(X[i])
            right_y.append(y[i])

    # Send missing values to the larger side
    if len(left_y) >= len(right_y):
        left_X.extend(missing_X)
        left_y.extend(missing_y)
    else:
        right_X.extend(missing_X)
        right_y.extend(missing_y)

    return left_X, left_y, right_X, right_y
    
X = [
    [1],
    [2],
    [None],   # missing value
    [4],
    [6]
]

y = ["No", "No", "Yes", "Yes", "Yes"]


left_X, left_y, right_X, right_y = split_dataset(X, y, feature=0, value=3)

print("Left X :", left_X)
print("Left y :", left_y)
print("Right X:", right_X)
print("Right y:", right_y)


Left X : [[1], [2], [None]]
Left y : ['No', 'No', 'Yes']
Right X: [[4], [6]]
Right y: ['Yes', 'Yes']


In [27]:
# 22 Implement random feature selection (Decision Tree Lite → Random Forest concept intro).
import csv
import math
import random


X = []
y = []

with open("C:/Users/shres/OneDrive/Documents/play_tennis.csv", "r") as f:
    reader = csv.reader(f)
    header = next(reader)

    for row in reader:
        day = int(row[0])        # numeric
        outlook = row[1]         # categorical
        play = row[-1]           # Yes / No

        X.append([day, outlook])
        y.append(play)



# Entropy

def entropy(y):
    total = len(y)
    counts = {}

    for label in y:
        counts[label] = counts.get(label, 0) + 1

    ent = 0
    for label in counts:
        p = counts[label] / total
        ent -= p * math.log2(p)

    return ent



# Gini Index

def gini(y):
    total = len(y)
    counts = {}

    for label in y:
        counts[label] = counts.get(label, 0) + 1

    g = 1
    for label in counts:
        p = counts[label] / total
        g -= p * p

    return g



# Split Dataset

def split_dataset(X, y, feature, value):
    left_y, right_y = [], []

    for i in range(len(X)):
        if X[i][feature] <= value:
            left_y.append(y[i])
        else:
            right_y.append(y[i])

    return left_y, right_y



# Information Gain

def information_gain(X, y, feature, value, criterion="entropy"):
    if criterion == "entropy":
        parent = entropy(y)
        measure = entropy
    else:
        parent = gini(y)
        measure = gini

    left_y, right_y = split_dataset(X, y, feature, value)

    n = len(y)
    weighted = (len(left_y)/n)*measure(left_y) + (len(right_y)/n)*measure(right_y)

    return parent - weighted



# Random Feature Selection

def random_features(num_features):
    k = int(math.sqrt(num_features))
    if k < 1:
        k = 1
    return random.sample(range(num_features), k)



# Best Split (Random Feature Selection)

def best_split(X, y, criterion="entropy"):
    best_ig = -1
    best_feature = None
    best_value = None

    num_features = len(X[0])
    features_to_try = random_features(num_features)

    for feature in features_to_try:
        values = sorted(set(row[feature] for row in X))

        for value in values:
            ig = information_gain(X, y, feature, value, criterion)

            if ig > best_ig:
                best_ig = ig
                best_feature = feature
                best_value = value

    return best_feature, best_value, best_ig



# Class Probability

def class_probability(y):
    probs = {}
    total = len(y)

    for label in y:
        probs[label] = probs.get(label, 0) + 1

    for label in probs:
        probs[label] /= total

    return probs



# Build Tree (Recursive)

def build_tree(X, y, depth=0, max_depth=3, min_samples=2, criterion="entropy"):
    # stopping conditions
    if len(set(y)) == 1:
        return {"prob": class_probability(y)}

    if depth == max_depth or len(y) < min_samples:
        return {"prob": class_probability(y)}

    feature, value, ig = best_split(X, y, criterion)

    if ig <= 0:
        return {"prob": class_probability(y)}

    left_X, left_y = [], []
    right_X, right_y = [], []

    for i in range(len(X)):
        if X[i][feature] <= value:
            left_X.append(X[i])
            left_y.append(y[i])
        else:
            right_X.append(X[i])
            right_y.append(y[i])

    return {
        "feature": feature,
        "value": value,
        "left": build_tree(left_X, left_y, depth+1, max_depth, min_samples, criterion),
        "right": build_tree(right_X, right_y, depth+1, max_depth, min_samples, criterion)
    }



# Predict One Sample

def predict(tree, x):
    if "prob" in tree:
        return max(tree["prob"], key=tree["prob"].get)

    if x[tree["feature"]] <= tree["value"]:
        return predict(tree["left"], x)
    else:
        return predict(tree["right"], x)



# Predict Multiple Samples

def predict_all(tree, X_test):
    return [predict(tree, x) for x in X_test]



# Print Tree

def print_tree(tree, depth=0):
    space = "  " * depth

    if "prob" in tree:
        print(space + "Leaf:", tree["prob"])
        return

    print(space + f"Feature {tree['feature']} <= {tree['value']} ?")
    print(space + "Left:")
    print_tree(tree["left"], depth+1)
    print(space + "Right:")
    print_tree(tree["right"], depth+1)


# Build & Test

tree = build_tree(X, y, criterion="entropy")

print("\nDECISION TREE STRUCTURE:\n")
print_tree(tree)

print("\nPREDICTIONS:")
preds = predict_all(tree, X)
for i in range(len(preds)):
    print(X[i], "=>", preds[i])



DECISION TREE STRUCTURE:

Feature 0 <= 2 ?
Left:
  Leaf: {'No': 1.0}
Right:
  Feature 1 <= Overcast ?
  Left:
    Leaf: {'Yes': 1.0}
  Right:
    Feature 1 <= Rain ?
    Left:
      Leaf: {'Yes': 0.6, 'No': 0.4}
    Right:
      Leaf: {'No': 0.3333333333333333, 'Yes': 0.6666666666666666}

PREDICTIONS:
[1, 'Sunny'] => No
[2, 'Sunny'] => No
[3, 'Overcast'] => Yes
[4, 'Rain'] => Yes
[5, 'Rain'] => Yes
[6, 'Rain'] => Yes
[7, 'Overcast'] => Yes
[8, 'Sunny'] => Yes
[9, 'Sunny'] => Yes
[10, 'Rain'] => Yes
[11, 'Sunny'] => Yes
[12, 'Overcast'] => Yes
[13, 'Overcast'] => Yes
[14, 'Rain'] => Yes


In [28]:
# 23 Evaluate model stability by training multiple times.
# ------------------------------------
# Accuracy function
# ------------------------------------
def accuracy(y_true, y_pred):
    correct = 0
    for i in range(len(y_true)):
        if y_true[i] == y_pred[i]:
            correct += 1
    return correct / len(y_true)


# ------------------------------------
# Model Stability Evaluation
# ------------------------------------
def evaluate_stability(X, y, runs=5):
    accuracies = []

    for i in range(runs):
        tree = build_tree(X, y, criterion="entropy")
        predictions = predict_all(tree, X)
        acc = accuracy(y, predictions)
        accuracies.append(acc)

        print(f"Run {i+1} Accuracy: {round(acc, 3)}")

    avg_acc = sum(accuracies) / len(accuracies)

    print("\nAverage Accuracy:", round(avg_acc, 3))
    print("All Accuracies:", accuracies)



# Run Stability Test

evaluate_stability(X, y, runs=5)


Run 1 Accuracy: 0.786
Run 2 Accuracy: 0.857
Run 3 Accuracy: 0.714
Run 4 Accuracy: 0.929
Run 5 Accuracy: 0.857

Average Accuracy: 0.829
All Accuracies: [0.7857142857142857, 0.8571428571428571, 0.7142857142857143, 0.9285714285714286, 0.8571428571428571]


In [29]:
# 24 Create visualization of decision path for one sample.
def show_decision_path(tree, x, depth=0):
    space = "  " * depth

    # If leaf node
    if "prob" in tree:
        print(space + "Reached leaf")
        print(space + "Class probabilities:", tree["prob"])
        prediction = max(tree["prob"], key=tree["prob"].get)
        print(space + "Final Prediction:", prediction)
        return

    feature = tree["feature"]
    value = tree["value"]

    print(space + f"Check: Feature {feature} <= {value} ?")

    if x[feature] <= value:
        print(space + f"Yes → go LEFT (value = {x[feature]})")
        show_decision_path(tree["left"], x, depth + 1)
    else:
        print(space + f"No → go RIGHT (value = {x[feature]})")
        show_decision_path(tree["right"], x, depth + 1)
        
sample = X[0]   # take first sample
print("Sample:", sample)
print("\nDecision Path:\n")
show_decision_path(tree, sample)


Sample: [1, 'Sunny']

Decision Path:

Check: Feature 0 <= 2 ?
Yes → go LEFT (value = 1)
  Reached leaf
  Class probabilities: {'No': 1.0}
  Final Prediction: No


In [30]:
# 25 Save the tree structure to JSON and reload it for prediction.

import json


# Save tree to JSON

def save_tree(tree, filename="tree.json"):
    with open(filename, "w") as f:
        json.dump(tree, f)
    print(f"Tree saved to {filename}")



# Load tree from JSON

def load_tree(filename="tree.json"):
    with open(filename, "r") as f:
        tree = json.load(f)
    print(f"Tree loaded from {filename}")
    return tree
    
#  Build the tree
tree = build_tree(X, y, criterion="entropy")

#  Save it
save_tree(tree, "play_tennis_tree.json")

#  Load it later
loaded_tree = load_tree("play_tennis_tree.json")

#  Predict using loaded tree
sample = X[0]
prediction = predict(loaded_tree, sample)
print("\nSample:", sample)
print("Prediction from loaded tree:", prediction)


Tree saved to play_tennis_tree.json
Tree loaded from play_tennis_tree.json

Sample: [1, 'Sunny']
Prediction from loaded tree: No
