# Q1: Information Gain for Splitting on CreditScore at 650

In this notebook, we calculate the information gain obtained by splitting the training dataset on the feature `CreditScore` at the threshold of 650. The dataset contains 8 records with two risk classes: **Low** and **High** (4 records each).

In [None]:
import numpy as np
import pandas as pd

def entropy(labels):
    """Compute the entropy of a list of labels."""
    # Get unique classes and their counts
    values, counts = np.unique(labels, return_counts=True)
    probabilities = counts / counts.sum()
    return -np.sum(probabilities * np.log2(probabilities))

# Create the training dataset
data = {
    'ID': [1, 2, 3, 4, 5, 6, 7, 8],
    'Age': [35, 28, 45, 31, 52, 29, 42, 33],
    'CreditScore': [720, 650, 750, 600, 780, 630, 710, 640],
    'Education': [16, 14, None, 12, 18, 14, 16, 12],
    'RiskLevel': ['Low', 'High', 'Low', 'High', 'Low', 'High', 'Low', 'High']
}

df = pd.DataFrame(data)

# Calculate the entropy of the parent node
parent_entropy = entropy(df['RiskLevel'])
print("Parent Entropy:", parent_entropy)  

# Split the dataset based on CreditScore at 650
group_A = df[df['CreditScore'] >= 650]  # Group A: CreditScore >= 650
group_B = df[df['CreditScore'] < 650]   # Group B: CreditScore < 650

entropy_A = entropy(group_A['RiskLevel'])
entropy_B = entropy(group_B['RiskLevel'])

print("Entropy of Group A (CreditScore >= 650):", entropy_A)
print("Entropy of Group B (CreditScore < 650):", entropy_B)

# Calculate the weighted entropy after the split
n = len(df)
weighted_entropy = (len(group_A) / n) * entropy_A + (len(group_B) / n) * entropy_B
print("Weighted Entropy after split:", weighted_entropy)

# Information Gain calculation
information_gain = parent_entropy - weighted_entropy
print("Information Gain:", information_gain)

### Explanation

1. **Parent Entropy:**
   
   Since the dataset is perfectly balanced with 4 **Low** and 4 **High** risk records, the entropy is calculated as:
   \[
   E(\text{parent}) = -\left(\frac{1}{2}\log_2\frac{1}{2} + \frac{1}{2}\log_2\frac{1}{2}\right) = 1 \; \text{bit}

2. **Split Details:**
   
   - **Group A (CreditScore ≥ 650):** Contains 5 records (4 Low, 1 High).
   - **Group B (CreditScore < 650):** Contains 3 records (all High).

3. **Entropy After the Split:**
   
   The entropy of Group A and Group B is calculated separately and then weighted by the proportion of records in each group.

4. **Information Gain:**
   
   The information gain is the reduction in entropy after the split:
   \[
   \text{Gain} = E(\text{parent}) - E_{\text{split}}
   \]

   This value indicates how effective the split is at reducing uncertainty about the risk classification.