# Q2: Variance Reduction for Splitting on Age = 35

This notebook calculates the variance reduction for a regression decision tree predicting the `CreditScore` when splitting the training data on `Age = 35`. We then discuss how variance reduction differs from information gain in classification trees.

In [None]:
import numpy as np
import pandas as pd

def variance(values):
    """Compute the variance of a numpy array of values."""
    mean = np.mean(values)
    return np.mean((values - mean) ** 2)

# Create the training dataset
data = {
    'ID': [1, 2, 3, 4, 5, 6, 7, 8],
    'Age': [35, 28, 45, 31, 52, 29, 42, 33],
    'CreditScore': [720, 650, 750, 600, 780, 630, 710, 640],
    'Education': [16, 14, None, 12, 18, 14, 16, 12],
    'RiskLevel': ['Low', 'High', 'Low', 'High', 'Low', 'High', 'Low', 'High']
}

df = pd.DataFrame(data)

# Calculate the parent variance for CreditScore
parent_mean = np.mean(df['CreditScore'])
parent_variance = variance(df['CreditScore'])
print('Parent Mean:', parent_mean)
print('Parent Variance:', parent_variance)

# Split the dataset based on Age = 35
group_A = df[df['Age'] <= 35]  # Group A: Age <= 35
group_B = df[df['Age'] > 35]   # Group B: Age > 35

# Calculate variance for each group
variance_A = variance(group_A['CreditScore'])
variance_B = variance(group_B['CreditScore'])

print('Group A (Age <= 35) Variance:', variance_A)
print('Group B (Age > 35) Variance:', variance_B)

# Calculate the weighted variance after the split
n = len(df)
weighted_variance = (len(group_A)/n) * variance_A + (len(group_B)/n) * variance_B
print('Weighted Variance after split:', weighted_variance)

# Compute the variance reduction
variance_reduction = parent_variance - weighted_variance
print('Variance Reduction:', variance_reduction)

# Discussion:
print("\nVariance reduction minimizes the mean squared error for continuous targets, whereas information gain in classification trees measures the reduction in impurity (entropy) for categorical outcomes.")

### Explanation

1. **Parent Node:** The variance of the CreditScore in the full dataset is computed first. In our case, the mean is 685 and the variance is 3575.

2. **Splitting on Age = 35:** The dataset is divided into two groups:
   - **Group A (Age ≤ 35):** Records with CreditScores [720, 650, 600, 630, 640] with a computed variance of approximately 1576.
   - **Group B (Age > 35):** Records with CreditScores [750, 780, 710] with a computed variance of approximately 822.22.

3. **Weighted Variance and Reduction:** The weighted variance after the split is approximately 1293.33, leading to a variance reduction of about 2281.67.

4. **Difference from Information Gain:** Variance reduction is used for continuous target predictions (regression) and minimizes squared errors, while information gain is used in classification tasks to reduce impurity (entropy) for categorical targets.