
# Question 1: Calculating Information Gain for a Decision Tree Classification Problem

You are given a dataset of numerical features and a target variable indicating whether a tennis game will be played. The dataset has the following attributes: Temperature, Humidity, and WindSpeed. The target variable is PlayTennis. The dataset is as follows:

| Temperature | Humidity | WindSpeed | PlayTennis |
|-------------|----------|-----------|------------|
| 85          | 85       | 5         | No         |
| 80          | 90       | 10        | No         |
| 78          | 95       | 5         | Yes        |
| 72          | 80       | 5         | Yes        |
| 69          | 70       | 10        | Yes        |
| 75          | 80       | 15        | No         |

1. **Calculate the entropy of the target variable PlayTennis.**

2. **Choose a threshold for splitting the attribute "Temperature" and calculate the information gain for this split. Use the threshold \( T = 75 \) (i.e., split the data into subsets where Temperature <= 75 and Temperature > 75).**

3. **Determine which attribute would be the best to split on first based on the highest information gain among the attributes Temperature \( T = 75 \), Humidity \( H = 85 \), and WindSpeed \( W = 10 \).**

### Solution Outline

1. **Entropy of PlayTennis:**

   Entropy is calculated using the formula:
   $
   H(S) = - \sum_{i=1}^{c} p_i \log_2(p_i)
   $
   where \(c\) is the number of classes and \(p_i\) is the probability of class \(i\).

2. **Information Gain for Temperature:**

   - Split the dataset based on the threshold \( T = 75 \).
   - Calculate the entropy of each subset.
   - Calculate the weighted average entropy of the subsets.
   - Information gain is calculated using the formula:
   $
   IG(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)
   $
   where \(S_v\) is the subset of \(S\) for which attribute \(A\) has value \(v\).

3. **Calculating Information Gain for all attributes and determining the best attribute to split:**

   Repeat the above process for Humidity and WindSpeed using appropriate thresholds. Compare the information gains and select the attribute with the highest information gain.


# Solution

### 1. Calculate the entropy of the target variable PlayTennis

The entropy \( H(S) \) of the target variable \(PlayTennis\) can be calculated using the formula:
$ H(S) = - \sum_{i=1}^{c} p_i \log_2(p_i) $

where \(c\) is the number of classes and \(p_i\) is the probability of class \(i\).

In our dataset:
- Number of "Yes" (PlayTennis = Yes) = 3
- Number of "No" (PlayTennis = No) = 3
- Total samples = 6

So, the probabilities are:

$ p(Yes) = \frac{3}{6} = 0.5 $

$ p(No) = \frac{3}{6} = 0.5 $

The entropy is:

$ H(S) = - (0.5 \log_2 0.5 + 0.5 \log_2 0.5) \$

$ H(S) = - (0.5 \times -1 + 0.5 \times -1) $

$ H(S) = - (-0.5 - 0.5) $

$ H(S) = 1 $

### 2. Choose a threshold for splitting the attribute "Temperature" and calculate the information gain for this split. Use the threshold \( T = 75 \)

To calculate the information gain, we need to split the dataset based on the threshold \( T = 75 \) and calculate the entropy of the resulting subsets.

**Split based on Temperature \( T = 75 \):**
- Temperature ≤ 75: Samples 4, 5, 6
- Temperature > 75: Samples 1, 2, 3

Let's calculate the entropy for each subset.

**Subset where Temperature ≤ 75:**
- Samples: 4, 5, 6
- PlayTennis: Yes, Yes, No

Entropy:

$ p(Yes) = \frac{2}{3} = 0.6667 $

$ p(No) = \frac{1}{3} = 0.3333 $

$ H(S1) = - (0.6667 \log_2 0.6667 + 0.3333 \log_2 0.3333) $

$ H(S1) = - (0.6667 \times -0.585 + 0.3333 \times -1.585) $

$ H(S1) = - (-0.389 + -0.528) $

$ H(S1) = 0.918 $

**Subset where Temperature > 75:**
- Samples: 1, 2, 3
- PlayTennis: No, No, Yes

Entropy:

$ p(Yes) = \frac{1}{3} = 0.3333 $

$ p(No) = \frac{2}{3} = 0.6667 $

$ H(S2) = - (0.3333 \log_2 0.3333 + 0.6667 \log_2 0.6667) $

$ H(S2) = - (0.3333 \times -1.585 + 0.6667 \times -0.585) $

$ H(S2) = - (-0.528 + -0.389) $

$ H(S2) = 0.918 $

**Weighted Average Entropy:**

$ H(S_T) = \frac{3}{6} H(S1) + \frac{3}{6} H(S2) $

$ H(S_T) = 0.5 \times 0.918 + 0.5 \times 0.918 $

$ H(S_T) = 0.918 $

**Information Gain:**

$ IG(S, Temperature) = H(S) - H(S_T) $

$ IG(S, Temperature) = 1 - 0.918 $

$ IG(S, Temperature) = 0.082 $

### 3. Determine which attribute would be the best to split on first based on the highest information gain among the attributes Temperature, Humidity, and WindSpeed.

Let's calculate the information gain for Humidity and WindSpeed using appropriate thresholds.

**Information Gain for Humidity:**

Choose a threshold for splitting Humidity. Use \( H = 85 \).

**Split based on Humidity \( H = 85 \):**
- Humidity ≤ 85: Samples 1, 4, 5, 6
- Humidity > 85: Samples 2, 3

**Subset where Humidity ≤ 85:**
- Samples: 1, 4, 5, 6
- PlayTennis: No, Yes, Yes, No

Entropy:

$ p(Yes) = \frac{2}{4} = 0.5 $

$ p(No) = \frac{2}{4} = 0.5 $

$ H(S1) = - (0.5 \log_2 0.5 + 0.5 \log_2 0.5) $

$ H(S1) = - (0.5 \times -1 + 0.5 \times -1) $

$ H(S1) = 1 $

**Subset where Humidity > 85:**
- Samples: 2, 3
- PlayTennis: No, Yes

Entropy:

$ p(Yes) = \frac{1}{2} = 0.5 $

$ p(No) = \frac{1}{2} = 0.5 $

$ H(S2) = - (0.5 \log_2 0.5 + 0.5 \log_2 0.5) $

$ H(S2) = - (0.5 \times -1 + 0.5 \times -1) $

$ H(S2) = 1 $

**Weighted Average Entropy:**

$ H(S_H) = \frac{4}{6} H(S1) + \frac{2}{6} H(S2) $

$ H(S_H) = \frac{4}{6} \times 1 + \frac{2}{6} \times 1 $

$ H(S_H) = 1 $

**Information Gain:**

$ IG(S, Humidity) = H(S) - H(S_H) $

$ IG(S, Humidity) = 1 - 1 $

$ IG(S, Humidity) = 0 $

**Information Gain for WindSpeed:**

Choose a threshold for splitting WindSpeed. Use \( W = 10 \).

**Split based on WindSpeed \( W = 10 \):**
- WindSpeed ≤ 10: Samples 1, 3, 4, 5
- WindSpeed > 10: Samples 2, 6

**Subset where WindSpeed ≤ 10:**
- Samples: 1, 3, 4, 5
- PlayTennis: No, Yes, Yes, Yes

Entropy:

$ p(Yes) = \frac{3}{4} = 0.75 $

$ p(No) = \frac{1}{4} = 0.25 $

$ H(S1) = - (0.75 \log_2 0.75 + 0.25 \log_2 0.25) $

$ H(S1) = - (0.75 \times -0.415 + 0.25 \times -2) $

$ H(S1) = - (-0.31125 + -0.5) $

$ H(S1) = 0.81125 $

**Subset where WindSpeed > 10:**
- Samples: 2, 6
- PlayTennis: No, No

Entropy:

$ H(S2) = - (1 \log_2 1) $

$ H(S2) = 0 $

**Weighted Average Entropy:**

$ H(S_W) = \frac{4}{6} H(S1) + \frac{2}{6} H(S2) $

$ H(S_W) = \frac{4}{6} \times 0.81125 + \frac{2}{6} \times 0 $

$ H(S_W) = 0.540833 $

**Information Gain:**

$ IG(S, WindSpeed) = H(S) - H(S_W) $

$ IG(S, WindSpeed) = 1 - 0.540833 $

$ IG(S, WindSpeed) = 0.459167 $

### Best Attribute to Split

Comparing the information gains:
- \( IG(S, Temperature) = 0.082 \)
- \( IG(S, Humidity) = 0 \)
- \( IG(S, WindSpeed) = 0.459167 \)

The best attribute to split on first is **WindSpeed**, with the highest information gain of 0.459167.

# Question 2: Calculating Information Gain for a Decision Tree Regression Problem

You are given a dataset of numerical features and a target variable indicating the house prices. The dataset has the following attributes: Size (in square feet), Number of Bedrooms, and Age (in years). The target variable is Price (in $1000s). The dataset is as follows:

| Size (sqft) | Bedrooms | Age (years) | Price ($1000s) |
|-------------|----------|-------------|----------------|
| 2100        | 3        | 30          | 400            |
| 1600        | 2        | 20          | 330            |
| 2400        | 4        | 15          | 369            |
| 1416        | 3        | 40          | 232            |
| 3000        | 4        | 8           | 540            |
| 1985        | 3        | 30          | 260            |

1. **Calculate the variance of the target variable Price.**

2. **Choose a threshold for splitting the attribute "Size" and calculate the information gain for this split. Use the threshold \( S = 2000 \) (i.e., split the data into subsets where Size ≤ 2000 and Size > 2000).**

---

### Solution Outline:

1. **Variance of Price:**

   Variance is calculated using the formula:
   $
   \text{Var}(S) = \frac{1}{N} \sum_{i=1}^{N} (x_i - \bar{x})^2
   $
   where \(N\) is the number of samples, \(x_i\) is the value of the target variable for the \(i\)-th sample, and $\bar{x}$ is the mean of the target variable.

2. **Information Gain for Size:**

   - Split the dataset based on the threshold \( S = 2000 \).
   - Calculate the variance of each subset.
   - Calculate the weighted average variance of the subsets.
   - Information gain is calculated using the formula:
   $
   IG(S, A) = \text{Var}(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} \text{Var}(S_v)
   $
   where \(S_v\) is the subset of \(S\) for which attribute \(A\) has value \(v\).


# Solution

### 1. Calculate the variance of the target variable Price

The variance of the target variable can be calculated using the formula:

$ \text{Var}(S) = \frac{1}{N} \sum_{i=1}^{N} (x_i - \bar{x})^2 $

where \(N\) is the number of samples, \(x_i\) is the value of the target variable for the \(i\)-th sample, and \(\bar{x}\) is the mean of the target variable.

First, calculate the mean of the Price:

$ \bar{x} = \frac{400 + 330 + 369 + 232 + 540 + 260}{6} $

$ \bar{x} = \frac{2131}{6} $

$ \bar{x} \approx 355.17 $

Next, calculate the variance:

$ \text{Var}(S) = \frac{1}{6} \left[ (400 - 355.17)^2 + (330 - 355.17)^2 + (369 - 355.17)^2 + (232 - 355.17)^2 + (540 - 355.17)^2 + (260 - 355.17)^2 \right] $

$ \text{Var}(S) = \frac{1}{6} \left[ 2004.31 + 637.22 + 189.22 + 15116.43 + 34112.87 + 9033.68 \right] $

$ \text{Var}(S) = \frac{1}{6} \times 60993.73 $
$ \text{Var}(S) \approx 10165.62 $

### 2. Information Gain for Size

To calculate the information gain, we need to split the dataset based on the threshold \( S = 2000 \) and calculate the variance of the resulting subsets.

**Split based on Size \( S = 2000 \):**
- Size ≤ 2000: Samples 2, 4, 6
- Size > 2000: Samples 1, 3, 5

**Subset where Size ≤ 2000:**
- Prices: 330, 232, 260

Mean:

$ \bar{x}_1 = \frac{330 + 232 + 260}{3} = \frac{822}{3} \approx 274 $

Variance:

$ \text{Var}(S_1) = \frac{1}{3} \left[ (330 - 274)^2 + (232 - 274)^2 + (260 - 274)^2 \right] $

$ \text{Var}(S_1) = \frac{1}{3} \left[ 3136 + 1764 + 196 \right] $

$ \text{Var}(S_1) = \frac{5096}{3} $

$ \text{Var}(S_1) \approx 1698.67 $

**Subset where Size > 2000:**
- Prices: 400, 369, 540

Mean:

$ \bar{x}_2 = \frac{400 + 369 + 540}{3} = \frac{1309}{3} \approx 436.33 $

Variance:
$ \text{Var}(S_2) = \frac{1}{3} \left[ (400 - 436.33)^2 + (369 - 436.33)^2 + (540 - 436.33)^2 \right] $

$ \text{Var}(S_2) = \frac{1}{3} \left[ 1303.91 + 4503.56 + 10777.96 \right] $

$ \text{Var}(S_2) = \frac{16605.43}{3} $

$ \text{Var}(S_2) \approx 5535.14 $

**Weighted Average Variance:**

$ \text{Var}(S_{\text{Size}}) = \frac{3}{6} \times \text{Var}(S_1) + \frac{3}{6} \times \text{Var}(S_2) $

$ \text{Var}(S_{\text{Size}}) = 0.5 \times 1698.67 + 0.5 \times 5535.14 $

$ \text{Var}(S_{\text{Size}}) = 3397.57 $

**Information Gain:**

$ IG(S, \text{Size}) = \text{Var}(S) - \text{Var}(S_{\text{Size}}) $

$ IG(S, \text{Size}) = 10165.62 - 3397.57 $

$ IG(S, \text{Size}) = 6768.05 $

