# Data Mining Assignment 3: Solutions for Chapter 8 and Chapter 9

This Jupyter Notebook provides analytical solutions and Python code for questions from Chapter 8 (Tree-Based Methods) and Chapter 9 (Support Vector Machines) from the provided document. All code has been updated to use `fetch_california_housing` instead of the deprecated `load_boston` dataset to ensure compatibility with modern `scikit-learn` versions.

## Chapter 8: Tree-Based Methods

### Question 7: Random Forests on California Housing Data

**Analytical Solution:**

We need to plot test error (MSE) for random forests on a housing dataset, varying `max_features` and `n_estimators`, similar to Figure 8.10. The original problem used the Boston dataset (13 features), but due to deprecation, we use `fetch_california_housing` (8 features). We test `max_features = [1, 3, 5, 8]` and `n_estimators = [10, 50, 100, 200, 500]`.

Random forests average predictions from multiple trees, reducing variance. We expect:
- MSE decreases as `n_estimators` increases, stabilizing around 100–200 trees.
- Intermediate `max_features` (e.g., 3, 5) often performs best, balancing randomness and predictive power.
- Low `max_features` (e.g., 1) may underfit; high values (e.g., 8) may overfit slightly.

**Code**: The following code trains random forests and plots test MSE using the California housing dataset.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California housing dataset
california = fetch_california_housing()
X = california.data
y = california.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Parameters to test
n_estimators_list = [10, 50, 100, 200, 500]
max_features_list = [1, 3, 5, 8]
results = {}

# Train models and compute test MSE
for max_features in max_features_list:
    mse_list = []
    for n_estimators in n_estimators_list:
        rf = RandomForestRegressor(
            n_estimators=n_estimators,
            max_features=max_features,
            random_state=42
        )
        rf.fit(X_train, y_train)
        y_pred = rf.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        mse_list.append(mse)
    results[max_features] = mse_list

# Plot results
plt.figure(figsize=(10, 6))
for max_features in max_features_list:
    plt.plot(n_estimators_list, results[max_features], marker='o', label=f'max_features={max_features}')
plt.xlabel('Number of Trees (n_estimators)')
plt.ylabel('Test MSE')
plt.title('Test Error vs. Number of Trees for Different max_features (California Housing)')
plt.legend()
plt.grid(True)
plt.show()

**Expected Results**:

- **Trend**: MSE decreases with more trees, stabilizing at 100–200 trees.
- **max_features**:
  - `max_features = 3, 5`: Typically yield lower MSE, balancing randomness and feature inclusion.
  - `max_features = 1`: Higher MSE due to underfitting from excessive randomness.
  - `max_features = 8`: Slightly higher MSE than 3 or 5, as trees become more correlated.
- **Example MSE**: At `n_estimators = 500`, MSE may be ~0.2–0.3, with `max_features = 3` or `5` near the lowest.
- The plot resembles Figure 8.10, with intermediate `max_features` typically optimal.

## Chapter 9: Support Vector Machines

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Create grid
x1 = np.linspace(-5, 5, 100)
x2 = np.linspace(-5, 5, 100)
X1, X2 = np.meshgrid(x1, x2)

# Hyperplane 1: 1 + 3X_1 - X_2 = 0
Z1 = 1 + 3*X1 - X2
# Hyperplane 2: -2 + X_1 + 2X_2 = 0
Z2 = -2 + X1 + 2*X2

plt.figure(figsize=(8, 8))

# Plot hyperplane 1
plt.contour(X1, X2, Z1, levels=[0], colors='blue', linestyles='solid')
plt.contourf(X1, X2, Z1, levels=[0, np.max(Z1)], colors='lightblue', alpha=0.3)  # X_2 < 3X_1 + 1
plt.contourf(X1, X2, Z1, levels=[np.min(Z1), 0], colors='lightgray', alpha=0.3)  # X_2 > 3X_1 + 1

# Plot hyperplane 2
plt.contour(X1, X2, Z2, levels=[0], colors='red', linestyles='dashed')
plt.contourf(X1, X2, Z2, levels=[0, np.max(Z2)], colors='lightcoral', alpha=0.3)  # X_2 > -X_1/2 + 1
plt.contourf(X1, X2, Z2, levels=[np.min(Z1), 0], colors='lightyellow', alpha=0.3)  # X_2 < -X_1/2 + 1

plt.xlabel('X_1')
plt.ylabel('X_2')
plt.title('Hyperplanes and Regions')
plt.grid(True)
plt.legend(['1 + 3X_1 - X_2 = 0', '-2 + X_1 + 2X_2 = 0'])
plt.show()

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Create grid
x1 = np.linspace(-5, 5, 100)
x2 = np.linspace(-5, 5, 100)
X1, X2 = np.meshgrid(x1, x2)

# Circle equation: (1 + X_1)^2 + (2 - X_2)^2
Z = (1 + X1)**2 + (2 - X2)**2

plt.figure(figsize=(8, 8))

# Plot circle
plt.contour(X1, X2, Z, levels=[4], colors='black', linestyles='solid')
plt.contourf(X1, X2, Z, levels=[4, np.max(Z)], colors='lightblue', alpha=0.3)  # Outside
plt.contourf(X1, X2, Z, levels=[np.min(Z), 4], colors='lightcoral', alpha=0.3)  # Inside

# Plot points
points = [(0, 0), (-1, 1), (2, 2), (3, 8)]
colors = []
for x1, x2 in points:
    value = (1 + x1)**2 + (2 - x2)**2
    color = 'blue' if value > 4 else 'red'
    colors.append(color)
    plt.scatter(x1, x2, c=color, s=100, edgecolors='black')
    plt.text(x1 + 0.2, x2, f'({x1}, {x2})', fontsize=10)

plt.xlabel('X_1')
plt.ylabel('X_2')
plt.title('Non-Linear Decision Boundary and Classified Points')
plt.grid(True)
plt.show()