To integrate decision trees into the hidden layer of the neural network, we use them as additional "neurons" that process the activations from the standard hidden neurons. Here's a detailed explanation and a step-by-step breakdown of how decision trees are used in this architecture:

### Integrating Decision Trees into the Hidden Layer

1. **Input Layer**:
   - Number of Neurons: 13 (corresponding to the 13 features of the dataset).

2. **First Part of the Hidden Layer**:
   - Number of Neurons: 10
   - Each neuron applies a Tanh activation function.

3. **Decision Trees in the Hidden Layer**:
   - Number of Decision Trees: 10
   - Each decision tree is trained on the activations from the 10 hidden neurons.
   - The decision trees output predictions based on the hidden neuron activations.

4. **Combined Hidden Layer**:
   - The outputs from the 10 hidden neurons and the 10 decision trees are combined into a single layer with 20 neurons.

5. **Output Layer**:
   - Number of Neurons: 1 (corresponding to the predicted house price).




Let's focus on how to integrate decision trees into the hidden layer of the neural network. We'll use decision trees to process the activations from the hidden neurons and then combine their outputs with the original activations before passing them to the next layer.

### Steps to Integrate Decision Trees

1. **Forward Pass**:
   - Compute the activations from the hidden neurons using a Tanh activation function.
   - Use these activations as input features to train decision trees.
   - The decision trees output predictions based on the activations.
   - Combine the original activations with the decision tree outputs to form the combined hidden layer.

2. **Backward Pass**:
   - Calculate the error between the predicted output and the true target values.
   - Compute gradients and update the weights using backpropagation.



Updating the weights in a neural network, including the custom architecture with decision trees in the hidden layer, involves backpropagation. Let's break down how we update the weights in this architecture:

### Steps to Update Weights

1. **Forward Pass**: Compute the activations for all layers, including the outputs from the decision trees.
2. **Compute Error**: Calculate the error between the predicted output and the true target values.
3. **Backward Pass**: Propagate the error backward through the network to compute gradients for the weights.
4. **Gradient Clipping**: Clip gradients to prevent them from becoming too large (optional but often useful).
5. **Update Weights**: Adjust the weights using the computed gradients and the learning rate.

### Explanation of Each Step

1. **Forward Pass**:
    - Compute the pre-activation \( z1 \) for the hidden neurons.
    - Apply the Tanh activation function to get \( a1 \).
    - Train the decision trees using \( a1 \) and the target labels \( y \). The decision trees' predictions are concatenated with \( a1 \) to form the combined hidden layer.
    - Compute the final output \( z2 \) using the combined hidden layer.

2. **Compute Error**:
    - Calculate the error between the network's output and the true labels.

3. **Backward Pass**:

    - Compute the gradients for the weights and biases in the output layer:

        $$
        d\_weights2 = \frac{\partial \text{Loss}}{\partial \text{weights2}} = \frac{\text{combined\_hidden}^T \cdot \text{output\_error}}{X.shape[0]}
        $$
        $$
        d\_bias2 = \frac{\partial \text{Loss}}{\partial \text{bias2}} = \text{mean}(\text{output\_error}, \text{axis}=0)
        $$

    - Compute the error for the hidden layer and its gradients:
        $$
        \text{hidden\_error} = (\text{output\_error} \cdot \text{weights2}[:\text{hidden\_dim}].T) \cdot (1 - a1^2)
        $$
        $$
        d\_weights1 = \frac{\partial \text{Loss}}{\partial \text{weights1}} = \frac{X^T \cdot \text{hidden\_error}}{X.shape[0]}
        $$
        $$
        d\_bias1 = \frac{\partial \text{Loss}}{\partial \text{bias1}} = \text{mean}(\text{hidden\_error}, \text{axis}=0)
        $$

4. **Gradient Clipping**:
    - Optionally, clip the gradients to a maximum norm to prevent exploding gradients.

5. **Update Weights and Biases**:
    - Use the computed gradients and the learning rate to update the weights and biases:
        $$
        \text{weights2} -= \text{learning\_rate} \cdot d\_weights2
        $$
        $$
        \text{bias2} -= \text{learning\_rate} \cdot d\_bias2
        $$
        $$
        \text{weights1} -= \text{learning\_rate} \cdot d\_weights1
        $$
        $$
        \text{bias1} -= \text{learning\_rate} \cdot d\_bias1
        $$

By following these steps, the weights and biases of the network, including those involved with the decision trees, are updated to minimize the prediction error.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_absolute_error

In [5]:
# Manually load the Boston dataset
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
X = np.hstack([raw_df.values[::2, :-1], raw_df.values[1::2, :2]])
y = raw_df.values[1::2, 2]

scaler = StandardScaler()
X = scaler.fit_transform(X)

In [6]:
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
# Initialize models
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(),
    "Lasso Regression": Lasso(),
    "ElasticNet Regression": ElasticNet(),
    "Support Vector Regression": SVR(),
    "MLP Regressor": MLPRegressor(hidden_layer_sizes=(100,), max_iter=1000, random_state=42),
    "Random Forest Regressor": RandomForestRegressor(n_estimators=100, random_state=42),
    "Gradient Boosting Regressor": GradientBoostingRegressor(random_state=42),
    "XGBoost Regressor": XGBRegressor(random_state=42)
}

In [8]:
# Function to train and evaluate a model
def train_evaluate_model(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    r2 = r2_score(y_test, predictions)
    mae = mean_absolute_error(y_test, predictions)
    return r2, mae

# Train and evaluate models
results = {}
for name, model in models.items():
    r2, mae = train_evaluate_model(model, X_train, X_test, y_train, y_test)
    results[name] = {"R² Score": r2, "MAE": mae}

In [9]:
# Custom MLP with 10 trees in hidden layer
class MLPWithDecisionTrees:
    def __init__(self, input_dim, hidden_dim, output_dim, num_trees_hidden):
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim
        self.num_trees_hidden = num_trees_hidden

        # Initialize weights and biases for the first hidden layer
        self.weights1 = np.random.randn(input_dim, hidden_dim) * 0.01
        self.bias1 = np.zeros(hidden_dim)

        # Initialize weights and biases for the output layer
        self.weights2 = np.random.randn(hidden_dim + num_trees_hidden, output_dim) * 0.01
        self.bias2 = np.zeros(output_dim)

        # Initialize decision trees for the hidden layer
        self.trees_hidden = [DecisionTreeRegressor(max_depth=5, random_state=i) for i in range(num_trees_hidden)]

    def forward(self, X, y_batch):
        # Compute hidden layer activations
        self.z1 = np.dot(X, self.weights1) + self.bias1
        self.a1 = np.tanh(self.z1)  # Tanh activation

        # Generate tree outputs for the hidden layer using batch labels
        hidden_tree_outputs = np.column_stack([tree.fit(self.a1, y_batch).predict(self.a1) for tree in self.trees_hidden])
        self.combined_hidden = np.hstack((self.a1, hidden_tree_outputs))

        # Compute output layer activations
        self.z2 = np.dot(self.combined_hidden, self.weights2) + self.bias2
        return self.z2  # Linear output

    def backward(self, X, y, output, learning_rate):
        # Compute the error between the output and the true labels
        output_error = output - y.reshape(-1, 1)

        # Compute gradients for the weights and biases of the output layer
        d_weights2 = np.dot(self.combined_hidden.T, output_error) / X.shape[0]
        d_bias2 = np.mean(output_error, axis=0)

        # Compute hidden layer error and gradients
        hidden_error = np.dot(output_error, self.weights2[:self.hidden_dim].T) * (1 - self.a1 ** 2)  # Tanh derivative
        d_weights1 = np.dot(X.T, hidden_error) / X.shape[0]
        d_bias1 = np.mean(hidden_error, axis=0)

        # Gradient clipping to prevent exploding gradients
        max_grad_norm = 1.0
        d_weights2 = np.clip(d_weights2, -max_grad_norm, max_grad_norm)
        d_bias2 = np.clip(d_bias2, -max_grad_norm, max_grad_norm)
        d_weights1 = np.clip(d_weights1, -max_grad_norm, max_grad_norm)
        d_bias1 = np.clip(d_bias1, -max_grad_norm, max_grad_norm)

        # Update weights and biases using gradient descent
        self.weights2 -= learning_rate * d_weights2
        self.bias2 -= learning_rate * d_bias2
        self.weights1 -= learning_rate * d_weights1
        self.bias1 -= learning_rate * d_bias1

    def train(self, X, y, epochs, learning_rate):
        for epoch in range(epochs):
            output = self.forward(X, y)
            self.backward(X, y, output, learning_rate)
            loss = np.mean((output - y.reshape(-1, 1)) ** 2)
            if epoch % 100 == 0:
                print(f'Epoch {epoch}, Loss: {loss}')

In [10]:
# Train MLPWithDecisionTrees
input_dim = X_train.shape[1]
hidden_dim = 10
output_dim = 1
num_trees_hidden = 10
epochs = 1000
learning_rate = 0.001

custom_mlp = MLPWithDecisionTrees(input_dim, hidden_dim, output_dim, num_trees_hidden)
custom_mlp.train(X_train, y_train, epochs, learning_rate)
custom_predictions = custom_mlp.forward(X_test, y_test)
custom_r2 = r2_score(y_test, custom_predictions)
custom_mae = mean_absolute_error(y_test, custom_predictions)

results["Custom MLP with Trees"] = {"R² Score": custom_r2, "MAE": custom_mae}

Epoch 0, Loss: 611.8313057991296
Epoch 100, Loss: 12.909873483945182
Epoch 200, Loss: 13.07198808020817
Epoch 300, Loss: 12.16671062980413
Epoch 400, Loss: 11.629635180695681
Epoch 500, Loss: 9.889585188286981
Epoch 600, Loss: 10.09015510887839
Epoch 700, Loss: 8.79987603012809
Epoch 800, Loss: 10.582802360832517
Epoch 900, Loss: 10.331503681976471


In [11]:
# Convert results to a DataFrame for better visualization
results_df = pd.DataFrame(results).T
print(results_df)

                             R² Score       MAE
Linear Regression            0.629049  3.530902
Ridge Regression             0.628946  3.527434
Lasso Regression             0.583928  3.797634
ElasticNet Regression        0.576617  3.688701
Support Vector Regression    0.647374  2.798191
MLP Regressor                0.754212  3.019009
Random Forest Regressor      0.886802  2.115863
Gradient Boosting Regressor  0.917226  1.899354
XGBoost Regressor            0.900123  1.931674
Custom MLP with Trees        0.964352  1.116360
