# ExtraTrees classifier

## Lecture 7

### GRA 4160
### Predictive modelling with machine learning

#### Lecturer: Vegard H. Larsen

## Building ExtraTrees (Extremely Randomized Trees)

**Selecting a Subset of Features:** At each node in the tree, both Random Forest and Extra Trees algorithms start by selecting a random subset of the features (or predictors).

**Determining the Split Point:**

- *Random Forest:* Once the subset of features is selected, the Random Forest algorithm will search for the best possible split point among these features. This involves finding the value that best separates the data according to the target variable, often using a criterion like Gini impurity or entropy in classification tasks. This process is somewhat similar to what a standard decision tree does, but limited to a subset of features.

- *Extra Trees:* In contrast, the Extra Trees algorithm introduces more randomness. After selecting a subset of features, instead of searching for the most optimal split based on some criterion, it randomly selects a split point for each feature. Then, among these randomly generated splits, it chooses one to split the node. This means that the algorithm does not necessarily choose the best split from a statistical perspective, but rather a random one.

**Impact of Random Splits:**
This increased randomness in choosing splits can lead to more diversified trees within the ensemble, as it reduces the likelihood of creating similar trees even if they are based on the same training data.

As a result, the individual trees in an Extra Trees ensemble can have higher bias compared to those in a Random Forest, but when combined, the ensemble as a whole often has lower variance. This is because the random splits lead to less correlated trees, which is beneficial in an ensemble method.

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the data into a pandas dataframe
df = pd.read_csv("../data/titanic/train.csv")

# Preprocess the data
df = df.dropna()
df['Sex'] = df['Sex'].apply(lambda x: 1 if x == 'male' else 0)

# Split the data into training and test sets
X = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
y = df['Survived']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [10]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Create an Extra Trees classifier object
etc = ExtraTreesClassifier(n_estimators=100, max_depth=3, random_state=1)

# Train the Extra Trees classifier on the training data
etc.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = etc.predict(X_test)

# Evaluate the performance of the Extra Trees classifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.7391304347826086
Precision: 0.7878787878787878
Recall: 0.8387096774193549


In [3]:
print(f'Survival rate in test set: {y_test.sum()/len(y_test):.2f}')

Survival rate in test set: 0.67


## Can you build a Extra Trees classifier using only the DecisionTreeClassifier class?

In [13]:
import numpy as np
from scipy.stats import mode
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.multiclass import check_classification_targets
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted

class SimpleRandomSplitTree(BaseEstimator, ClassifierMixin):
    def __init__(self, max_depth=None):
        self.max_depth = max_depth

    def fit(self, X, y):
        # Check that X and y have correct shape
        X, y = check_X_y(X, y)
        # Check that y has acceptable targets
        check_classification_targets(y)

        self.classes_, y = np.unique(y, return_inverse=True)
        self.n_classes_ = len(self.classes_)
        self.tree_ = self._grow_tree(X, y, depth=0)
        return self

    def _grow_tree(self, X, y, depth):
        # Stopping criteria: if all targets are the same or if maximum depth is reached
        if len(set(y)) == 1 or (self.max_depth is not None and depth >= self.max_depth):
            return np.argmax(np.bincount(y))

        n_samples, n_features = X.shape
    
        # Attempt to split until valid split is found or decide it's a leaf node
        for _ in range(n_features):
            feature_idx = np.random.randint(0, n_features)
            unique_values = np.unique(X[:, feature_idx])

            # If there's less than 2 unique values, can't split on this feature
            if unique_values.size < 2:
                continue

            split_value = np.random.uniform(X[:, feature_idx].min(), X[:, feature_idx].max())

            left_idx = X[:, feature_idx] < split_value
            right_idx = ~left_idx

            # Check if the split actually divides the dataset
            if np.any(left_idx) and np.any(right_idx):
                left_child = self._grow_tree(X[left_idx], y[left_idx], depth + 1)
                right_child = self._grow_tree(X[right_idx], y[right_idx], depth + 1)
                return (feature_idx, split_value, left_child, right_child)

        # If no valid split found, return the most common target as leaf node
        return np.argmax(np.bincount(y))

    def predict(self, X):
        # Input validation
        X = check_array(X)
        check_is_fitted(self)

        predictions = [self._predict_one(x, self.tree_) for x in X]
        return self.classes_[np.array(predictions)]

    def _predict_one(self, x, node):
        # If we have a leaf node
        if not isinstance(node, tuple):
            return node

        # Decide whether to follow left or right child
        feature_idx, split_value, left_child, right_child = node
        if x[feature_idx] < split_value:
            return self._predict_one(x, left_child)
        else:
            return self._predict_one(x, right_child)

# Now we update the SimpleExtraTreesClassifier to use this new tree
class SimpleExtraTreesClassifier:
    def __init__(self, n_estimators=100, max_depth=None, max_features='sqrt'):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.max_features = max_features
        self.trees = []

    def fit(self, X, y):
        n_features = X.shape[1]
        for _ in range(self.n_estimators):
            tree = SimpleRandomSplitTree(max_depth=self.max_depth)

            # Randomly select features
            if self.max_features == 'sqrt':
                size = int(np.sqrt(n_features))
            elif self.max_features == 'log2':
                size = int(np.log2(n_features))
            else:
                size = n_features

            features_idx = np.random.choice(range(n_features), size=size, replace=False)
            X_subset = X.iloc[:, features_idx]

            # Train the tree
            tree.fit(X_subset, y)
            self.trees.append((tree, features_idx))

    def predict(self, X):
        predictions = np.zeros((self.n_estimators, X.shape[0]), dtype=np.int64)
        for i, (tree, features_idx) in enumerate(self.trees):
            predictions[i] = tree.predict(X.iloc[:, features_idx])

        # Majority voting
        final_predictions, _ = mode(predictions, axis=0)
        return final_predictions[0]

In [14]:
# Example Usage
clf = SimpleExtraTreesClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

  final_predictions, _ = mode(predictions, axis=0)


In [15]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.6739130434782609
Precision: 0.6818181818181818
Recall: 0.967741935483871
