how to get the leaf node for every record in a data frame, for every tree in a gradient boosting classifier. It specifically addresses how to implement the methods in the referenced paper #7

WMJi · 2017-12-04T08:01:02Z

just reading this great paper and trying to implement this:

... We treat each individual tree as a categorical feature that takes as value the index of the leaf an instance ends up falling in. We use 1- of-K coding of this type of features. For example, consider the boosted tree model in Figure 1 with 2 subtrees, where the first subtree has 3 leafs and the second 2 leafs. If an instance ends up in leaf 2 in the first subtree and leaf 1 in second subtree, the overall input to the linear classifier will be the binary vector [0, 1, 0, 1, 0], where the first 3 entries correspond to the leaves of the first subtree and last 2 to those of the second subtree ...

Anyone know how I can predict a bunch of rows and for each of those rows get the selected leaf for each tree in the ensemble? For this use case I don't really care what the node represents, just its index really. Had a look at the source and I could not quickly see anything obvious. I can see that I need to iterate the trees and do something like this:

for sample in X_test:
    for tree in gbc.estimators_:
        leaf = tree.leaf_index(sample) # This is the function I need but don't think exists.
        ...

The following function goes beyond identifying the selected leaf from the Decision Tree and implements the application in the referenced paper. Its use is the same as the referenced paper, where I use the GBC for feature engineering.

def makeTreeBins(gbc, X):
    '''
    Takes in a GradientBoostingClassifier object (gbc) and a data frame (X).
    Returns a numpy array of dim (rows(X), num_estimators), where each row
    represents the set of terminal nodes that the record X[i] falls into across
    all estimators in the GBC.  Note, each tree produces 2^max_depth terminal nodes.
    I append a prefix to the terminal node id in each incremental estimator so that
    I can use these as feature ids in other classifiers.
    '''
    for i, dt_i in enumerate(gbc.estimators_):
        prefix = (i + 2)*100 #Must be an integer
        nds = prefix + dt_i[0].tree_.apply(np.array(X).astype(np.float32))
        if i == 0:
            nd_mat = nds.reshape(len(nds), 1)        
        else:
            nd_mat = np.hstack((nd_mat, nds.reshape(len(nds), 1)))
    return nd_mat

Ref

The text was updated successfully, but these errors were encountered:

WMJi · 2017-12-04T08:46:56Z

now, a simply gbc.apply(X) would work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to get the leaf node for every record in a data frame, for every tree in a gradient boosting classifier. It specifically addresses how to implement the methods in the referenced paper #7

how to get the leaf node for every record in a data frame, for every tree in a gradient boosting classifier. It specifically addresses how to implement the methods in the referenced paper #7

WMJi commented Dec 4, 2017 •

edited

Loading

WMJi commented Dec 4, 2017

how to get the leaf node for every record in a data frame, for every tree in a gradient boosting classifier. It specifically addresses how to implement the methods in the referenced paper #7

how to get the leaf node for every record in a data frame, for every tree in a gradient boosting classifier. It specifically addresses how to implement the methods in the referenced paper #7

Comments

WMJi commented Dec 4, 2017 • edited Loading

WMJi commented Dec 4, 2017

WMJi commented Dec 4, 2017 •

edited

Loading