Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to get the leaf node for every record in a data frame, for every tree in a gradient boosting classifier. It specifically addresses how to implement the methods in the referenced paper #7

Open
WMJi opened this issue Dec 4, 2017 · 1 comment

Comments

@WMJi
Copy link
Contributor

WMJi commented Dec 4, 2017

just reading this great paper and trying to implement this:

... We treat each individual tree as a categorical feature that takes as value the index of the leaf an instance ends up falling in. We use 1- of-K coding of this type of features. For example, consider the boosted tree model in Figure 1 with 2 subtrees, where the first subtree has 3 leafs and the second 2 leafs. If an instance ends up in leaf 2 in the first subtree and leaf 1 in second subtree, the overall input to the linear classifier will be the binary vector [0, 1, 0, 1, 0], where the first 3 entries correspond to the leaves of the first subtree and last 2 to those of the second subtree ...

Anyone know how I can predict a bunch of rows and for each of those rows get the selected leaf for each tree in the ensemble? For this use case I don't really care what the node represents, just its index really. Had a look at the source and I could not quickly see anything obvious. I can see that I need to iterate the trees and do something like this:

for sample in X_test:
    for tree in gbc.estimators_:
        leaf = tree.leaf_index(sample) # This is the function I need but don't think exists.
        ...

The following function goes beyond identifying the selected leaf from the Decision Tree and implements the application in the referenced paper. Its use is the same as the referenced paper, where I use the GBC for feature engineering.

def makeTreeBins(gbc, X):
    '''
    Takes in a GradientBoostingClassifier object (gbc) and a data frame (X).
    Returns a numpy array of dim (rows(X), num_estimators), where each row
    represents the set of terminal nodes that the record X[i] falls into across
    all estimators in the GBC.  Note, each tree produces 2^max_depth terminal nodes.
    I append a prefix to the terminal node id in each incremental estimator so that
    I can use these as feature ids in other classifiers.
    '''
    for i, dt_i in enumerate(gbc.estimators_):
        prefix = (i + 2)*100 #Must be an integer
        nds = prefix + dt_i[0].tree_.apply(np.array(X).astype(np.float32))
        if i == 0:
            nd_mat = nds.reshape(len(nds), 1)        
        else:
            nd_mat = np.hstack((nd_mat, nds.reshape(len(nds), 1)))
    return nd_mat

Ref

@WMJi
Copy link
Contributor Author

WMJi commented Dec 4, 2017

now, a simply gbc.apply(X) would work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant