Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track which DataFrame Column corresponds to which Array Column(s) after Transform #13

Closed
sveitser opened this issue Aug 13, 2014 · 5 comments

Comments

@sveitser
Copy link

I think it would be useful for feature selection if it was possible to keep track of which DataFrame columns were mapped to which array columns during the transformation so that one could use for instance the feature_importances_ of ensemble methods in sklearn.

Is there a straight-forward way to do this right now? I looked into it a bit but didn't find a common way to get the necessary information during fitting of the sklearn transforms. Therefore the best way I can currently think of is to do the inspection separately for each sklearn transform, i. e. use self.feature_names_ for DictVectorizer, self.classes_ for LabelBinarizer, etc.

I'm thinking there must be a better way to do this.

@paulgb
Copy link
Collaborator

paulgb commented Aug 13, 2014

Last time I looked at sklearn (a few minor versions ago) there was no common way for transformations to indicate which columns corresponded to which names. See #7 for some discussion.

Separate implementation for each sklearn transformer would work, but I'm not keen on having a bunch of special cases for each sklearn transformer. That said, I'd accept a patch for that as long as it didn't break anything else.

@dukebody
Copy link
Collaborator

dukebody commented Aug 9, 2015

I think it is possible to track at least which columns of the final matrix correspond to each variable. Since the results of the transformation of each set of columns is hstacked in the end, we could keep track of the columns that resulted from the transformation of each variable in a "feature_indices_" variable in the mapper after transformation.

The meaning could be exactly the same one as OneHotEncoder:

feature_indices_ : array of shape (n_features,)
Indices to feature ranges. Feature i in the original data is mapped to features from feature_indices_[i] to feature_indices_[i+1] (and then potentially masked by active_features_ afterwards)

@sveitser could you implement that?

@Motorrat
Copy link

Motorrat commented Sep 7, 2016

I know there is work done on this issue so just wanted to vote for it. Here is an example of a boilerplate that could be used for many classification jobs driven from a database table. It would be very handy to be able to track back what features (columns) made it to the final set.

    #X and y are created from a dataframe
    labels = list(X.select_dtypes(include=['object']).columns)
    booleans = list(X.select_dtypes(include=['bool']).columns)
    integers = list(X.select_dtypes(include=['int64']).columns)
    floats = list(X.select_dtypes(include=['float64']).columns)

    mapping=[(b,None) for b in booleans]
    mapping.extend((f, StandardScaler()) for f in floats)
    mapping.extend((i, StandardScaler()) for i in integers) # or LabelBinarizer
    mapping.extend((l,LabelBinarizer()) for l in labels)

    mapper = DataFrameMapper(mapping)

    # the below is illustrative
    clf_pipeline = Pipeline([('map',mapper),('feature_selection', SelectKBest(f_classif, k=5) ),('clf',SGDClassifier(class_weight='balanced'))])
    y_pred = cross_validation.cross_val_predict(clf_pipeline,X,y,cv=8,n_jobs=-1)
    print("Precision: %1.2f, Recall: %1.2f, F1: %1.2f" %
     (precision_score(y, y_pred),recall_score(y, y_pred),f1_score(y, y_pred)))

Via clf_pipeline.named_steps['feature_selection'].get_support() we can see what was selected via KBest. But as I understand there is no way to track it back to the original X using DataFrameMapper data.

@Motorrat
Copy link

Motorrat commented Sep 15, 2016

this is my workaround
def binarize_label_columns(X):
    numericals = list(X.select_dtypes(include=['int64','float64']).columns)
    labels = list(X.select_dtypes(include=['object']).columns)
    #keep numericals and append binarized labels with new column names
    #we do it outside the pipeline because LabelBinarizer doesn't keep the
    #new column names (classes) for the column lineage
    N=X[numericals] 
    for l in labels:
        lb=LabelBinarizer()
        L=pd.DataFrame(lb.fit_transform(X[l]))
        L.columns=[l+"_"+v for v in lb.classes_]
        N=pd.concat([N,L],axis=1)
    return N

@dukebody
Copy link
Collaborator

dukebody commented Apr 9, 2017

Good news! This functionality was addressed in 2fc6286

@dukebody dukebody closed this as completed Apr 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants