Track which DataFrame Column corresponds to which Array Column(s) after Transform #13

sveitser · 2014-08-13T09:46:39Z

I think it would be useful for feature selection if it was possible to keep track of which DataFrame columns were mapped to which array columns during the transformation so that one could use for instance the feature_importances_ of ensemble methods in sklearn.

Is there a straight-forward way to do this right now? I looked into it a bit but didn't find a common way to get the necessary information during fitting of the sklearn transforms. Therefore the best way I can currently think of is to do the inspection separately for each sklearn transform, i. e. use self.feature_names_ for DictVectorizer, self.classes_ for LabelBinarizer, etc.

I'm thinking there must be a better way to do this.

The text was updated successfully, but these errors were encountered:

paulgb · 2014-08-13T15:21:36Z

Last time I looked at sklearn (a few minor versions ago) there was no common way for transformations to indicate which columns corresponded to which names. See #7 for some discussion.

Separate implementation for each sklearn transformer would work, but I'm not keen on having a bunch of special cases for each sklearn transformer. That said, I'd accept a patch for that as long as it didn't break anything else.

dukebody · 2015-08-09T16:39:37Z

I think it is possible to track at least which columns of the final matrix correspond to each variable. Since the results of the transformation of each set of columns is hstacked in the end, we could keep track of the columns that resulted from the transformation of each variable in a "feature_indices_" variable in the mapper after transformation.

The meaning could be exactly the same one as OneHotEncoder:

feature_indices_ : array of shape (n_features,)
Indices to feature ranges. Feature i in the original data is mapped to features from feature_indices_[i] to feature_indices_[i+1] (and then potentially masked by active_features_ afterwards)

@sveitser could you implement that?

Motorrat · 2016-09-07T15:38:33Z

I know there is work done on this issue so just wanted to vote for it. Here is an example of a boilerplate that could be used for many classification jobs driven from a database table. It would be very handy to be able to track back what features (columns) made it to the final set.

    #X and y are created from a dataframe
    labels = list(X.select_dtypes(include=['object']).columns)
    booleans = list(X.select_dtypes(include=['bool']).columns)
    integers = list(X.select_dtypes(include=['int64']).columns)
    floats = list(X.select_dtypes(include=['float64']).columns)

    mapping=[(b,None) for b in booleans]
    mapping.extend((f, StandardScaler()) for f in floats)
    mapping.extend((i, StandardScaler()) for i in integers) # or LabelBinarizer
    mapping.extend((l,LabelBinarizer()) for l in labels)

    mapper = DataFrameMapper(mapping)

    # the below is illustrative
    clf_pipeline = Pipeline([('map',mapper),('feature_selection', SelectKBest(f_classif, k=5) ),('clf',SGDClassifier(class_weight='balanced'))])
    y_pred = cross_validation.cross_val_predict(clf_pipeline,X,y,cv=8,n_jobs=-1)
    print("Precision: %1.2f, Recall: %1.2f, F1: %1.2f" %
     (precision_score(y, y_pred),recall_score(y, y_pred),f1_score(y, y_pred)))

Via clf_pipeline.named_steps['feature_selection'].get_support() we can see what was selected via KBest. But as I understand there is no way to track it back to the original X using DataFrameMapper data.

Motorrat · 2016-09-15T01:16:43Z

this is my workaround
def binarize_label_columns(X):
    numericals = list(X.select_dtypes(include=['int64','float64']).columns)
    labels = list(X.select_dtypes(include=['object']).columns)
    #keep numericals and append binarized labels with new column names
    #we do it outside the pipeline because LabelBinarizer doesn't keep the
    #new column names (classes) for the column lineage
    N=X[numericals] 
    for l in labels:
        lb=LabelBinarizer()
        L=pd.DataFrame(lb.fit_transform(X[l]))
        L.columns=[l+"_"+v for v in lb.classes_]
        N=pd.concat([N,L],axis=1)
    return N

dukebody · 2017-04-09T15:08:19Z

Good news! This functionality was addressed in 2fc6286

dukebody mentioned this issue Nov 2, 2015

Pandas In, Pandas Out? .inverse_transform() method #41

Open

dukebody added the enhancement label Nov 8, 2015

asford mentioned this issue Mar 2, 2016

Add DataFrameMapper feature metadata and y-value support. #54

Closed

5 tasks

dukebody mentioned this issue Aug 4, 2016

GridSearch - how to access internal parameters #61

Open

dukebody closed this as completed Apr 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track which DataFrame Column corresponds to which Array Column(s) after Transform #13

Track which DataFrame Column corresponds to which Array Column(s) after Transform #13

sveitser commented Aug 13, 2014

paulgb commented Aug 13, 2014

dukebody commented Aug 9, 2015

Motorrat commented Sep 7, 2016 •

edited

Loading

Motorrat commented Sep 15, 2016 •

edited

Loading

dukebody commented Apr 9, 2017

Track which DataFrame Column corresponds to which Array Column(s) after Transform #13

Track which DataFrame Column corresponds to which Array Column(s) after Transform #13

Comments

sveitser commented Aug 13, 2014

paulgb commented Aug 13, 2014

dukebody commented Aug 9, 2015

Motorrat commented Sep 7, 2016 • edited Loading

Motorrat commented Sep 15, 2016 • edited Loading

dukebody commented Apr 9, 2017

Motorrat commented Sep 7, 2016 •

edited

Loading

Motorrat commented Sep 15, 2016 •

edited

Loading