Name columns #70

jph00 · 2017-01-10T18:36:16Z

It would be nice to have a way to specify the names of columns that are created by a transform, such that later on you could pass mapper.names (or similar) to any functions that expect a list of column names (eg variable importance) or for use in any charts where you would want to label the columns with their names.

This could default to the name of the pandas column that created it (if there's only one input and output) or the input columns joined with '_' if there's multiple inputs, and the name concatenated with '_1', '_2' etc if there's multiple outputs.

dukebody · 2017-01-14T18:54:52Z

I like this. We can make the DataFrameMapper optionally return a pandas dataframe with the naming conventions you mention. This way you could easily get the column names, and also perform slicing and dicing on the output easily.

Do you believe this is something you could implement? I can help reviewing the PR.

dukebody · 2017-01-14T18:59:27Z

You can find some code I created to keep track of how many output columns did a feature get expanded to here: https://github.com/paulgb/sklearn-pandas/pull/56/files

There is some discussion regarding that also in #54

jph00 · 2017-01-15T00:49:54Z

OK here's a PR for the functionality. Let me know if you want me to make any changes, or feel free to edit it however you like.

dukebody · 2017-01-16T18:48:24Z

@jph00 Brilliant! Good implementation and tests. :) I just rebased and added some documentation at #73. Will merge as soon as you give your OK.

jph00 · 2017-01-17T00:08:48Z

Great! Go right ahead - and thanks.

dukebody · 2017-01-17T08:14:39Z

A pleasure. :) Closing as I understand this feature is included in the fact that pandas dataframes have named columns.

arnau126 · 2017-01-17T11:06:57Z

I have a custom Transformer, where classes_ is the unique values of the fitting data. My transformer doesn't alter the number of columns of the data to transform (unlike LabelBinarizer which from 1 column generates many, for example).

In this case I think that the following lines will break the transforming:

if hasattr(t, 'classes_') and (len(t.classes_)>2):
    return [c + '_' + o for o in t.classes_]

Because in my case the number of classes is not equals to the number of columns.

Might I be using classes_ incorrectly? Or it is a use case that must be contemplated?

dukebody · 2017-01-17T12:03:09Z

@arnau126 The code to generate the column names is just inferring them because some sklearn transformers use this internal attribute as the unique class names, and generate len(classes_) columns.

I believe the best solution here is to check that the number of columns of the transformer output is equal to len(classes_) and, if not, don't use the inferred naming.

@arnau126 do you believe you can create a PR for this? Tests are run using tox.

arnau126 · 2017-01-20T10:55:16Z

@dukebody Yes, it is the solution I was thinking of.
PR: #74.

dcbb · 2017-01-31T08:55:42Z

I came across this discussion looking for column names to interpret feature importance of a classifier.

Am I correct to assume that using the dataframe output is currently the only way to get the column names?

For what I want to do I'm actually quite happy with matrix output, but I still need the column names. (I'm using a bunch of LabelBinarizers – this is the reason why feature importance is hard to make sense of without naming).

dukebody · 2017-01-31T10:19:47Z

We can for sure find a way to generate and output the names without outputting a dataframe, yes. Do you think this is something you can contribute?

dcbb · 2017-01-31T20:45:53Z

Yes, I started looking at the naming function anyway. I'll give it a try, but it may take a little due to my workload.

dukebody · 2017-02-01T08:16:06Z

No hurries, let me know if I can help with pointers about how to do the testing or whatever.

dukebody · 2017-02-01T08:19:11Z

Let's continue talking about this in #78

dukebody mentioned this issue Jan 14, 2017

Feature Request: PandasFeatureUnion #69

Closed

jph00 mentioned this issue Jan 15, 2017

add df_out to return a data frame #72

Closed

dukebody mentioned this issue Jan 16, 2017

Add a df_out option to return a dataframe #73

Merged

dukebody closed this as completed Jan 17, 2017

dukebody reopened this Jan 17, 2017

dukebody mentioned this issue Feb 1, 2017

Name columns without outputting a dataframe #78

Closed

dukebody closed this as completed Feb 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Name columns #70

Name columns #70

jph00 commented Jan 10, 2017

dukebody commented Jan 14, 2017

dukebody commented Jan 14, 2017

jph00 commented Jan 15, 2017

dukebody commented Jan 16, 2017

jph00 commented Jan 17, 2017

dukebody commented Jan 17, 2017

arnau126 commented Jan 17, 2017 •

edited

dukebody commented Jan 17, 2017 •

edited

arnau126 commented Jan 20, 2017 •

edited

dcbb commented Jan 31, 2017

dukebody commented Jan 31, 2017

dcbb commented Jan 31, 2017

dukebody commented Feb 1, 2017

dukebody commented Feb 1, 2017

Name columns #70

Name columns #70

Comments

jph00 commented Jan 10, 2017

dukebody commented Jan 14, 2017

dukebody commented Jan 14, 2017

jph00 commented Jan 15, 2017

dukebody commented Jan 16, 2017

jph00 commented Jan 17, 2017

dukebody commented Jan 17, 2017

arnau126 commented Jan 17, 2017 • edited

dukebody commented Jan 17, 2017 • edited

arnau126 commented Jan 20, 2017 • edited

dcbb commented Jan 31, 2017

dukebody commented Jan 31, 2017

dcbb commented Jan 31, 2017

dukebody commented Feb 1, 2017

dukebody commented Feb 1, 2017

arnau126 commented Jan 17, 2017 •

edited

dukebody commented Jan 17, 2017 •

edited

arnau126 commented Jan 20, 2017 •

edited