Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name columns #70

Closed
jph00 opened this issue Jan 10, 2017 · 14 comments
Closed

Name columns #70

jph00 opened this issue Jan 10, 2017 · 14 comments

Comments

@jph00
Copy link

jph00 commented Jan 10, 2017

It would be nice to have a way to specify the names of columns that are created by a transform, such that later on you could pass mapper.names (or similar) to any functions that expect a list of column names (eg variable importance) or for use in any charts where you would want to label the columns with their names.

This could default to the name of the pandas column that created it (if there's only one input and output) or the input columns joined with '_' if there's multiple inputs, and the name concatenated with '_1', '_2' etc if there's multiple outputs.

@dukebody
Copy link
Collaborator

I like this. We can make the DataFrameMapper optionally return a pandas dataframe with the naming conventions you mention. This way you could easily get the column names, and also perform slicing and dicing on the output easily.

Do you believe this is something you could implement? I can help reviewing the PR.

@dukebody
Copy link
Collaborator

You can find some code I created to keep track of how many output columns did a feature get expanded to here: https://github.com/paulgb/sklearn-pandas/pull/56/files

There is some discussion regarding that also in #54

@jph00
Copy link
Author

jph00 commented Jan 15, 2017

OK here's a PR for the functionality. Let me know if you want me to make any changes, or feel free to edit it however you like.

@dukebody
Copy link
Collaborator

@jph00 Brilliant! Good implementation and tests. :) I just rebased and added some documentation at #73. Will merge as soon as you give your OK.

@jph00
Copy link
Author

jph00 commented Jan 17, 2017

Great! Go right ahead - and thanks.

@dukebody
Copy link
Collaborator

A pleasure. :) Closing as I understand this feature is included in the fact that pandas dataframes have named columns.

@arnau126
Copy link
Collaborator

arnau126 commented Jan 17, 2017

I have a custom Transformer, where classes_ is the unique values of the fitting data. My transformer doesn't alter the number of columns of the data to transform (unlike LabelBinarizer which from 1 column generates many, for example).

In this case I think that the following lines will break the transforming:

if hasattr(t, 'classes_') and (len(t.classes_)>2):
    return [c + '_' + o for o in t.classes_]

Because in my case the number of classes is not equals to the number of columns.

Might I be using classes_ incorrectly? Or it is a use case that must be contemplated?

@dukebody
Copy link
Collaborator

dukebody commented Jan 17, 2017

@arnau126 The code to generate the column names is just inferring them because some sklearn transformers use this internal attribute as the unique class names, and generate len(classes_) columns.

I believe the best solution here is to check that the number of columns of the transformer output is equal to len(classes_) and, if not, don't use the inferred naming.

@arnau126 do you believe you can create a PR for this? Tests are run using tox.

@dukebody dukebody reopened this Jan 17, 2017
@arnau126
Copy link
Collaborator

arnau126 commented Jan 20, 2017

@dukebody Yes, it is the solution I was thinking of.
PR: #74.

@dcbb
Copy link

dcbb commented Jan 31, 2017

I came across this discussion looking for column names to interpret feature importance of a classifier.

Am I correct to assume that using the dataframe output is currently the only way to get the column names?

For what I want to do I'm actually quite happy with matrix output, but I still need the column names. (I'm using a bunch of LabelBinarizers – this is the reason why feature importance is hard to make sense of without naming).

@dukebody
Copy link
Collaborator

We can for sure find a way to generate and output the names without outputting a dataframe, yes. Do you think this is something you can contribute?

@dcbb
Copy link

dcbb commented Jan 31, 2017

Yes, I started looking at the naming function anyway. I'll give it a try, but it may take a little due to my workload.

@dukebody
Copy link
Collaborator

dukebody commented Feb 1, 2017

No hurries, let me know if I can help with pointers about how to do the testing or whatever.

@dukebody
Copy link
Collaborator

dukebody commented Feb 1, 2017

Let's continue talking about this in #78

@dukebody dukebody closed this as completed Feb 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants