Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

step for dummy variables across multiple columns #716

Closed
topepo opened this issue Jun 2, 2021 · 5 comments
Closed

step for dummy variables across multiple columns #716

topepo opened this issue Jun 2, 2021 · 5 comments
Assignees
Labels
feature a feature request or enhancement

Comments

@topepo
Copy link
Member

topepo commented Jun 2, 2021

In cases where a row of a data set can have multiple categories spread across columns, a step to find the unique set of categories, then create dummy columns for each. Each row can have multiple 1's in the new dummy variables.

@topepo topepo added the feature a feature request or enhancement label Jun 2, 2021
@EmilHvitfeldt
Copy link
Member

I can take this one! step_multi_dummy()?

NA should be counted as none right?

@topepo
Copy link
Member Author

topepo commented Jun 2, 2021

I think that step_dummy_multi_label() (for tab-complete) would be a good idea.

NA should be counted as none right?

Correct.

I'm not sure if there is an easy way to define the prefix for the new indicator variables (from the data).

Just so we are clear, imagine columns for language spoken. There data might look like:

lang_1     lang_2     lang_3
English    Italian    NA
Spanish    NA         French
Armenian   English    French
NA         NA         NA

We would end up with

Armenian English French Italian Spanish
       0       1      0       1       0
       0       0      1       0       1
       1       1      1       0       0
       0       0      0       0       0

@EmilHvitfeldt
Copy link
Member

I think that step_dummy_multi_label() (for tab-complete) would be a good idea.

Perfect.

I think a good default for the prefix would be based on the first column selected.

Should the step have a way to deal with a high number of labels? I was thinking of a threshold argument like step_other() could be useful. Defaulting to 0. You could always step_other() each individual column, but It would be nice to do it globally as well.

@EmilHvitfeldt EmilHvitfeldt self-assigned this Jun 2, 2021
@topepo
Copy link
Member Author

topepo commented Jun 2, 2021

I think that adding a threshold argument for an other category would be a great idea.

EmilHvitfeldt added a commit that referenced this issue Jun 8, 2021
@topepo topepo closed this as completed in 659a16c Sep 14, 2021
@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Sep 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants