Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean_names() appends underscore before trailing number #153

Closed
sfirke opened this issue Dec 12, 2017 · 19 comments · Fixed by #177
Closed

clean_names() appends underscore before trailing number #153

sfirke opened this issue Dec 12, 2017 · 19 comments · Fixed by #177

Comments

@sfirke
Copy link
Owner

sfirke commented Dec 12, 2017

My raw data has column name email1 (sigh) and clean_names is converting it to email_1. I'd prefer email1, that underscore seems excessive.

@Tazinho I'm stepping through the code for clean_names and it looks like this enters snakecase::to_any_case as email1. When I run snakecase::to_any_case("email1") I get email_1. What do you think about this returning email_1 vs. email1? It's not a huge deal to me either way but my preference would be not to insert an underscore before a trailing number.

@sfirke
Copy link
Owner Author

sfirke commented Dec 12, 2017

Is this related to Tazinho/snakecase#76?

@Tazinho
Copy link
Contributor

Tazinho commented Dec 12, 2017

yes, it's related and thanks for the feedback. I was wondering when this pops up^^. I personally prefer the underscore, but I would like to support both options.

So, I am setting underscores around everything, which is not a letter, at the beginning of parsing. As mentioned here #154 I could for example add another parsing option for this case, but would like to wait a bit and then make a overhaul of the snakecasae pkg (of course with respect to dependencies). If you have ideas how to handle this situation best within snakecase or you see other chances for improvement, pls let me know.

@sfirke
Copy link
Owner Author

sfirke commented Jan 1, 2018

@Tazinho I thought about this some more and wonder if the simplest option is for me to take the result of to_any_case() and then use regex to turn _1 into 1, so "jan_2009_profit" becomes "jan2009_profit". That would address this issue without any modifications to to_any_case().

I'd be taking the stance in clean_names() that in variable names, numbers are typically descriptive of the word they follow, and so should be attached without an underscore there.

Thus email1 stays email1. I think I like keeping the underscore for text following a number, jan2009profit is visually long even though you can tell the different parts by what's numerals vs. text.

@sfirke
Copy link
Owner Author

sfirke commented Jan 2, 2018

Anyone want to verify that indeed this regex is exactly what's needed here? Removes underscores followed by numbers.

> gsub("_(?=[0-9])", "", "jan_2009_profit", perl = TRUE)
[1] "jan2009_profit"

@Tazinho
Copy link
Contributor

Tazinho commented Jan 2, 2018

I think you need to add that before the underscore is not a number. Otherwise var1_120 becomes var1120.

@sfirke
Copy link
Owner Author

sfirke commented Jan 3, 2018 via email

@Tazinho
Copy link
Contributor

Tazinho commented Jan 3, 2018

Yes, numbers which are not directly next to each other in the input should also be separated in the output, no matter of the case argument:

snakecase::to_any_case("var 1 1 1")
##>[1] "var_1_12_1"

For the cases which have an empty string as separator like the camel cases, it looks like they have the behaviour that you prefer

snakecase::to_any_case("var 1 12 1", case = "upper_camel")
##> [1] "Var1_12_1"

There are artificial ways to get around this via the postprocess argument

snakecase::to_any_case("var 1 12 1", postprocess = "")
##> [1] "var1121"

Or the protect argument (however, this is totally not intended usage; BTW the protect argument might become deprecated soon.)

snakecase::to_any_case("var 1 1 1", protect = "r")
##> [1] "var1_12_1"

@sfirke
Copy link
Owner Author

sfirke commented Jan 20, 2018

oh ho I have an idea that works with to_any_case as it is! ⚡💡 clean_names can scan for incoming patterns where a number and letter are adjacent, e.g., in email2 it would find l2 and store it in a vector. Then after putting the text through to_any_case, this would come back as email_2. At which point the function would look for the stored patterns, now with the underscore inserted - here l_2 and replace with the original l2.

This seems more minimal than what I propose above, which removes all underscores followed by numbers. This will leave alone numbers separated by spaces or other characters in the incoming raw names. But it handles the use case which I've found annoying, where a name like email2 becomes less useful from clean_names() turning it into email_2.

How about that?

@Tazinho
Copy link
Contributor

Tazinho commented Jan 21, 2018

So your suggestion would be to leave numbers just as is?
Like email1 stays email1 and email2___3bla becomes email2_3bla and email_2_4 becomes email_2_4 ?

I could add this via an extra parsing option and I also think this would make sense.

@sfirke
Copy link
Owner Author

sfirke commented Jan 21, 2018 via email

@Tazinho
Copy link
Contributor

Tazinho commented Jan 21, 2018

Will be It's implemented as parsing option 6 in master development version for now after setting the releasetag of this nights CRAN update.
(The parsing_option numbers might change in the near future).
As shown in the example this option works fine for 23, but in case of 42 it might not be intended,
thats why I think I will leave the current default in to_any_case().

> to_any_case("species42value 23month", 
+             case = "snake", parsing_option = 6)
[1] "species42value_23month"

Thanks a lot for the suggestion. I really like this solution!!

@sfirke
Copy link
Owner Author

sfirke commented Jan 22, 2018

For the example you show, what I had in mind for those inputs would be species42_value_23_month, with the trailing underscores like the current case = "snake" behavior. What do you think about that?

Also, timing-wise: I think if this is in the next CRAN version ofsnakecase before I submit janitor to CRAN, then I'll just rely on the behavior in snakecase. Or if I'm ready to submit janitor well before this addition to snakecase is ready for CRAN, then I'll implement the behavior myself. Does that seem right to you?

@Tazinho
Copy link
Contributor

Tazinho commented Jan 22, 2018

Your suggestion might be a good default. However, since there are not standardisations for names and I would like to stay with snakecase more general, I would like to just give the options surround with underscores (parsing_option 1) or leave surroundings in the output the same as in the input (currently parsing_option 6). If I would understand you correct, than you would need the latter one and then add on your side the "_" or whatever is supplied as output separator after digits, which are followed by (for example) letters.

As the next changes shouldn't be relevant to @strengejacke, I can submit to CRAN again, when you want me to.

@strengejacke
Copy link

I must read this in detail later to see whether this affects my defaults when using snakecase or not...

@sfirke
Copy link
Owner Author

sfirke commented Jan 22, 2018

Excellent. Okay, I'll work from the dev version of snakecase then and build up around that "parsing_option" = 6.

@sfirke sfirke mentioned this issue Jan 22, 2018
20 tasks
@Tazinho
Copy link
Contributor

Tazinho commented Jan 28, 2018

@sfirke From v 0.8.3, which is currently on the github devversion-01 branch and will be on master soon parsing_option 5 and 6 will become parsing_option 3 and 4. Old 3 and 4 are deleted since they are more or less nonsense...

@sfirke
Copy link
Owner Author

sfirke commented Jan 29, 2018

Note to self: work from snakecase pkg v. 0.9 as described here: Tazinho/snakecase#106 (comment)

@sfirke
Copy link
Owner Author

sfirke commented Feb 8, 2018

Summarizing what I think is my conclusion on how to approach this.

Background
Sometimes numbers modify the preceding word and a space is annoying: we don't want email2 to become email_2. Sometimes they are their own semantic unit and some space would be appropriate, e.g., june_2009_sales.

Desired behavior
Above I state a desire to convert june_2009_sales to june2009_sales, taking the position that numbers following words should not be separated. With the implementation of to_any_case parsing option 4, I now think that's too heavy-handed.

Instead I think spacing should be retained based on the input, as is the behavior of to_any_case parsing option 4. So:

  • email2 stays email2
  • email 2 becomes email_2 <- not my personal preference, but this is more minimal and takes me out of the business of assuming what the user wants.
  • Similarly, June 2009 Sales becomes june_2009_sales - here, the underscores are desirable.

There's no way to tell if the user wants email2 or june_2009_sales as the result, so go with the more minimal one.

Implementation
Handily, no regexes needed - just parsing option 4 from snakecase.

Other notes to the future
If something is attempted with regex, remember to check case. Don't want to insert _ into a camelCase output.

@sfirke
Copy link
Owner Author

sfirke commented Feb 8, 2018

I like that this matches the original behavior of the "simple" clean_names in v0.3.1 ☮️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants