clean_names() appends underscore before trailing number #153

sfirke · 2017-12-12T03:06:37Z

My raw data has column name email1 (sigh) and clean_names is converting it to email_1. I'd prefer email1, that underscore seems excessive.

@Tazinho I'm stepping through the code for clean_names and it looks like this enters snakecase::to_any_case as email1. When I run snakecase::to_any_case("email1") I get email_1. What do you think about this returning email_1 vs. email1? It's not a huge deal to me either way but my preference would be not to insert an underscore before a trailing number.

The text was updated successfully, but these errors were encountered:

sfirke · 2017-12-12T03:07:14Z

Is this related to Tazinho/snakecase#76?

Tazinho · 2017-12-12T08:07:30Z

yes, it's related and thanks for the feedback. I was wondering when this pops up^^. I personally prefer the underscore, but I would like to support both options.

So, I am setting underscores around everything, which is not a letter, at the beginning of parsing. As mentioned here #154 I could for example add another parsing option for this case, but would like to wait a bit and then make a overhaul of the snakecasae pkg (of course with respect to dependencies). If you have ideas how to handle this situation best within snakecase or you see other chances for improvement, pls let me know.

sfirke · 2018-01-01T20:14:50Z

@Tazinho I thought about this some more and wonder if the simplest option is for me to take the result of to_any_case() and then use regex to turn _1 into 1, so "jan_2009_profit" becomes "jan2009_profit". That would address this issue without any modifications to to_any_case().

I'd be taking the stance in clean_names() that in variable names, numbers are typically descriptive of the word they follow, and so should be attached without an underscore there.

Thus email1 stays email1. I think I like keeping the underscore for text following a number, jan2009profit is visually long even though you can tell the different parts by what's numerals vs. text.

sfirke · 2018-01-02T03:08:23Z

Anyone want to verify that indeed this regex is exactly what's needed here? Removes underscores followed by numbers.

> gsub("_(?=[0-9])", "", "jan_2009_profit", perl = TRUE)
[1] "jan2009_profit"

Tazinho · 2018-01-02T07:35:21Z

I think you need to add that before the underscore is not a number. Otherwise var1_120 becomes var1120.

sfirke · 2018-01-03T02:01:49Z

I was thinking about that and figured there was no output from to_any_case() that could be two numbers separated by an underscore. Like "var1_23" should be impossible to generate. Is there a possible input to to_any_case that will yield such a result?

…

On Jan 2, 2018 2:35 AM, "Tazinho" ***@***.***> wrote: I think you have to add that before the underscore is not a number. Otherwise var1_120 becomes var1120. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#153 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHOBkNHq81MZ5nHIC3ka6-JdZ7oFFazxks5tGdw6gaJpZM4Q-WtT> .

Tazinho · 2018-01-03T10:48:23Z

Yes, numbers which are not directly next to each other in the input should also be separated in the output, no matter of the case argument:

snakecase::to_any_case("var 1 1 1")
##>[1] "var_1_12_1"

For the cases which have an empty string as separator like the camel cases, it looks like they have the behaviour that you prefer

snakecase::to_any_case("var 1 12 1", case = "upper_camel")
##> [1] "Var1_12_1"

There are artificial ways to get around this via the postprocess argument

snakecase::to_any_case("var 1 12 1", postprocess = "")
##> [1] "var1121"

Or the protect argument (however, this is totally not intended usage; BTW the protect argument might become deprecated soon.)

snakecase::to_any_case("var 1 1 1", protect = "r")
##> [1] "var1_12_1"

sfirke · 2018-01-20T20:09:50Z

oh ho I have an idea that works with to_any_case as it is! ⚡💡 clean_names can scan for incoming patterns where a number and letter are adjacent, e.g., in email2 it would find l2 and store it in a vector. Then after putting the text through to_any_case, this would come back as email_2. At which point the function would look for the stored patterns, now with the underscore inserted - here l_2 and replace with the original l2.

This seems more minimal than what I propose above, which removes all underscores followed by numbers. This will leave alone numbers separated by spaces or other characters in the incoming raw names. But it handles the use case which I've found annoying, where a name like email2 becomes less useful from clean_names() turning it into email_2.

How about that?

Tazinho · 2018-01-21T01:00:48Z

So your suggestion would be to leave numbers just as is?
Like email1 stays email1 and email2___3bla becomes email2_3bla and email_2_4 becomes email_2_4 ?

I could add this via an extra parsing option and I also think this would make sense.

sfirke · 2018-01-21T02:39:12Z

Yep that's it. If that behavior could be implemented within the framework of to_any_case, great! Or, if it feels out of place or too niche or tricky for to_any_case, I think I can implement it as described above.

…

On Jan 20, 2018 8:00 PM, "Tazinho" ***@***.***> wrote: So your suggestion would be to leave numbers just as is? Like email1 stays email1 and email2___3bla becomes email2_3bla and email_2_4 becomes email_2_4 ? I could add this via an extra parsing option and I also think this would make sense. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#153 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHOBkEeUnGEvnmOmC2WChFZBGdfUpzqsks5tMoxBgaJpZM4Q-WtT> .

Tazinho · 2018-01-21T10:00:00Z

Will be ~~It's~~ implemented as parsing option 6 in master development version ~~for now~~ after setting the releasetag of this nights CRAN update.
(The parsing_option numbers might change in the near future).
As shown in the example this option works fine for 23, but in case of 42 it might not be intended,
thats why I think I will leave the current default in to_any_case().

> to_any_case("species42value 23month", 
+             case = "snake", parsing_option = 6)
[1] "species42value_23month"

Thanks a lot for the suggestion. I really like this solution!!

sfirke · 2018-01-22T15:40:37Z

For the example you show, what I had in mind for those inputs would be species42_value_23_month, with the trailing underscores like the current case = "snake" behavior. What do you think about that?

Also, timing-wise: I think if this is in the next CRAN version ofsnakecase before I submit janitor to CRAN, then I'll just rely on the behavior in snakecase. Or if I'm ready to submit janitor well before this addition to snakecase is ready for CRAN, then I'll implement the behavior myself. Does that seem right to you?

Tazinho · 2018-01-22T15:54:43Z

Your suggestion might be a good default. However, since there are not standardisations for names and I would like to stay with snakecase more general, I would like to just give the options surround with underscores (parsing_option 1) or leave surroundings in the output the same as in the input (currently parsing_option 6). If I would understand you correct, than you would need the latter one and then add on your side the "_" or whatever is supplied as output separator after digits, which are followed by (for example) letters.

As the next changes shouldn't be relevant to @strengejacke, I can submit to CRAN again, when you want me to.

strengejacke · 2018-01-22T16:00:06Z

I must read this in detail later to see whether this affects my defaults when using snakecase or not...

sfirke · 2018-01-22T17:10:29Z

Excellent. Okay, I'll work from the dev version of snakecase then and build up around that "parsing_option" = 6.

Tazinho · 2018-01-28T23:02:57Z

@sfirke From v 0.8.3, which is currently on the github devversion-01 branch and will be on master soon parsing_option 5 and 6 will become parsing_option 3 and 4. Old 3 and 4 are deleted since they are more or less nonsense...

sfirke · 2018-01-29T17:23:29Z

Note to self: work from snakecase pkg v. 0.9 as described here: Tazinho/snakecase#106 (comment)

sfirke · 2018-02-08T19:40:53Z

Summarizing what I think is my conclusion on how to approach this.

Background
Sometimes numbers modify the preceding word and a space is annoying: we don't want email2 to become email_2. Sometimes they are their own semantic unit and some space would be appropriate, e.g., june_2009_sales.

Desired behavior
Above I state a desire to convert june_2009_sales to june2009_sales, taking the position that numbers following words should not be separated. With the implementation of to_any_case parsing option 4, I now think that's too heavy-handed.

Instead I think spacing should be retained based on the input, as is the behavior of to_any_case parsing option 4. So:

email2 stays email2
email 2 becomes email_2 <- not my personal preference, but this is more minimal and takes me out of the business of assuming what the user wants.
Similarly, June 2009 Sales becomes june_2009_sales - here, the underscores are desirable.

There's no way to tell if the user wants email2 or june_2009_sales as the result, so go with the more minimal one.

Implementation
Handily, no regexes needed - just parsing option 4 from snakecase.

Other notes to the future
If something is attempted with regex, remember to check case. Don't want to insert _ into a camelCase output.

sfirke · 2018-02-08T21:23:44Z

I like that this matches the original behavior of the "simple" clean_names in v0.3.1 ☮️

sfirke mentioned this issue Jan 1, 2018

Does dependency on snakecase::to_any_case risk breakage in the future? #154

Closed

Tazinho mentioned this issue Jan 11, 2018

Currently no way to suppress underscores around digits Tazinho/snakecase#76

Closed

Tazinho mentioned this issue Jan 21, 2018

add an extra parsing option, where numerals positions just stay the same Tazinho/snakecase#127

Closed

sfirke mentioned this issue Jan 22, 2018

Submit version 1.0 to CRAN #140

Closed

20 tasks

sfirke mentioned this issue Feb 8, 2018

clean_names respects separators (or lack) between number and letter #177

Merged

sfirke closed this as completed in #177 Feb 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clean_names() appends underscore before trailing number #153

clean_names() appends underscore before trailing number #153

sfirke commented Dec 12, 2017

sfirke commented Dec 12, 2017

Tazinho commented Dec 12, 2017 •

edited

sfirke commented Jan 1, 2018

sfirke commented Jan 2, 2018

Tazinho commented Jan 2, 2018 •

edited

sfirke commented Jan 3, 2018 via email

Tazinho commented Jan 3, 2018

sfirke commented Jan 20, 2018 •

edited

Tazinho commented Jan 21, 2018

sfirke commented Jan 21, 2018 via email

Tazinho commented Jan 21, 2018 •

edited

sfirke commented Jan 22, 2018

Tazinho commented Jan 22, 2018

strengejacke commented Jan 22, 2018

sfirke commented Jan 22, 2018

Tazinho commented Jan 28, 2018

sfirke commented Jan 29, 2018

sfirke commented Feb 8, 2018 •

edited

sfirke commented Feb 8, 2018

clean_names() appends underscore before trailing number #153

clean_names() appends underscore before trailing number #153

Comments

sfirke commented Dec 12, 2017

sfirke commented Dec 12, 2017

Tazinho commented Dec 12, 2017 • edited

sfirke commented Jan 1, 2018

sfirke commented Jan 2, 2018

Tazinho commented Jan 2, 2018 • edited

sfirke commented Jan 3, 2018 via email

Tazinho commented Jan 3, 2018

sfirke commented Jan 20, 2018 • edited

Tazinho commented Jan 21, 2018

sfirke commented Jan 21, 2018 via email

Tazinho commented Jan 21, 2018 • edited

sfirke commented Jan 22, 2018

Tazinho commented Jan 22, 2018

strengejacke commented Jan 22, 2018

sfirke commented Jan 22, 2018

Tazinho commented Jan 28, 2018

sfirke commented Jan 29, 2018

sfirke commented Feb 8, 2018 • edited

sfirke commented Feb 8, 2018

Tazinho commented Dec 12, 2017 •

edited

Tazinho commented Jan 2, 2018 •

edited

sfirke commented Jan 20, 2018 •

edited

Tazinho commented Jan 21, 2018 •

edited

sfirke commented Feb 8, 2018 •

edited