Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean_spelling: inacurate warning #47

Closed
thibautjombart opened this issue Feb 27, 2019 · 6 comments · Fixed by #48
Closed

clean_spelling: inacurate warning #47

thibautjombart opened this issue Feb 27, 2019 · 6 comments · Fixed by #48
Assignees
Labels
enhancement New feature or request high priority this feature should be completed and tested as soon as possible

Comments

@thibautjombart
Copy link
Contributor

Here's the example:

## cleaning rules
toto <- structure(list(bad_spelling = c("f", "h", "m", "mm", "ms", ".missing", 
".default"), good_spelling = c("female", "male", "male", "male", 
"unknown", "unknown", "unknown_check_dictionary"), variable = c("sexe", 
"sexe", "sexe", "sexe", "sexe", "sexe", "sexe")), row.names = c(NA, 
-7L), class = "data.frame")

## 1) spurious warning here
> clean_spelling(c("male", "female", "f"), toto)
[1] "male"   "female" "female"
Warning message:
Unknown levels in `f`: unknown_check_dictionary 

## 2) here no warning but see notes below
> clean_spelling(c("male", "female", "as", "f"), toto)
[1] "male"                     "female"                  
[3] "unknown_check_dictionary" "female"                  
> 

Notes:

  • case 1) should have no warning
  • case 2) should optionally have a warning, saying "Values 'as' is not defined in the dictionary; using default replacement"

As a complement to this, it would be great to have a functionality that would add to an existing dictionary all terms not defined, whenever a .default field is defined. Probably belongs to a different issues though - I'll let you repost

@thibautjombart thibautjombart added enhancement New feature or request high priority this feature should be completed and tested as soon as possible labels Feb 27, 2019
@zkamvar
Copy link
Member

zkamvar commented Feb 27, 2019

* case 1) should have no warning

gotcha... simple fix :)

* case 2) should optionally have a warning, saying "Values 'as' is not defined in the dictionary; using default replacement"

This has the potential to produce a LOT of warnings for a data set.

As a complement to this, it would be great to have a functionality that would add to an existing dictionary all terms not defined, whenever a .default field is defined. Probably belongs to a different issues though - I'll let you repost

zkamvar added a commit that referenced this issue Feb 27, 2019
@thibautjombart
Copy link
Contributor Author

If it is not associated to the .global variable, should be okay? Or maybe if we want to be on the safe side,
add another wildcard called .default.warn which will use default and trigger a warning? :)

Or: add another oprtional column to the dictionary, which will trigger a pre-defined behaviour if that replacement is used; could be message, warning, error?

@zkamvar
Copy link
Member

zkamvar commented Feb 27, 2019

Currently, I have an option called warn_default that will print the warning. I will have clean_variable_spelling() catch all of these and issue a group warning later.

If it is not associated to the .global variable, should be okay?

That's a good point. We should make it impossible to use .default with .global since it could potentially destroy a lot of data.

Or maybe if we want to be on the safe side, add another wildcard called .default.warn which will use default and trigger a warning? :)
Or: add another oprtional column to the dictionary, which will trigger a pre-defined behaviour if that replacement is used; could be message, warning, error?

I'm on the fence about this. Part of me thinks it's a nifty idea and part of me thinks it's overcomplicating the wordlist.

zkamvar added a commit that referenced this issue Feb 27, 2019
@thibautjombart
Copy link
Contributor Author

That's cool. About the .default used in conjunction with .global, I would disallow it by default, but add an argument to allow it. Imagine selecting a bunch of columns with identical sets of possible values, that we may want to clean in one go (got this case here with 30+ columns of symptoms for instance).

@zkamvar
Copy link
Member

zkamvar commented Feb 27, 2019

About the .default used in conjunction with .global, I would disallow it by default, but add an argument to allow it. Imagine selecting a bunch of columns with identical sets of possible values, that we may want to clean in one go (got this case here with 30+ columns of symptoms for instance).

This would be solved better with the implementation of #40 since .global goes after all unprotected text columns.

@thibautjombart
Copy link
Contributor Author

Not sure, I still think it may be handy to specify several variables in the column variable, but this is for another time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority this feature should be completed and tested as soon as possible
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants