Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fuzzy matching #37

Open
Lingtax opened this issue Mar 18, 2021 · 5 comments
Open

add fuzzy matching #37

Lingtax opened this issue Mar 18, 2021 · 5 comments
Labels
enhancement New feature or request

Comments

@Lingtax
Copy link
Contributor

Lingtax commented Mar 18, 2021

There's a persistent issue where people provide expansive and idiosyncratic responses (e.g. "I'm sexually female") that can be reasonably classified by a human user, but are difficult to accommodate in the dictionaries method as it stands.

There are a number of suggestions for how we might resolve this (e.g. grep), but these of course have potential issues with unknown future inputs. Emily also likes how the current process gives you a transparent log of how recoding happens which becomes trickier with fuzzy matching.

This is a summary of the proposed (by Emily and I) implementation of any fuzzy matching.

Fuzzy matching should:

  1. not be default
  2. require deliberate action to implement (i.e. not just fuzzy = TRUE),
  3. require user input to validate matches.

The core function arguments would default to:
gender_recode <- function(gender = gender, dictionary = gendercoder::broad, fill = FALSE, match = "exact")

And implementation would be:

gender_recode(gender_data, dictionary = broad, fill = TRUE, match = "fuzzy")

> gendercoder has exactly matched 99 (99%) of cases
> gendercoder suggests that "I'm sexually female" indicates a gender of: female. Please provide input:

1. Yes, female
2. No, male
3. No, Sex and gender diverse
4. No, other (provide text input)
5. No, replace with NA

 Selection:

Keen to get input on alternatives and implementations.

@Lingtax Lingtax added the enhancement New feature or request label Mar 18, 2021
@ekothe
Copy link
Contributor

ekothe commented Mar 18, 2021

One issue with this would be that it would create a pipeline that is not reproducible and can't be run inside a rmarkdown document (without author input).

Given that we already allow use of a custom dictionary could we instead have a function like gender_create_dictionary() that has a similar implementation except that it uses the user responses to build a custom dictionary? That way people could just apply the custom dictionary when running the code again.

I imagine a pipeline like

  1. User applies gender_recode() with a broad dictionary to data and 99% are matched
  2. User applies gender_create_dictionary to unmatched data to create a new dictionary that recodes previously unmatched responses. Optionally the code to create this dictionary is provided as a message so it can be easily reused.
  3. User applies gender_recode to data with the custom dictionary and remaining 1% are matched
  4. If the gender recoding needs to be re-run this could be achieved by using gender_recode(dictionary = c(broad, custom))

Also, the selection options should have 6. No, replace with inputted value. This would be useful for novel responses like apogender that should be added to the dictionary without requiring the user to retype.

@Lingtax
Copy link
Contributor Author

Lingtax commented Mar 18, 2021

Except that also doesn't run in a RMD.

The other hang-up is that this is going to have scalability problems. Taking inputs for 12 fuzzy matches is fine. Taking it for 120 is going to be a PITA

@ekothe
Copy link
Contributor

ekothe commented Mar 19, 2021

Why wouldn't it run in an RMD? In that workflow you would use the message text from Step 2 to recreate the custom dictionary programmatically.

@Lingtax
Copy link
Contributor Author

Lingtax commented Mar 19, 2021

Not in one pass I mean. Yes, once the dictionary is created, it's created, but there's still interactive built into that pipeline.

@ekothe
Copy link
Contributor

ekothe commented Mar 19, 2021

Yes, I can't see much way around that without simply skipping validation of fuzzy matches which seems dangerous

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants