add fuzzy matching #37

Lingtax · 2021-03-18T00:31:06Z

There's a persistent issue where people provide expansive and idiosyncratic responses (e.g. "I'm sexually female") that can be reasonably classified by a human user, but are difficult to accommodate in the dictionaries method as it stands.

There are a number of suggestions for how we might resolve this (e.g. grep), but these of course have potential issues with unknown future inputs. Emily also likes how the current process gives you a transparent log of how recoding happens which becomes trickier with fuzzy matching.

This is a summary of the proposed (by Emily and I) implementation of any fuzzy matching.

Fuzzy matching should:

not be default
require deliberate action to implement (i.e. not just fuzzy = TRUE),
require user input to validate matches.

The core function arguments would default to:
gender_recode <- function(gender = gender, dictionary = gendercoder::broad, fill = FALSE, match = "exact")

And implementation would be:

gender_recode(gender_data, dictionary = broad, fill = TRUE, match = "fuzzy")

> gendercoder has exactly matched 99 (99%) of cases
> gendercoder suggests that "I'm sexually female" indicates a gender of: female. Please provide input:

1. Yes, female
2. No, male
3. No, Sex and gender diverse
4. No, other (provide text input)
5. No, replace with NA

 Selection:

Keen to get input on alternatives and implementations.

The text was updated successfully, but these errors were encountered:

ekothe · 2021-03-18T21:42:03Z

One issue with this would be that it would create a pipeline that is not reproducible and can't be run inside a rmarkdown document (without author input).

Given that we already allow use of a custom dictionary could we instead have a function like gender_create_dictionary() that has a similar implementation except that it uses the user responses to build a custom dictionary? That way people could just apply the custom dictionary when running the code again.

I imagine a pipeline like

User applies gender_recode() with a broad dictionary to data and 99% are matched
User applies gender_create_dictionary to unmatched data to create a new dictionary that recodes previously unmatched responses. Optionally the code to create this dictionary is provided as a message so it can be easily reused.
User applies gender_recode to data with the custom dictionary and remaining 1% are matched
If the gender recoding needs to be re-run this could be achieved by using gender_recode(dictionary = c(broad, custom))

Also, the selection options should have 6. No, replace with inputted value. This would be useful for novel responses like apogender that should be added to the dictionary without requiring the user to retype.

Lingtax · 2021-03-18T23:15:35Z

Except that also doesn't run in a RMD.

The other hang-up is that this is going to have scalability problems. Taking inputs for 12 fuzzy matches is fine. Taking it for 120 is going to be a PITA

ekothe · 2021-03-19T00:44:34Z

Why wouldn't it run in an RMD? In that workflow you would use the message text from Step 2 to recreate the custom dictionary programmatically.

Lingtax · 2021-03-19T00:55:22Z

Not in one pass I mean. Yes, once the dictionary is created, it's created, but there's still interactive built into that pipeline.

ekothe · 2021-03-19T01:15:41Z

Yes, I can't see much way around that without simply skipping validation of fuzzy matches which seems dangerous

Lingtax added the enhancement New feature or request label Mar 18, 2021

ekothe mentioned this issue Oct 12, 2021

should "I am XXX" resolve to "XXX" #34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add fuzzy matching #37

add fuzzy matching #37

Lingtax commented Mar 18, 2021 •

edited

Loading

ekothe commented Mar 18, 2021

Lingtax commented Mar 18, 2021

ekothe commented Mar 19, 2021

Lingtax commented Mar 19, 2021

ekothe commented Mar 19, 2021

add fuzzy matching #37

add fuzzy matching #37

Comments

Lingtax commented Mar 18, 2021 • edited Loading

ekothe commented Mar 18, 2021

Lingtax commented Mar 18, 2021

ekothe commented Mar 19, 2021

Lingtax commented Mar 19, 2021

ekothe commented Mar 19, 2021

Lingtax commented Mar 18, 2021 •

edited

Loading