LatinR

In this repository I uploaded the material I used at the LatinR Conference 2019 in Santiago de Chile. A lot more material - from the keynote speakers Hadley Wickham, Mine Çetinkaya-Rundel, Erin LeDell and many other great speakers can be found here.

Fuzzy merging (or also called Fuzzy matching) is a way to map text that is written differently (e.g. Richard Vogg, RICHARD VOGG, Richard.vogg) to the same entity.

One approach is to have a list with "dirty" names and compare it to a list of "clean" names by creating a distance matrix with a suitable string distance metric (e.g. the Jaro-Winkler distance).

In the code I later also tried to parallelize to improve the perfomance of the method and practice to work with the parallel package (it takes around 2 minutes for 500000 "dirty" names and 500 clean names) on 3 cores.

Output:

Fuzzy deduplication

If you have a list of words and want to remove fuzzy duplicates, I recommend to take a look at the fuzzy duplicates attempt document (still work in progress). I wrote a function that does the basic job of replacing less frequent appearances with higher frequent appearances of similar words.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LatinR

Fuzzy deduplication

Files

README.md

Latest commit

History

README.md

File metadata and controls

LatinR

Fuzzy deduplication