Skip to content

Latest commit

 

History

History
19 lines (9 loc) · 1.22 KB

README.md

File metadata and controls

19 lines (9 loc) · 1.22 KB

LatinR

In this repository I uploaded the material I used at the LatinR Conference 2019 in Santiago de Chile. A lot more material - from the keynote speakers Hadley Wickham, Mine Çetinkaya-Rundel, Erin LeDell and many other great speakers can be found here.

Fuzzy merging (or also called Fuzzy matching) is a way to map text that is written differently (e.g. Richard Vogg, RICHARD VOGG, Richard.vogg) to the same entity.

One approach is to have a list with "dirty" names and compare it to a list of "clean" names by creating a distance matrix with a suitable string distance metric (e.g. the Jaro-Winkler distance).

In the code I later also tried to parallelize to improve the perfomance of the method and practice to work with the parallel package (it takes around 2 minutes for 500000 "dirty" names and 500 clean names) on 3 cores.

Output:

Fuzzy deduplication

If you have a list of words and want to remove fuzzy duplicates, I recommend to take a look at the fuzzy duplicates attempt document (still work in progress). I wrote a function that does the basic job of replacing less frequent appearances with higher frequent appearances of similar words.