Skip to content
tumarkin edited this page Mar 6, 2018 · 12 revisions

Overview

yente consists of three parts:

  1. A preprocessor that transforms text via phonetic algorithms and/or word (phonetic code) length truncation.
  2. A matcher that finds matches based on word rarity (cosine similarity with an inverse density function) and allows for misspellings.
  3. An output control that provides a certain number of results and/or restricts results based on scores.

Each of these may be customized as described below. ``yente --help` will provide a list of all possible options for reference. Examples shows how these options may be combined.

Preprocessing

A phonetic algorithm may be enabled to find words that sound similar. Currently yente support SoundEx and Phonix. To enable, for example SoundEx, add -p SoundEx or --phonetic-algorithm=SoundEx to your command.

One may also limit word lengths. To do so, add --max-token-length=N to the command. This is most frequently done with phonetic algorithms. Doing so essentially emphasizes the initial sounds in a word when matching. For example to use SoundEx with a 3 character sound represent, run yente with the command options:

yente FROM-FILE TO-FILE -o OUTPUT-FILE --phonetic-algorithm=SounDex --max-token-length=3

Matching

Misspellings

Matching can allow for misspellings. This approach analyzes all possible combinations of the constituent words when comparing names. Each word pair is ranked based on similarity. The best word pairs between any two entities is used for matching.

Word pairs similarities are computed using a two step procedure:

  1. Match fraction: The match fraction represents a raw similarity between words. It is based on the number of operations (additions, subtractions, deletions and transpositions) are necessary to transform one word to the other using a Damerau-Levenshtein edit distance. For example, the two words are "Burton" and "Bruton" differ by one transposition (the "u" and "r"). Therefore, one operation is necessary to transform these six character words from one to the other. The match fraction is then (6-5)/6 = 1/6. It the two words have different lengths, the longer word length is used in this calculation.

  2. Scaled similarity: The match fraction is raised to a polynomial power to determine a final scaled similarity. The scaled similarity is defined as:

Scaled similarity = Match fraction^F,

where F is a scaling factor. F should be a positive floating point number. As the Match fraction is a number between 0 and 1, the scaling factor F allows the user to specify arbitrary levels of misspelling tolerance. When F is high, misspellings are greatly penalized. When F is small, the algorithm is relatively tolerant of misspellings.

Enable misspellings by adding the command option --misspelling-penalty=F.

Sub-group matching

In some cases, it is desirable to search for matches only within a specific subgroup. This is enabled with --subgroup-search. When matching names only those within a subgroup will be considered. Subgroup identifiers much match exactly.

Results and Output

One may request more than one result by adding --number-of-results=I or -N I to the command. I is an integer specifying the number of results to output. By default, yente does not include ties in the output. If this is not desirable add --include-ties or -i to the command. Finally, is is possible to filter out low quality matches by specifying a minimum match score with --minimum-match-score=MIN. Matches are ranked between 0 and 1, with 1 being the highest quality. So, to eliminate names whose best matches are less than 0.5, add --minimum-match-score=0.5 to the command.

Multicore support

yente runs on multicore computers to enable faster processing time through the use of parallel processing. yente's processing speed will scale in rough proportion to the number of cores.