Skip to content
/ yenta Public

A fast, fuzzy, flexible command-line matchmaker for textual data

License

Notifications You must be signed in to change notification settings

tumarkin/yenta

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

yenta

A fast, fuzzy, flexible command-line matchmaker for textual data

Overview

yenta matches names across two data files. It has the following features:

  • Intelligent: Matching is based on rareness of words, which means that one does not need to preprocess the names to remove common, non-informative words in names (i.e. and, the, company). Just feed your data in to the program and get results.
  • Robust: yenta incorporates feautes that are commonly needed in name matching. It is both word-order and case insensitive (Shawn Spencer matches SPENCER, SHAWN). yenta removes punctuation by default.
  • Unicode aware: By default, yenta automatically converts unicode accented characters to their ASCII equivalents.
  • Customizable: Users may optionally allow for misspellings, implement phonetic algorithms, trim the constituent words of a name at a prespecified number of characters, output any number of potential matches (with and without ties), and combine any of the preceding customizations.
  • High performance: yenta is a multi-core program written in Rust, a blazingly fast and memory-efficient language.

Installation

  • Install Rust
  • Clone this repository
  • At the command line, change to the root of the cloned repository and then type: cargo install --path=. This may be cargo install --release depending on your version.

Quick Start

Save your data files in CSV format. You will match names from one file to potential matches in a second file. Assume that the first file is called from_names.csv and the second file is called to_names.csv. yenta requires that each of your CSV files has a column called name, in lower case. This column will be used by the fuzzy matcher. You may also have an optional column called id, which, if used, simply serves as a reference identifier that is echoed to the output.

On the command line, cd into the directory with your files. To create an output file called matches.csv use the following command:

yenta from_names.csv to_names.csv --output-file=matches.csv

Recipes

Information

See the wiki for information on installation, usage, and best practices. It also includes some examples for matching problems that commonly arise in research.

Contributing

Submit a pull request and I will respond.

If yenta has in any way made your life easier, please send me an email or star this repository. If you would like to see a feature added, let me know through the Github forum.

To Do

  • Multiple producer, single-consumer output clustering
  • CLI error standardization
  • Match Modes
    • Exact token
    • Ngram
    • Levenshtein
    • Damerau-Levenshtein
  • Subgroup search
  • Benchmark BTreeMap/BTreeSet vs HashMap/HashSet
  • Evaluate Tokio/Crossbeam
  • NameProcessed::new using token iterator
  • MinTieHeap

About

A fast, fuzzy, flexible command-line matchmaker for textual data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages