Skip to content

vincentchu/fuzz_ball

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fuzz_ball

FuzzBall is a gem that finds fuzzy matches of a string (the 'needle') within an array of strings (the 'haystack'). It does so via a two-step process: first, it finds candidate strings from the haystack that have high similarity to the needle, then uses a Smith-Waterman algorithm to fuzzily match from these candidates. Strings are returned along with a matching score. Both steps of the search are written in C for greater performance.

Usage

require 'fuzz_ball'
ruby-1.9.2-p290 :001 > words = %Q[Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.].split(/\s+/)

ruby-1.9.2-p290 :002 > fuzz = FuzzBall::Searcher.new( words )
=> <FuzzBall::Searcher n_files=19>

ruby-1.9.2-p290 :003 > fuzz.search("lor")
=> [{:alignment=>[2, 4, 1, 3, 0, 2], :score=>0.6666666666666666, :string=>"dolore"}, {:alignment=>[2, 4, 1, 3, 0, 2], :score=>0.8, :string=>"dolor"}]

ruby-1.9.2-p290 :004 > fuzz.add("a_new_string")
=> true

ruby-1.9.2-p290 :005 > fuzz.search("new_string")
=> [{:alignment=>[9, 11, 8, 10, 7, 9, 6, 8, 5, 7, 4, 6, 3, 5, 2, 4, 1, 3, 0, 2], :score=>1.5, :string=>"a_new_string"}]

Algorithms

FuzzBall uses a two-step process. When a search is executed, the number of haystack strings that must be searched is reduced by first eliminating haystack strings that are not similar to the needle string. Similarity is determined by a high number of shared duples between the needle string and a haystack string (in the appropriate order). This metric is computed by pre-indexing the duples in the haystack strings.

These candidate strings are then fuzzily matched using the Smith-Waterman algorithm for performaing local sequence alignment.

Duple Indexing

When a new instance of FuzzBall::Searcher is instantiated, each haystack string is decomposed into duples. The position of each duple is indexed internally by the C implementation and stored in a lightweight hash data structure, allowing fast (O(1)) lookups of the haystack strings that contain a given duple. The C implementation uses uthash. See DupleIndex.c in ext/duple_index.

Smith-Waterman

The Smith-Waterman algorithm is a dynamic programming algorithm that performs local sequence alighment by comparing segments of all possible lengths. It is commonly used to align protein or nucleotide sequences and is heavily used within the computational biology communinity. However it works just as well on normal strings.

Dependencies

The C extension makes use of uthash, a C hash implementation written by Troy D. Hanson under a modified BSD license.

Author

FuzzBall was written by Vincent Chu (vincentchu [at] gmail.com).

About

Fast string search / match from an array of strings. Written in C for performance.

Resources

Stars

Watchers

Forks

Packages

No packages published