Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Fast string search / match from an array of strings. Written in C for performance.
branch: master

v0.9.2:

 - Fixes bug where duple index was traversed in wrong direction
 - Rejiggering of how string length is penalized
latest commit a385031c51
vincentchu authored
Failed to load latest commit information.
examples/command_t Replacing filelist with linux filelist in example command_t app
ext/fuzz_ball v0.9.2:
lib v0.9.2:
spec v0.9.2:
README.md Adding README.md
fuzz_ball.gemspec v0.9.2:

README.md

fuzz_ball

FuzzBall is a gem that finds fuzzy matches of a string (the 'needle') within an array of strings (the 'haystack'). It does so via a two-step process: first, it finds candidate strings from the haystack that have high similarity to the needle, then uses a Smith-Waterman algorithm to fuzzily match from these candidates. Strings are returned along with a matching score. Both steps of the search are written in C for greater performance.

Usage

require 'fuzz_ball'
ruby-1.9.2-p290 :001 > words = %Q[Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.].split(/\s+/)

ruby-1.9.2-p290 :002 > fuzz = FuzzBall::Searcher.new( words )
=> <FuzzBall::Searcher n_files=19>

ruby-1.9.2-p290 :003 > fuzz.search("lor")
=> [{:alignment=>[2, 4, 1, 3, 0, 2], :score=>0.6666666666666666, :string=>"dolore"}, {:alignment=>[2, 4, 1, 3, 0, 2], :score=>0.8, :string=>"dolor"}]

ruby-1.9.2-p290 :004 > fuzz.add("a_new_string")
=> true

ruby-1.9.2-p290 :005 > fuzz.search("new_string")
=> [{:alignment=>[9, 11, 8, 10, 7, 9, 6, 8, 5, 7, 4, 6, 3, 5, 2, 4, 1, 3, 0, 2], :score=>1.5, :string=>"a_new_string"}]

Algorithms

FuzzBall uses a two-step process. When a search is executed, the number of haystack strings that must be searched is reduced by first eliminating haystack strings that are not similar to the needle string. Similarity is determined by a high number of shared duples between the needle string and a haystack string (in the appropriate order). This metric is computed by pre-indexing the duples in the haystack strings.

These candidate strings are then fuzzily matched using the Smith-Waterman algorithm for performaing local sequence alignment.

Duple Indexing

When a new instance of FuzzBall::Searcher is instantiated, each haystack string is decomposed into duples. The position of each duple is indexed internally by the C implementation and stored in a lightweight hash data structure, allowing fast (O(1)) lookups of the haystack strings that contain a given duple. The C implementation uses uthash. See DupleIndex.c in ext/duple_index.

Smith-Waterman

The Smith-Waterman algorithm is a dynamic programming algorithm that performs local sequence alighment by comparing segments of all possible lengths. It is commonly used to align protein or nucleotide sequences and is heavily used within the computational biology communinity. However it works just as well on normal strings.

Dependencies

The C extension makes use of uthash, a C hash implementation written by Troy D. Hanson under a modified BSD license.

Author

FuzzBall was written by Vincent Chu (vincentchu [at] gmail.com).

Something went wrong with that request. Please try again.