fuzzy file matching is too fuzzy #43

nocash · 2015-07-16T20:29:49Z

Looks like improvements are currently under development in atom/fuzzaldrin#22

jeancroy · 2015-07-18T18:53:15Z

That's interesting.

Why fuzzaldrin fail ?
It zoom on the filename basically doing score = score(fullpath) + score(filename).
It also bail out (return 0) when it does not find some chars.
Because it cannot find "model" in the zoomed "user.rb" that part get 0.

Will new version solve it ?
Good news is that dynamic programming support errors. I do the bail out thing to be compatible, and because It bring a lot of speed. But I do it on the full string.

Fullpath clearly score more in your "Best Result" mark.

Zommed in, it's less clear.
Wrong result contain the whole "user" part, as well as "mode" which is requested in the query.
Correct result start sooner and is shorter. However size & position does not weight for a lot. (This is done to support longer string where query is acronym)

All in all I believe it's a matter of the preference for file name or full path.
I might add a test about this as a remainder people use fullpath too.
Because now pretty much every test is "prefer basename", except for the case of exact match which does not cover this.

aaronjensen · 2015-07-18T19:03:54Z

@jeancroy the problem is that the algorithm (and it sounds like your new optimizations) do not take into account word boundaries or match length. That's very important for fuzzy matching in my experience.

See the example here: https://github.com/garybernhardt/selecta#theory-of-operation
and the algorithm that this uses: https://github.com/JazzCore/ctrlp-cmatcher

Basically, the length of the match of app/models/user.rb when word boundaries are taken into account is as low as it can be (modeluser = 9) whereas the match length of the winning option is massive:

drop_moderator_column_on_users.rb
.....1234567890123.......4567.... -> 17

app/models/user.rb
....12345..6789... -> 9

By taking into account word boundaries and minimum match length, app/models/user is clearly better.

aaronjensen · 2015-07-18T19:17:08Z

By the way, instead of considering base name vs full path, you might consider giving preference to shorter paths and matches closer to the end of the path. I'm not sure how selecta and ctrl-p-cmatcher do it, but they do it well

jeancroy · 2015-07-18T19:17:16Z

@aaronjensen added test, passed without modification 👍

aaronjensen · 2015-07-18T19:19:17Z

awesome! hope to see it get merged in soon :) Thanks.

jeancroy · 2015-07-19T14:13:19Z

Because you where kind enough to share idea, I'll develop a bit more.

See that use case ?
atom/fuzzy-finder#57 (A)
Somehow ImportanceTableCtrl should be prefered to switch.css when searching for itc
So that put some limit to the size-related penalty.

Another one:
atom/fuzzaldrin#17 (B)
diagonals should be prefered to Diagonal when searching for diag
Here a single character case 'D' vs 'd', should win, even if there's one extra s at the end of the word.

Those expectations kind of puts haystack size at the role of tiebreaker.
Why would we want to penalize large string ? Because the sheer size of them allow to do accidental matches. However when they do so, it tend to be isolated character, in the middle of nowhere, and that's what we are going to detect ! We can do so by counting the number of consecutive match (grouped character) and giving large bonus for that.

However in this case "moderator" is almost "model". Also my script would have attached the "u" to "user" instead of "columns". Final both match landmark character (word boundary) at "m" and "u". So I cannot guarantee that large string will be scored like garbage (because it's not garbage) and a tie breaker is exactly what you need.

Another interesting fact is that I find consecutive character very intuitive and use it to resolve otherwise contradictory cases. How do i know the lowercase "i" of "itc" should prefer uppercase "I" of "Importance" while lowercase "d" of "diag" should reject uppercase "D" of "Diagonal" in favor of "diagonal" ? Well "diag" score consecutive point in the actual word, while "itc" score consecutive in the acronym of the word ! Where the consecutive are, control the affinity for exact case vs acronym camelCase. It also ensure that if a large string accidentally match a landmark, not part of a sequence requested by query, we still get garbage like score.

jeancroy mentioned this issue Sep 18, 2015

Rewrite scoring algorithm to support run of consecutive character, fix acronyms and add optimal selection of character. atom/fuzzaldrin#22

Closed

nocash closed this as completed May 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fuzzy file matching is too fuzzy #43

fuzzy file matching is too fuzzy #43

nocash commented Jul 16, 2015

jeancroy commented Jul 18, 2015

aaronjensen commented Jul 18, 2015

aaronjensen commented Jul 18, 2015

jeancroy commented Jul 18, 2015

aaronjensen commented Jul 18, 2015

jeancroy commented Jul 19, 2015

fuzzy file matching is too fuzzy #43

fuzzy file matching is too fuzzy #43

Comments

nocash commented Jul 16, 2015

jeancroy commented Jul 18, 2015

aaronjensen commented Jul 18, 2015

aaronjensen commented Jul 18, 2015

jeancroy commented Jul 18, 2015

aaronjensen commented Jul 18, 2015

jeancroy commented Jul 19, 2015