Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fuzzy file matching is too fuzzy #43

Closed
nocash opened this issue Jul 16, 2015 · 6 comments
Closed

fuzzy file matching is too fuzzy #43

nocash opened this issue Jul 16, 2015 · 6 comments

Comments

@nocash
Copy link

nocash commented Jul 16, 2015

teaching-channel - atom 2015-07-16 13-21-02

Looks like improvements are currently under development in atom/fuzzaldrin#22

@jeancroy
Copy link

That's interesting.

Why fuzzaldrin fail ?
It zoom on the filename basically doing score = score(fullpath) + score(filename).
It also bail out (return 0) when it does not find some chars.
Because it cannot find "model" in the zoomed "user.rb" that part get 0.

Will new version solve it ?
Good news is that dynamic programming support errors. I do the bail out thing to be compatible, and because It bring a lot of speed. But I do it on the full string.

Fullpath clearly score more in your "Best Result" mark.

Zommed in, it's less clear.
Wrong result contain the whole "user" part, as well as "mode" which is requested in the query.
Correct result start sooner and is shorter. However size & position does not weight for a lot. (This is done to support longer string where query is acronym)

All in all I believe it's a matter of the preference for file name or full path.
I might add a test about this as a remainder people use fullpath too.
Because now pretty much every test is "prefer basename", except for the case of exact match which does not cover this.

@aaronjensen
Copy link
Member

@jeancroy the problem is that the algorithm (and it sounds like your new optimizations) do not take into account word boundaries or match length. That's very important for fuzzy matching in my experience.

See the example here: https://github.com/garybernhardt/selecta#theory-of-operation
and the algorithm that this uses: https://github.com/JazzCore/ctrlp-cmatcher

Basically, the length of the match of app/models/user.rb when word boundaries are taken into account is as low as it can be (modeluser = 9) whereas the match length of the winning option is massive:

drop_moderator_column_on_users.rb
.....1234567890123.......4567.... -> 17

app/models/user.rb
....12345..6789... -> 9

By taking into account word boundaries and minimum match length, app/models/user is clearly better.

@aaronjensen
Copy link
Member

By the way, instead of considering base name vs full path, you might consider giving preference to shorter paths and matches closer to the end of the path. I'm not sure how selecta and ctrl-p-cmatcher do it, but they do it well

@jeancroy
Copy link

@aaronjensen added test, passed without modification 👍

@aaronjensen
Copy link
Member

awesome! hope to see it get merged in soon :) Thanks.

@jeancroy
Copy link

Because you where kind enough to share idea, I'll develop a bit more.

See that use case ?
atom/fuzzy-finder#57 (A)
Somehow ImportanceTableCtrl should be prefered to switch.css when searching for itc
So that put some limit to the size-related penalty.

Another one:
atom/fuzzaldrin#17 (B)
diagonals should be prefered to Diagonal when searching for diag
Here a single character case 'D' vs 'd', should win, even if there's one extra s at the end of the word.

Those expectations kind of puts haystack size at the role of tiebreaker.
Why would we want to penalize large string ? Because the sheer size of them allow to do accidental matches. However when they do so, it tend to be isolated character, in the middle of nowhere, and that's what we are going to detect ! We can do so by counting the number of consecutive match (grouped character) and giving large bonus for that.

However in this case "moderator" is almost "model". Also my script would have attached the "u" to "user" instead of "columns". Final both match landmark character (word boundary) at "m" and "u". So I cannot guarantee that large string will be scored like garbage (because it's not garbage) and a tie breaker is exactly what you need.

Another interesting fact is that I find consecutive character very intuitive and use it to resolve otherwise contradictory cases. How do i know the lowercase "i" of "itc" should prefer uppercase "I" of "Importance" while lowercase "d" of "diag" should reject uppercase "D" of "Diagonal" in favor of "diagonal" ? Well "diag" score consecutive point in the actual word, while "itc" score consecutive in the acronym of the word ! Where the consecutive are, control the affinity for exact case vs acronym camelCase. It also ensure that if a large string accidentally match a landmark, not part of a sequence requested by query, we still get garbage like score.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants