-
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more accurate account search #11537
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4b59bb2
to
4e9c3c7
Compare
When ElasticSearch is available, a more accurate search is implemented: - Using edge n-gram index for acct and display name - Using asciifolding and cjk width normalization on display names - Using Gaussian decay on account activity for additional scoring (recency) - Using followers/friends ratio for additional scoring (spamminess) - Using followers number for additional scoring (size) The exact match precedence only takes effect when the input conforms to the username format and the username part of it is complete, i.e. when the user started typing the domain part.
4e9c3c7
to
e4986ab
Compare
7ea4606
to
0c88c25
Compare
nightpool
approved these changes
Aug 15, 2019
hiyuki2578
pushed a commit
to ProjectMyosotis/mastodon
that referenced
this pull request
Oct 2, 2019
* Add more accurate account search When ElasticSearch is available, a more accurate search is implemented: - Using edge n-gram index for acct and display name - Using asciifolding and cjk width normalization on display names - Using Gaussian decay on account activity for additional scoring (recency) - Using followers/friends ratio for additional scoring (spamminess) - Using followers number for additional scoring (size) The exact match precedence only takes effect when the input conforms to the username format and the username part of it is complete, i.e. when the user started typing the domain part. * Support single-letter usernames * Fix tests * Fix not picking up account updates * Add weights and normalization for scores, skip zero terms queries * Use local counts for accounts index, adjust search parameters * Fix mistakes * Using updated_at of accounts is inadequate for remote accounts
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When ElasticSearch is available, a more accurate search is implemented:
Additionally, the previous behaviour is also kept:
The exact match precedence only takes effect when the input conforms to the username format and the username part of it is complete, i.e. when the user started typing the domain part.
What does this mean? Searching for people who have unicode characters in their display names, such as accents and umlauts, has become possible without entering those characters (asciifolding); likewise, you no longer need to use the exact flavour of CJK characters (full-width/half-width) to get the same results; accounts that haven't been active in a long time are less likely to appear near the top; accounts that are follow-botting are less likely to appear near the top (spamminess), and more established accounts (by followers number) are more likely to appear higher (but mind, that's within specific searches). Like before, the biggest factor is whether you're following someone.
Evaluations
I decided to run different queries for the same search term to see the differences between results and how they are scored. The queries did not consider follow relationships because a user that is followed by the searching user will universally appear at the top of the results.
Search term "garg"
Expecting to find: "Gargron"
Here is the status quo, search results from PostgreSQL for comparison:
The first result is a dead account, the rest are mostly bots.
Now, let's begin with a simple query without any additional scoring:
The third result is an active account, the rest, not so much. They're also all local accounts.
Same query with follower ratio affecting the score:
There are more interesting results here. Some accounts are bots/inactive/fake, however. The follower ratio isn't very insightful alone because it behaves wildly at low numbers.
Same query, but with followers number affecting the score:
These are pretty interesting results. The first two accounts are real and active. The third is real, but belongs to an instance that's been dead for a year. The fourth is a dead account. The fifth and sixth accounts are real and active.
Same query, but with last activity affecting the score:
These results seem a bit nonsense. At the very least, they're neither all local, nor is the long inactive gonext.gg account among them.
Now, same query, but combining all 3 scoring modifiers:
The first 4 results are real and active accounts. The fifth is fake, the sixth is inactive, the rest are dead. Seeing these suggestions, the user could press ENTER to get the desired completion immediately.
Search term "electro"
Expected to find "electroCutie@beach.city"
Status quo from PostgreSQL:
Just like in the other example, there is no real logic to the results and what we expect to find isn't there at all.
No additional scoring:
While a "electrocutie" is the second result, that's a different account that was last active in 2017.
Follower ratio affecting the score:
What we expect to find isn't there at all. The first result is an active and popular bot, but most of the other results are inactive or fake.
Followers number affecting the score:
The first two accounts are inactive. The third and fourth are real, and hey, it appears! What we expect to find is the 5th result.
Last activity affecting the score:
All the accounts indeed have posted within the last two months, and one of them is the one we expect to find, however, it is low on the list.
All combined:
The first result is real. The second has never posted. The third is inactive. The fourth is real and active. The fifth is what we were looking to find. Seeing these suggestions, the user would have to type another "c" to get the desired result and press ENTER.