Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more accurate account search #11537

Merged
merged 8 commits into from
Aug 15, 2019
Merged

Conversation

Gargron
Copy link
Member

@Gargron Gargron commented Aug 11, 2019

When ElasticSearch is available, a more accurate search is implemented:

  • Using edge n-gram index for acct and display name
  • Using asciifolding and cjk width normalization on display names
  • Using Gaussian decay on account activity for additional scoring (recency)
  • Using followers/friends ratio for additional scoring (spamminess)
  • Using followers number for additional scoring (size)

Additionally, the previous behaviour is also kept:

  • Prioritizing people you follow (closeness)

The exact match precedence only takes effect when the input conforms to the username format and the username part of it is complete, i.e. when the user started typing the domain part.

What does this mean? Searching for people who have unicode characters in their display names, such as accents and umlauts, has become possible without entering those characters (asciifolding); likewise, you no longer need to use the exact flavour of CJK characters (full-width/half-width) to get the same results; accounts that haven't been active in a long time are less likely to appear near the top; accounts that are follow-botting are less likely to appear near the top (spamminess), and more established accounts (by followers number) are more likely to appear higher (but mind, that's within specific searches). Like before, the biggest factor is whether you're following someone.


Evaluations

I decided to run different queries for the same search term to see the differences between results and how they are scored. The queries did not consider follow relationships because a user that is followed by the searching user will universally appear at the top of the results.

Search term "garg"

Expecting to find: "Gargron"

Here is the status quo, search results from PostgreSQL for comparison:

Document score Completion
0.583333 Gargy
0.583333 gargosoft@instance.business
0.583333 Gargo@kinky.business
0.583333 gargath@canislupus.im
0.583333 gargantia@peertube.mastodon.host
0.583333 gargoyle@witchcraft.cafe
0.583333 gargamel@botsin.space
0.583333 Gargantia
0.583333 gargronbot
0.5 AnvitZero

The first result is a dead account, the rest are mostly bots.

Now, let's begin with a simple query without any additional scoring:

Document score Completion
26.223595 garg
24.804735 Gargy
23.99578 gargan
23.249702 Gargron
22.583332 gargle
22.343393 gargrant
22.343393 Gargabob
21.893581 Gargoyle
21.849594 Garg0yle
21.849594 gargirl_

The third result is an active account, the rest, not so much. They're also all local accounts.

Same query with follower ratio affecting the score:

Document score Completion Follower ratio
23.213722 Gargron 0.9984334183049349
16.653728 gargamel@botsin.space 0.9487179487179487
16.490458 gargron@mastodon.cloud 0.9615384615384616
16.437838 gargronbot 0.8095238095238095
15.444919 gargron@social.tchncs.de 0.9444444444444444
14.700065 Gargron@not.phrack.fyi 0.8571428571428571
12.664553 Gargron@workshop.chaurocks.com 0.7368421052631579
12.636898 Gargoyle89@bear.community 0.7368421052631579
11.447411 Gargron@pixelfed.social 0.7
11.301861 Garg0yle 0.5172413793103449

There are more interesting results here. Some accounts are bots/inactive/fake, however. The follower ratio isn't very insightful alone because it behaves wildly at low numbers.

Same query, but with followers number affecting the score:

Document score Completion Followers
125.62445 Gargron 253021
46.84463 Gargron@tooting.ai 730
39.91339 Gargron@gonext.gg 208
37.98905 Gargy 32
37.80327 Gargantia 75
34.1727 Gargron@icosahedron.website 91
31.940203 gargosoft@instance.business 64
27.929485 gargamel@botsin.space 37
27.49409 gargath@canislupus.im 46
26.885624 Garg0yle 15

These are pretty interesting results. The first two accounts are real and active. The third is real, but belongs to an instance that's been dead for a year. The fourth is a dead account. The fifth and sixth accounts are real and active.

Same query, but with last activity affecting the score:

Document score Completion Last activity
23.993477 gargan 2019-07-14T09:13:39.612Z
23.250185 Gargron 2019-08-13T15:54:40.449Z
20.651749 gargle 2019-07-06T12:17:49.275Z
20.038937 Gargantia 2019-08-13T14:32:59.966Z
17.35989 gargan@mstdn.onl 2019-08-09T16:02:42.902Z
17.35989 Gargron@icosahedron.website 2019-08-11T22:36:35.678Z
17.150051 gargron@mastodon.cloud 2019-07-28T06:02:13.118Z
16.35344 gargron@social.tchncs.de 2019-07-28T05:42:18.582Z
16.35344 gargaml@mamot.fr 2019-07-25T05:42:19.984Z
16.35344 Gargron@tooting.ai 2019-08-12T18:55:56.475Z

These results seem a bit nonsense. At the very least, they're neither all local, nor is the long inactive gonext.gg account among them.

Now, same query, but combining all 3 scoring modifiers:

Document score Completion
57.362843 Gargron
24.647789 Gargron@tooting.ai
22.533825 Gargantia
20.086798 Gargron@icosahedron.website
19.396187 gargron@mastodon.cloud
17.570133 gargron@social.tchncs.de
16.815432 gargath@canislupus.im
15.135833 Gargy
14.861097 gargamel@botsin.space
14.809118 Gargron@gonext.gg

The first 4 results are real and active accounts. The fifth is fake, the sixth is inactive, the rest are dead. Seeing these suggestions, the user could press ENTER to get the desired completion immediately.

Search term "electro"

Expected to find "electroCutie@beach.city"

Status quo from PostgreSQL:

Document score Completion
0.583333 6klop_electronix@peertube.mastodon.host
0.583333 electronicbites
0.583333 electroniclife@zotum.net
0.583333 electronicfix
0.583333 electromeca
0.583333 Electrospaces
0.583333 electronlover
0.583333 ElectronKnight
0.583333 electropure@switter.at
0.583333 ElectronicOpus@mastodon.land

Just like in the other example, there is no real logic to the results and what we expect to find isn't there at all.

No additional scoring:

Document score Completion
20.166695 electronio
19.980896 Electrocutie
19.980896 Electropanda
19.51557 electround
19.51557 electrofin
19.32907 electronicfix
19.147978 electromeca
18.899029 electrovert
18.801424 electrofryed
18.500273 electrofelix

While a "electrocutie" is the second result, that's a different account that was last active in 2017.

Follower ratio affecting the score:

Document score Completion Follower ratio
0.99934167 eff 0.9993416721527321
0.93073595 ElectronicOpus 0.9307359307359307
0.9216758 electronicfam 0.9216757741347905
0.8947368 electronicintifada@newsbots.eu 0.8947368421052632
0.8888889 KinkMuxer@monsterpit.net 0.8888888888888888
0.71428573 Electron@eldritch.cafe 0.7142857142857143
0.69430053 ElectronDance 0.694300518134715
0.68 Electrospaces 0.68
0.59375 electronic@mastodon.cloud 0.59375
0.57894737 Electronicfreak@chaos.social 0.5789473684210527

What we expect to find isn't there at all. The first result is an active and popular bot, but most of the other results are inactive or fake.

Followers number affecting the score:

Document score Completion Followers
48.421394 electronicfam 506
46.409096 ElectronicOpus 430
39.119625 Electron@social.art-software.fr 231
37.55837 ElectronDance 134
37.092392 electroCutie@beach.city 147
33.518337 Electrolux@switter.at 90
33.07969 electrotamitha@octodon.social 97
32.782784 Electronic_Bunny@todon.nl 93
25.746859 eff 3036
25.717287 Electrohead@switter.at 34

The first two accounts are inactive. The third and fourth are real, and hey, it appears! What we expect to find is the 5th result.

Last activity affecting the score:

Document score Completion Last activity
20.166739 electronio 2019-07-24T22:11:32.799Z
19.147995 electromeca 2019-08-13T16:57:55.021Z
18.145172 ElectronikBrain@noagendasocial.com 2019-07-28T04:42:35.112Z
17.603788 ElectronDance 2019-07-18T11:26:13.697Z
17.068218 electrotamitha@toot.cat 2019-08-12T12:12:21.506Z
17.068218 electronic@mastodon.cloud 2019-07-29T04:42:14.630Z
17.068218 electroCutie@beach.city 2019-08-12T16:16:11.123Z
16.795864 electronicintifada@newsbots.eu 2019-08-12T19:35:13.601Z
16.795864 Electron@pix.diaspodon.fr 2019-08-07T19:47:17.972Z
16.57605 electrodnd21@todon.nl 2019-08-01T15:42:59.433Z

All the accounts indeed have posted within the last two months, and one of them is the one we expect to find, however, it is low on the list.

All combined:

Document score Completion
22.461493 ElectronDance
21.638235 electronicfam
20.932909 ElectronicOpus
20.93084 Electron@social.art-software.fr
20.209057 electroCutie@beach.city
19.339832 Electronic_Bunny@todon.nl
17.767172 electronicintifada@newsbots.eu
17.099651 ElectronikBrain@noagendasocial.com
16.590124 electronic@mastodon.cloud
16.419668 electromeca

The first result is real. The second has never posted. The third is inactive. The fourth is real and active. The fifth is what we were looking to find. Seeing these suggestions, the user would have to type another "c" to get the desired result and press ENTER.

@Gargron Gargron force-pushed the feature-better-account-search branch 2 times, most recently from 4b59bb2 to 4e9c3c7 Compare August 11, 2019 15:42
When ElasticSearch is available, a more accurate search is implemented:

- Using edge n-gram index for acct and display name
- Using asciifolding and cjk width normalization on display names
- Using Gaussian decay on account activity for additional scoring (recency)
- Using followers/friends ratio for additional scoring (spamminess)
- Using followers number for additional scoring (size)

The exact match precedence only takes effect when the input conforms
to the username format and the username part of it is complete, i.e.
when the user started typing the domain part.
@Gargron Gargron force-pushed the feature-better-account-search branch from 4e9c3c7 to e4986ab Compare August 11, 2019 15:52
@Gargron Gargron force-pushed the feature-better-account-search branch from 7ea4606 to 0c88c25 Compare August 11, 2019 21:04
app/services/account_search_service.rb Outdated Show resolved Hide resolved
app/services/account_search_service.rb Outdated Show resolved Hide resolved
app/services/account_search_service.rb Outdated Show resolved Hide resolved
@Gargron Gargron merged commit 8fdff27 into master Aug 15, 2019
@Gargron Gargron deleted the feature-better-account-search branch August 15, 2019 23:24
hiyuki2578 pushed a commit to ProjectMyosotis/mastodon that referenced this pull request Oct 2, 2019
* Add more accurate account search

When ElasticSearch is available, a more accurate search is implemented:

- Using edge n-gram index for acct and display name
- Using asciifolding and cjk width normalization on display names
- Using Gaussian decay on account activity for additional scoring (recency)
- Using followers/friends ratio for additional scoring (spamminess)
- Using followers number for additional scoring (size)

The exact match precedence only takes effect when the input conforms
to the username format and the username part of it is complete, i.e.
when the user started typing the domain part.

* Support single-letter usernames

* Fix tests

* Fix not picking up account updates

* Add weights and normalization for scores, skip zero terms queries

* Use local counts for accounts index, adjust search parameters

* Fix mistakes

* Using updated_at of accounts is inadequate for remote accounts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants