Add more accurate account search #11537

Gargron · 2019-08-11T14:37:17Z

When ElasticSearch is available, a more accurate search is implemented:

Using edge n-gram index for acct and display name
Using asciifolding and cjk width normalization on display names
Using Gaussian decay on account activity for additional scoring (recency)
Using followers/friends ratio for additional scoring (spamminess)
Using followers number for additional scoring (size)

Additionally, the previous behaviour is also kept:

Prioritizing people you follow (closeness)

The exact match precedence only takes effect when the input conforms to the username format and the username part of it is complete, i.e. when the user started typing the domain part.

What does this mean? Searching for people who have unicode characters in their display names, such as accents and umlauts, has become possible without entering those characters (asciifolding); likewise, you no longer need to use the exact flavour of CJK characters (full-width/half-width) to get the same results; accounts that haven't been active in a long time are less likely to appear near the top; accounts that are follow-botting are less likely to appear near the top (spamminess), and more established accounts (by followers number) are more likely to appear higher (but mind, that's within specific searches). Like before, the biggest factor is whether you're following someone.

Evaluations

I decided to run different queries for the same search term to see the differences between results and how they are scored. The queries did not consider follow relationships because a user that is followed by the searching user will universally appear at the top of the results.

Search term "garg"

Expecting to find: "Gargron"

Here is the status quo, search results from PostgreSQL for comparison:

Document score	Completion
0.583333	Gargy
0.583333	gargosoft@instance.business
0.583333	Gargo@kinky.business
0.583333	gargath@canislupus.im
0.583333	gargantia@peertube.mastodon.host
0.583333	gargoyle@witchcraft.cafe
0.583333	gargamel@botsin.space
0.583333	Gargantia
0.583333	gargronbot
0.5	AnvitZero

The first result is a dead account, the rest are mostly bots.

Now, let's begin with a simple query without any additional scoring:

Document score	Completion
26.223595	garg
24.804735	Gargy
23.99578	gargan
23.249702	Gargron
22.583332	gargle
22.343393	gargrant
22.343393	Gargabob
21.893581	Gargoyle
21.849594	Garg0yle
21.849594	gargirl_

The third result is an active account, the rest, not so much. They're also all local accounts.

Same query with follower ratio affecting the score:

Document score	Completion	Follower ratio
23.213722	Gargron	0.9984334183049349
16.653728	gargamel@botsin.space	0.9487179487179487
16.490458	gargron@mastodon.cloud	0.9615384615384616
16.437838	gargronbot	0.8095238095238095
15.444919	gargron@social.tchncs.de	0.9444444444444444
14.700065	Gargron@not.phrack.fyi	0.8571428571428571
12.664553	Gargron@workshop.chaurocks.com	0.7368421052631579
12.636898	Gargoyle89@bear.community	0.7368421052631579
11.447411	Gargron@pixelfed.social	0.7
11.301861	Garg0yle	0.5172413793103449

There are more interesting results here. Some accounts are bots/inactive/fake, however. The follower ratio isn't very insightful alone because it behaves wildly at low numbers.

Same query, but with followers number affecting the score:

Document score	Completion	Followers
125.62445	Gargron	253021
46.84463	Gargron@tooting.ai	730
39.91339	Gargron@gonext.gg	208
37.98905	Gargy	32
37.80327	Gargantia	75
34.1727	Gargron@icosahedron.website	91
31.940203	gargosoft@instance.business	64
27.929485	gargamel@botsin.space	37
27.49409	gargath@canislupus.im	46
26.885624	Garg0yle	15

These are pretty interesting results. The first two accounts are real and active. The third is real, but belongs to an instance that's been dead for a year. The fourth is a dead account. The fifth and sixth accounts are real and active.

Same query, but with last activity affecting the score:

Document score	Completion	Last activity
23.993477	gargan	2019-07-14T09:13:39.612Z
23.250185	Gargron	2019-08-13T15:54:40.449Z
20.651749	gargle	2019-07-06T12:17:49.275Z
20.038937	Gargantia	2019-08-13T14:32:59.966Z
17.35989	gargan@mstdn.onl	2019-08-09T16:02:42.902Z
17.35989	Gargron@icosahedron.website	2019-08-11T22:36:35.678Z
17.150051	gargron@mastodon.cloud	2019-07-28T06:02:13.118Z
16.35344	gargron@social.tchncs.de	2019-07-28T05:42:18.582Z
16.35344	gargaml@mamot.fr	2019-07-25T05:42:19.984Z
16.35344	Gargron@tooting.ai	2019-08-12T18:55:56.475Z

These results seem a bit nonsense. At the very least, they're neither all local, nor is the long inactive gonext.gg account among them.

Now, same query, but combining all 3 scoring modifiers:

Document score	Completion
57.362843	Gargron
24.647789	Gargron@tooting.ai
22.533825	Gargantia
20.086798	Gargron@icosahedron.website
19.396187	gargron@mastodon.cloud
17.570133	gargron@social.tchncs.de
16.815432	gargath@canislupus.im
15.135833	Gargy
14.861097	gargamel@botsin.space
14.809118	Gargron@gonext.gg

The first 4 results are real and active accounts. The fifth is fake, the sixth is inactive, the rest are dead. Seeing these suggestions, the user could press ENTER to get the desired completion immediately.

Search term "electro"

Expected to find "electroCutie@beach.city"

Status quo from PostgreSQL:

Document score	Completion
0.583333	6klop_electronix@peertube.mastodon.host
0.583333	electronicbites
0.583333	electroniclife@zotum.net
0.583333	electronicfix
0.583333	electromeca
0.583333	Electrospaces
0.583333	electronlover
0.583333	ElectronKnight
0.583333	electropure@switter.at
0.583333	ElectronicOpus@mastodon.land

Just like in the other example, there is no real logic to the results and what we expect to find isn't there at all.

No additional scoring:

Document score	Completion
20.166695	electronio
19.980896	Electrocutie
19.980896	Electropanda
19.51557	electround
19.51557	electrofin
19.32907	electronicfix
19.147978	electromeca
18.899029	electrovert
18.801424	electrofryed
18.500273	electrofelix

While a "electrocutie" is the second result, that's a different account that was last active in 2017.

Follower ratio affecting the score:

Document score	Completion	Follower ratio
0.99934167	eff	0.9993416721527321
0.93073595	ElectronicOpus	0.9307359307359307
0.9216758	electronicfam	0.9216757741347905
0.8947368	electronicintifada@newsbots.eu	0.8947368421052632
0.8888889	KinkMuxer@monsterpit.net	0.8888888888888888
0.71428573	Electron@eldritch.cafe	0.7142857142857143
0.69430053	ElectronDance	0.694300518134715
0.68	Electrospaces	0.68
0.59375	electronic@mastodon.cloud	0.59375
0.57894737	Electronicfreak@chaos.social	0.5789473684210527

What we expect to find isn't there at all. The first result is an active and popular bot, but most of the other results are inactive or fake.

Followers number affecting the score:

Document score	Completion	Followers
48.421394	electronicfam	506
46.409096	ElectronicOpus	430
39.119625	Electron@social.art-software.fr	231
37.55837	ElectronDance	134
37.092392	electroCutie@beach.city	147
33.518337	Electrolux@switter.at	90
33.07969	electrotamitha@octodon.social	97
32.782784	Electronic_Bunny@todon.nl	93
25.746859	eff	3036
25.717287	Electrohead@switter.at	34

The first two accounts are inactive. The third and fourth are real, and hey, it appears! What we expect to find is the 5th result.

Last activity affecting the score:

Document score	Completion	Last activity
20.166739	electronio	2019-07-24T22:11:32.799Z
19.147995	electromeca	2019-08-13T16:57:55.021Z
18.145172	ElectronikBrain@noagendasocial.com	2019-07-28T04:42:35.112Z
17.603788	ElectronDance	2019-07-18T11:26:13.697Z
17.068218	electrotamitha@toot.cat	2019-08-12T12:12:21.506Z
17.068218	electronic@mastodon.cloud	2019-07-29T04:42:14.630Z
17.068218	electroCutie@beach.city	2019-08-12T16:16:11.123Z
16.795864	electronicintifada@newsbots.eu	2019-08-12T19:35:13.601Z
16.795864	Electron@pix.diaspodon.fr	2019-08-07T19:47:17.972Z
16.57605	electrodnd21@todon.nl	2019-08-01T15:42:59.433Z

All the accounts indeed have posted within the last two months, and one of them is the one we expect to find, however, it is low on the list.

All combined:

Document score	Completion
22.461493	ElectronDance
21.638235	electronicfam
20.932909	ElectronicOpus
20.93084	Electron@social.art-software.fr
20.209057	electroCutie@beach.city
19.339832	Electronic_Bunny@todon.nl
17.767172	electronicintifada@newsbots.eu
17.099651	ElectronikBrain@noagendasocial.com
16.590124	electronic@mastodon.cloud
16.419668	electromeca

The first result is real. The second has never posted. The third is inactive. The fourth is real and active. The fifth is what we were looking to find. Seeing these suggestions, the user would have to type another "c" to get the desired result and press ENTER.

When ElasticSearch is available, a more accurate search is implemented: - Using edge n-gram index for acct and display name - Using asciifolding and cjk width normalization on display names - Using Gaussian decay on account activity for additional scoring (recency) - Using followers/friends ratio for additional scoring (spamminess) - Using followers number for additional scoring (size) The exact match precedence only takes effect when the input conforms to the username format and the username part of it is complete, i.e. when the user started typing the domain part.

app/services/account_search_service.rb

* Add more accurate account search When ElasticSearch is available, a more accurate search is implemented: - Using edge n-gram index for acct and display name - Using asciifolding and cjk width normalization on display names - Using Gaussian decay on account activity for additional scoring (recency) - Using followers/friends ratio for additional scoring (spamminess) - Using followers number for additional scoring (size) The exact match precedence only takes effect when the input conforms to the username format and the username part of it is complete, i.e. when the user started typing the domain part. * Support single-letter usernames * Fix tests * Fix not picking up account updates * Add weights and normalization for scores, skip zero terms queries * Use local counts for accounts index, adjust search parameters * Fix mistakes * Using updated_at of accounts is inadequate for remote accounts

Gargron force-pushed the feature-better-account-search branch 2 times, most recently from 4b59bb2 to 4e9c3c7 Compare August 11, 2019 15:42

Gargron force-pushed the feature-better-account-search branch from 4e9c3c7 to e4986ab Compare August 11, 2019 15:52

Support single-letter usernames

0c88c25

Gargron force-pushed the feature-better-account-search branch from 7ea4606 to 0c88c25 Compare August 11, 2019 21:04

Gargron added 2 commits August 12, 2019 01:08

Fix tests

a0f3842

Fix not picking up account updates

8fad486

ClearlyClaire reviewed Aug 12, 2019

View reviewed changes

app/services/account_search_service.rb Outdated Show resolved Hide resolved

app/services/account_search_service.rb Outdated Show resolved Hide resolved

app/services/account_search_service.rb Outdated Show resolved Hide resolved

Gargron added 4 commits August 12, 2019 21:58

Add weights and normalization for scores, skip zero terms queries

387cafc

Use local counts for accounts index, adjust search parameters

4dc328b

Fix mistakes

734ba46

Using updated_at of accounts is inadequate for remote accounts

0e3bbcf

nightpool approved these changes Aug 15, 2019

View reviewed changes

Gargron merged commit 8fdff27 into master Aug 15, 2019

Gargron deleted the feature-better-account-search branch August 15, 2019 23:24

Gargron mentioned this pull request Aug 15, 2019

Add more accurate hashtag search #11579

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more accurate account search #11537

Add more accurate account search #11537

Gargron commented Aug 11, 2019 •

edited

Loading

Add more accurate account search #11537

Add more accurate account search #11537

Conversation

Gargron commented Aug 11, 2019 • edited Loading

Evaluations

Search term "garg"

Search term "electro"

Gargron commented Aug 11, 2019 •

edited

Loading