Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Universal Search #9529

Open
AnTi-ArT opened this issue Dec 15, 2018 · 16 comments
Open

Universal Search #9529

AnTi-ArT opened this issue Dec 15, 2018 · 16 comments
Labels
suggestion Feature suggestion

Comments

@AnTi-ArT
Copy link

Search "Everything"!

Problem: I just signed up to a few instances and noticed that searching for the same keyword yields vastly different results. Which not only makes me wonder: How can I use Mastodon and get the most out of it (aka not miss valuable content). But also: How am I supposed to post something and reach the widest possible audience. It's extremely obscure.
Moreover, it makes choosing the instance overly complicated. Before I signed up I was under the impression that it does not matter that much, because I could still access everything outside the instance through "Federated". As it turns out "federated" = "someone followed by someone from your instance".

The federationBot from mastodon.host adresses exactly this need to see the most possible content. (It doesn't help with posting to the biggest audience, though)

Home, Local, Federated, UNIVERSAL

In my opinion (I've been on mastodon since yesterday... (yes, I'm a tumblr refugee)) it would be very beneficial to have the option to Search "Everything", universally/globally. An "universal timeline" on the other hand would probably be unusable.
This should, of course, respect blocked lists and similar settings from the user and instance (eg. blocking certain tags/words, blacklisted instances, SFW filters etc).

Benefits

  • easier to discover interesting people from other instances. (Like, how do I even discover someone who isn't followed by anyone from my instance right now??)
  • easier to find content that has no specific "interest" instance right now, or could fit into different instances with overlapping interests. And vice versa less thinking about "now where do I post this?"
  • Right now, the logical step is to just join the most popular instance to have access to the widest audience/content. Which somehow opposes the whole idea of "decentralization".
  • Consistent search results when using Global/Universal search option.
  • Less need for multiple accounts on different instances

Disclaimer

I searched the issues, but couldn't find exactly this feature request. I read about the relays. And some suggestions like "I want to subscribe to a whole instance timeline" which adresses a similar need.
I am aware that the whole system probably does not support access to literally everything.
Also: I do not intent to replace Federated, because it has value in curating and filtering content.

@danhunsaker
Copy link
Contributor

This is technically infeasible, and impossible within the protocol. There's literally no way to even know all the instances in the Fediverse, much less be able to search all of them.

One instance learns of the existence of another when a user on it attempts to follow someone on the other. Until then, the instances aren't aware of each other's existence, and this is as much a technical limitation as by design. It's the same thing with any other federated service, such as email.

Even reducing the scope to just instances the current one is already aware of (which won't result in the same search results across instances, because each instance keeps its own list of other instances it knows), searching across them all would require either duplicating every post that server hosts locally (rather than doing this just for posts the current instance's users have indicated they actually care about), or sending out a search request to each of them and waiting for a response before generating a list of results. The former places an exceptional amount of extra burden on the database, while the latter places an exceptional amount of extra burden on remote instances - not to mention opening an attack vector for things such as DDoS.

So we're left with the current scenario, of searching only posts that the current instance is already aware of (those created locally, or cached locally because of user interest in that specific content). There are other approaches being implemented which will make discovering users - and even instances - far easier, but this one just has too many layers of complexity and risk.

Unless somebody else can come up with an implementation where instance A doesn't have to store the entire content of instances B through Z, and B through Z don't have to dedicate arbitrary processing power to instance A.

@trwnh
Copy link
Member

trwnh commented Dec 16, 2018

@danhunsaker It's not even a protocol thing I'd say. It's a more general search engine problem: you need to be able to discover and index content before you can search for it. There's nothing technical stopping anyone from crawling every single profile periodically to discover new accounts, then indexing all their public posts... aside from you maintaining a large database, that is.

On a protocol level, yes, posts are delivered to your followers' instances + to any instance that requests a post manually via searching for its direct URL. With relays, posts are delivered to every instance that subscribes to that relay. But short of every instance sending their posts to a mega-relay, and then having a mega-database subscribe to that relay, you're not going to be able to discover on a "universal" / "everything" level. Consider how large Google is as a company, given that they try to crawl and index the entire internet.

@Gargron Gargron added the suggestion Feature suggestion label Jan 20, 2019
@abeorch
Copy link

abeorch commented Jul 25, 2019

I just now posted on #10537 - Rather than federate the content - what about allowing distributed search. Allow instances to receive and respond to search queries based on the relevant security factors and their particular search queries?

@stale
Copy link

stale bot commented Oct 26, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the status/wontfix This will not be worked on label Oct 26, 2019
@sant0s12
Copy link

sant0s12 commented Jun 7, 2020

I wish this was a thing by now...

@stale stale bot removed the status/wontfix This will not be worked on label Jun 7, 2020
@csuwildcat
Copy link

Hey folks, I have kept tabs on this thread and just wanted to drop in to share something:

At Microsoft, we have been working with the open source and standards community on Decentralized Identifiers (DIDs), a standard specification that's in its final paces. We have also been working on an open source, robustly decentralized implementation of DIDs called ION - there are no authorities, trusted nodes, validator lists, etc. involved, not even Microsoft itself is an authority in the protocol. ION is not a blockchain, it's a logical DID protocol that can operate at massive scale, and has one really cool feature you folks might find interesting in relation to this thread: ION has the ability to derive the list of all identifiers in the system. Now, before your imagination runs wild with 'data on a public system', ION is not about identity data, it is only an ID that maps to public keys and endpoints that the ID holder can point to (which presumably can hold encrypted or public data, their choice).

If you used ION IDs for user IDs, there are many benefits: 1) IDs would be owned by the user in a way that is more robustly censorship and interdiction resistant that any other system we have ever had in the world of digital identity, and 2) you would actually have a directory of the IDs, such that you could do a crawl of all IDs that list a Mastodon Service Endpoint and do whatever queries against that server you wanted.

ION just hit v1.0 and will be rolling out shortly, but here's a post we did last year as we moved into beta: https://techcommunity.microsoft.com/t5/identity-standards-blog/ion-booting-up-the-network/ba-p/1441552

Let me know if you are interested, we'd love to talk about what this protocol could do for the Mastodon community!

@CrazyPython
Copy link

Perhaps this could be implemented with a probabilistic data structure such as a Bloom Filter?

@OndraZizka
Copy link

OndraZizka commented May 5, 2022

An idea on a global hashtag search. In short, individual nodes could redundantly track hashtags that "belong to them" by a hashing function.

  1. It would require a distributed index of nodes. That assumes that the nodes are in a fully connected graph. If not, then there would be one index per sub-graph.
    The index would contain just the addresses of the nodes and whether they want to participate in hashtag indexing.
    The address would be hashed into a number "hash1" by a known algorithm.

  2. Each node, when a post is created, would hash each hashtag, into "hash2".
    Then the node would pair the hash2 with several hash1 based on a known algorithm, and notify the nodes from the index under "hash1" about a new post with the given hashtag. (I think there's nothing in the protocol for such unsolicited notification, so that would need an amendment.)

  3. When a user looks for a hashtag, again, a hash2 would be computed, matched with the hash1's, and the respective nodes would be "queried". I understand that the protocol is designed to stay strictly pub-sub, so the query would need to be a subscription to a pseudo-user, eg. #sometag@some.node .

The hashing and matching would dynamically change based on the nodes index, to accommodate the new nodes, as is done in the infinitely scaling databases. That is probably the trickiest part, along with having the tags followed by reliable enough nodes.
Also, time-based sharding would need to occur to distribute the load of a hot hashtag. (The whole mechanism is inspired by how DynamoDB works internally.)

Was there any research in that direction? Thanks.

@ellieayla
Copy link

Fwiw, when I first started using mastodon this was exactly how I assumed federated search across instances worked;

Even reducing the scope to just instances the current one is already aware of ... searching across them all would require ... sending out a search request to each of them ...

I expected each instance to expose a search api, receive low-priority queries (either hashtag or fulltext), do stemming/filtering locally, and respond with a dozen of results to the user's local instance. The user's local instance makes (n concurrent) search query to every instance it knows about, merges (and potentially discard some of) those results, and reveals them to the user as they arrive.

With some avoidance of wasteful network traffic (cache query/response locally to avoid asking instances the same query in rapid succession, stop bothering all instances when local query connection drops) or compute (deny queries from blocked instances, de/prioritize (or opt out of) contributing to federated search results entirely, limit to queries originating from followed users.)

I haven't thought through the implementation or ramifications of such a thing. I merely believed someone already built it (but that I couldn't find the documentation).

@nemasu
Copy link

nemasu commented Nov 10, 2022

Fwiw, when I first started using mastodon this was exactly how I assumed federated search across instances worked;

Even reducing the scope to just instances the current one is already aware of ... searching across them all would require ... sending out a search request to each of them ...

Same here. I looked into the documentation, and I think it would be relatively easy to implement a remote hashtag search feature at least. My instance is using relays and knows of many other instances, so we could keep track of these in the DB to use later. Then when a remote hashtag search is made, you can use the timelines/tag API with local set to true in order to avoid duplicates.

There is still the downside of actually knowing other instances, so to make use of this feature on a fresh or inactive instance, you would really need to use relays to build a list.

I'll probably put some time into adding this on my local instance to see if it even makes sense, as I just joined a couple days ago .
Also, if this is just, a bad idea for whatever reason, let me know, but personally I think this would make a huge difference in the usability of the platform.

@DerZyklop
Copy link

@nemasu i'm not familiar with how the relays work, but you still have to wait for the slowest server (bottleneck) before you can show results, right?

@nemasu
Copy link

nemasu commented Nov 11, 2022

@DerZyklop I don't think so, just stream results to the user as they come in. I'm not yet sure if the front end has support for live updates like that, but it's definitely possible.

@CEbbinghaus
Copy link

Can we rename this Issue since Universal no longer accurately represents the scope of this work?

I do however feel like search right now is lacking and that a better way to solve it would result in a vastly increased UX. Perhaps initially limiting it to Hashtags using bloom filters + hashes and hopefully expanding it to a better Full Text experience in future.

Realistically the only two real problems I can see are Latency and Processing capacity especially as the amount of users grow rapidly. If each search pings 50 servers then 100 searches across 50 servers (2 searches per server) would result in 50,000 additional requests that will have to be handled (real life scaling would probably be way worse). This could pose a real problem especially as user and server counts are growing and could result in small scale DDOS attacks especially if lots of large instances are suddenly pinging smaller servers. Opting out only works in and of as far as other servers respect the opting out behavior.

@pointlessone
Copy link
Contributor

This could pose a real problem especially as user and server counts are growing and could result in small scale DDOS attacks

Does the server have to mediate all the searches? This feature can be offloaded to the client side. The server can do the regular search but also add a list of known instances that accept outside search requests so that the client could also send a search request to those instances. If we want to be more thorough about it we can add authentication. It's more involved but probably easier/CPU-lighter than piping large amounts of search results.

@ajturner
Copy link

I wanted to chime in that I agree it would be a positive user experience for a client to be able to dynamically search across multiple instances and show aggregated results.

image

This proposal would rely on parallel client (browser) XHR HTTP requests using the existing search API. This reduces load on the instance servers and mitigates any privacy concerns. It would even be possible to use different account credentials for each instance.

In my projects we've called this Distributed Search and often use the long-standing OpenSearch specification to define the search query interface. (not related to the more recent Elastic technology fork)

@strk
Copy link

strk commented Jan 4, 2024

Had you considered using existing search engines by exposing the public posts to crawlers ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
suggestion Feature suggestion
Projects
None yet
Development

No branches or pull requests