-
-
Notifications
You must be signed in to change notification settings - Fork 2
Scrapper should be opt-in #7
Comments
This is a bad idea overall. Nominating yourself to curate and collect data that no one asked you to do will probably not fly well. |
I actually was asked about this. This project started because of an abuse issue multiple instances were having where their blocks were being evaded by proxy instances, so I wrote them a tool to help them identify those systems. I've worked with multiple instance admins throughout this process to ensure that user privacy is respected, and have made sure that this system can easily be opted out of by instances who don't want to be crawled. I'm taking all feedback seriously and have made several changes based on that feedback. This tool is meant to help instance admins out, but if it causes damage in the process it'll get taken down. |
Regardless of intention, all of this should be opt-in by default. Not the other way around. |
Yeah, generally people hate these tools, even if it was made with good/neutral intentions. I also am very wary of tools like this. |
+1 on making things be opt-in. |
PLEASE. Opt-in. Instances have already blocked hachyderm for this. |
@tedivm If opting out of this tool does anything meaningful, what happens when the AP proxy software starts opting out? Conversely, why collect so much info not relevant to identifying proxies by counting subdomains? |
Using your own service, we can see mastinator.com is blocked a ton for the exact things you are trying to do (scraping, violating consent, not going with the culture) |
Like the KF scraper, this publicizes which instances block a given instance; doing so makes that data much more easily accessible than it was before. This encourages targeted retaliation, and thus needs to be opt-in. Your good intentions don't erase this problem and the real harm it causes. I can believe you when you say this was meant to address real issues around blocklist-evasion, but the data it currently exposes will merely replace one threat with another. I suggest deleting existing data, turning it off, and asking Fedi for feedback before starting a project like this again. I do think it's possible to publish very limited aggregate data that doesn't enable targeted harassment, but this isn't how it's done. |
You NEED to make this opt-in only Your opt-out method is also inadequate Not everyone who's running a Masto instance can alter their robots.txt, thanks to limitations from hosting providers |
So then complain to the hosting providers. Why is it everybody else's problem? |
Because everyone else is writing scrapers... |
I have removed those endpoints from the service. |
There is a social contract on the internet that the site owner puts a robots.txt file at the root to dictate what bots do. The internet (and even law in some cases) are built around this contract. This is how the internet has worked for ages. You have a means to not have /any/ bot spider and get information. It takes one line in robots.txt, or a blocking of the user agent, or a flip of a switch to not advertise the information. I believe the last one should be possible on a shared host running Mastodon. I'd advise the latter if you're concerned about this information being public. There are people who do not care about robots.txt and will get the information if its avaliable, and security through hoping-no-bad-people-will-ever-collect-this-information-I-plaster-all-over-the-place is a bad model. |
There's also a social contest on fedi not to build scapers lol |
I don't think so. There are very popular sites like fediverse.observer that spiders instances and collect similar information. The sentiment that something like this is bad is not a common one in my experience; I think that it's a vocal minority. |
Well then we can disagree, but this thread suggests otherwise |
The fact remains, I'm an instance admin and I CAN'T opt out This thing NEEDS to be shut down until it is opt-in only |
I'll also add that nodeinfo is designed for tools like this.
|
Ways of doing so, even under shared hosting, are possible as I explained earlier. |
Since the creator's fedi account has been suspended, they might not have seen my reply so I'll copy it here since it's relevant:
|
Opt in or get out. |
I'm referring to what I said about not exposing the blocklist itself. I believe that is an option. On mastodon.social you can also see obscured domain names on the about page as well. |
robots.txt allowing bots is opting in. This project respects robots.txt. |
No, that's opt out, literally the opposite, and the documentation doesn't give any other options than using robots.txt |
It's an implicit opt-in. I'd advise you to read my earlier comment. |
On Mon, Jan 09, 2023 at 01:17:09PM -0800, webb wrote:
robots.txt allowing bots is opting in
@w3bb opting-out of opting-out is not the same as opting in, because consent isn't a freaking multiplication problem.
|
The term for that is "opt out" |
This is how the internet has worked for ages. If people had to get manual consent for spidering, search engines would have been impossible. People should be complaining to their hosts who can't spend fifteen minutes to add a basic option like that instead of spiders who have no reasonable means of knowing you can't use robots.txt. |
On Mon, Jan 09, 2023 at 01:27:05PM -0800, webb wrote:
This is how the internet has worked for ages. If people had to get manual consent for spidering, search engines would have been impossible.
That's the point. There's a reason why just about every attempt at a Fediverse search engine has been network-filtered, Fediblocked, tarpitted, and fed bad data until it shut down. Tools like search engines which aren't opt-in aren't welcome on Fedi.
…--
Seirdy (https://seirdy.one)
|
This is a false equivalency. Like I mentioned earlier up in the thread (I assume it hasn't come through to you on email) a better comparison would be to something like fediverse.observer. This I believe uses a Mastodon endpoint, but the same information is exposed via nodeinfo, which is explicitly designed for tools like these. fediverse.observer also uses nodeinfo. |
(For some reason it sent it while I was typing a draft, sorry about that.) |
Just to be clear, I'm not an admin on Hachyderm. The people spreading that rumor are wrong. Hachyderm has nothing to do with this project other than me making a post there. |
Did you read anything here? That's not what anyone is discussing |
I'm reading everything that comes through, but I wanted to clear that one piece of information up. |
Bruh, if it was opt-in, it would be useless cause nobody would bother to opt in! Imagine if you had to opt-in to bots on the world wide web; search engines like google wouldn't really work because a lot of sites don't care and don't have a robots.txt! Well, the fediverse is based on the world wide web, so the same concept applies. |
The website is down. Thank you all for your feedback. |
The fediverse has a unique culture towards scrappers, and I believe that you should probably make this opt-in, if you don't wanna start getting bashed all over fedi.
The text was updated successfully, but these errors were encountered: