Scrapper should be opt-in #7

l3gacyb3ta · 2023-01-09T14:53:14Z

The fediverse has a unique culture towards scrappers, and I believe that you should probably make this opt-in, if you don't wanna start getting bashed all over fedi.

nahga · 2023-01-09T15:23:06Z

This is a bad idea overall. Nominating yourself to curate and collect data that no one asked you to do will probably not fly well.

tedivm · 2023-01-09T15:46:45Z

I actually was asked about this. This project started because of an abuse issue multiple instances were having where their blocks were being evaded by proxy instances, so I wrote them a tool to help them identify those systems. I've worked with multiple instance admins throughout this process to ensure that user privacy is respected, and have made sure that this system can easily be opted out of by instances who don't want to be crawled.

I'm taking all feedback seriously and have made several changes based on that feedback. This tool is meant to help instance admins out, but if it causes damage in the process it'll get taken down.

tedivm · 2023-01-09T15:59:35Z

Here's some context on the original abuse issue that caused this project to get formed.

nahga · 2023-01-09T15:59:43Z

Regardless of intention, all of this should be opt-in by default. Not the other way around.

l3gacyb3ta · 2023-01-09T18:27:24Z

Yeah, generally people hate these tools, even if it was made with good/neutral intentions. I also am very wary of tools like this.

mothdotmonster · 2023-01-09T19:38:14Z

+1 on making things be opt-in.

Freeplayg · 2023-01-09T19:43:35Z

PLEASE. Opt-in.

Instances have already blocked hachyderm for this.

VyrCossont · 2023-01-09T19:53:14Z

I actually was asked about this. This project started because of an abuse issue multiple instances were having where their blocks were being evaded by proxy instances, so I wrote them a tool to help them identify those systems.

@tedivm If opting out of this tool does anything meaningful, what happens when the AP proxy software starts opting out? Conversely, why collect so much info not relevant to identifying proxies by counting subdomains?

l3gacyb3ta · 2023-01-09T19:58:29Z

Using your own service, we can see mastinator.com is blocked a ton for the exact things you are trying to do (scraping, violating consent, not going with the culture)

Seirdy · 2023-01-09T19:58:35Z

Like the KF scraper, this publicizes which instances block a given instance; doing so makes that data much more easily accessible than it was before. This encourages targeted retaliation, and thus needs to be opt-in. Your good intentions don't erase this problem and the real harm it causes.

I can believe you when you say this was meant to address real issues around blocklist-evasion, but the data it currently exposes will merely replace one threat with another. I suggest deleting existing data, turning it off, and asking Fedi for feedback before starting a project like this again. I do think it's possible to publish very limited aggregate data that doesn't enable targeted harassment, but this isn't how it's done.

bgcarlisle · 2023-01-09T20:12:17Z

You NEED to make this opt-in only

Your opt-out method is also inadequate

Not everyone who's running a Masto instance can alter their robots.txt, thanks to limitations from hosting providers

w3bb · 2023-01-09T20:28:19Z

Not everyone who's running a Masto instance can alter their robots.txt, thanks to limitations from hosting providers

So then complain to the hosting providers. Why is it everybody else's problem?

l3gacyb3ta · 2023-01-09T20:30:23Z

So then complain to the hosting providers. Why is it everybody else's problem?

Because everyone else is writing scrapers...

tedivm · 2023-01-09T20:42:53Z

Like the KF scraper, this publicizes which instances block a given instance; doing so makes that data much more easily accessible than it was before. This encourages targeted retaliation, and thus needs to be opt-in. Your good intentions don't erase this problem and the real harm it causes.

I have removed those endpoints from the service.

w3bb · 2023-01-09T20:44:05Z

Because everyone else is writing scrapers...

There is a social contract on the internet that the site owner puts a robots.txt file at the root to dictate what bots do. The internet (and even law in some cases) are built around this contract. This is how the internet has worked for ages. You have a means to not have /any/ bot spider and get information.

It takes one line in robots.txt, or a blocking of the user agent, or a flip of a switch to not advertise the information. I believe the last one should be possible on a shared host running Mastodon. I'd advise the latter if you're concerned about this information being public. There are people who do not care about robots.txt and will get the information if its avaliable, and security through hoping-no-bad-people-will-ever-collect-this-information-I-plaster-all-over-the-place is a bad model.

l3gacyb3ta · 2023-01-09T20:47:43Z

There is a social contract on the internet that the site owner puts a robots.txt file at the root to dictate what bots do.

There's also a social contest on fedi not to build scapers lol

w3bb · 2023-01-09T20:53:57Z

There's also a social contest on fedi not to build [spiders] lol

I don't think so. There are very popular sites like fediverse.observer that spiders instances and collect similar information. The sentiment that something like this is bad is not a common one in my experience; I think that it's a vocal minority.

l3gacyb3ta · 2023-01-09T20:54:24Z

Well then we can disagree, but this thread suggests otherwise

bgcarlisle · 2023-01-09T20:55:55Z

The fact remains, I'm an instance admin and I CAN'T opt out

This thing NEEDS to be shut down until it is opt-in only

w3bb · 2023-01-09T20:56:41Z

I'll also add that nodeinfo is designed for tools like this.

NodeInfo is an effort to create a standardized way of exposing metadata about a server running one of the distributed social networks. The two key goals are being able to get better insights into the user base of distributed social networking and the ability to build tools that allow users to choose the best fitting software and server for their needs.

w3bb · 2023-01-09T20:57:52Z

The fact remains, I'm an instance admin and I CAN'T opt out

This thing NEEDS to be shut down until it is opt-in only

Ways of doing so, even under shared hosting, are possible as I explained earlier.

Seirdy · 2023-01-09T21:01:39Z

Since the creator's fedi account has been suspended, they might not have seen my reply so I'll copy it here since it's relevant:

Listen. People do not trust you right now. You potentially hold a ton of data and people feel unsafe merely knowing you have it. Tons of people have archived the data you exposed already and are literally going through it right now.

You need to over-correct fast, and that means shutting this down.

ThatOneCalculator · 2023-01-09T21:08:30Z

Opt in or get out.

w3bb · 2023-01-09T21:15:07Z

(This is the documentation: https://github.com/tedivm/fedimapper#block-this-bot)

I'm referring to what I said about not exposing the blocklist itself. I believe that is an option. On mastodon.social you can also see obscured domain names on the about page as well.

w3bb · 2023-01-09T21:16:58Z

Opt in or get out.

robots.txt allowing bots is opting in. This project respects robots.txt.

bgcarlisle · 2023-01-09T21:18:33Z

No, that's opt out, literally the opposite, and the documentation doesn't give any other options than using robots.txt

w3bb · 2023-01-09T21:20:27Z

No, that's opt out, literally the opposite, and the documentation doesn't give any other options than using robots.txt

It's an implicit opt-in. I'd advise you to read my earlier comment.

Seirdy · 2023-01-09T21:21:03Z

On Mon, Jan 09, 2023 at 01:17:09PM -0800, webb wrote: robots.txt allowing bots is opting in

@w3bb opting-out of opting-out is not the same as opting in, because consent isn't a freaking multiplication problem.

bgcarlisle · 2023-01-09T21:26:47Z

It's an implicit opt-in

The term for that is "opt out"

w3bb · 2023-01-09T21:26:54Z

@w3bb opting-out of opting-out is not the same as opting in, because consent isn't a freaking multiplication problem.

This is how the internet has worked for ages. If people had to get manual consent for spidering, search engines would have been impossible. People should be complaining to their hosts who can't spend fifteen minutes to add a basic option like that instead of spiders who have no reasonable means of knowing you can't use robots.txt.

Seirdy · 2023-01-09T21:40:34Z

On Mon, Jan 09, 2023 at 01:27:05PM -0800, webb wrote: This is how the internet has worked for ages. If people had to get manual consent for spidering, search engines would have been impossible.

That's the point. There's a reason why just about every attempt at a Fediverse search engine has been network-filtered, Fediblocked, tarpitted, and fed bad data until it shut down. Tools like search engines which aren't opt-in aren't welcome on Fedi.

…

-- Seirdy (https://seirdy.one)

w3bb · 2023-01-09T21:49:38Z

That's the point. There's a reason why just about every attempt at a Fediverse search engine has been network-filtered, Fediblocked, tarpitted, and fed bad data until it shut down. Tools like search engines which aren't opt-in aren't welcome on Fedi.

This is a false equivalency. Like I mentioned earlier up in the thread (I assume it hasn't come through to you on email) a better comparison would be to something like fediverse.observer. This I believe uses a Mastodon endpoint, but the same information is exposed via nodeinfo, which is explicitly designed for tools like these. fediverse.observer also uses nodeinfo.

w3bb · 2023-01-09T21:52:07Z

(For some reason it sent it while I was typing a draft, sorry about that.)

tedivm · 2023-01-09T22:31:40Z

Just to be clear, I'm not an admin on Hachyderm. The people spreading that rumor are wrong. Hachyderm has nothing to do with this project other than me making a post there.

bgcarlisle · 2023-01-09T22:46:35Z

Did you read anything here?

That's not what anyone is discussing

tedivm · 2023-01-09T22:55:12Z

I'm reading everything that comes through, but I wanted to clear that one piece of information up.

ledlamp · 2023-01-09T23:00:18Z

Bruh, if it was opt-in, it would be useless cause nobody would bother to opt in! Imagine if you had to opt-in to bots on the world wide web; search engines like google wouldn't really work because a lot of sites don't care and don't have a robots.txt! Well, the fediverse is based on the world wide web, so the same concept applies.

tedivm · 2023-01-09T23:02:39Z

The website is down. Thank you all for your feedback.

ThatOneCalculator mentioned this issue Jan 9, 2023

Take this project offline until you've made it truly opt-in. #8

Closed

tedivm closed this as completed Jan 9, 2023

Scrapper should be opt-in #7

Scrapper should be opt-in #7

Comments

l3gacyb3ta commented Jan 9, 2023

nahga commented Jan 9, 2023

tedivm commented Jan 9, 2023

tedivm commented Jan 9, 2023

nahga commented Jan 9, 2023

l3gacyb3ta commented Jan 9, 2023

mothdotmonster commented Jan 9, 2023

Freeplayg commented Jan 9, 2023

VyrCossont commented Jan 9, 2023

l3gacyb3ta commented Jan 9, 2023

Seirdy commented Jan 9, 2023 • edited Loading

bgcarlisle commented Jan 9, 2023

w3bb commented Jan 9, 2023

l3gacyb3ta commented Jan 9, 2023

tedivm commented Jan 9, 2023

w3bb commented Jan 9, 2023 • edited Loading

l3gacyb3ta commented Jan 9, 2023

w3bb commented Jan 9, 2023

l3gacyb3ta commented Jan 9, 2023

bgcarlisle commented Jan 9, 2023

w3bb commented Jan 9, 2023

w3bb commented Jan 9, 2023

Seirdy commented Jan 9, 2023 • edited Loading

ThatOneCalculator commented Jan 9, 2023

w3bb commented Jan 9, 2023

w3bb commented Jan 9, 2023

bgcarlisle commented Jan 9, 2023

w3bb commented Jan 9, 2023

Seirdy commented Jan 9, 2023 via email

bgcarlisle commented Jan 9, 2023

w3bb commented Jan 9, 2023

Seirdy commented Jan 9, 2023 via email

w3bb commented Jan 9, 2023

w3bb commented Jan 9, 2023

tedivm commented Jan 9, 2023

bgcarlisle commented Jan 9, 2023

tedivm commented Jan 9, 2023

ledlamp commented Jan 9, 2023

tedivm commented Jan 9, 2023

Seirdy commented Jan 9, 2023 •

edited

Loading

w3bb commented Jan 9, 2023 •

edited

Loading

Seirdy commented Jan 9, 2023 •

edited

Loading