Skip to content
This repository has been archived by the owner on Jan 9, 2023. It is now read-only.

Scrapper should be opt-in #7

Closed
l3gacyb3ta opened this issue Jan 9, 2023 · 38 comments
Closed

Scrapper should be opt-in #7

l3gacyb3ta opened this issue Jan 9, 2023 · 38 comments

Comments

@l3gacyb3ta
Copy link

The fediverse has a unique culture towards scrappers, and I believe that you should probably make this opt-in, if you don't wanna start getting bashed all over fedi.

@nahga
Copy link

nahga commented Jan 9, 2023

This is a bad idea overall. Nominating yourself to curate and collect data that no one asked you to do will probably not fly well.

@tedivm
Copy link
Owner

tedivm commented Jan 9, 2023

I actually was asked about this. This project started because of an abuse issue multiple instances were having where their blocks were being evaded by proxy instances, so I wrote them a tool to help them identify those systems. I've worked with multiple instance admins throughout this process to ensure that user privacy is respected, and have made sure that this system can easily be opted out of by instances who don't want to be crawled.

I'm taking all feedback seriously and have made several changes based on that feedback. This tool is meant to help instance admins out, but if it causes damage in the process it'll get taken down.

@tedivm
Copy link
Owner

tedivm commented Jan 9, 2023

Here's some context on the original abuse issue that caused this project to get formed.

@nahga
Copy link

nahga commented Jan 9, 2023

Regardless of intention, all of this should be opt-in by default. Not the other way around.

@l3gacyb3ta
Copy link
Author

Yeah, generally people hate these tools, even if it was made with good/neutral intentions. I also am very wary of tools like this.

@mothdotmonster
Copy link

+1 on making things be opt-in.

@Freeplayg
Copy link

PLEASE. Opt-in.

Instances have already blocked hachyderm for this.

@VyrCossont
Copy link

I actually was asked about this. This project started because of an abuse issue multiple instances were having where their blocks were being evaded by proxy instances, so I wrote them a tool to help them identify those systems.

@tedivm If opting out of this tool does anything meaningful, what happens when the AP proxy software starts opting out? Conversely, why collect so much info not relevant to identifying proxies by counting subdomains?

@l3gacyb3ta
Copy link
Author

Using your own service, we can see mastinator.com is blocked a ton for the exact things you are trying to do (scraping, violating consent, not going with the culture)

@Seirdy
Copy link

Seirdy commented Jan 9, 2023

Like the KF scraper, this publicizes which instances block a given instance; doing so makes that data much more easily accessible than it was before. This encourages targeted retaliation, and thus needs to be opt-in. Your good intentions don't erase this problem and the real harm it causes.

I can believe you when you say this was meant to address real issues around blocklist-evasion, but the data it currently exposes will merely replace one threat with another. I suggest deleting existing data, turning it off, and asking Fedi for feedback before starting a project like this again. I do think it's possible to publish very limited aggregate data that doesn't enable targeted harassment, but this isn't how it's done.

@bgcarlisle
Copy link

You NEED to make this opt-in only

Your opt-out method is also inadequate

Not everyone who's running a Masto instance can alter their robots.txt, thanks to limitations from hosting providers

@w3bb
Copy link

w3bb commented Jan 9, 2023

Not everyone who's running a Masto instance can alter their robots.txt, thanks to limitations from hosting providers

So then complain to the hosting providers. Why is it everybody else's problem?

@l3gacyb3ta
Copy link
Author

So then complain to the hosting providers. Why is it everybody else's problem?

Because everyone else is writing scrapers...

@tedivm
Copy link
Owner

tedivm commented Jan 9, 2023

Like the KF scraper, this publicizes which instances block a given instance; doing so makes that data much more easily accessible than it was before. This encourages targeted retaliation, and thus needs to be opt-in. Your good intentions don't erase this problem and the real harm it causes.

I have removed those endpoints from the service.

@w3bb
Copy link

w3bb commented Jan 9, 2023

Because everyone else is writing scrapers...

There is a social contract on the internet that the site owner puts a robots.txt file at the root to dictate what bots do. The internet (and even law in some cases) are built around this contract. This is how the internet has worked for ages. You have a means to not have /any/ bot spider and get information.

It takes one line in robots.txt, or a blocking of the user agent, or a flip of a switch to not advertise the information. I believe the last one should be possible on a shared host running Mastodon. I'd advise the latter if you're concerned about this information being public. There are people who do not care about robots.txt and will get the information if its avaliable, and security through hoping-no-bad-people-will-ever-collect-this-information-I-plaster-all-over-the-place is a bad model.

@l3gacyb3ta
Copy link
Author

There is a social contract on the internet that the site owner puts a robots.txt file at the root to dictate what bots do.

There's also a social contest on fedi not to build scapers lol

@w3bb
Copy link

w3bb commented Jan 9, 2023

There's also a social contest on fedi not to build [spiders] lol

I don't think so. There are very popular sites like fediverse.observer that spiders instances and collect similar information. The sentiment that something like this is bad is not a common one in my experience; I think that it's a vocal minority.

@l3gacyb3ta
Copy link
Author

Well then we can disagree, but this thread suggests otherwise

@bgcarlisle
Copy link

The fact remains, I'm an instance admin and I CAN'T opt out

This thing NEEDS to be shut down until it is opt-in only

@w3bb
Copy link

w3bb commented Jan 9, 2023

I'll also add that nodeinfo is designed for tools like this.

NodeInfo is an effort to create a standardized way of exposing metadata about a server running one of the distributed social networks. The two key goals are being able to get better insights into the user base of distributed social networking and the ability to build tools that allow users to choose the best fitting software and server for their needs.

@w3bb
Copy link

w3bb commented Jan 9, 2023

The fact remains, I'm an instance admin and I CAN'T opt out

This thing NEEDS to be shut down until it is opt-in only

Ways of doing so, even under shared hosting, are possible as I explained earlier.

@Seirdy
Copy link

Seirdy commented Jan 9, 2023

Since the creator's fedi account has been suspended, they might not have seen my reply so I'll copy it here since it's relevant:

Listen. People do not trust you right now. You potentially hold a ton of data and people feel unsafe merely knowing you have it. Tons of people have archived the data you exposed already and are literally going through it right now.

You need to over-correct fast, and that means shutting this down.

@ThatOneCalculator
Copy link

Opt in or get out.

@w3bb
Copy link

w3bb commented Jan 9, 2023

(This is the documentation: https://github.com/tedivm/fedimapper#block-this-bot)

I'm referring to what I said about not exposing the blocklist itself. I believe that is an option. On mastodon.social you can also see obscured domain names on the about page as well.

@w3bb
Copy link

w3bb commented Jan 9, 2023

Opt in or get out.

robots.txt allowing bots is opting in. This project respects robots.txt.

@bgcarlisle
Copy link

No, that's opt out, literally the opposite, and the documentation doesn't give any other options than using robots.txt

@w3bb
Copy link

w3bb commented Jan 9, 2023

No, that's opt out, literally the opposite, and the documentation doesn't give any other options than using robots.txt

It's an implicit opt-in. I'd advise you to read my earlier comment.

@Seirdy
Copy link

Seirdy commented Jan 9, 2023 via email

@bgcarlisle
Copy link

It's an implicit opt-in

The term for that is "opt out"

@w3bb
Copy link

w3bb commented Jan 9, 2023

@w3bb opting-out of opting-out is not the same as opting in, because consent isn't a freaking multiplication problem.

This is how the internet has worked for ages. If people had to get manual consent for spidering, search engines would have been impossible. People should be complaining to their hosts who can't spend fifteen minutes to add a basic option like that instead of spiders who have no reasonable means of knowing you can't use robots.txt.

@Seirdy
Copy link

Seirdy commented Jan 9, 2023 via email

@w3bb
Copy link

w3bb commented Jan 9, 2023

That's the point. There's a reason why just about every attempt at a Fediverse search engine has been network-filtered, Fediblocked, tarpitted, and fed bad data until it shut down. Tools like search engines which aren't opt-in aren't welcome on Fedi.

This is a false equivalency. Like I mentioned earlier up in the thread (I assume it hasn't come through to you on email) a better comparison would be to something like fediverse.observer. This I believe uses a Mastodon endpoint, but the same information is exposed via nodeinfo, which is explicitly designed for tools like these. fediverse.observer also uses nodeinfo.

@w3bb
Copy link

w3bb commented Jan 9, 2023

(For some reason it sent it while I was typing a draft, sorry about that.)

@tedivm
Copy link
Owner

tedivm commented Jan 9, 2023

Just to be clear, I'm not an admin on Hachyderm. The people spreading that rumor are wrong. Hachyderm has nothing to do with this project other than me making a post there.

@bgcarlisle
Copy link

Did you read anything here?

That's not what anyone is discussing

@tedivm
Copy link
Owner

tedivm commented Jan 9, 2023

I'm reading everything that comes through, but I wanted to clear that one piece of information up.

@ledlamp
Copy link

ledlamp commented Jan 9, 2023

Bruh, if it was opt-in, it would be useless cause nobody would bother to opt in! Imagine if you had to opt-in to bots on the world wide web; search engines like google wouldn't really work because a lot of sites don't care and don't have a robots.txt! Well, the fediverse is based on the world wide web, so the same concept applies.

@tedivm
Copy link
Owner

tedivm commented Jan 9, 2023

The website is down. Thank you all for your feedback.

@tedivm tedivm closed this as completed Jan 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests