Skip to content
This repository has been archived by the owner on Jan 9, 2023. It is now read-only.

Scrapper should be opt-in #7

Closed
l3gacyb3ta opened this issue Jan 9, 2023 · 38 comments
Closed

Scrapper should be opt-in #7

l3gacyb3ta opened this issue Jan 9, 2023 · 38 comments

Comments

@l3gacyb3ta
Copy link

The fediverse has a unique culture towards scrappers, and I believe that you should probably make this opt-in, if you don't wanna start getting bashed all over fedi.

@nahga
Copy link

nahga commented Jan 9, 2023

This is a bad idea overall. Nominating yourself to curate and collect data that no one asked you to do will probably not fly well.

@tedivm
Copy link
Owner

tedivm commented Jan 9, 2023

I actually was asked about this. This project started because of an abuse issue multiple instances were having where their blocks were being evaded by proxy instances, so I wrote them a tool to help them identify those systems. I've worked with multiple instance admins throughout this process to ensure that user privacy is respected, and have made sure that this system can easily be opted out of by instances who don't want to be crawled.

I'm taking all feedback seriously and have made several changes based on that feedback. This tool is meant to help instance admins out, but if it causes damage in the process it'll get taken down.

@tedivm
Copy link
Owner

tedivm commented Jan 9, 2023

Here's some context on the original abuse issue that caused this project to get formed.

@nahga
Copy link

nahga commented Jan 9, 2023

Regardless of intention, all of this should be opt-in by default. Not the other way around.

@l3gacyb3ta
Copy link
Author

Yeah, generally people hate these tools, even if it was made with good/neutral intentions. I also am very wary of tools like this.

@mothdotmonster
Copy link

+1 on making things be opt-in.

@Freeplayg
Copy link

PLEASE. Opt-in.

Instances have already blocked hachyderm for this.

@VyrCossont
Copy link

I actually was asked about this. This project started because of an abuse issue multiple instances were having where their blocks were being evaded by proxy instances, so I wrote them a tool to help them identify those systems.

@tedivm If opting out of this tool does anything meaningful, what happens when the AP proxy software starts opting out? Conversely, why collect so much info not relevant to identifying proxies by counting subdomains?

@l3gacyb3ta
Copy link
Author

Using your own service, we can see mastinator.com is blocked a ton for the exact things you are trying to do (scraping, violating consent, not going with the culture)

@Seirdy
Copy link

Seirdy commented Jan 9, 2023

Like the KF scraper, this publicizes which instances block a given instance; doing so makes that data much more easily accessible than it was before. This encourages targeted retaliation, and thus needs to be opt-in. Your good intentions don't erase this problem and the real harm it causes.

I can believe you when you say this was meant to address real issues around blocklist-evasion, but the data it currently exposes will merely replace one threat with another. I suggest deleting existing data, turning it off, and asking Fedi for feedback before starting a project like this again. I do think it's possible to publish very limited aggregate data that doesn't enable targeted harassment, but this isn't how it's done.

@bgcarlisle
Copy link

You NEED to make this opt-in only

Your opt-out method is also inadequate

Not everyone who's running a Masto instance can alter their robots.txt, thanks to limitations from hosting providers

@l3gacyb3ta
Copy link
Author

So then complain to the hosting providers. Why is it everybody else's problem?

Because everyone else is writing scrapers...

@tedivm
Copy link
Owner

tedivm commented Jan 9, 2023

Like the KF scraper, this publicizes which instances block a given instance; doing so makes that data much more easily accessible than it was before. This encourages targeted retaliation, and thus needs to be opt-in. Your good intentions don't erase this problem and the real harm it causes.

I have removed those endpoints from the service.

@l3gacyb3ta
Copy link
Author

There is a social contract on the internet that the site owner puts a robots.txt file at the root to dictate what bots do.

There's also a social contest on fedi not to build scapers lol

@l3gacyb3ta
Copy link
Author

Well then we can disagree, but this thread suggests otherwise

@bgcarlisle
Copy link

The fact remains, I'm an instance admin and I CAN'T opt out

This thing NEEDS to be shut down until it is opt-in only

@Seirdy
Copy link

Seirdy commented Jan 9, 2023

Since the creator's fedi account has been suspended, they might not have seen my reply so I'll copy it here since it's relevant:

Listen. People do not trust you right now. You potentially hold a ton of data and people feel unsafe merely knowing you have it. Tons of people have archived the data you exposed already and are literally going through it right now.

You need to over-correct fast, and that means shutting this down.

@ThatOneCalculator
Copy link

Opt in or get out.

@bgcarlisle
Copy link

No, that's opt out, literally the opposite, and the documentation doesn't give any other options than using robots.txt

@Seirdy
Copy link

Seirdy commented Jan 9, 2023 via email

@bgcarlisle
Copy link

It's an implicit opt-in

The term for that is "opt out"

@Seirdy
Copy link

Seirdy commented Jan 9, 2023 via email

@tedivm
Copy link
Owner

tedivm commented Jan 9, 2023

Just to be clear, I'm not an admin on Hachyderm. The people spreading that rumor are wrong. Hachyderm has nothing to do with this project other than me making a post there.

@bgcarlisle
Copy link

Did you read anything here?

That's not what anyone is discussing

@tedivm
Copy link
Owner

tedivm commented Jan 9, 2023

I'm reading everything that comes through, but I wanted to clear that one piece of information up.

@ledlamp
Copy link

ledlamp commented Jan 9, 2023

Bruh, if it was opt-in, it would be useless cause nobody would bother to opt in! Imagine if you had to opt-in to bots on the world wide web; search engines like google wouldn't really work because a lot of sites don't care and don't have a robots.txt! Well, the fediverse is based on the world wide web, so the same concept applies.

@tedivm
Copy link
Owner

tedivm commented Jan 9, 2023

The website is down. Thank you all for your feedback.

@tedivm tedivm closed this as completed Jan 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

11 participants