Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please make this tool "opt-in" by default #293

Closed
edent opened this issue Apr 23, 2023 · 29 comments
Closed

Please make this tool "opt-in" by default #293

edent opened this issue Apr 23, 2023 · 29 comments

Comments

@edent
Copy link

edent commented Apr 23, 2023

Some of my sites are being hammered by users of your tool. I don't understand why the onus is on me to add a new header to my sites opting out of this tool.

Please can you change the default behaviour so that it will only work on sites which set the X-Robots-Tag: YesAI?

@rom1504
Copy link
Owner

rom1504 commented Apr 23, 2023 via email

@rom1504 rom1504 closed this as completed Apr 23, 2023
@edent
Copy link
Author

edent commented Apr 23, 2023

No, I'm happy for people to visit. But I don't understand why I need to specifically opt out of this bot?

@rom1504
Copy link
Owner

rom1504 commented Apr 23, 2023

This is not a bot. It's a tool to help people get images from websites

@edent
Copy link
Author

edent commented Apr 23, 2023

OK, so every time someone builds a tool, I have to specifically opt out of it. Is that what you're saying?

Why not be a good netizen and make it so it only works on sites that have opted in? I'm happy to give you a PR to do that, if you like?

@rom1504
Copy link
Owner

rom1504 commented Apr 23, 2023

OK, so every time someone builds a tool, I have to specifically opt out of it. Is that what you're saying?

Yes, for example if you want to ban the tool "Google Chrome" from accessing your website, you can do so.

Why not be a good netizen and make it so it only works on sites that have opted in? I'm happy to give you a PR to do that, if you like?

That would be unethical, you can read the readme to understand why.

@edent
Copy link
Author

edent commented Apr 23, 2023

Wait... you think consent is unethical...?

@rom1504
Copy link
Owner

rom1504 commented Apr 23, 2023

https://github.com/rom1504/img2dataset#ai-use-impact here's the relevant section in case you didn't find it

@rom1504
Copy link
Owner

rom1504 commented Apr 23, 2023

Letting a small minority (eg a few people that are publishing content) prevent the large majority (most publishers of content) from sharing their images and from having the benefit of last gen AI tool would definitely be unethical yes.

Consent is obviously not unethical. You can give your consent for anything if you wish.
It seems you're trying to decide for million of other people without asking them for their consent.

@snail-coupe
Copy link

But you're taking consent away from people. Your assuming consent rather than requiring it which isn't consent for all the people who know nothing about your tool or how to request it not to pull their images. You cannot find consent unknowingly.

The fact that you then provide an option for users of your tool to then even disregard the choice of people who do explicitly remove consent is an alarming red flag.

Your examples are disingenuous. Someone scraping images from millions of websites without permission is completely different to a user browsing parts of a website in Chrome.

@hardillb
Copy link

I'm sorry, but your logic is totally flawed, it depends on image owners knowing your tool exists before it indexes their site.

At the point they discover it (by wondering what is hammering their site and reading their logs), and then manage to find your readme and add the relevant header it's already too late as your tool has most likely already totally ingested their content without any consent.

Add to the fact you document an option to directly ignore the flags they could use to opt-out.

Your attitude here is exactly why people are against this sort of thing, it shows a total disregard for content creators.

@rom1504
Copy link
Owner

rom1504 commented Apr 23, 2023

Did you all already ban the search engines bot from your websites? If not, why not ?

@rom1504
Copy link
Owner

rom1504 commented Apr 23, 2023

Your attitude here is exactly why people are against this sort of thing, it shows a total disregard for content creators.

Please show me a vote showing that at least 10% of content creator would be happy with preventing 100% of content creators by default to have their images indexed and processed.

If you do that, you can also widely distribute this information as it will help have a lot more objective discussions on the topic way beyond this specific tool.

@snail-coupe
Copy link

Please show me a vote showing at least 10% of content creators are happy for their content to be collated and used in bulk without seeking their explicit permission first.

Does your tool examine the location the images are being downloaded from to locate any copyright statements or license requirements?

@andypiper
Copy link

To disable this behavior and download all images, you may pass…

So, is it worth me making changes to my site and hosting to add headers to avoid having your tool take my data and add it to these datasets, or not?

I guess I will do it anyway, in the hope that it puts off this behaviour, but you’re basically saying (in these comments and in the README) that anything published online is fair game for any usage whatsoever, which… I don’t think is true.

@kit-repo-depot
Copy link

kit-repo-depot commented Apr 23, 2023

Will you at least make it a condition of the license that any datasets created with the tool must be public and the sources of the images notified? Maybe through the administrative email from a whois lookup

fair game for any usage whatsoever, which… I don’t think is true.

either ethically or legally, how do you ensure GDPR compliance, for example

@rom1504
Copy link
Owner

rom1504 commented Apr 23, 2023

I would be happy if more people would publish open datasets. Most people don't however (for example see midjourney). There's not much anyone can do about it from a technical perspective.

You can choose to help release public datasets however, for example by releasing list of urls of your website and instructions how to download them. Maybe you can even fine tune some text to image models to make your specific styles more popular.

@dmhowcroft
Copy link

You should at least highlight in the README that the availability of an image for viewing on the web does not on its own grant any kind of license or copyright for that image to be reproduced elsewhere or repurposed for a dataset, so that users of your tool are clear that they should consult their own legal advisors to ensure their usage is acceptable.

@kit-repo-depot
Copy link

There's not much anyone can do about it from a technical perspective.

actually, as the creator and owner of the copyright for a dataset creation tool you have quite a bit of legally enforceable sway over how people use your tool: the license. The GPL (and the agpl after it) has for years put limits on what users can and can’t do with software, and you may add or remove parts of those licenses at any time

@berlincount
Copy link

I think honoring the robots.txt (as per https://datatracker.ietf.org/doc/rfc9309/) instead of requiring custom headers would address the issue properly and generically.

Other tools (like wget) do as well.

Please consider this approach over requiring custom work from site owners for your tool.

Thank you!

@Square252
Copy link

Square252 commented Apr 23, 2023

Maybe it's time to sue the repo owner?

The code actively ignores the robots.txt. This is malicious.

He has clearly demonstrated that he has no knowledge about copyright and reacts pretty immature.

/edit
He's French, they got massive copyright laws, so this could get pretty ugly for him fast.

@berlincount
Copy link

Maybe it's time to sue the repo owner?

He has clearly demonstrated that he has no knowledge about copyright and reacts pretty immature.

hey, let's stay civil. we all can always do better.

1 similar comment
@berlincount
Copy link

Maybe it's time to sue the repo owner?

He has clearly demonstrated that he has no knowledge about copyright and reacts pretty immature.

hey, let's stay civil. we all can always do better.

@Square252
Copy link

Maybe it's time to sue the repo owner?

He has clearly demonstrated that he has no knowledge about copyright and reacts pretty immature.

hey, let's stay civil. we all can always do better.

Multiple people tried, he closed this issue in response. There is nothing to talk about anymore IMHO.

@noncombatant
Copy link

Honoring robots.txt (and copyright licenses — and Creative Commons tags are machine-readable for this reason) is a tried and true mechanism for handling this kind of concern. Including, yes Romain, for search engines (which honor them).

@phryk
Copy link

phryk commented Apr 23, 2023

At least 90% of requests to my sites are already from shitty bots.

Every single one of these requests uses up energy, which in turn emits CO².

I have to regularly update my robots.txt with the worst offenders of that
particular week to keep this at least somewhat in check and because of
people like you, even this extra effort that I shouldn't have to put in in the
first place just isn't a viable option.

I'm seriously beginning to ponder automatically hitting back with garbage data
at full bandwidth because at some point we'll have to reach the "find out" part
of fuck around and find out. Not good in terms of CO² output, but I'm seriously
at a loss on how to handle this bullshit.

Additionally, training data being undocumented garbage is one of the major
problems leading to "AI" being unexplainable and unauditable blackboxes,
making any sort of public governance essentially impossible and this tool
seems to be a prime example of how datasets like that come into existence.

You don't seem to understand nor want to engage with either issue and this
illustrates perfectly why I hate most of the AI scene.

@Coffee2CodeNL
Copy link

Reprocess all images on your site with Glaze.

@ianturton
Copy link

Did you all already ban the search engines bot from your websites? If not, why not ?

That's what robots.txt is for, we thought of this 20 years ago.

@rom1504
Copy link
Owner

rom1504 commented Apr 23, 2023

Alright let's stop here then.

If you want to help implementing robots.txt support, go at #48 , a few technical implementations were suggested there.

Repository owner locked as too heated and limited conversation to collaborators Apr 23, 2023
@rom1504
Copy link
Owner

rom1504 commented Apr 23, 2023

It is sad that several of you are not understanding the potential of AI and open AI and as a consequence have decided to fight it.

You will have many opportunities in the years to come to benefit from AI. I hope you see that sooner rather than later. As creators you have even more opportunities to benefit from it.

That said, if you do want to prevent your content from having a wider reach in indexes and in models, you can use the headers recommended in the readme. You can also use robots.txt which is already respected by some sources of data (for example common crawl)

Headers will however only affect this specific tool, I have no power about the majority of data collection efforts which are not open source and are not using this tool.

Have a good day.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests