Please make this tool "opt-in" by default #293

edent · 2023-04-23T13:44:30Z

Some of my sites are being hammered by users of your tool. I don't understand why the onus is on me to add a new header to my sites opting out of this tool.

Please can you change the default behaviour so that it will only work on sites which set the X-Robots-Tag: YesAI?

The text was updated successfully, but these errors were encountered:

rom1504 · 2023-04-23T14:22:50Z

If you don't wish for people to view images from your website, the best way is to turn it off.

…

On Sun, Apr 23, 2023, 15:44 Terence Eden ***@***.***> wrote: Some of my sites are being hammered by users of your tool. I don't understand why the onus is on *me* to add a new header to my sites opting out of this tool. Please can you change the default behaviour so that it will only work on sites which set the X-Robots-Tag: YesAI? — Reply to this email directly, view it on GitHub <#293>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437UVIJRS6JMTJHILV3TXCUW4RANCNFSM6AAAAAAXIRNTCE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

edent · 2023-04-23T14:30:50Z

No, I'm happy for people to visit. But I don't understand why I need to specifically opt out of this bot?

rom1504 · 2023-04-23T14:32:10Z

This is not a bot. It's a tool to help people get images from websites

edent · 2023-04-23T14:36:09Z

OK, so every time someone builds a tool, I have to specifically opt out of it. Is that what you're saying?

Why not be a good netizen and make it so it only works on sites that have opted in? I'm happy to give you a PR to do that, if you like?

rom1504 · 2023-04-23T14:39:05Z

OK, so every time someone builds a tool, I have to specifically opt out of it. Is that what you're saying?

Yes, for example if you want to ban the tool "Google Chrome" from accessing your website, you can do so.

Why not be a good netizen and make it so it only works on sites that have opted in? I'm happy to give you a PR to do that, if you like?

That would be unethical, you can read the readme to understand why.

edent · 2023-04-23T14:40:08Z

Wait... you think consent is unethical...?

rom1504 · 2023-04-23T14:43:32Z

https://github.com/rom1504/img2dataset#ai-use-impact here's the relevant section in case you didn't find it

rom1504 · 2023-04-23T14:51:11Z

Letting a small minority (eg a few people that are publishing content) prevent the large majority (most publishers of content) from sharing their images and from having the benefit of last gen AI tool would definitely be unethical yes.

Consent is obviously not unethical. You can give your consent for anything if you wish.
It seems you're trying to decide for million of other people without asking them for their consent.

snail-coupe · 2023-04-23T15:20:17Z

But you're taking consent away from people. Your assuming consent rather than requiring it which isn't consent for all the people who know nothing about your tool or how to request it not to pull their images. You cannot find consent unknowingly.

The fact that you then provide an option for users of your tool to then even disregard the choice of people who do explicitly remove consent is an alarming red flag.

Your examples are disingenuous. Someone scraping images from millions of websites without permission is completely different to a user browsing parts of a website in Chrome.

hardillb · 2023-04-23T15:22:39Z

I'm sorry, but your logic is totally flawed, it depends on image owners knowing your tool exists before it indexes their site.

At the point they discover it (by wondering what is hammering their site and reading their logs), and then manage to find your readme and add the relevant header it's already too late as your tool has most likely already totally ingested their content without any consent.

Add to the fact you document an option to directly ignore the flags they could use to opt-out.

Your attitude here is exactly why people are against this sort of thing, it shows a total disregard for content creators.

rom1504 · 2023-04-23T15:24:26Z

Did you all already ban the search engines bot from your websites? If not, why not ?

rom1504 · 2023-04-23T15:32:57Z

Your attitude here is exactly why people are against this sort of thing, it shows a total disregard for content creators.

Please show me a vote showing that at least 10% of content creator would be happy with preventing 100% of content creators by default to have their images indexed and processed.

If you do that, you can also widely distribute this information as it will help have a lot more objective discussions on the topic way beyond this specific tool.

snail-coupe · 2023-04-23T15:41:15Z

Please show me a vote showing at least 10% of content creators are happy for their content to be collated and used in bulk without seeking their explicit permission first.

Does your tool examine the location the images are being downloaded from to locate any copyright statements or license requirements?

andypiper · 2023-04-23T15:43:42Z

To disable this behavior and download all images, you may pass…

So, is it worth me making changes to my site and hosting to add headers to avoid having your tool take my data and add it to these datasets, or not?

I guess I will do it anyway, in the hope that it puts off this behaviour, but you’re basically saying (in these comments and in the README) that anything published online is fair game for any usage whatsoever, which… I don’t think is true.

kit-repo-depot · 2023-04-23T16:28:17Z

Will you at least make it a condition of the license that any datasets created with the tool must be public and the sources of the images notified? Maybe through the administrative email from a whois lookup

fair game for any usage whatsoever, which… I don’t think is true.

either ethically or legally, how do you ensure GDPR compliance, for example

rom1504 · 2023-04-23T16:43:47Z

I would be happy if more people would publish open datasets. Most people don't however (for example see midjourney). There's not much anyone can do about it from a technical perspective.

You can choose to help release public datasets however, for example by releasing list of urls of your website and instructions how to download them. Maybe you can even fine tune some text to image models to make your specific styles more popular.

dmhowcroft · 2023-04-23T16:48:04Z

You should at least highlight in the README that the availability of an image for viewing on the web does not on its own grant any kind of license or copyright for that image to be reproduced elsewhere or repurposed for a dataset, so that users of your tool are clear that they should consult their own legal advisors to ensure their usage is acceptable.

kit-repo-depot · 2023-04-23T16:56:30Z

There's not much anyone can do about it from a technical perspective.

actually, as the creator and owner of the copyright for a dataset creation tool you have quite a bit of legally enforceable sway over how people use your tool: the license. The GPL (and the agpl after it) has for years put limits on what users can and can’t do with software, and you may add or remove parts of those licenses at any time

berlincount · 2023-04-23T17:28:46Z

I think honoring the robots.txt (as per https://datatracker.ietf.org/doc/rfc9309/) instead of requiring custom headers would address the issue properly and generically.

Other tools (like wget) do as well.

Please consider this approach over requiring custom work from site owners for your tool.

Thank you!

Square252 · 2023-04-23T17:57:23Z

Maybe it's time to sue the repo owner?

The code actively ignores the robots.txt. This is malicious.

He has clearly demonstrated that he has no knowledge about copyright and reacts pretty immature.

/edit
He's French, they got massive copyright laws, so this could get pretty ugly for him fast.

berlincount · 2023-04-23T18:04:50Z

Maybe it's time to sue the repo owner?

He has clearly demonstrated that he has no knowledge about copyright and reacts pretty immature.

hey, let's stay civil. we all can always do better.

berlincount · 2023-04-23T18:05:36Z

Maybe it's time to sue the repo owner?

He has clearly demonstrated that he has no knowledge about copyright and reacts pretty immature.

hey, let's stay civil. we all can always do better.

Square252 · 2023-04-23T18:06:41Z

Maybe it's time to sue the repo owner?

He has clearly demonstrated that he has no knowledge about copyright and reacts pretty immature.

hey, let's stay civil. we all can always do better.

Multiple people tried, he closed this issue in response. There is nothing to talk about anymore IMHO.

noncombatant · 2023-04-23T18:07:39Z

Honoring robots.txt (and copyright licenses — and Creative Commons tags are machine-readable for this reason) is a tried and true mechanism for handling this kind of concern. Including, yes Romain, for search engines (which honor them).

phryk · 2023-04-23T18:11:52Z

At least 90% of requests to my sites are already from shitty bots.

Every single one of these requests uses up energy, which in turn emits CO².

I have to regularly update my robots.txt with the worst offenders of that
particular week to keep this at least somewhat in check and because of
people like you, even this extra effort that I shouldn't have to put in in the
first place just isn't a viable option.

I'm seriously beginning to ponder automatically hitting back with garbage data
at full bandwidth because at some point we'll have to reach the "find out" part
of fuck around and find out. Not good in terms of CO² output, but I'm seriously
at a loss on how to handle this bullshit.

Additionally, training data being undocumented garbage is one of the major
problems leading to "AI" being unexplainable and unauditable blackboxes,
making any sort of public governance essentially impossible and this tool
seems to be a prime example of how datasets like that come into existence.

You don't seem to understand nor want to engage with either issue and this
illustrates perfectly why I hate most of the AI scene.

Coffee2CodeNL · 2023-04-23T18:14:34Z

Reprocess all images on your site with Glaze.

ianturton · 2023-04-23T18:19:30Z

Did you all already ban the search engines bot from your websites? If not, why not ?

That's what robots.txt is for, we thought of this 20 years ago.

rom1504 · 2023-04-23T18:41:09Z

Alright let's stop here then.

If you want to help implementing robots.txt support, go at #48 , a few technical implementations were suggested there.

rom1504 · 2023-04-23T19:14:25Z

It is sad that several of you are not understanding the potential of AI and open AI and as a consequence have decided to fight it.

You will have many opportunities in the years to come to benefit from AI. I hope you see that sooner rather than later. As creators you have even more opportunities to benefit from it.

That said, if you do want to prevent your content from having a wider reach in indexes and in models, you can use the headers recommended in the readme. You can also use robots.txt which is already respected by some sources of data (for example common crawl)

Headers will however only affect this specific tool, I have no power about the majority of data collection efforts which are not open source and are not using this tool.

Have a good day.

rom1504 closed this as completed Apr 23, 2023

Repository owner locked as too heated and limited conversation to collaborators Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please make this tool "opt-in" by default #293

Please make this tool "opt-in" by default #293

edent commented Apr 23, 2023

rom1504 commented Apr 23, 2023 via email

edent commented Apr 23, 2023

rom1504 commented Apr 23, 2023

edent commented Apr 23, 2023

rom1504 commented Apr 23, 2023

edent commented Apr 23, 2023

rom1504 commented Apr 23, 2023

rom1504 commented Apr 23, 2023 •

edited

Loading

snail-coupe commented Apr 23, 2023

hardillb commented Apr 23, 2023

rom1504 commented Apr 23, 2023

rom1504 commented Apr 23, 2023

snail-coupe commented Apr 23, 2023

andypiper commented Apr 23, 2023

kit-repo-depot commented Apr 23, 2023 •

edited

Loading

rom1504 commented Apr 23, 2023

dmhowcroft commented Apr 23, 2023

kit-repo-depot commented Apr 23, 2023

berlincount commented Apr 23, 2023

Square252 commented Apr 23, 2023 •

edited

Loading

berlincount commented Apr 23, 2023

berlincount commented Apr 23, 2023

Square252 commented Apr 23, 2023

noncombatant commented Apr 23, 2023

phryk commented Apr 23, 2023

Coffee2CodeNL commented Apr 23, 2023

ianturton commented Apr 23, 2023

rom1504 commented Apr 23, 2023

rom1504 commented Apr 23, 2023

Please make this tool "opt-in" by default #293

Please make this tool "opt-in" by default #293

Comments

edent commented Apr 23, 2023

rom1504 commented Apr 23, 2023 via email

edent commented Apr 23, 2023

rom1504 commented Apr 23, 2023

edent commented Apr 23, 2023

rom1504 commented Apr 23, 2023

edent commented Apr 23, 2023

rom1504 commented Apr 23, 2023

rom1504 commented Apr 23, 2023 • edited Loading

snail-coupe commented Apr 23, 2023

hardillb commented Apr 23, 2023

rom1504 commented Apr 23, 2023

rom1504 commented Apr 23, 2023

snail-coupe commented Apr 23, 2023

andypiper commented Apr 23, 2023

kit-repo-depot commented Apr 23, 2023 • edited Loading

rom1504 commented Apr 23, 2023

dmhowcroft commented Apr 23, 2023

kit-repo-depot commented Apr 23, 2023

berlincount commented Apr 23, 2023

Square252 commented Apr 23, 2023 • edited Loading

berlincount commented Apr 23, 2023

berlincount commented Apr 23, 2023

Square252 commented Apr 23, 2023

noncombatant commented Apr 23, 2023

phryk commented Apr 23, 2023

Coffee2CodeNL commented Apr 23, 2023

ianturton commented Apr 23, 2023

rom1504 commented Apr 23, 2023

rom1504 commented Apr 23, 2023

rom1504 commented Apr 23, 2023 •

edited

Loading

kit-repo-depot commented Apr 23, 2023 •

edited

Loading

Square252 commented Apr 23, 2023 •

edited

Loading