-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please make this tool "opt-in" by default #293
Comments
If you don't wish for people to view images from your website, the best way
is to turn it off.
…On Sun, Apr 23, 2023, 15:44 Terence Eden ***@***.***> wrote:
Some of my sites are being hammered by users of your tool. I don't
understand why the onus is on *me* to add a new header to my sites opting
out of this tool.
Please can you change the default behaviour so that it will only work on
sites which set the X-Robots-Tag: YesAI?
—
Reply to this email directly, view it on GitHub
<#293>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437UVIJRS6JMTJHILV3TXCUW4RANCNFSM6AAAAAAXIRNTCE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
No, I'm happy for people to visit. But I don't understand why I need to specifically opt out of this bot? |
This is not a bot. It's a tool to help people get images from websites |
OK, so every time someone builds a tool, I have to specifically opt out of it. Is that what you're saying? Why not be a good netizen and make it so it only works on sites that have opted in? I'm happy to give you a PR to do that, if you like? |
Yes, for example if you want to ban the tool "Google Chrome" from accessing your website, you can do so.
That would be unethical, you can read the readme to understand why. |
Wait... you think consent is unethical...? |
https://github.com/rom1504/img2dataset#ai-use-impact here's the relevant section in case you didn't find it |
Letting a small minority (eg a few people that are publishing content) prevent the large majority (most publishers of content) from sharing their images and from having the benefit of last gen AI tool would definitely be unethical yes. Consent is obviously not unethical. You can give your consent for anything if you wish. |
But you're taking consent away from people. Your assuming consent rather than requiring it which isn't consent for all the people who know nothing about your tool or how to request it not to pull their images. You cannot find consent unknowingly. The fact that you then provide an option for users of your tool to then even disregard the choice of people who do explicitly remove consent is an alarming red flag. Your examples are disingenuous. Someone scraping images from millions of websites without permission is completely different to a user browsing parts of a website in Chrome. |
I'm sorry, but your logic is totally flawed, it depends on image owners knowing your tool exists before it indexes their site. At the point they discover it (by wondering what is hammering their site and reading their logs), and then manage to find your readme and add the relevant header it's already too late as your tool has most likely already totally ingested their content without any consent. Add to the fact you document an option to directly ignore the flags they could use to opt-out. Your attitude here is exactly why people are against this sort of thing, it shows a total disregard for content creators. |
Did you all already ban the search engines bot from your websites? If not, why not ? |
Please show me a vote showing that at least 10% of content creator would be happy with preventing 100% of content creators by default to have their images indexed and processed. If you do that, you can also widely distribute this information as it will help have a lot more objective discussions on the topic way beyond this specific tool. |
Please show me a vote showing at least 10% of content creators are happy for their content to be collated and used in bulk without seeking their explicit permission first. Does your tool examine the location the images are being downloaded from to locate any copyright statements or license requirements? |
So, is it worth me making changes to my site and hosting to add headers to avoid having your tool take my data and add it to these datasets, or not? I guess I will do it anyway, in the hope that it puts off this behaviour, but you’re basically saying (in these comments and in the README) that anything published online is fair game for any usage whatsoever, which… I don’t think is true. |
Will you at least make it a condition of the license that any datasets created with the tool must be public and the sources of the images notified? Maybe through the administrative email from a whois lookup
either ethically or legally, how do you ensure GDPR compliance, for example |
I would be happy if more people would publish open datasets. Most people don't however (for example see midjourney). There's not much anyone can do about it from a technical perspective. You can choose to help release public datasets however, for example by releasing list of urls of your website and instructions how to download them. Maybe you can even fine tune some text to image models to make your specific styles more popular. |
You should at least highlight in the README that the availability of an image for viewing on the web does not on its own grant any kind of license or copyright for that image to be reproduced elsewhere or repurposed for a dataset, so that users of your tool are clear that they should consult their own legal advisors to ensure their usage is acceptable. |
actually, as the creator and owner of the copyright for a dataset creation tool you have quite a bit of legally enforceable sway over how people use your tool: the license. The GPL (and the agpl after it) has for years put limits on what users can and can’t do with software, and you may add or remove parts of those licenses at any time |
I think honoring the Other tools (like Please consider this approach over requiring custom work from site owners for your tool. Thank you! |
Maybe it's time to sue the repo owner? The code actively ignores the robots.txt. This is malicious. He has clearly demonstrated that he has no knowledge about copyright and reacts pretty immature. /edit |
hey, let's stay civil. we all can always do better. |
1 similar comment
hey, let's stay civil. we all can always do better. |
Multiple people tried, he closed this issue in response. There is nothing to talk about anymore IMHO. |
Honoring robots.txt (and copyright licenses — and Creative Commons tags are machine-readable for this reason) is a tried and true mechanism for handling this kind of concern. Including, yes Romain, for search engines (which honor them). |
At least 90% of requests to my sites are already from shitty bots. Every single one of these requests uses up energy, which in turn emits CO². I have to regularly update my I'm seriously beginning to ponder automatically hitting back with garbage data Additionally, training data being undocumented garbage is one of the major You don't seem to understand nor want to engage with either issue and this |
Reprocess all images on your site with Glaze. |
That's what robots.txt is for, we thought of this 20 years ago. |
Alright let's stop here then. If you want to help implementing robots.txt support, go at #48 , a few technical implementations were suggested there. |
It is sad that several of you are not understanding the potential of AI and open AI and as a consequence have decided to fight it. You will have many opportunities in the years to come to benefit from AI. I hope you see that sooner rather than later. As creators you have even more opportunities to benefit from it. That said, if you do want to prevent your content from having a wider reach in indexes and in models, you can use the headers recommended in the readme. You can also use robots.txt which is already respected by some sources of data (for example common crawl) Headers will however only affect this specific tool, I have no power about the majority of data collection efforts which are not open source and are not using this tool. Have a good day. |
Some of my sites are being hammered by users of your tool. I don't understand why the onus is on me to add a new header to my sites opting out of this tool.
Please can you change the default behaviour so that it will only work on sites which set the
X-Robots-Tag: YesAI
?The text was updated successfully, but these errors were encountered: