New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Respect noai and noimageai directives when downloading image files #218
Conversation
Could you please check the impact of those options by downloading a subset
of laion2B-en with and without them?
(1M samples should be enough to draw conclusions)
…On Fri, Nov 11, 2022, 15:56 Chris Nell ***@***.***> wrote:
Media owners can use the X-Robots-Tag header to communicate usage
directives for the associated media, including instruction that the image
not be used in any indexes (noindex) or included in datasets used for
machine learning purposes (noai).
This PR makes img2dataset respect such directives by not including
associated media in the generated dataset. It also updates the useragent
string, introducing a img2dataset user agent token so that requests made
using the tool are identifiable by media hosts.
Refs:
-
https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag
-
https://www.deviantart.com/team/journal/A-New-Directive-for-Opting-Out-of-AI-Datasets-934500371
------------------------------
You can view, comment on, or merge this pull request online at:
#218
Commit Summary
- 70d8c34
<70d8c34>
Respect noai and noimageai directives
File Changes
(4 files <https://github.com/rom1504/img2dataset/pull/218/files>)
- *M* img2dataset/downloader.py
<https://github.com/rom1504/img2dataset/pull/218/files#diff-2ded925514a1a2d2eebace43140502f4b37b16c8bd36a5a801360def648088b1>
(27)
- *M* tests/fixtures.py
<https://github.com/rom1504/img2dataset/pull/218/files#diff-5dbc88d6e5c3615caf10e32a9d6fc6ff683f5b5814948928cb84c3ab91c038b6>
(11)
- *M* tests/http_server.py
<https://github.com/rom1504/img2dataset/pull/218/files#diff-3b8f205367cf5efc5d6543112162056d4ed0b15934f4aed8e9684f2224810788>
(12)
- *M* tests/test_downloader.py
<https://github.com/rom1504/img2dataset/pull/218/files#diff-e39baf18d7b9129f03c7479a4575179a37a595f9635644a3c7b478861170f697>
(9)
Patch Links:
- https://github.com/rom1504/img2dataset/pull/218.patch
- https://github.com/rom1504/img2dataset/pull/218.diff
—
Reply to this email directly, view it on GitHub
<#218>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437VX4QRIJNPUIGH6X33WHZNAVANCNFSM6AAAAAAR5VQQTM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
img2dataset/downloader.py
Outdated
USER_AGENT_TOKEN = "img2dataset" | ||
|
||
|
||
def is_disallowed(headers): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does one additional call, so can you check speed ?
I am not convinced this should be done at the downloading step but rather at the crawling step (while collecting the links)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note there is no change to HTTP requests being issued -- the headers are already sent (or not) with the file download. Are you concerned with overhead associated with the additional python method call to parse headers? I will include timings and stats for this branch compared to current main head branch in a bit (running now) but i suspect these overheads will be negligible compared to the cost of the actual HTTP requests.
Re your other point, I agree that the crawling step should check page HTML for relevant robots meta tag directives and omit images accordingly. If there is a repo used for indexing the LAION datasets that you'd suggest adding this functionality to, I'd be happy to work on a PR for that as well.
However, it's possible that the creator of a given dataset did NOT consider such directives when their crawl was done, and it's also possible that directives associated with a specific media file change after the crawl (the media owner has every right to do so). As such I don't think the user who actually downloads and uses the indexed media can be absolved of all responsibility to respect directives that apply to their intended usage, especially if those directives are made readily available. Since the HTTP headers are sent with the image files when downloaded anyway, I worry it could be seen as negligent by the img2dataset user to receive these directive but choose to ignore them -- especially now that legality of use of third-party images for AI training purposes is being tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let's add this under a command line option
I did a couple of tests, run sequentially on a M1 macbook pro over a 1Gb residential fiber connection. Downloading the first 1M images from laion-art.Using the PR's branch (test completed at around 10AM PST):
Using the current main head (completed around 11:30am PST):
So about 1% fewer images were downloaded, and it took a bit less than 10% longer. However, I don't trust the images/sec timing here because the test was running later in the day, and thereby subject to more internet congestion in general. So, I did a second shorter test to try to control for that: Downloading the first 10k images from part 00000 of laion2B-enUsing the PR branch:
Using the current main head:
Both tests run around 12pm PST; this time the main branch downloaded fewer images and took slightly longer. So overall I think the practical effect on downloading LAION2b is negligible, but if you still have performance concerns then the experiment should probably redone in a more controlled manner(eg in parallel with more repeated trials and from separate EC2 hosts). |
img2dataset/downloader.py
Outdated
@@ -25,9 +46,13 @@ def download_image(row, timeout): | |||
request = urllib.request.Request( | |||
url, | |||
data=None, | |||
headers={"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0"}, | |||
headers={ | |||
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0 (compatible; img2dataset; +https://github.com/rom1504/img2dataset)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want you could add an option for this and leave the current thing as default
img2dataset/downloader.py
Outdated
) | ||
with urllib.request.urlopen(request, timeout=timeout) as r: | ||
if is_disallowed(r.headers): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be done under an option, left to disabled by default
@rom1504 I've updated the pull request to put the functionality under cmdline options as suggested; and also updated documentation and examples accordingly. I do think it would be better for these options to be enabled rather than disabled by default, but I understand that might be a debatable position. |
I would very much recommend performing this check by default and providing a flag to skip the check if that's really what the user wants to do. This tool is primarily used for fetching images as part of AI/ML training. The presence of the Unfortunately most users stick with default options. As such, by skipping this check by default, the vast majority of users running this tool will unwittingly be using it against the express wishes of the creator and/or copyright holder. It would be greatly appreciated if you would take these concerns into consideration. |
@rom1504 What do you think? I can easily change the defaults, and then revert some of the changes to examples. |
@rom1504 This would also go a long way towards addressing one of the concerns raised in the ethics review of the Laion 5B dataset. Respecting these tags by default moves ML data collection closer to a point where the end user does not have to be aware of every dataset being collected before it is collected. Instead, they can rely on their hosting provider's and the dataset collector's support of the tag to exclude their data from the set as appropriate. I think this is both in the ML community's and the image owners' best interest. Peter (Head of AI, DeviantArt) |
We talked with @raincoastchris and agreed to
I hope that helps to alleviate some of DeviantArt users concerns. |
After hearing these well reasoned reasons for making this check the default, you decide instead that "automatically disrespect artists, break TOS and potentially the law" should be the default. Am I understanding this right? Seems a bit careless. What happened to the principal of sensible defaults? I mean, you're still allowing the software to completely ignore these directives. Why can't that be optional instead? I don't understand the argument for reproducibility. Since the biggest datasets were categorized, many images have gone off the web. It's impossible to reproduce the original datasets completely already. |
It does not. Try again on the ethics here. |
"We merged this but gutted it so that we can wash our hands of responsibility without actually respecting anyone's wishes or moral rights" cool |
Hey guys, it's "noai", not "idratheryoudidntbutreallywhoamitosaynoai". Leaving in the ability to disregard this extremely clear dataset opt-out notice is questionable enough - both ethically, and as stated several times already, legally - but making that behaviour the default... I'm trying to find a word for it, other than malicious. |
#249 opting out is now enabled by default If using this default, be aware there are ethical issues with slowing down democratization of skills and art to millions of people. |
Media owners can use the X-Robots-Tag header to communicate usage directives for the associated media, including instruction that the image not be used in any indexes (noindex) or included in datasets used for machine learning purposes (noai).
This PR makes img2dataset respect such directives by not including associated media in the generated dataset. It also updates the useragent string, introducing a img2dataset user agent token so that requests made using the tool are identifiable by media hosts.
Refs: