Respect noai and noimageai directives when downloading image files #218

raincoastchris · 2022-11-11T14:55:59Z

Media owners can use the X-Robots-Tag header to communicate usage directives for the associated media, including instruction that the image not be used in any indexes (noindex) or included in datasets used for machine learning purposes (noai).

This PR makes img2dataset respect such directives by not including associated media in the generated dataset. It also updates the useragent string, introducing a img2dataset user agent token so that requests made using the tool are identifiable by media hosts.

Refs:

rom1504 · 2022-11-11T15:21:50Z

Could you please check the impact of those options by downloading a subset of laion2B-en with and without them? (1M samples should be enough to draw conclusions)

…

On Fri, Nov 11, 2022, 15:56 Chris Nell ***@***.***> wrote: Media owners can use the X-Robots-Tag header to communicate usage directives for the associated media, including instruction that the image not be used in any indexes (noindex) or included in datasets used for machine learning purposes (noai). This PR makes img2dataset respect such directives by not including associated media in the generated dataset. It also updates the useragent string, introducing a img2dataset user agent token so that requests made using the tool are identifiable by media hosts. Refs: - https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag - https://www.deviantart.com/team/journal/A-New-Directive-for-Opting-Out-of-AI-Datasets-934500371 ------------------------------ You can view, comment on, or merge this pull request online at: #218 Commit Summary - 70d8c34 <70d8c34> Respect noai and noimageai directives File Changes (4 files <https://github.com/rom1504/img2dataset/pull/218/files>) - *M* img2dataset/downloader.py <https://github.com/rom1504/img2dataset/pull/218/files#diff-2ded925514a1a2d2eebace43140502f4b37b16c8bd36a5a801360def648088b1> (27) - *M* tests/fixtures.py <https://github.com/rom1504/img2dataset/pull/218/files#diff-5dbc88d6e5c3615caf10e32a9d6fc6ff683f5b5814948928cb84c3ab91c038b6> (11) - *M* tests/http_server.py <https://github.com/rom1504/img2dataset/pull/218/files#diff-3b8f205367cf5efc5d6543112162056d4ed0b15934f4aed8e9684f2224810788> (12) - *M* tests/test_downloader.py <https://github.com/rom1504/img2dataset/pull/218/files#diff-e39baf18d7b9129f03c7479a4575179a37a595f9635644a3c7b478861170f697> (9) Patch Links: - https://github.com/rom1504/img2dataset/pull/218.patch - https://github.com/rom1504/img2dataset/pull/218.diff — Reply to this email directly, view it on GitHub <#218>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437VX4QRIJNPUIGH6X33WHZNAVANCNFSM6AAAAAAR5VQQTM> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

rom1504 · 2022-11-11T15:23:14Z

img2dataset/downloader.py

+USER_AGENT_TOKEN = "img2dataset"
+
+
+def is_disallowed(headers):


This does one additional call, so can you check speed ?

I am not convinced this should be done at the downloading step but rather at the crawling step (while collecting the links)

Note there is no change to HTTP requests being issued -- the headers are already sent (or not) with the file download. Are you concerned with overhead associated with the additional python method call to parse headers? I will include timings and stats for this branch compared to current main head branch in a bit (running now) but i suspect these overheads will be negligible compared to the cost of the actual HTTP requests.

Re your other point, I agree that the crawling step should check page HTML for relevant robots meta tag directives and omit images accordingly. If there is a repo used for indexing the LAION datasets that you'd suggest adding this functionality to, I'd be happy to work on a PR for that as well.

However, it's possible that the creator of a given dataset did NOT consider such directives when their crawl was done, and it's also possible that directives associated with a specific media file change after the crawl (the media owner has every right to do so). As such I don't think the user who actually downloads and uses the indexed media can be absolved of all responsibility to respect directives that apply to their intended usage, especially if those directives are made readily available. Since the HTTP headers are sent with the image files when downloaded anyway, I worry it could be seen as negligent by the img2dataset user to receive these directive but choose to ignore them -- especially now that legality of use of third-party images for AI training purposes is being tested.

ok, let's add this under a command line option

raincoastchris · 2022-11-11T20:14:42Z

I did a couple of tests, run sequentially on a M1 macbook pro over a 1Gb residential fiber connection.

Downloading the first 1M images from laion-art.

Using the PR's branch (test completed at around 10AM PST):

time img2dataset --url_list laion-art-1m.parquet --input_format "parquet"\
        --url_col "URL" --caption_col "TEXT" --output_format webdataset\
          --output_folder x-robots-tag --processes_count 16 --thread_count 64 --image_size 384\
           --resize_only_if_bigger=True --resize_mode="keep_ratio" --skip_reencode=True
...
total   - success: 0.899 - failed to download: 0.096 - failed to resize: 0.005 - images per sec: 173 - count: 1000000
img2dataset --url_list laion-art-1m.parquet --input_format "parquet" --url_co  13291.84s user 3165.21s system 284% cpu 1:36:31.75 total

Using the current main head (completed around 11:30am PST):

 time img2dataset --url_list laion-art-1m.parquet --input_format "parquet"\
        --url_col "URL" --caption_col "TEXT" --output_format webdataset\
          --output_folder main --processes_count 16 --thread_count 64 --image_size 384\
           --resize_only_if_bigger=True --resize_mode="keep_ratio" --skip_reencode=True
...
total   - success: 0.900 - failed to download: 0.095 - failed to resize: 0.005 - images per sec: 185 - count: 1000000
img2dataset --url_list laion-art-1m.parquet --input_format "parquet" --url_co  13443.00s user 3162.97s system 307% cpu 1:30:03.95 total

So about 1% fewer images were downloaded, and it took a bit less than 10% longer. However, I don't trust the images/sec timing here because the test was running later in the day, and thereby subject to more internet congestion in general. So, I did a second shorter test to try to control for that:

Downloading the first 10k images from part 00000 of laion2B-en

Using the PR branch:

time img2dataset --url_list laion2b-en-10k.parquet --input_format "parquet"\
        --url_col "URL" --caption_col "TEXT" --output_format webdataset\
          --output_folder robots10k --processes_count 16 --thread_count 64 --image_size 384\
           --resize_only_if_bigger=True --resize_mode="keep_ratio" --skip_reencode=True
...
total   - success: 0.885 - failed to download: 0.108 - failed to resize: 0.007 - images per sec: 76 - count: 10000
img2dataset --url_list laion2b-en-10k.parquet --input_format "parquet"  "URL"  106.75s user 20.07s system 94% cpu 2:14.59 total

Using the current main head:

time img2dataset --url_list laion2b-en-10k.parquet --input_format "parquet"\
         --url_col "URL" --caption_col "TEXT" --output_format webdataset\
           --output_folder main10k --processes_count 16 --thread_count 64 --image_size 384\
            --resize_only_if_bigger=True --resize_mode="keep_ratio" --skip_reencode=True
...
total   - success: 0.882 - failed to download: 0.110 - failed to resize: 0.007 - images per sec: 63 - count: 10000
img2dataset --url_list laion2b-en-10k.parquet --input_format "parquet"  "URL"  103.43s user 22.17s system 77% cpu 2:42.22 total

Both tests run around 12pm PST; this time the main branch downloaded fewer images and took slightly longer.

So overall I think the practical effect on downloading LAION2b is negligible, but if you still have performance concerns then the experiment should probably redone in a more controlled manner(eg in parallel with more repeated trials and from separate EC2 hosts).

rom1504 · 2022-11-15T21:45:17Z

img2dataset/downloader.py

@@ -25,9 +46,13 @@ def download_image(row, timeout):
        request = urllib.request.Request(
            url,
            data=None,
-            headers={"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0"},
+            headers={
+                "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0 (compatible; img2dataset; +https://github.com/rom1504/img2dataset)"


if you want you could add an option for this and leave the current thing as default

rom1504 · 2022-11-15T21:45:35Z

img2dataset/downloader.py

        )
        with urllib.request.urlopen(request, timeout=timeout) as r:
+            if is_disallowed(r.headers):


this could be done under an option, left to disabled by default

raincoastchris · 2022-11-16T21:03:03Z

@rom1504 I've updated the pull request to put the functionality under cmdline options as suggested; and also updated documentation and examples accordingly. I do think it would be better for these options to be enabled rather than disabled by default, but I understand that might be a debatable position.

dcoles · 2022-11-17T06:12:40Z

I would very much recommend performing this check by default and providing a flag to skip the check if that's really what the user wants to do.

This tool is primarily used for fetching images as part of AI/ML training. The presence of the noai and noimageai tag indicate that the content owner or hosting provider does not want their material used in such a fashion. In addition, use to the contrary is typically against the website's terms of service (e.g. DeviantArt ToS) and may violate local laws including the US Computer Fraud and Abuse Act and DMCA.

Unfortunately most users stick with default options. As such, by skipping this check by default, the vast majority of users running this tool will unwittingly be using it against the express wishes of the creator and/or copyright holder. It would be greatly appreciated if you would take these concerns into consideration.

raincoastchris · 2022-11-18T20:11:10Z

I would very much recommend performing this check by default and providing a flag to skip the check if that's really what the user wants to do.

@rom1504 What do you think? I can easily change the defaults, and then revert some of the changes to examples.

sumpfork · 2022-11-22T21:53:38Z

@rom1504 This would also go a long way towards addressing one of the concerns raised in the ethics review of the Laion 5B dataset. Respecting these tags by default moves ML data collection closer to a point where the end user does not have to be aware of every dataset being collected before it is collected. Instead, they can rely on their hosting provider's and the dataset collector's support of the tag to exclude their data from the set as appropriate. I think this is both in the ML community's and the image owners' best interest.

Peter (Head of AI, DeviantArt)

rom1504 · 2022-11-24T14:48:51Z

We talked with @raincoastchris and agreed to

Document it in the readme to inform img2dataset users of the meaning of these options and why they can choose to use them
Add an example for LAION art with these options enabled to make it easier for people choosing to enable these options
leave them off by default for reproducibility concerns

I hope that helps to alleviate some of DeviantArt users concerns.

Stealcase · 2022-11-27T11:11:19Z

After hearing these well reasoned reasons for making this check the default, you decide instead that "automatically disrespect artists, break TOS and potentially the law" should be the default. Am I understanding this right?

Seems a bit careless.

What happened to the principal of sensible defaults?

I mean, you're still allowing the software to completely ignore these directives. Why can't that be optional instead?

I don't understand the argument for reproducibility. Since the biggest datasets were categorized, many images have gone off the web. It's impossible to reproduce the original datasets completely already.

PinballsWizard · 2022-12-05T17:52:50Z

I hope that helps to alleviate some of DeviantArt users concerns.

It does not. Try again on the ethics here.

MrLightningBolt · 2022-12-05T18:24:52Z

"We merged this but gutted it so that we can wash our hands of responsibility without actually respecting anyone's wishes or moral rights" cool

estroBiologist · 2022-12-05T19:09:48Z

Hey guys, it's "noai", not "idratheryoudidntbutreallywhoamitosaynoai". Leaving in the ability to disregard this extremely clear dataset opt-out notice is questionable enough - both ethically, and as stated several times already, legally - but making that behaviour the default... I'm trying to find a word for it, other than malicious.

rom1504 · 2022-12-19T14:38:06Z

#249 opting out is now enabled by default

If using this default, be aware there are ethical issues with slowing down democratization of skills and art to millions of people.

Respect noai and noimageai directives

70d8c34

rom1504 reviewed Nov 11, 2022

View reviewed changes

rom1504 reviewed Nov 15, 2022

View reviewed changes

Make it optional to respect X-Robots-Tag directives

7466e9e

raincoastchris added 2 commits November 23, 2022 16:04

Changes to documentation and exmaples

3b83ced

m

d1cf639

rom1504 merged commit eb672d5 into rom1504:main Nov 24, 2022

raincoastchris deleted the x_robots_tag branch November 24, 2022 17:57

Repository owner locked as too heated and limited conversation to collaborators Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respect noai and noimageai directives when downloading image files #218

Respect noai and noimageai directives when downloading image files #218

raincoastchris commented Nov 11, 2022

rom1504 commented Nov 11, 2022 via email

rom1504 Nov 11, 2022

raincoastchris Nov 11, 2022

rom1504 Nov 15, 2022

raincoastchris commented Nov 11, 2022 •

edited

rom1504 Nov 15, 2022

rom1504 Nov 15, 2022

raincoastchris commented Nov 16, 2022 •

edited

dcoles commented Nov 17, 2022

raincoastchris commented Nov 18, 2022

sumpfork commented Nov 22, 2022

rom1504 commented Nov 24, 2022

Stealcase commented Nov 27, 2022

PinballsWizard commented Dec 5, 2022

MrLightningBolt commented Dec 5, 2022

estroBiologist commented Dec 5, 2022

rom1504 commented Dec 19, 2022

Respect noai and noimageai directives when downloading image files #218

Respect noai and noimageai directives when downloading image files #218

Conversation

raincoastchris commented Nov 11, 2022

rom1504 commented Nov 11, 2022 via email

rom1504 Nov 11, 2022

Choose a reason for hiding this comment

raincoastchris Nov 11, 2022

Choose a reason for hiding this comment

rom1504 Nov 15, 2022

Choose a reason for hiding this comment

raincoastchris commented Nov 11, 2022 • edited

Downloading the first 1M images from laion-art.

Downloading the first 10k images from part 00000 of laion2B-en

rom1504 Nov 15, 2022

Choose a reason for hiding this comment

rom1504 Nov 15, 2022

Choose a reason for hiding this comment

raincoastchris commented Nov 16, 2022 • edited

dcoles commented Nov 17, 2022

raincoastchris commented Nov 18, 2022

sumpfork commented Nov 22, 2022

rom1504 commented Nov 24, 2022

Stealcase commented Nov 27, 2022

PinballsWizard commented Dec 5, 2022

MrLightningBolt commented Dec 5, 2022

estroBiologist commented Dec 5, 2022

rom1504 commented Dec 19, 2022

raincoastchris commented Nov 11, 2022 •

edited

raincoastchris commented Nov 16, 2022 •

edited