Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect noai and noimageai directives when downloading image files #218

Merged
merged 4 commits into from Nov 24, 2022

Conversation

raincoastchris
Copy link
Contributor

Media owners can use the X-Robots-Tag header to communicate usage directives for the associated media, including instruction that the image not be used in any indexes (noindex) or included in datasets used for machine learning purposes (noai).

This PR makes img2dataset respect such directives by not including associated media in the generated dataset. It also updates the useragent string, introducing a img2dataset user agent token so that requests made using the tool are identifiable by media hosts.

Refs:

@rom1504
Copy link
Owner

rom1504 commented Nov 11, 2022 via email

USER_AGENT_TOKEN = "img2dataset"


def is_disallowed(headers):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does one additional call, so can you check speed ?

I am not convinced this should be done at the downloading step but rather at the crawling step (while collecting the links)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note there is no change to HTTP requests being issued -- the headers are already sent (or not) with the file download. Are you concerned with overhead associated with the additional python method call to parse headers? I will include timings and stats for this branch compared to current main head branch in a bit (running now) but i suspect these overheads will be negligible compared to the cost of the actual HTTP requests.

Re your other point, I agree that the crawling step should check page HTML for relevant robots meta tag directives and omit images accordingly. If there is a repo used for indexing the LAION datasets that you'd suggest adding this functionality to, I'd be happy to work on a PR for that as well.

However, it's possible that the creator of a given dataset did NOT consider such directives when their crawl was done, and it's also possible that directives associated with a specific media file change after the crawl (the media owner has every right to do so). As such I don't think the user who actually downloads and uses the indexed media can be absolved of all responsibility to respect directives that apply to their intended usage, especially if those directives are made readily available. Since the HTTP headers are sent with the image files when downloaded anyway, I worry it could be seen as negligent by the img2dataset user to receive these directive but choose to ignore them -- especially now that legality of use of third-party images for AI training purposes is being tested.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's add this under a command line option

@raincoastchris
Copy link
Contributor Author

raincoastchris commented Nov 11, 2022

I did a couple of tests, run sequentially on a M1 macbook pro over a 1Gb residential fiber connection.

Downloading the first 1M images from laion-art.

Using the PR's branch (test completed at around 10AM PST):

time img2dataset --url_list laion-art-1m.parquet --input_format "parquet"\
        --url_col "URL" --caption_col "TEXT" --output_format webdataset\
          --output_folder x-robots-tag --processes_count 16 --thread_count 64 --image_size 384\
           --resize_only_if_bigger=True --resize_mode="keep_ratio" --skip_reencode=True
...
total   - success: 0.899 - failed to download: 0.096 - failed to resize: 0.005 - images per sec: 173 - count: 1000000
img2dataset --url_list laion-art-1m.parquet --input_format "parquet" --url_co  13291.84s user 3165.21s system 284% cpu 1:36:31.75 total

Using the current main head (completed around 11:30am PST):

 time img2dataset --url_list laion-art-1m.parquet --input_format "parquet"\
        --url_col "URL" --caption_col "TEXT" --output_format webdataset\
          --output_folder main --processes_count 16 --thread_count 64 --image_size 384\
           --resize_only_if_bigger=True --resize_mode="keep_ratio" --skip_reencode=True
...
total   - success: 0.900 - failed to download: 0.095 - failed to resize: 0.005 - images per sec: 185 - count: 1000000
img2dataset --url_list laion-art-1m.parquet --input_format "parquet" --url_co  13443.00s user 3162.97s system 307% cpu 1:30:03.95 total

So about 1% fewer images were downloaded, and it took a bit less than 10% longer. However, I don't trust the images/sec timing here because the test was running later in the day, and thereby subject to more internet congestion in general. So, I did a second shorter test to try to control for that:

Downloading the first 10k images from part 00000 of laion2B-en

Using the PR branch:

time img2dataset --url_list laion2b-en-10k.parquet --input_format "parquet"\
        --url_col "URL" --caption_col "TEXT" --output_format webdataset\
          --output_folder robots10k --processes_count 16 --thread_count 64 --image_size 384\
           --resize_only_if_bigger=True --resize_mode="keep_ratio" --skip_reencode=True
...
total   - success: 0.885 - failed to download: 0.108 - failed to resize: 0.007 - images per sec: 76 - count: 10000
img2dataset --url_list laion2b-en-10k.parquet --input_format "parquet"  "URL"  106.75s user 20.07s system 94% cpu 2:14.59 total

Using the current main head:

time img2dataset --url_list laion2b-en-10k.parquet --input_format "parquet"\
         --url_col "URL" --caption_col "TEXT" --output_format webdataset\
           --output_folder main10k --processes_count 16 --thread_count 64 --image_size 384\
            --resize_only_if_bigger=True --resize_mode="keep_ratio" --skip_reencode=True
...
total   - success: 0.882 - failed to download: 0.110 - failed to resize: 0.007 - images per sec: 63 - count: 10000
img2dataset --url_list laion2b-en-10k.parquet --input_format "parquet"  "URL"  103.43s user 22.17s system 77% cpu 2:42.22 total

Both tests run around 12pm PST; this time the main branch downloaded fewer images and took slightly longer.


So overall I think the practical effect on downloading LAION2b is negligible, but if you still have performance concerns then the experiment should probably redone in a more controlled manner(eg in parallel with more repeated trials and from separate EC2 hosts).

@@ -25,9 +46,13 @@ def download_image(row, timeout):
request = urllib.request.Request(
url,
data=None,
headers={"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0"},
headers={
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0 (compatible; img2dataset; +https://github.com/rom1504/img2dataset)"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want you could add an option for this and leave the current thing as default

)
with urllib.request.urlopen(request, timeout=timeout) as r:
if is_disallowed(r.headers):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be done under an option, left to disabled by default

@raincoastchris
Copy link
Contributor Author

raincoastchris commented Nov 16, 2022

@rom1504 I've updated the pull request to put the functionality under cmdline options as suggested; and also updated documentation and examples accordingly. I do think it would be better for these options to be enabled rather than disabled by default, but I understand that might be a debatable position.

@dcoles
Copy link

dcoles commented Nov 17, 2022

I would very much recommend performing this check by default and providing a flag to skip the check if that's really what the user wants to do.

This tool is primarily used for fetching images as part of AI/ML training. The presence of the noai and noimageai tag indicate that the content owner or hosting provider does not want their material used in such a fashion. In addition, use to the contrary is typically against the website's terms of service (e.g. DeviantArt ToS) and may violate local laws including the US Computer Fraud and Abuse Act and DMCA.

Unfortunately most users stick with default options. As such, by skipping this check by default, the vast majority of users running this tool will unwittingly be using it against the express wishes of the creator and/or copyright holder. It would be greatly appreciated if you would take these concerns into consideration.

@raincoastchris
Copy link
Contributor Author

I would very much recommend performing this check by default and providing a flag to skip the check if that's really what the user wants to do.

@rom1504 What do you think? I can easily change the defaults, and then revert some of the changes to examples.

@sumpfork
Copy link

@rom1504 This would also go a long way towards addressing one of the concerns raised in the ethics review of the Laion 5B dataset. Respecting these tags by default moves ML data collection closer to a point where the end user does not have to be aware of every dataset being collected before it is collected. Instead, they can rely on their hosting provider's and the dataset collector's support of the tag to exclude their data from the set as appropriate. I think this is both in the ML community's and the image owners' best interest.

Peter (Head of AI, DeviantArt)

@rom1504 rom1504 merged commit eb672d5 into rom1504:main Nov 24, 2022
@rom1504
Copy link
Owner

rom1504 commented Nov 24, 2022

We talked with @raincoastchris and agreed to

  • Document it in the readme to inform img2dataset users of the meaning of these options and why they can choose to use them
  • Add an example for LAION art with these options enabled to make it easier for people choosing to enable these options
  • leave them off by default for reproducibility concerns

I hope that helps to alleviate some of DeviantArt users concerns.

@raincoastchris raincoastchris deleted the x_robots_tag branch November 24, 2022 17:57
@Stealcase
Copy link
Contributor

After hearing these well reasoned reasons for making this check the default, you decide instead that "automatically disrespect artists, break TOS and potentially the law" should be the default. Am I understanding this right?

Seems a bit careless.

What happened to the principal of sensible defaults?

I mean, you're still allowing the software to completely ignore these directives. Why can't that be optional instead?

I don't understand the argument for reproducibility. Since the biggest datasets were categorized, many images have gone off the web. It's impossible to reproduce the original datasets completely already.

@PinballsWizard
Copy link

I hope that helps to alleviate some of DeviantArt users concerns.

It does not. Try again on the ethics here.

@MrLightningBolt
Copy link

"We merged this but gutted it so that we can wash our hands of responsibility without actually respecting anyone's wishes or moral rights" cool

@estroBiologist
Copy link

Hey guys, it's "noai", not "idratheryoudidntbutreallywhoamitosaynoai". Leaving in the ability to disregard this extremely clear dataset opt-out notice is questionable enough - both ethically, and as stated several times already, legally - but making that behaviour the default... I'm trying to find a word for it, other than malicious.

Repository owner locked as too heated and limited conversation to collaborators Dec 5, 2022
@rom1504
Copy link
Owner

rom1504 commented Dec 19, 2022

#249 opting out is now enabled by default

If using this default, be aware there are ethical issues with slowing down democratization of skills and art to millions of people.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants