Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Meta Tag for AI Consent Management #9334

Open
brennancaldwell opened this issue May 25, 2023 · 10 comments
Open

Proposal: Meta Tag for AI Consent Management #9334

brennancaldwell opened this issue May 25, 2023 · 10 comments
Labels
addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest

Comments

@brennancaldwell
Copy link

brennancaldwell commented May 25, 2023

Introduction

With the rapid growth of artificial intelligence, and especially machine learning models that train on web data, the issue of data usage consent has become more relevant than ever. Currently, there is no standard way for website owners to express their consent or otherwise for AI models to use their data for training or crawling purposes. This proposal seeks to address this issue by introducing a new HTML meta tag called ai-consent.

The Proposed Solution

I propose the introduction of an HTML meta tag named ai-consent. This tag would have a content attribute with the following possible values:

  • all: The website owner consents to the use of their content for both AI model training and live search operations.
  • search-only: The website owner consents to the use of their content for live search operations only, provided the source website is cited by the AI agent. They do not consent to the use of their content for AI model training.
  • none: The website owner does not consent to the use of their content by AI for any purpose.

The tag would appear in the <head> of an HTML document. For example:

<meta name="ai-consent" content="all">

Use Cases and Examples

Below are some examples of how the ai-consent tag could be used:

  1. A news website owner wants their articles to be included in both AI training and search results. They would use:
<meta name="ai-consent" content="all">
  1. A personal blog author does not want their content included in AI model training but is fine with it being used for live search results, provided the blog is cited. They would use:
<meta name="ai-consent" content="search-only">
  1. A privacy-focused website's owner does not want their content used by AI at all. They would use:
<meta name="ai-consent" content="none">

Considerations

This proposal introduces a method for website owners to manage consent regarding AI data usage and is similar in intent to the noindex meta tag. However, it does not enforce the consent. It would be the responsibility of AI creators and operators to respect and enforce these tags, which might not happen short of robust regulation. Additionally, the proposed tag would need to be included in popular web crawlers' whitelists of meta tags.

Conclusion

The proposed ai-consent meta tag provides a standard method for website owners to express their consent for AI data usage. It would promote transparency and respect for website owners' data preferences, contributing to a more ethical web environment for AI.

@rthrejheytjyrtj545
Copy link

Why should the author explicitly choose none to indicate that they do not agree? What is meant by the absence of this type of metadata?

Doesn't this sentence duplicate the existing license link type? Interested parties can already create a mechanism like CC REL and provide the appropriate legal background, this is an organizational issue, not a technological one.

@brennancaldwell
Copy link
Author

These are great points! Thank you for pointing these out. I had considered just proposing all and search-only -- I believe the default assumption should be no consent.

I also agree that this is more a question of organization than technology. The details of implementation aren't important to me so much as agreeing on a standard for establishing consent specifically in the case of model training and search. Perhaps this can indeed be handled using a license link tag.

@rthrejheytjyrtj545
Copy link

rthrejheytjyrtj545 commented May 25, 2023

By the way, if you leave it in force something similar to DNT, you can move the proposal to the Microformats Wiki (which will be officially recognized as a specification), or go with the same to WICG. Also, bikeshedding: something like notraining and nosnipping would sound more “vanilla”.

@brennancaldwell
Copy link
Author

Thank you!

@rthrejheytjyrtj545
Copy link

rthrejheytjyrtj545 commented May 25, 2023

No problem. What I suggested to you in the comment above is a move away from metadata in favor of a link type.

You can, of course, write a specification and send it to MetaExtensions, but this is a chore and “However, a new metadata name should not be created in any of the following cases: If the name is for something expected to have processing requirements in user agents; in that case it ought to be standardized” might be applicable given that crawlers are also UA in some way. So <link href = . rel = training/> might be a good option...

@ramijwar
Copy link

wow that's awesome

@domenic domenic added addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest labels Jun 1, 2023
@saschanaz
Copy link
Member

saschanaz commented Jun 23, 2023

FYI, DeviantArt and SketchFab came up with <meta name="robots" content="noai">.

@myakura
Copy link

myakura commented Jun 27, 2023

I believe that bots can crawl non-HTML resource files, such as source codes or images. Isn't it better to define this in (or on top of) the robots.txt protocol?
https://datatracker.ietf.org/doc/html/rfc9309

@rthrejheytjyrtj545
Copy link

@myakura, no, because there are countless crawlers in the future, and the author cannot be made responsible for following them. In addition, no one wants to limit crawling in this case, only the use of the collected content.

@jfhr
Copy link

jfhr commented Aug 15, 2023

One consideration here is that crawlers would need to download each individual page to find out if it has an ai-consent meta tag. Downloading lots of pages just to find out you can't use them is a waste of money - as long as this is a voluntary standard, companies would be less incentivized to respect it at all.

The robots.txt standard avoids exactly that problem by having a single file for an entire origin. Perhaps a similar file could be introduced for ai consent management. e.g.

All: /documentation
Search-Only: /weblog
None: /personal

This could be hosted under a well-known URI such as /.well-known/ai-consent.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest
Development

No branches or pull requests

7 participants