Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: HTML Attribute to state non-consent when scraping for training datasets #10101

Closed
gouldcs opened this issue Jan 28, 2024 · 1 comment
Labels
addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest

Comments

@gouldcs
Copy link

gouldcs commented Jan 28, 2024

What problem are you trying to solve?

Since the release of ChatGPT in early 2023, there has been widespread concern over companies scraping data from websites without explicit consent from the publisher. The data is used in products such as ChatGPT to solicit a paid service, without properly compensating the data source(s) used. Governments around the world lag in creating laws to protect these content creators from these LLMs, but we can leverage HTML to create a paper trail for when these laws eventually develop. Users of HTML should have access to a standard attribute which must explicitly be set, to formally consent to web scraping for the purpose of training any form of Artificial Intelligence.

What solutions exist today?

The existing solutions for this problem are limited:

  1. Users can explicitly state their non-consent in the content of the webapp itself.
    a. This is not ideal, since sentence structure/language may vary. The lack of a standard data shape makes it difficult for parsers to automatically filter out content, and it lacks the granularity of control provided by a standard attribute.
  2. Users can set a custom data attribute
    a. This is also a poor option for the same reason mentioned above. It’s hard to parse because it’s not standardized. A parser doesn’t know what attribute to look for.

How would you solve it?

I propose the introduction of a new standard HTML attribute that allows users to explicitly state they don't consent to all, or parts, of their webpage being used to train any form of Artificial Intelligence. Here is the new attribute and its states:

  • `training="rejected": When the user wants to specify that they do not consent to the use of this element and its contents for training data.
  • training="accepted" or the absence of the attribute (default behavior): When the user does not mind having their webapp content being scraped for use in training any form of Artificial Intelligece, or an element up the tree already has the training attribute set (and therefore this child element inherits the value)

Use Cases

The use-cases for this feature are fairly simple and straightforward:

  1. The author of a webapp wants to explicitly state that they don't consent to their content being parsed for the purpose of being included in training data for an LLM, and do so in such a way that it is encoded within the HTML itself.
    a. Requirements:
    1. Must be able to set a standard (training) attribute at the outermost HTML tag
    2. All children of the tag mentioned above must inherit the set value of the attribute
  2. The author of a webapp wants to explicitly state that they don't consent to some (or most) of the content on their website being parsed and included in training data for an LLM, but may select specific elements which they consent to being parsed. This can be done by specifying nonconsent at the root of the webapp, and then specifying consent for specific components. The component, and its children, will inherit this consent.
    a. Requirements:
    1. When a parent element sets this attribute, all children must inherit the set value of the attribute.
    2. If a child element overrides the attribute with its own value, the child element will reference the overridden value, and apply it to its children as well.
  3. The author of a webapp wants to explicitly state that they consent to their content being parsed for the purpose of being included in training data for an LLM (default/assumed behavior).
    a. Requirements:
    1. The default behavior of the new training attribute is that of consent by the author of the webpage (accepted)
    2. The webpage author must explicitly set the training attribute to opt-out.

Examples

An example of a user who does not consent to any of the content on their site to be used in datasets for training AI models:

<body training="rejected">
    <h1>
        There is explicit non-consent for the use of this content to train AI models.
    </h1>
</body>

An example of a user who only consents to some of their content on their site to be used in datasets for training AI models:

<body training="rejected">
    <h1>
        There is explicit non-consent for the use of this content to train AI models.
    </h1>
    <p training="accepted">
        There is consent for this content to be used to train AI models
    </p>
</body>

An example of a user who consents to most of the content on their site being used in datasets for training AI models, but has some content which they don't consent to being used:

<body>
    <h1>
        There is consent for this content to be used to train AI models.
    </h1>
    <p training="rejected">
        There is explicit non-consent for the use of this content to train AI models.
    </p>
</body>

The default behavior, where the user consents to all of the content on their website being used in datasets for training AI models:

<body>
    <h1>
        There is default consent for this content to be used to train AI models.
    </h1>
</body>

Anything else?

Other Considerations

  1. Why should anyone use this attribute if it serves no protections?
    a. For documentation purposes, and to leave a paper trail if/when protections are created. Government policy is lagging, but it's likely only a matter of time before the data sourcing issue for training AI models gets addressed. These training datasets are still growing. Developers should have a way to protect themselves and their web content as early as possible, and entities seeking to create training data sets from web content should have a standard method for filtering out content which does not consent to being used in training datasets. This is a legal win for both the content creator, and the training model creator.
  2. Why should this attribute default to providing consent?
    a. It may be confusing as to why this attribute defaults to the author giving consent, rather than following a principle of least privilege and defaulting to non-consent. I think it's a concern of audit reliability. If training data for an LLM were to ever be put under a microscope it must be clear, within each HTML document, that the author explicitly stated they do not consent to data collection for the purpose of training LLMs. To ensure no confusion, the standard attribute should assume that the author does consent to such data collection. If the attribute defaulted to nonconsent, companies parsing the data would be able to argue that the author used an older version of HTML which doesn't include the standard attribute, or that the data was collected before the introduction of the standard tag. By having the tag be explicit in its intent, this prevents these claims.
  3. Why create a new attribute for this?
    a. The existing solutions mentioned above are not easily parsed as they lack a standard form. By establishing a standard attribute, it allows web developers to easily give explicit nonconsent to anyone who desires to scrape the website for use in AI training data sets. Furthermore, it leaves no excuse for entities parsing webapps to not first check for these attributes when crawling.

Conclusion

We should not wait and react to government policy to provide tools to web developers and training dataset creators which can help protect web content from being used without consent to train AI models.

@gouldcs gouldcs added addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest labels Jan 28, 2024
@gouldcs gouldcs changed the title Proposal: Proposal: HTML Attribute to state non-consent when scraping for training datasets Jan 28, 2024
@keithamus
Copy link
Contributor

Unfortunately the HTML standard has no jurisdiction or ability to enforce anything onto companies that scrape web documents. Adding such an attribute without any level of enforcement would likely result in no real change, and may harm forward progress (as historical precedent we could look at DNT).

In addition, I think this is the wrong venue. HTML prescribes how a page behaves and how the browser should consume and present it, but it does little around providing structured data. You might find that the schema.org working group would be a better venue for this sort of work. Alternatively leveraging robots.txt may be a tool that works today.

Just to clarify, I'm going to close this issue due to the above reasons, but that does not constitute as an opinion against the above idea or desire. I encourage you to explore this further in other standards venues. Best of luck!

@keithamus keithamus closed this as not planned Won't fix, can't repro, duplicate, stale Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest
Development

No branches or pull requests

2 participants