Proposal: HTML Attribute to state non-consent when scraping for training datasets #10101
Labels
addition/proposal
New features or enhancements
needs implementer interest
Moving the issue forward requires implementers to express interest
What problem are you trying to solve?
Since the release of ChatGPT in early 2023, there has been widespread concern over companies scraping data from websites without explicit consent from the publisher. The data is used in products such as ChatGPT to solicit a paid service, without properly compensating the data source(s) used. Governments around the world lag in creating laws to protect these content creators from these LLMs, but we can leverage HTML to create a paper trail for when these laws eventually develop. Users of HTML should have access to a standard attribute which must explicitly be set, to formally consent to web scraping for the purpose of training any form of Artificial Intelligence.
What solutions exist today?
The existing solutions for this problem are limited:
a. This is not ideal, since sentence structure/language may vary. The lack of a standard data shape makes it difficult for parsers to automatically filter out content, and it lacks the granularity of control provided by a standard attribute.
a. This is also a poor option for the same reason mentioned above. It’s hard to parse because it’s not standardized. A parser doesn’t know what attribute to look for.
How would you solve it?
I propose the introduction of a new standard HTML attribute that allows users to explicitly state they don't consent to all, or parts, of their webpage being used to train any form of Artificial Intelligence. Here is the new attribute and its states:
training="accepted"
or the absence of the attribute (default behavior): When the user does not mind having their webapp content being scraped for use in training any form of Artificial Intelligece, or an element up the tree already has thetraining
attribute set (and therefore this child element inherits the value)Use Cases
The use-cases for this feature are fairly simple and straightforward:
a. Requirements:
training
) attribute at the outermost HTML taga. Requirements:
a. Requirements:
training
attribute is that of consent by the author of the webpage (accepted
)training
attribute to opt-out.Examples
An example of a user who does not consent to any of the content on their site to be used in datasets for training AI models:
An example of a user who only consents to some of their content on their site to be used in datasets for training AI models:
An example of a user who consents to most of the content on their site being used in datasets for training AI models, but has some content which they don't consent to being used:
The default behavior, where the user consents to all of the content on their website being used in datasets for training AI models:
Anything else?
Other Considerations
a. For documentation purposes, and to leave a paper trail if/when protections are created. Government policy is lagging, but it's likely only a matter of time before the data sourcing issue for training AI models gets addressed. These training datasets are still growing. Developers should have a way to protect themselves and their web content as early as possible, and entities seeking to create training data sets from web content should have a standard method for filtering out content which does not consent to being used in training datasets. This is a legal win for both the content creator, and the training model creator.
a. It may be confusing as to why this attribute defaults to the author giving consent, rather than following a principle of least privilege and defaulting to non-consent. I think it's a concern of audit reliability. If training data for an LLM were to ever be put under a microscope it must be clear, within each HTML document, that the author explicitly stated they do not consent to data collection for the purpose of training LLMs. To ensure no confusion, the standard attribute should assume that the author does consent to such data collection. If the attribute defaulted to nonconsent, companies parsing the data would be able to argue that the author used an older version of HTML which doesn't include the standard attribute, or that the data was collected before the introduction of the standard tag. By having the tag be explicit in its intent, this prevents these claims.
a. The existing solutions mentioned above are not easily parsed as they lack a standard form. By establishing a standard attribute, it allows web developers to easily give explicit nonconsent to anyone who desires to scrape the website for use in AI training data sets. Furthermore, it leaves no excuse for entities parsing webapps to not first check for these attributes when crawling.
Conclusion
We should not wait and react to government policy to provide tools to web developers and training dataset creators which can help protect web content from being used without consent to train AI models.
The text was updated successfully, but these errors were encountered: