Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Meta Tag for AI Generated Content #9479

Open
evayde opened this issue Jul 2, 2023 · 32 comments
Open

Proposal: Meta Tag for AI Generated Content #9479

evayde opened this issue Jul 2, 2023 · 32 comments
Labels
addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest

Comments

@evayde
Copy link

evayde commented Jul 2, 2023

Introduction

With the rapid growth of artificial intelligence, and especially machine learning models that train on web data, the issues that

  • these models themself train on (poorly) generated data over and over again,
  • Users don't know whether the content is generated or not,
  • and Search Engines cannot decide the quality of content,

arise.

Currently, there is no standard way for website owners to express that AI models (partly) generated their content. This proposal seeks to address this issue by introducing a new HTML meta tag called ai-generated.

The Proposed Solution

I propose the introduction of an HTML meta tag named ai-generated. This tag would have a content attribute with the following possible values:

  • all: The whole main content was generated by AI
  • partially: The content was co-authored by AI
  • none: none of the content was generated by AI
  • unknown (internal value?): it is unknown whether the content was generated. This value should be assumed in case of an absence of the meta tag

The tag would appear in the <head> of an HTML document. For example:

<meta name="ai-generated" content="partially">

Use Cases

Below are some examples of when the ai-generated meta tag could be used:

1. Let search engines know the content was (partially) generated by AI

Websites use AI-generated content in different ways. In the future, search engines might be aware that the content was generated by AI (because they generated it themselves), and not providing the meta tag would automatically de-rank those websites.

2. Let users know the content was (partially) generated by AI

When browsers see this meta tag, they could visually indicate that parts of the website were authored by AI, telling the user to treat the information with caution.

3. Let AI know that this content was generated by AI

AI should be aware that the following content was already generated, and thus, the information might be flawed.

Examples

Below are examples of how to use the ai-generated meta tag:

1. The whole (main) content was generated by AI (e.g., the main chunk of text content)
<meta name="ai-generated" content="all">

2. Only parts of the content were generated by AI
<meta name="ai-generated" content="partially">

3. Nothing on this website was generated by AI
<meta name="ai-generated" content="none">

Existing Solutions

We have two existing tags that could solve this problem, but we would have to standardize the use:

1. Meta Generator
<meta name="generator" content="Chat-GPT">

The meta generator tag indicates that the structure of the document has been generated. In my opinion, this is good enough but solves a different problem. It could, however, actually be used to indicate that the structure of a website was generated by AI.

2. Meta Author
This tag is more interesting as it does exactly what was proposed. But its use would have to be standardized in order to be useful:

The content was fully created by AI:
<meta name="author" content="AI">

The content was co-authored by AI:
<meta name="author" content="Me, AI">

The content was not created by AI:
<meta name="author" content="Me">

In my opinion, having a dedicated meta tag for ai-generated is the better solution.

Other considerations

1. Why should an author use the tag?
Authors need incentives to use this tag. First of all, they contribute to the quality of AI-generated content, as AI might not pick up content that had been generated. Second, we have to be able to identify the content that was generated. Adobe already tries this with Firefly, but we also need a mechanism for written content. So, in the future, Search Engines and other relevant players might punish content that was generated and doesn't explicitly state so.

2. Schema Org
We could move the whole issue to Schema Org and call it a day. E.g., by proposing the ai-generated attribute to them, users could indicate whether articles etc. were generated.

3. How to show which parts of content were generated by AI?
This is an unsolved problem. I am not a friend of creating a new attribute or even new tags, but currently, this might be the only way to solve it:

<span ai-generated="true">Foo</span>

Of course, this would indeed be easier if we just used the schema org solution. Or maybe a combination.

Conclusion

The proposed ai-generated meta tag provides a standard method for website owners to express that their content was (partially) generated by AI. It would promote transparency and respect for website users, contributing to a more ethical web environment for AI.

How to declare which parts of the website are generated remains unresolved and open to discussion.

Other

I copied some of the text from this issue which proposed the ai-consent meta tag, as they were very similar. #9334

@domenic domenic added addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest labels Jul 14, 2023
@Pandapip1
Copy link

It's quite possible that an image or a bit of text is AI generated, but the rest of the document isn't. Wouldn't an attribute be better?

Also, a better option might be to have a more general attribution attribute, which would be a strict superset of this feature.

@evayde
Copy link
Author

evayde commented Aug 1, 2023

It's quite possible that an image or a bit of text is AI generated, but the rest of the document isn't. Wouldn't an attribute be better?

Read "Other Considerations" number 3.

Also, a better option might be to have a more general attribution attribute, which would be a strict superset of this feature.

Wouldn't that be the Author meta tag? This was mentioned in "Existing Solutions" number 2.

@Pandapip1
Copy link

Pandapip1 commented Aug 2, 2023

Somehow I completely missed those. Yes, those are roughly what I'm thinking of.

I still think that being able to indicate which AI generated the content is valuable, and that per-uelement specificity is equally valuable. Worst-case, stick it on the body tag.

@evayde
Copy link
Author

evayde commented Aug 3, 2023

I still think that being able to indicate which AI generated the content is valuable

Possible with a combination of existing meta tags, imo

<meta name="author" content="Me, ChatGPT">
<meta name="ai-generated" content="partially">

This seems redundant, but it actually isn't. In this case, we set the name of the AI in the author meta content instead of just writing "AI." We cannot possibly keep track of every AI name. So this would explicitly say that AI created parts of the content, and the additional information which AI was used to create the content is added to the author meta information.

@Pandapip1
Copy link

Pandapip1 commented Aug 8, 2023

Just so you know - for multiple authors, multiple tags should be used IIRC.

<meta name="author" content="Me">
<meta name="author" content="ChatGPT">
<meta name="ai-generated" content="partially">

How about a flag be added to the author mata tag instead, like:

<meta name="author" content="Me">
<meta name="author" content="ChatGPT" ai="1">

@BLamy
Copy link

BLamy commented Aug 16, 2023

I think it would be really useful to be able to link embeddings databases to webpages.

For instance

<link rel="embeddings" type="openai/text-embedding-ada-002" src="./public/embeddings.sqlite">

That way a user could semantically chat with a website without having to embed the page for themselves.

Something kind of like https://til.simonwillison.net/llms/openai-embeddings-related-content

@Pandapip1
Copy link

I think it would be really useful to be able to link embeddings databases to webpages.

Probably worth opening a seperate issue for that. It seems out-of-scope for this proposal.

@TheRealRitMan
Copy link

TheRealRitMan commented Sep 11, 2023

How about if it was defined id's and/or classes so that the area could easily be delineated and parsed? Absence of the "ai-generated" meta tag means the author is asserting there is no ai-generated content, which in turn eases backward compatibility, and the only individuals affected are AI content publishers who need to catch up with the standard.

<meta name="ai-generated" content="id=id1,id2;class=class1,class2..." />

@Pandapip1
Copy link

I'd still prefer the ai="1" syntax, but I would be okay with that as a close second.

I'm really not a fan of people boycotting AI-generated work for various reasons, but at the same time I recognize that if I were one of those people, I would definitely want a standard like this. As such, I am generally +0 to this proposal as a whole.

@TheRealRitMan
Copy link

I absolutely agree with the bit operator as first choice, but we need to be more specific than that. In one of my use cases, I want to have a chatbot on my page, so I dont want the rest of my content to be ignored. I do not see this as boycotting AI content. Simply identifying it. If the intention of the content is above board, it should not be an issue.

@Pandapip1
Copy link

I think what you want is an RDFa extension, then.

@TheRealRitMan
Copy link

TheRealRitMan commented Sep 13, 2023

Like so? And I must disclose I used AI to help me, since that is what this thread is about it wouldn't be right for me not to. In my defense, I haven't used markup in probably a decade.

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
   <title>AI Generated Content and Author Content Areas</title>
   <!-- Metadata for ChatGPT3.5 content -->
   <meta name="AI-generated" content="ChatGPT3.5" />
   <!-- Metadata for ChatGPT4 content -->
   <meta name="AI-generated" content="ChatGPT4" />
</head>
<body>
   <h1>Human Generated Content</h1>
   <p>by Adam</p>

   <!-- AI-generated content for ChatGPT3.5 -->
   <div id="ChatGPT3.5-content">
      <h2>ChatGPT3.5 Generated Content</h2>
      <p>This section contains content generated by ChatGPT3.5.</p>
      <!-- Add ChatGPT3.5 content here -->
   </div>

   <!-- AI-generated content for ChatGPT4 -->
   <div id="ChatGPT4-content">
      <h2>ChatGPT4 Generated Content</h2>
      <p>This section contains content generated by ChatGPT4.</p>
      <!-- Add ChatGPT4 content here -->
   </div>
</body>
</html>

@Pandapip1
Copy link

Pandapip1 commented Sep 13, 2023

No, RDFa is a standard that effectively allows you to have scoped meta tags. I'm proposing that AIPerson be an extension of Person with no new properties (but is, of course, interpreted to be an AI).

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Lorem Ipsum Document</title>
</head>
<body>
    <h1>About Lorem Ipsum</h1>
    <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam in vehicula arcu. Sed non urna in libero venenatis aliquam ac a odio.</p>
    
    <!-- RDFa Markup for Lorem Ipsum -->
    <div about="#about-lorem-ipsum" typeof="schema:CreativeWork">
        <h2 property="schema:name">Lorem Ipsum</h2>
        <p property="schema:description">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam in vehicula arcu. Sed non urna in libero venenatis aliquam ac a odio.</p>
        <span property="schema:datePublished">2023-09-13</span>
    </div>

    <h2>More Lorem Ipsum</h2>
    <p>Phasellus nec diam vel ipsum blandit consectetur. Nullam vestibulum, ligula at ullamcorper suscipit, neque metus sodales ex, sit amet facilisis tellus neque a libero.</p>

    <!-- RDFa Markup for More Lorem Ipsum -->
    <div about="#more-lorem-ipsum" typeof="schema:CreativeWork">
        <h3 property="schema:name">More Lorem Ipsum</h3>
        <p property="schema:description">Phasellus nec diam vel ipsum blandit consectetur. Nullam vestibulum, ligula at ullamcorper suscipit, neque metus sodales ex, sit amet facilisis tellus neque a libero.</p>
        <span property="schema:datePublished">2023-09-13</span>
    </div>

    <h2>Author</h2>

    <div about="#author" typeof="AIPerson">
        <h4 property="schema:name">AI Author</h4>
    </div>
</body>
</html>

@bathos
Copy link

bathos commented Sep 13, 2023

At least with regard to some of the stated use cases, this appears to suffer from the evil bit problem.

@Pandapip1
Copy link

So does robots.txt, but it's still useful and generally respected nonetheless.

@TheRealRitMan
Copy link

TheRealRitMan commented Sep 15, 2023

<head>
  <meta name="author" content="human author,AmazonLex,Claude,GoogleBard,Pi,LLaMA2,Copilot,ChatGPT4,etc">
  <meta name="ai-content” value=true>  <!-- SIMPLES -->
</head>

<body>
  
 <!-- NOW THE BROWSER KNOWS THERE IS AI CONTENT IN THIS DOCUMENT AND THE AUTHOR(S) 
     END OF META ROLE `FINISH IN SCRIPT TAG -->

<h1>HUMAN GENERATED CONTENT</H1>

  <script
    type="application/ai"  // BROWSER KNOWS THIS IS LIVE AI CONTENT 
    src="/script.ai"  // IN THIS CASE LOCAL TO THE SERVER
    id="HumanAuthorMadeCustomerServiceBot" // THIS ID IS FOR THE USER' INFORMATION
  ></script>

<h1>HUMAN GENERATED CONTENT</H1>

  <script
    type="application/ai"  // BROWSER KNOWS THIS IS LIVE AI CONTENT 
    src="https://anthropic.com/script.ai"  // IN THIS CASE OFF-SITE
    async 
    defer
    crossorigin="anonymous"
    integrity="sha256-abc123xyz456" // SECURITY WHICH NEEDS TO BE ADDRESSES AS WELL
    referrerpolicy="strict-origin-when-cross-origin"
    nonce="abc123" // CONTENT SECURITY POLICY WHITELISTS AUTHOR
    id="Claude-content"
  ></script>

<h1>HUMAN GENERATED CONTENT</H1>

  <script
    type=“text/ai” // STATIC AI-GENERATED TEXT - BROWSER KNOWS THIS IS AI CONTENT
    src="/script.ai" // LOCAL SCRIPT
    id="GoogleBard-API-content" // EXAMPLE OF A TYPE OF API USE CASE
    // RENDERS WITHIN THE SCRIPT TAGS - HUMAN AUTHOR ADDS MARKUP LIKE <div id="xxx" class="yyy"></div>
  ></script>

<h1>HUMAN GENERATED CONTENT</H1>

  <script
    type=“text/ai” // STATIC AI-GENERATED TEXT - BROWSER KNOWS THIS IS AI CONTENT
  ><span id="LLaMA2-content">INLINE CONTENT HERE - SORRY AGAIN GANG</span></script>

<h1>HUMAN GENERATED CONTENT</H1>

   <!--  DOES THIS SOLUTION HOLD WATER? -->

</body>

@bathos
Copy link

bathos commented Sep 15, 2023

@TheRealRitMan What is that supposed to be demonstrating? What is that media type supposed to represent? What do CSP nonces or subresource integrity have to do with this concept (etc)?

@TheRealRitMan
Copy link

TheRealRitMan commented Sep 15, 2023

in short the meta tag lets the browser know there is AI content. do we need meta info to also identify where every bit of it is amongst hybrid content? I was going with @Pandapip1's suggestion of the bit operator for the meta tag is it AI true or false. If true, its Identified in the SCRIPT tag by the content-type. notice I had AI for a live AI because it might be a 2 way connection, and ai-generated is static content furnished by script or inline.

You(and the spider) know there is AI content, You know all the authors in multiple AI situations, you know if its live or static. you know where it is, you know which author it was if the human author tells you. You know what kind of AI it is, you know if it is remote or local. . Those are a few examples, I feel like its a bit cleaner than the RDFa route, but I am a hack so maybe I have no business trying to define standards! If Im being dumb call me an idiot its cool :) We never had an issue this serious and I am honored to be part of the discussion. I feel like AI is no less significant than when the Internet itself came out.

@Pandapip1
Copy link

A MIME type for AI-generated content is an interesting, but isn't really what MIME types are for. MIME types are for specifying what format your data is in, not for specifying other metadata. So -1 to that idea.

+1 to my RDFa and <meta author="???" ai="1> solutions. They extend those particular standards in ways they were intended to be extended and effectively solve the problem.

+0.5 to the <meta name="ai-generated" content="id=id1,id2;class=class1,class2..." /> solution. For the latter case, I would prefer if the content was a CSS selector (e.g. <meta name="ai-generated" content="#id1, #id2, .class1, .class2"). It also solves the original problem but is a bit less clean and requires implementers to implement additional parsing beyond the normal XML parsing.

@TheRealRitMan
Copy link

+1 to your upgrade of my id=id suggestion for adding the # and . identifiers for id's and classes, and removing the assignment operator.

I'm neutral on the RFA because I found it tricky to understand, I am with you on making things as clean as possible and the *=1 is the shortest way to indicate anything.

The mime types were meant to indicate the type of AI content because I realized there is more than just static content, and we are going to have AI sessions so that was forward looking and

I'm not clear why your <meta name="ai-generated" content="#id1, #id2, .class1, .class2"> syntax doesn't work, that meta information indicates there is AI generated content and where it resides. It has the same result as the RFA solution with much less information, is simple to implement and addresses the bullet points of the OP:

Users don't know whether the content is generated or not,
and Search Engines cannot decide the quality of content

Either way the spiders are going to have to make changes to weigh the content for value +1 to you for all your input

@TheRealRitMan
Copy link

TheRealRitMan commented Sep 16, 2023

copied from @Pandapip1 tweak on my idea.

<meta name="ai-generated" content="#id1, .class1">
</head><body>
<div id="id1">AI GENERATED</div>
<div id="nonai1">NON-AI GENERATED CONTENT>
<div class="class1">AI GENERATED</div>
<div id="nonai2">NON-AI GENERATED CONTENT>

How does it get any simpler? Here is how: MOST credit to @Pandapip1

<meta author="" ai="1>

This is EVEN simpler. I don't know how much I love having this brand new technology be a child of the author tag EVEN THOUGH it is highly relevant. It does prevent you from adding a human author without any modification.

AND WE CANT FORGET @evayde who offered an extensive view of the options. (IDK if these points are real or what, but I say you get MAD props for such a detailed and clear layout of the problem, and the Proposed Solutions and Use Cases Other Considerations, Other Other. Whatever the most points you can get, you deserve.

<meta name="ai-generated" content="partially">

I think this, with my addition of putting the id and classes in:

<meta name="ai-generated" content="id=id1,id2;class=class1,class2..." />

AND THEN when @Pandapip1 added the selectors and the correct way of assigning them:

<meta name="ai-generated" content="#id1, #id2, .class1, .class2">

Is the winner for being clear cut, user friendly and everyone here contributed. Anyone care to second that?

@Pandapip1
Copy link

Pandapip1 commented Sep 17, 2023

I'm fine with <meta name="ai-generated" content="css selector"> as a standard.

@TheRealRitMan
Copy link

@Pandapip1 - I will say again that your ai="1" idea is simpler, but it doesn't cover hybrid situations. And since you optimized my idea, let me say you could have eliminated the double quotes since it is an integer! JK

@evayde - this is your thread what do you think about <meta name="ai-generated" content="css selector"> as a standard?

@evayde
Copy link
Author

evayde commented Oct 18, 2023

@TheRealRitMan
I think that the thing with CSS selectors could be prone to a lot of errors and false positives. Especially with how CSS is used in the real world. For instance: How do we handle automatically generated class names? Maybe I am missing a use case here.

To me, it is sufficient to be able to tell that parts of the website are generated. The AI could figure it out by themselves (e.g. they are able to figure out whats a navigation, whats a sidebar, whats the main content and so on). It should be a hint and not a definite guide to every generated word.

I also assume that whoever provides such a hint will most likely also use other measures to inform their users about generated content (e.g. by providing a list of sources, which could be Microdata).

On top of that, this meta tag could also act as an enabler to other (possibly proprietary) technologies. Think of WAI Aria, Schema.org, or Open Graph. So, there could be a technology or library building on that (which possibly might work with CSS class names), but in my opinion, this is out of scope of the HTML spec.

@Pandapip1
Copy link

On top of that, this meta tag could also act as an enabler to other (possibly proprietary) technologies. Think of WAI Aria, Schema.org, or Open Graph. So, there could be a technology or library building on that (which possibly might work with CSS class names), but in my opinion, this is out of scope of the HTML spec.

What do you think about the other proposal, ai="1" for author tags?

@evayde
Copy link
Author

evayde commented Oct 18, 2023

@Pandapip1
I assume that you would use it as follows?

Partially by AI
<meta name="author" content="name of somebody" ai="1">

Completely AI generated
<meta name="author" content="" ai="1">

No AI involved
<meta name="author" content="..." ai="0">
<meta name="author" content="...">

That could also work, however, it could be confusing. What do we do in these cases?
<meta name="author" content="ChatGPT" ai="0"> <- should mean no AI
<meta name="author" content="ChatGPT" ai="1"> <- should mean partially generated by AI

So, the solution might be misunderstood and open to human error, while the proposed solution is explicit:
The mere existence of the author tag doesn't mean anything, so devices would have to read the contents of the tag to figure out the meaning. While a special ai-generated tag would explicitly state that there might be something going on with AI (or nothing at all, but it's explicit).

Also, there's another thing about my proposal, what would that mean?

<meta name="ai-generated" content="all">
<meta name="author" content="some person">

It means everything was generated by an AI, but there's a human author. Now, it could mean that it was the person who used AI to generate the content (something that couldn't be expressed with your solution).

I don't want to simply dismiss your idea. As I mentioned earlier, I don't like to pollute HTML with more and more meta tags. And this is what I like about your approach, it reuses the author meta. Despite the shortcomings, it would still be a viable solution in my eyes.

@Pandapip1
Copy link

No, there would just be one author meta tag per author, as usual. If the author is an AI, the AI flag is set.

@MatthiasWiesmann
Copy link

For schema.org, there is a proposal to map the IPTC tags values here, which I feel would be relevant:

schemaorg/schemaorg#3392

@danbri
Copy link

danbri commented Dec 14, 2023

Thanks @MatthiasWiesmann, there's a draft at https://webschemas.org/IPTCDigitalSourceEnumeration now, although we would do well to add some examples the link to the IPTC codes is very explicit so round-tripping between embedded-in-image metadata and published-in-a-referencing-webpage metadata ought to be straightforward in most cases.

@ioaoai
Copy link

ioaoai commented Dec 21, 2023

Great discussion. I like the proposal.

@Khrommm
Copy link

Khrommm commented May 11, 2024

I'd still prefer the ai="1" syntax, but I would be okay with that as a close second.

I'm really not a fan of people boycotting AI-generated work for various reasons, but at the same time I recognize that if I were one of those people, I would definitely want a standard like this. As such, I am generally +0 to this proposal as a whole.

AI content is low effort content created on the basis of valuable content based on hundreds of hours of human work. Often, it is not based on sources and links because, for example, Open AI does not provide sources of works on which it was learned.

Even if it is good and substantive, it is still based on human work. Therefore, AI content should be depreciated compared to human content because it always has less value due to the impossibility or problematic nature of verifying sources and the risk of possible hallucinations contained in the text, which are sometimes difficult to detect.

Imagine a situation where, while looking for confirmation of whether what is in one text is true, you come across 10 other texts generated by AI with the same nonsense and you become convinced that you are reading the truth and there are no real sources

That is why it is so important to catch it and distinguish it from human content, because there is a risk that search results will be flooded with content fully generated by AI, which generally is often better written in terms of grammar and SEO, which paradoxically translates into a lower search position for content written by humans. . Isn't it about us all drowning in the hallucinations of an algorithm that will soon start learning from variations of its own sweats?

@Pandapip1
Copy link

You make a good point. While I don't believe it's impossible to make LLMs cite their sources (and GPT-4 does when browsing the internet), I agree that tagging content as AI-generated to avoid training LLMs on LLM output is probably needed in order to avoid a feedback cycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest
Development

No branches or pull requests

10 participants