Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learn to parse IETF datatracker pages #1122

Closed
tidoust opened this issue Nov 10, 2023 · 0 comments · Fixed by #1135
Closed

Learn to parse IETF datatracker pages #1122

tidoust opened this issue Nov 10, 2023 · 0 comments · Fixed by #1135

Comments

@tidoust
Copy link
Member

tidoust commented Nov 10, 2023

IETF entries that have not yet been published as RFCs have a canonical URL that looks like https://datatracker.ietf.org/doc/html/....

Such entries need to be explicit about the organization, the group, and more often than not, a "better looking" nightly URL, e.g. one under https://www.ietf.org/archive/. The problem with the nightly URL is that it typically contains the current revision of the draft and thus becomes outdated as soon as a new revision is published.

As we add more of these specs, the code could rather:

  1. know that it's an IETF spec
  2. fetch the datatracker page, which can be directly derived from the canonical URL such as https://datatracker.ietf.org/doc/draft-zern-webp/ to get the latest nightly URL.

Not sure yet how to extract the group itself though.

tidoust added a commit that referenced this issue Nov 10, 2023
Note the need to be more explicit about IETF entries. It would be worth
improving the code now that we start to add more of these specs. This is
tracked in #1122.

This would add:

```json
[
  {
    "url": "https://datatracker.ietf.org/doc/html/draft-zern-webp/",
    "seriesComposition": "full",
    "shortname": "webp",
    "series": {
      "shortname": "webp",
      "currentSpecification": "webp",
      "title": "WebP Image Format",
      "shortTitle": "WebP Image Format",
      "nightlyUrl": "https://www.ietf.org/archive/id/draft-zern-webp-13.html"
    },
    "organization": "IETF",
    "groups": [
      {
        "name": "Network Working Group",
        "url": "https://datatracker.ietf.org/group/app/"
      }
    ],
    "nightly": {
      "url": "https://www.ietf.org/archive/id/draft-zern-webp-13.html",
      "status": "Editor's Draft",
      "alternateUrls": [],
      "filename": "draft-zern-webp-13.html"
    },
    "title": "WebP Image Format",
    "source": "spec",
    "shortTitle": "WebP Image Format",
    "categories": [
      "browser"
    ],
    "standing": "good"
  },
  {
    "url": "https://www.rfc-editor.org/rfc/rfc6386",
    "seriesComposition": "full",
    "shortname": "rfc6386",
    "series": {
      "shortname": "rfc6386",
      "currentSpecification": "rfc6386",
      "title": "VP8 Data Format and Decoding Guide",
      "shortTitle": "VP8 Data Format and Decoding Guide",
      "nightlyUrl": "https://www.rfc-editor.org/rfc/rfc6386"
    },
    "groups": [
      {
        "name": "Independent Submission",
        "url": "https://datatracker.ietf.org/stream/ise/"
      }
    ],
    "organization": "IETF",
    "nightly": {
      "url": "https://www.rfc-editor.org/rfc/rfc6386",
      "status": "Informational",
      "alternateUrls": [],
      "filename": "rfc6386.html"
    },
    "title": "VP8 Data Format and Decoding Guide",
    "source": "specref",
    "shortTitle": "VP8 Data Format and Decoding Guide",
    "categories": [
      "browser"
    ],
    "standing": "good"
  }
]
```
tidoust added a commit that referenced this issue Nov 10, 2023
Note the need to be somewhat explicit about IETF entries. It would be worth
improving the code now that we start adding more of these specs. This is
tracked in #1122.

This adds:

```json
[
  {
    "url": "https://datatracker.ietf.org/doc/html/draft-zern-webp/",
    "seriesComposition": "full",
    "shortname": "webp",
    "series": {
      "shortname": "webp",
      "currentSpecification": "webp",
      "title": "WebP Image Format",
      "shortTitle": "WebP Image Format",
      "nightlyUrl": "https://www.ietf.org/archive/id/draft-zern-webp-13.html"
    },
    "organization": "IETF",
    "groups": [
      {
        "name": "Network Working Group",
        "url": "https://datatracker.ietf.org/group/app/"
      }
    ],
    "nightly": {
      "url": "https://www.ietf.org/archive/id/draft-zern-webp-13.html",
      "status": "Editor's Draft",
      "alternateUrls": [],
      "filename": "draft-zern-webp-13.html"
    },
    "title": "WebP Image Format",
    "source": "spec",
    "shortTitle": "WebP Image Format",
    "categories": [
      "browser"
    ],
    "standing": "good"
  },
  {
    "url": "https://www.rfc-editor.org/rfc/rfc6386",
    "seriesComposition": "full",
    "shortname": "rfc6386",
    "series": {
      "shortname": "rfc6386",
      "currentSpecification": "rfc6386",
      "title": "VP8 Data Format and Decoding Guide",
      "shortTitle": "VP8 Data Format and Decoding Guide",
      "nightlyUrl": "https://www.rfc-editor.org/rfc/rfc6386"
    },
    "groups": [
      {
        "name": "Independent Submission",
        "url": "https://datatracker.ietf.org/stream/ise/"
      }
    ],
    "organization": "IETF",
    "nightly": {
      "url": "https://www.rfc-editor.org/rfc/rfc6386",
      "status": "Informational",
      "alternateUrls": [],
      "filename": "rfc6386.html"
    },
    "title": "VP8 Data Format and Decoding Guide",
    "source": "specref",
    "shortTitle": "VP8 Data Format and Decoding Guide",
    "categories": [
      "browser"
    ],
    "standing": "good"
  }
]
```

---------

Co-authored-by: Francois Daoust <fd@tidoust.net>
tidoust added a commit that referenced this issue Nov 21, 2023
The code now recognizes IETF draft documents that have a `datatracker.ietf.org`
URL:
- It associates them with the IETF organization
- It can compute a useful shortname (that code can in theory return a truncated
shortname because there is no direct way to validate that the Internet Draft
name contains a group ID).
- It extracts the group's ID from the nightly URL (that code could further be
improved to fetch the actual group name, right now the code only knows about
the "HTTP" working group).
- It associates IETF documents from the HTTP WG to the right repository.
- It computes the better-looking nightly URL at `www.ietf.org` or at
`httpwg.org` for HTTP WG documents.

This allows to simplify IETF data in `specs.json` a bit.

Note that the code still cannot process drafts that have been submitted by
individuals automatically, even when these drafts at targeted at a group.
Such drafts should be associated with the individuals that submitted them and
not with any group. A couple of spec entries, which incorrectly referenced the
Network WG or the HTTP WG, were fixed accordingly in `specs.json`.

This fixes #1122, but note that the code does not need to fetch the datatracker
page for the time being.
tidoust added a commit that referenced this issue Nov 22, 2023
The code now recognizes IETF draft documents that have a `datatracker.ietf.org`
URL and fetch all the information it needs for IETF drafts and RFCs from the IETF
datatracker using the Simplified Documents API:
https://datatracker.ietf.org/api/#simplified-documents

- It associates them with the IETF organization
- It can compute a useful shortname (that code can in theory return a truncated
shortname because there is no direct way to validate that the Internet Draft
name contains a group ID).
- It extracts the group's from the datatracker API
- It associates IETF documents from the HTTP WG to the right repository.
- It computes the better-looking nightly URL at `www.ietf.org` or at
`httpwg.org` for HTTP WG documents.

This allows to simplify IETF data in `specs.json` a bit.

This fixes #1122, but note that the code does not need to fetch the datatracker
page for the time being.

IETF documents may be linked to a group, an area, or be part of what IETF calls
individual submissions. Areas and individual submissions still link to a "group"
page at IETF, so the code just takes that info from datatracker as-is. As a
result, individual submissions are no longer associated with the author who
submitted the document, but that does not seem needed in any case.

The code throws when an IETF document that it knows under a certain name got
published under a different name to alert us that the canonical URL needs to
change in browser-specs. Name changes typically happen when a document
transitions to a working group, or when it gets published as an RFC.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant