Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarify GDC population at startup #9

Closed
tomkralidis opened this issue May 9, 2023 · 18 comments · Fixed by #48
Closed

clarify GDC population at startup #9

tomkralidis opened this issue May 9, 2023 · 18 comments · Fixed by #48
Labels
global-discovery-catalogue Global Discovery Catalogue

Comments

@tomkralidis
Copy link
Collaborator

The WIS2 Guide states in section 8.4.1:

A Global Cache will store a full set of discovery metadata records. This is not an additional metadata
catalogue that Data Consumers can search and browse – it provides a complete set of discovery
metadata records to support populating a Global Discovery Catalogue instance.

This ensures that a GDC can initiate itself from already published discovery metadata in the event of a catastrophe/re-deploy.

Options:

  1. have Global Caches store all discovery metadata at a known endpoint for a GDC to bootstrap itself on init
  2. re-define the above and specify that WIS2 Nodes are required to re-publish all discovery metadata on a periodic basis (weekly?)
  3. other options?

cc @golfvert @6a6d74 @efucile

@tomkralidis tomkralidis changed the title provide requirements for Global Cache to store WCMP2 at know endpoint clarify GDC population at startup May 9, 2023
@antje-s
Copy link

antje-s commented May 11, 2023

As a start variant option 1 is ok from my point of view, later it depends on the total number of metadata, as they could then not be overwritten again after 24 hours like the other cached data.
Option 2 I do not prefer
Alternatively, each WIS2 node could provide its own metadata on its data server permanently instead of only for the initial downloads after an update/add. But this would have the disadvantage that for an initial recovery all single sources would have to be queried. To do this, the source web service addresses must be listed somewhere. Even if this could be done via service metadata in the WDCs, a query would then first have to be run on another WDC and the risk of total stock gaps due to WIS2 Nodes that cannot currently be queried would be quite high. On the other hand, it has the advantage that each WIS2 Node is responsible for the completeness and provision of its own metadata.

@tomkralidis tomkralidis added the global-discovery-catalogue Global Discovery Catalogue label May 21, 2023
@kaiwirt
Copy link
Contributor

kaiwirt commented May 23, 2023

I would prefer Option 1. Republishing metadata regularly sounds not good to me.

For the endpoint: This could be added to the GC service metadata. This way GCs could change the URL by updating their own metadata. A GDC could then query GC metadata from another GDC and use this information to populate its own metadata store.

@tomkralidis
Copy link
Collaborator Author

@kaiwirt sure. GDCs can publish WCMP2 service records of themselves, and provide a link to an archive like:

{
    "rel": "archives",
    "type": "application/zip",
    "title": "Archive of all WCMP2 records",
    "href": "https://example.org/path/to/gdc-wcmp2-latest.zip"
}

@kaiwirt
Copy link
Contributor

kaiwirt commented May 23, 2023

Option 1 was that GCs have the metadata at a known endpoint. What i meant was, that this endpoint needs not to be known. It can be part of the GCs metadata.
Bootstraping would then be like

  1. Ask (other) GDC for the service metadata from one of the GCs
  2. Get the metadata endpoint URL from the GC metadata
  3. Bootstrap (own) GDC from the URL obtained in 2.

@tomkralidis
Copy link
Collaborator Author

Yes, we are saying the same thing. The "known URL" is a function of the link with rel=archives in the GDC WCMP2 record.

@tomkralidis
Copy link
Collaborator Author

Real world example: we (Canada) had to re-initialize our GDC this week, and found that we could not find an "archive" from which to perform a cold start.

We have initially loaded all known wis2box discovery metadata because we know how wis2box makes WCMP2 available (supported by the GDC reference implementation).

Of course this is not enough, as we need all WCMP2 for all WIS2 Nodes on cold start.

Ideally all records should be in the GC for a GDC to pull from.

cc @golfvert @kaiwirt @antje-s

@golfvert
Copy link
Collaborator

For (almost) static data like metadata records, or climate datasets or... I would suggest publishing (eg) once a day a notification message. I wouldn't create a special case for metadata such as zip file.

@6a6d74
Copy link
Collaborator

6a6d74 commented Oct 12, 2023

Moving this Issue to "urgent" - we need some discussion on this ahead of the face-to-face meeting in November

@6a6d74
Copy link
Collaborator

6a6d74 commented Oct 16, 2023

Discussed at ET-W2AT, 16-Oct-2023. Summary of key points and decision below.

Objectives:

  1. Provide a full set of metadata records to "bootstrap" a GDC from scratch (e.g., a metadata archive)
  2. Avoid routine publishing of messages that refer to resources that have not changed (e.g., a metadata record that is only updated once or twice per year)
  3. Avoid forcing the GC to implement special logic for different types of resource (data, metadata, reports)
  4. Ensure that the metadata archive is provided in a fault tolerant way (i.e., available from multiple locations)

Proposal:

Move the requirement to manage a discovery metadata archive from the Global Cache to the Global Discovery Catalogue. The GDC already has to implement logic dealing with inserts, updates and deletions to the catalogue. Publishing a resource which contains the full set of currently valid metadata records seems reasonable straightforward for the GDC. This means that the GC no longer has to implement special application logic just to deal with metadata records.

As with other data and metadata, the GC would subscribe to updates about these metadata archive resources and download a copy for re-publication, caching it for 24-hours.

This adds the need for the GDC to operate a Message Broker. But @golfvert noted that all Global Services will need to operate a Message Broker to publish WIS2 alert/monitoring messages (i.e., "reports"). This lead to a wider discussion about "report" messages in WIS2.

We also noted that a WIS Centre may share its technical capabilities (e.g., a Message Broker) with another WIS Centre. For example, DWD may allow MSC to publish notifications on its broker. This is a bilateral agreement and doesn't need to be covered in the Tech Regs. In this example, MSC would still be accountable for operating a broker; they do so by delegating responsibility to DWD. The Guide needs updating to describe this arrangement, which is likely to happen where larger Centres are supporting smaller ones (e.g., NZ supporting Pacific Island States).

Actions:

Define the details of the metadata archive resource (e.g., zip file?).

Define the details of the notification message to be used - especially where it's published in the topic hierarchy.

Create / update Issue about the [different types of] report messages in WIS2. << @tomkralidis

Update Technical Regulation (Manual on WIS Vol II):

  • [update] 3.7.5.6 A Global Cache shall retain a copy of core data and discovery metadata for a duration compatible with the real-time or near real-time schedule of the data and not less than 24-hours.
  • [delete] 3.7.5.7 A Global Cache shall replace a discovery metadata record if an updated version is available.
  • [delete] 3.7.5.8 A Global Cache shall retain a copy of a discovery metadata record until a notification is received indicating that the record should be removed.
  • [add] 3.7.6.x A Global Discovery Catalogue shall operate a Message Broker.
  • [add] 3.7.6.x Once per day, a Global Discovery Catalogue shall publish the complete set of valid discovery metadata records as a single downloadable resource (i.e., a metadata archive file).
  • [add] 3.7.6.x A Global Discovery Catalogue shall provide open access to the metadata archive resource.
  • [add] 3.7.6.x A Global Discovery Catalogue shall publish notifications via its Message Broker about the availability of a new metadata archive resource.
  • [update] 4.5.4 Based on received notifications, a Global Cache shall download core data and discovery metadata from WIS nodes and other Global Services and store for a duration compatible with the real-time or near real-time schedule of the data and not less than 24-hours.
  • [delete] 4.5.5 Based on received notifications, a Global Cache shall download discovery metadata from WIS nodes or other Global Caches and store until receipt of a notification requesting deletion of that discovery metadata record.
  • [add] 4.6.x A Global Discovery Catalogue shall, once per day, publish the full set of valid discovery metadata records as a single-file, downloadable metadata archive resource and publish a notification message advertising availability of this metadata archive resource.

Update the Guide to reflect this proposal and add section talking about bilateral agreements to share technical components.

Amend GDC implementations.

@tomkralidis
Copy link
Collaborator Author

Create / update Issue about the [different types of] report messages in WIS2. << @tomkralidis

Added in #44

@6a6d74
Copy link
Collaborator

6a6d74 commented Oct 16, 2023

If we use the report subtree of the topic hierarchy, we need to develop rules about how the GC should behave in terms of caching.

.../data/core/... - conditional; yes unless properties.cache = false
.../metadata/... - yes

Maybe easiest to direct the GC behaviour is:

.../report/... - conditional; no unless properties.cache = true?

@tomkralidis
Copy link
Collaborator Author

Sure, so this means a GDC needs to explicitly specify that a metadata archive is to be cached.

@golfvert
Copy link
Collaborator

Why the metadata zip file wouldn't be like any other data that we exchange in the normal data tree ?
Then for the other point of the discussion monitor/alert I would put this outside the user visible tree of origin/ and cache/.
This is something only for WIS2 operations and there is no need to have them visible to all.

So in summary:

  1. For the metadata zip from GDC just a normal product with the normal notification message
  2. For the alerting and monitoring using monitor/... and a notification message with a different structure and convention.

@6a6d74
Copy link
Collaborator

6a6d74 commented Oct 16, 2023

Treating the metadata archive as …/data/core/… is also OK. And it would be my preference. We just need to decide :)

If we do treat the metadata archive like data then maybe it should have a discovery metadata record too? Discoverable via the GDC.

@6a6d74
Copy link
Collaborator

6a6d74 commented Oct 23, 2023

@tomkralidis, @golfvert - can we conclude this discussion? I think the only outstanding decision is whether we treat the metadata archive as a "normal" data resource - implying it sits in the .../data/core/... subtree of the topic hierarchy and has a discovery metadata record itself.

@tomkralidis
Copy link
Collaborator Author

Given the context of this issue, i.e. the purpose of a such an archive would be primarily to support a cold start; suggest (for this phase):

  • post to .../metadata/...
  • no WCMP2 record

If we want to provide metadata as data, along with a WCMP2 record, I think we need to have more discussions around lifecycle:

  • how often to archive
  • what do provide and how far back, i.e. should we take daily snapshots and allow for retrieval of the archive at a given state/date?

In other words, the GDC becomes a pseudo data provider and needs to consider the data management lifecycle.

@efucile
Copy link
Member

efucile commented Nov 13, 2023

Decision

Each GDC will publish every day a zipfile containing all metadata records and will provide a landing page with a link to the zipfile.
GDC will publish the zip file on the following metadata topic

origin/a/wis2/centre-id/metadata

GC will cache all the metadata (all metadata records are treated as core) in the /metadata topic with a persistence of at least 24 hours to allow the GDC to access them.

@tomkralidis
Copy link
Collaborator Author

PR in #48. Note that related GC provisions are already in the Guide.

tomkralidis added a commit that referenced this issue Nov 22, 2023
* add GDC zipfile provision (#9)

* update GDC zipfile notes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
global-discovery-catalogue Global Discovery Catalogue
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants