Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent caching of large datasets in Global Caches #7

Closed
golfvert opened this issue Mar 8, 2023 · 26 comments
Closed

Prevent caching of large datasets in Global Caches #7

golfvert opened this issue Mar 8, 2023 · 26 comments
Labels
global-cache Global Cache

Comments

@golfvert
Copy link
Collaborator

golfvert commented Mar 8, 2023

Centres like ECMWF or Eumetsat will provide very large amount of core data that nevertheless should not (or may not) be stored in the Global Cache.
In the current approach, there is no mechanism to prevent those kind of data not to end up in the cache.
Right now all data published using messages in the topics origin/a/wis2/country/centre-id/core/# will end up being caught by Global Cache and copied.
We should define a way to prevent that default behaviour.

updated: 31 May 2023

===DECISIONS

  1. for core data, add properties.cache (true|false, default=true) to the notification message as decided by data producer
  2. if properties.cache: false the Global Cache SHALL:
    • Not download the data made available using this message
    • Publish the Notification in cache/a/wis2/... (similarly to the data being cache) with the properties.links not modified (the link will still point to the data producer endpoint)
@golfvert golfvert changed the title Prevent caching of large datasets on Global Cache Prevent caching of large datasets in Global Caches Mar 8, 2023
@golfvert
Copy link
Collaborator Author

golfvert commented Mar 8, 2023

Some possible options to consider:

  1. Adding in the notification message a cache: false in the properties
  2. Having this "no cache" as part of the metadata description
  3. Having those datasets directly published in the cache/a/wis2/... topic tree and having the handful Global Caches to use a list of not to be cached topic hierarchy. Typically those published by ECMWF, Eumetsat and a few (?) others.
  4. Else ?

@kaiwirt
Copy link
Contributor

kaiwirt commented Mar 9, 2023

Caches might want to use the length field in the message to decide whether they will be able to handle that data volume. If that field is missing they could still decide if they are going to store the data to the cache after the download finished.

@efucile
Copy link
Member

efucile commented Mar 9, 2023

I vote for option 3

@6a6d74
Copy link
Collaborator

6a6d74 commented Mar 9, 2023

Before looking for the right mechanism, I think it makes sense to agree who decides if things are cached.

Is it the data publisher? In which case putting a flag in the discovery metadata or notification message would work.

Is the cache operator? In which case "message length" could be a useful criteria.
... Aside: given that all caches would produce their own notification messages, would WIS2 work if caches made different decisions about what they were willing to cache. Example: if only Germany cached some things, consumers would only receive notifications pointing to the German cache?

The data publisher probably has a better idea of whether the data is real-time or near-real-time.
Aside: maybe the update frequency (specified in the discovery metadata) is a good metric for identifying real-time or near-real-time data?

The cache owner is the one impacted by the choice in terms of data down/upload and storage cost.

@golfvert
Copy link
Collaborator Author

golfvert commented Mar 9, 2023

And to have the complete picture to consider as well the impact for users.

@tomkralidis
Copy link
Collaborator

It also depends on whether limits will be mandated across all global services, or vary across same. In either case:

  • putting cache: bool (default: true) in discovery metadata would mean other global services supporting a lookup between discovery metadata and data notifications. We recently added an (optional) properties.metadata_id element to the notification message which could facilitate this, but again it is optional
  • putting cache: bool (default: true) in the notification message would allow the data publisher to specify caching at a data granule ("from this collection, cache this file, but not that one")
  • putting cache: bool (default: true) may be valuable for articulating that a data granule should not be cached for a other reason than size
  • using the message properties.content.size value, or a link object's length value can help provide an indication to Global Services to decided accordingly. If limits are Global Service specific, then the Global Service should communicate this constraint somehow (an AsyncAPI definition for GB, or a landing page for GC).

@6a6d74
Copy link
Collaborator

6a6d74 commented Mar 10, 2023

Hi Tom. Useful points.

Thinking out loud ...

  1. I think it makes sense for the data publisher to declare whether something should be cached - via discovery metadata and/or notification message (see below).
  2. "should be cached". I think it's for the Global Cache to make the decision about whether or not it will cache something - perhaps looking at size of an individual 'data granule' or the aggregate size of an entire dataset or all the data from a particular WIS2node. If a Global Cache decides not to cache something that a WIS2node has asked to be cached, it should raise an alert that's captured by the Global Monitor and also propagates back to the original WIS2node so they are aware of the issue. The alert should include the reason why it wasn't cached (e.g., file size, storage quota exceeded ...).
  3. Global Cache instances may have different criteria for refusing to cache something. Which would lead to data being cached only by some Global Caches. I think this is OK - the system would still work (albeit a bit less robustly) as data consumers would still be able to access the data, just from a smaller number of caches. Aside: it would also be useful to continually check consistency between Global Cache instances. Easiest way to do this would be to listen to the data notification messages and record which caches sent them for each data granule.
  4. The Global Discovery Catalogue is required to add actionable links to the discovery metadata record for cached datasets; e.g., additional subscription endpoint(s) in the /.../cache/.../ topic, and (possibly?) global cache locations from where the data can be downloaded. So - how does the Global Discovery Catalogue know when to do this? Easiest to have a cache: bool directive in the discovery metadata record. Otherwise, the Global Discovery Catalogue would have to listen to notification messages to see which datasets were actually being cached - which seems complex :). Easiest to have the cache: bool default to true for core data, so that only the edge case of not caching core data needs to be declared. Aside: global cache download locations may not be needed because it's the URL in the notification message(s) that provides that information, and the global cache isn't intended to be a browsable data access point.
  5. I think it's best to avoid the need for a real-time lookup between the Global Cache and the Global Discovery Catalogue to determine what should be cached - the Global Discovery Catalogue isn't designed to be a highly available component. So - ways to mitigate? (A) On start-up, a Global Cache instance could build a configuration of what to cache by scanning the discovery metadata in the catalogue. (B) Each notification message includes a cache: bool directive, so no lookup is needed. To avoid bloating the message, probably best to assume a default true for core data. The WIS2nodes only need to declare the edge case of cache = false. ... I prefer option (B).

@golfvert
Copy link
Collaborator Author

golfvert commented Mar 10, 2023

The cache: bool in the notification message is probably the easiest to manage for WIS2 Nodes and Global Cache operations.
This can added in the Metadata record so that this information is known upon discovery of the data.

From a user perspective, however, this may imply that he would have to subscribe to origin/a/wis2/... for receiving the non cached data.
Which is not very convenient IMHO.

An alternative to my option 3. above is for the WIS2 Node to:

  • add cache: false in the message
  • post this on origin/a/wis2/... as usual
    and for the Global Cache:
  • not to download/cache the data with cache: false
  • nevertheless republish the message onto cache/a/wis2/... without updating the download link(s).

This way, as a user, I don't really care I subscribe to cache/a/wis2/... and I will receive the correct links in the message.

Like that, WIS2 Node is in control, Global Cache follows a simple rule, user doesn't require to know about this subtlety.

@6a6d74
Copy link
Collaborator

6a6d74 commented Mar 10, 2023

That's a neat solution. It means that data consumer wanting core data doesn't have to worry about whether it's on cache or origin topics at the Global Broker. Notifications about core data will always be available via the cache/a/wis2 topics.

We just need to clearly document the (counter intuitive) situation of when a cache/a/wis2 topic message doesn't point to a URL at a Global Cache. From inspecting a notification message it will be obvious - because it must include a "don't cache me" cache: false directive.

It also means that the Global Discovery Catalogue can treat all core data the same - adding an additional actionable link pointing to subscription via the cache topic. So there's no need for the WIS2node to include a special cache me / don't cache me directive in the metadata.

BTW - I'm assuming that the Global Discovery Catalogue adds an actionable link for the associated cache/a/wis2/... topic at every Global Broker? This is the way that data consumers find out where they can subscribe - I'm expecting this to be a list of places, one or more of which is relevant for them.

@golfvert
Copy link
Collaborator Author

"We just need to clearly document the (counter intuitive) situation..." does it really matter?
As a user, I receive a message, I follow the link(s) and done.
We can mention that in the guide, as the download links may not be limited to dwd.de or... for core and recommended that's all.

@golfvert
Copy link
Collaborator Author

@kaiwirt, is this potential solution (don't cache but re-publish me) a good option for you as a Global Cache centre?

@6a6d74
Copy link
Collaborator

6a6d74 commented Mar 10, 2023

"We just need to clearly document the (counter intuitive) situation..." does it really matter? As a user, I receive a message, I follow the link(s) and done. We can mention that in the guide, as the download links may not be limited to dwd.de or... for core and recommended that's all.

@golfvert - by documentation yes, I meant mention it in the guide - I'm sure that some Global Cache implementers might not get the point unless we're explicit with the reason.

@kaiwirt
Copy link
Contributor

kaiwirt commented Mar 13, 2023

To me i think it is ok if caches republish messages without actually downloading and storing the data leaving the message unmodified but the topic.

Caches could also indicate this in the message. Having a flag like on_local_cache: true/false

@antje-s
Copy link

antje-s commented Mar 13, 2023

Sounds good, I just have one addition...
from my point of view it would be important for the automatic download of core data that the access to origin-download-URLs is open (without login for all WIS2 Nodes) otherwise the list of access data can become very long.

@golfvert
Copy link
Collaborator Author

As ECMWF will have its WIS2 Node soonish, and its data shouldn't be cached, shall we tentatively endorse:

  1. WIS2 Node can add cache: false in the Notification message. Default value being cache: true (if cache key is missing)
  2. Global Cache honours cache: false by NOT downloading the data and, nevertheless, publishes on the corresponding cache/a/wis2/...

If that is acceptable, then @tomkralidis can amend the WIS2-notification-message repo accordingly.

@kaiwirt
Copy link
Contributor

kaiwirt commented Mar 21, 2023 via email

@tomkralidis
Copy link
Collaborator

TT-WISMD 2023-04-12:

  • potentially distributes responsibility?
  • global services can/should decide
  • can have both
  • data producer driven (in WNM)
  • decision by global services
    • can notify data producer that data was not cached
      • traffic/load? Can send single message (report) to global service
    • data producer can also subscribe to global broker to validate their data publication
  • global service should be able to govern if a data granule gets cached or not

Recommendation:

  • can have both, global cache can override
  • global cache can realize as an implementation detail
  • global service SHALL have a WCMP2 record (properties.type="service")
    • rules can be communicated in themes/concepts?
    • constraints

@tomkralidis
Copy link
Collaborator

tomkralidis commented May 15, 2023

ET-W2AT 2023-05-15:

  • recommended data will NOT be cached by GC
  • for core data, add properties.cache (true|false, default=true) to WNM as decided by data producer
  • message is republished anyway
  • no issue in making "more" core data available (over and above resolution 1)
    • ACTION: Secretariat to verify

@golfvert
Copy link
Collaborator Author

golfvert commented May 20, 2023

A small complement to the summary above:

  • for core data, add properties.cache (true|false, default=true) to the notification message as decided by data producer
  • if properties.cache: false the Global Cache SHALL:
  1. Not download the data made available using this message
  2. Publish the Notification in cache/a/wis2/... (similarly to the data being cache) with the properties.links not modified (the link will still point to the data producer endpoint)

Item 2. is to make users' life easier. They will keep subscribing to the topic cache/a/wis2/... only. The "no cache core data" being a technicality, there is no need to expose this to user.

@tomkralidis
Copy link
Collaborator

Associated PR in wmo-im/wis2-notification-message#46

@kaiwirt
Copy link
Contributor

kaiwirt commented May 23, 2023

Just for my clarification. We use the same mechanism for recommended data: GC receives message on origin/#, it does not download the data but republishes the (unmodified) message as cache/#

@golfvert
Copy link
Collaborator Author

Not necessarily.
Access to recommended data may require authentication, signing a paper, accepting T&C,... almost whatever the data originator wants. Access to core data MUST be easy for users. Access to recommanded data MAY require particular action. So, having specifically to subscribe to origin/a/.... for this kind of dataset is, I think, acceptable.

@SimonElliottEUM
Copy link

Before looking for the right mechanism, I think it makes sense to agree who decides if things are cached.

Is it the data publisher? In which case putting a flag in the discovery metadata or notification message would work.

Is the cache operator? In which case "message length" could be a useful criteria. ... Aside: given that all caches would produce their own notification messages, would WIS2 work if caches made different decisions about what they were willing to cache. Example: if only Germany cached some things, consumers would only receive notifications pointing to the German cache?

The data publisher probably has a better idea of whether the data is real-time or near-real-time. Aside: maybe the update frequency (specified in the discovery metadata) is a good metric for identifying real-time or near-real-time data?

The cache owner is the one impacted by the choice in terms of data down/upload and storage cost.

A producer of core data needs to know in advance whether they will be cached or not. If core data are not cached than the producer will have to accommodate data access from an unknown number of consumers, as opposed to one download from the global cache. The caching of the core data at the Global Caches is a key advantage of the WIS2 architecture from the point of view of a producer of large volumes of such data.

@efucile
Copy link
Member

efucile commented Nov 13, 2023

Decision

An increase in the volume of data has to be announced by the provider in advance to allow the GCs to take required measures.

===DECISIONS

for core data, add properties.cache (true|false, default=true) to the notification message as decided by data producer
if properties.cache: false the Global Cache SHALL:
Not download the data made available using this message
Publish the Notification in cache/a/wis2/... (similarly to the data being cache) with the properties.links not modified (the link will still point to the data producer endpoint)

@6a6d74
Copy link
Collaborator

6a6d74 commented Jan 25, 2024

Guide section for Global Cache operators is updated. Need to put appropriate text in section for data publishers (section 2.6.3)

@6a6d74
Copy link
Collaborator

6a6d74 commented Feb 3, 2024

Done. New section for Data Publishers includes this information: "Considerations when providing Core data in WIS2"

@6a6d74 6a6d74 closed this as completed Feb 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
global-cache Global Cache
Projects
Status: Done
Status: For end of Pilot phase (Dec 2023)
Status: Decision added
Development

No branches or pull requests

8 participants