-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Entity Relations #1964
Comments
We (@dhenneke) like that the relationships are easily available at both sides of the implementation. That allows to find them quickly from everywhere, nice idea! The processor system is perfect for that.
We are a bit concerned that having the duplicated data might lead to code that is using the source fields instead of the relationships. However, doesn't removing them from the spec make it difficult to validate the entities? So I guess you would only plan to not expose them at the REST/GraphQL APIs. Skipping over metadata makes it a bit too easy 😉 However on first sight, one could support a full object beside a simple string, too. This approach is quite common in the k8s resources. This allows to easily write relationships in the short hand form (a simple string), or fall back to the more complex one if needed. The difference could already be resolved before passing it to relationship processor. apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: artist-web
description: The place to be, for great artists
spec:
type: website
lifecycle: production
owner: Group:artist-relations
implementsApis:
- ref: API:another-api
metadata:
annotations:
example.com/reason: We implemented it because...
label:
example.com/text: I have no idea for metadata for implementApi 😉, but I guess there are many!
- API:artist-api This would that be processed to: relations:
- type: backstage.io/owner
entity:
kind: Group
namespace: default
name: artist-relations
metadata:
annotations:
example.com/reason: We implemented it because...
label:
example.com/text: I have no idea for metadata for implementApi 😉, but I guess there are many!
- type: backstage.io/owner
entity:
kind: Group
namespace: default
name: artist-relations
- type: backstage.io/implements
entity:
kind: API
namespace: default
name: artist-api Another topic might be derived/indirect dependencies. These could be derived by processors to easily evaluate them. For example:
Just to give more examples for relationships:
I guess this is much deeper thought through that what we suggested, but our goal was just to start a discussion. We will close #1951 and are happy to see the suggestion in this PR to come along (and are even interested to push it by helping with the implementation!). |
Yep, my worry as well. I think source fields would have to be removed in the processor that reads them, since that'd be the source of truth of the mapping logic. We could either structure the processing in a way where validation of the source yaml happens before the relationship processing, or we could allow the fields to be marked for deletion so they can be removed later.
Yeah I know it will likely be needed and is tricky to model, but that's why I want to skip it for now. With the suggested model I think it is fairly straightforward to add metadata, it'll just take time to figure out the schema for it. Your example does fit perfectly with what I had in mind though 😁 In general it is also possible to use any kind of shape of the source fields for the relationships, and they don't necessarily have to map exactly to the structure of the relationship either.
Yep, I think plenty relationships could be generated by processors without any specific source field. Other examples would be a relationship to replace the |
I'm less concerned about leaving the original fields in place, for what it's worth. Skipping the complexity (in code and in explaining the reasons for them disappearing) greatly outweighs potential confusion as I see it. I think confusion is equally likely to happen regarding not being supposed to write these manually. Could these be part of a larger concept of a "computed status" of the entity? Similar to the status of k8s objects. We need this for several other concepts (refresh loop state is one of them). Then the relations would just be another subkey in the status space. This clearly separates things into "here's the hand written spec for how I want things to be" and "here's the current runtime status of this entity". |
Not sure about that. Given the suggestion in this RFC it would always be an error for a plugin to read the source field, and at that point it might be best to remove them. The idea is to allow in customization in how relationships are generated. For example you could flip around the
Yep, I like the idea of having a more clear separation between computed values and the entity definition. It's a bit awkward to have exceptions for some fields can't be written by hand. That's only if it can be done without too much overhead though imo, since we ofc already have a bunch of exceptions for fields that can't be written by hand. Also I'm wondering if the k8s status isn't more of a runtime thing, as in a request to |
I see the original data as holy, reflecting exactly what I took the care to write. I want the storage part of the catalog as such to be as "dumb" as possible. Reading from the source data is fine but you have to understand the implications. It's a matter of documentation that if you want resolved/computed/real-world things, the status is where you go. I feel that this is not exclusive to these relations either. A perhaps hypothetical example - a manually added annotation may point to the github project that a component is built in. A completely different process could be ingesting bulk data from github and deducing what components are being built in each project. A third process could be... Etc. All of them may eventually lead to a resulting annotation in the state that points to the github project. One more. The source data omits an owner entirely, and a separate build hook analyzes codeowners and contributes that part of the data to all components that were part of the build. There even wouldn't exist an original source to read from at all. What I'm saying is, I like where this proposal is going, and I think it's time that we DEMOTE the value of the original data, to essentially be regarded as a template or target of sorts. We could (so to speak) read |
@freben Yeh, something in the vein of this, given the example input in the RFC? apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: artist-web
labels:
backstage.io/type: website
backstage.io/lifecycle: production
annotations:
backstage.io/description: The place to be, for great artists
relations:
- type: backstage.io/owner
entity:
kind: Group
namespace: default
name: artist-relations
- type: backstage.io/implements
entity:
kind: API
namespace: default
name: artist-api
sourceDefinition:
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: artist-web
description: The place to be, for great artists
spec:
type: website
lifecycle: production
owner: Group:artist-relations
implementsApis:
- API:artist-api |
Maybe. I was thinking more in terms of keeping the exact structure we had, but essentially adding a new readonly |
So pending discussion around how to separate out generated bits from the source definition, should we move ahead with just adding a top-level |
If we are very clear that the location of that field may change, then I'm fine with that. |
@freben yeah that'd be the case. I'd like to validate the other parts of this suggestion without blocking on figuring that piece out. |
+1 for the more granular format to represent relationships in the catalog. At this point I don't worry too much about the shorthand form it in the source yaml files, since I suspect that in the long run we'll gather most of the data from other existing configuration (like |
Just an idea after reading this through. An alternative of having to manually write a relationship in the yaml in one "human friendly format" and query for it in a different could be an "entity yaml editor" in Backstage(and/or an IDE extension) that helps you with directly putting the trickier syntax into the yaml. An editor would help with autocompleting from existing entities and spit out the correct format for any type of relationship, then create a PR to update the yaml in the repo. Obviously this doesn't help if the specific piece of data is dynamic, but I guess no dynamic data can be in the source yaml anyways. But a helper tool could let us avoid having two different formats for basically the same set of data, and the debate of having to remove the write-friendly one upon retrieval would be a non-issue. |
Closing since this is now implemented, and any additional discussions around relations should happen in separate issues/RFCs. |
Entity Relations
This RFC suggests a possible implementation of modeling relations between entities in the software catalog.
Scope
The scope of this RFC is limited to relations between entities in the software catalog. It does not include modelling of relations to things outside of the catalog, for example DNS names or git repo locations. It also aims to encompass all such relations, such that there are no relations between entities that are not modeled using this RFC.
Context
There have been many discussions surrounding the software catalog where it has become clear that having a shared concept of entity relations may provide a significant benefit to the catalog model.
For example this discussion around APIs, this issue around ownership, this RFC for organizational data, or this comment on limiting relations to certain kinds.
Most importantly, by having a shared concept of relations as part of the catalog model, we should significantly improve the power and capabilities of our catalog GraphQL API.
We will also be discussing processing in the catalog as a part of this RFC. This is a feature of the Software Catalog where you can register various processors to process entities as they are ingested. This can be used to modify entities in various ways as they are being added to the catalog.
Use-cases
This RFC is based around the following 4 use-cases of relation modeling:
Ownership
With organizational data modeled in the software catalog, ownership can be expressed as links to entities of the kind
User
andGroup
. There may also be different types of ownership, such as "maintainer", "authors", "operator", "tech owner", "product owner" etc.Current implementation:
API Implementation
This is the relation between
API
entities and the systems and components that implement those APIs. The link is a simpleA
implementsB
, but it may also be bidirectional, withB
is implemented byA
.Current implementation:
Organizational Data
This covers the relations mentioned in #1401, and partially implemented in #1838. It includes parent/ancestor + child/descendants relations between groups. As well as members/memberOf relations between users and groups.
Current implementation:
Dependencies
This is the relation between a component and other entities the component depends on to fulfil its function. Dependencies could be of different kinds, such as
API
,Component
, orResource
. And why notUser
? ¯\_(ツ)_/¯Dependencies may also have additional metadata, such as whether it's a statically declared dependency, or one discovered at runtime. In the case of runtime dependencies, they may in turn have additional information such as when it was last updated, and the source of the information.
Current implementation:
none
Schema Design Considerations
In this section we discuss a couple of different points to consider as we decide on the schema.
Referencing Other Entities
As a part of defining relations, we will be referencing entities of various kinds. It will most likely be convenient if we have a way to express the
Kind
/namespace
/name
triplet of entities with a single string. We won't try to solve that in this RFC however, as it has its own RFC. In this RFC we will assume a syntax similar to<kind>:<name>
or<kind>:<namespace>:<name>
.Processed vs Static
A controversial point has been whether the relation model should be tailored for being written by humans or not.
If we aim to make it easy to be hand-written, we would want to avoid things like
ancestors
anddescendants
ofGroups
, and possibly only keep one ofparent
orchildren
. In general relationships are bidirectional in theA --X-> B, B --X'-> A
sense, and we would likely always want to select one of the directions that each relation can be declared and use that in the model. It would then be up to systems outside the catalog to resolve these links to be able to traverse relations in both directions, for example in the GraphQL layer.On the other hand, we could make relation resolution part of the catalog itself, using catalog processors to fill in the gaps in the relations graph. For example, a
ApiGraphProcessor
could populate allAPI
andComponent
entities withimplementsApis
andimplementedBy
entries where they are missing. This would allow the entity definition files to only declare half of the relation, but consumers of the catalog will be guaranteed that both fields are filled in. This also leaves it up to organizations or even individual teams to decide how and where they want to declare their relations.Processed Canonical Schema
In addition to filling in the gaps, catalog processing could be used to allow all relations to be represented using a more verbose canonical schema.
For example a relation like
Could be processed into a much more verbose, but more complete description like
An approach like this could improve the scalability and customizability of the model, at the cost of introducing more complexity. It would enable organizations to define their own short-hands for relations, while allowing for a well-defined standardized format for consuming the relations.
This processing can also be part of validating the relations, for example, it could reject an attempt to set the owner to an entity of the
Component
kind.Hand-written Lean Canonical Schema
An alternative to defining a verbose canonical schema and using catalog processors to populate it, is to use a much more lean and simpler to write schema. For example, we could express the previous owner example like this:
This re-uses the pattern of labels and annotations, and would likely also be placed alongside them in the entity metadata. As it is hand-written it would completely replace the existing fields such as
spec.owner
.Relations as Entities
A different way to model relations could be to let them be complete standalone entities. For example:
A benefit of this approach is that the relations themselves can use all of the existing modeling tools of the catalog, such as labels, annotations, and extensions.
This could also be combined with catalog processors to generate relation entities based on fields in other entities, similar to the processed canonical schema approach.
Relation Attributes
Some of the use-cases we're exploring as part of this RFC could benefit from allowing additional metadata to be defined as part of relations. For example, component dependencies can benefit from having additional metadata attached that describes whether a dependency is a statically defined one or one discovered at runtime.
Given the runtime/static dependency distinction, it is much nicer to model it as two different kinds of dependency relations as opposed to two different kind of relations. This becomes clearer when looking at it from a catalog consumer's point of view, and querying for component dependencies. With the first option you'll just query for all relations of the type "dependency", while with separate types you need to query for all "static-dependency" and "runtime-dependency" relations and join them together. A workaround would be to duplicate the relations, declaring for example both "runtime-dependency" and "dependency" relations between the same two components, but that doesn't seem like a direction we'd want to go.
It can of course be argued that runtime vs static dependency is not an important distinction to make, at which point I must remind you that it's just used as an example and there are likely many more 😁
Suggested Schema
This is the point where this RFC leads to an actual suggested implementation.
Entity relations are defined in a new top-level
relations
field. Each relation is an object with the fieldstype
, andentity
. Thetype
is from a set of well-defined values, similar to metadata annotation and label keys. Theentity
field is an object with the kind:namespace/name entity identifier triplet.Entity relations can not be defined by hand, they have to be added through processing in the catalog using other fields. For example, the default catalog implementation ships with a processor that translates the
spec.owner
field into anbackstage.io/owner
relation. The entity triplet in the relation description will always be fully populated, whereas the source field may use relative entity references such as for example omitting the namespace.The following is an example of what the documentation for the
implements
relation could look like as part of the software catalog documentation:Example
As mentioned the suggested schema would not be produced by hand, so given an entity defined like this:
It would be read back from the catalog like this:
With this being the equivalent yaml for the
relations
part:Note that the
spec.owner
andspec.implementsApis
fields are still part of the spec in the response for completeness, although we could possibly remove them, to ensure that they aren't read by plugins.Reasoning
I'm suggesting a solution that leans heavily into processing, because I think we can't get away with defining relations in the catalog without any processing involved. The reason for that being that any relationship is bi-directional, and maintaining all those edges by hand would be very tedious work. As part of that I'm of course also arguing that we should model relations as part of the catalog, and not leave it to the GraphQL layer to resolve. That's because the catalog is already a great place to do processing and has a built-in caching layer, and duplicating that in the GraphQL service seems like a waste.
The reason for going with such a verbose descriptor format is forwards compatibility and ease of use. We could've for example modeled the relations as
<relation-type>: [<entity-ref>]
, i.e:But using string values would've severely limited our options for extending the model in the future. Note that I am not suggesting that we add any form of additional metadata tied to relations at this point, even though I argued that it will likely be needed earlier 😉
Moving the type to be the key is something I don't have a strong opinion on however, and is more based on the preference of modeling JSON data as arrays of objects. Although it also doesn't save many characters or bytes to move the type to be a key.
The suggested solution also doesn't model relations as standalone entities. This is primarily due to the overhead of this approach, especially since we'd likely end up having to name each relation individually. It may make sense for some larger and more complex relations as separate entities though, and could be combined with with the suggestion in this RFC. It does feel very overkill for relations like
owner
anddependency
though.I'm also a bit worried that modelling relations as entities may cause confusion as they extend the existing entity model. What are for example the role of labels in relation entities, would you query for a set of relations using a label query? ^_> It's also not as clear how to map that model to GraphQL imo.
Other Approaches
This PR popped up while working on this, and takes the approach of modeling relations as standalone entities. Reasoning for not suggesting that approach is just above ^
Another approach is the lean canonical schema, which would be intended to be written by hand, and replace the existing fields. With that, an entity could look something like this:
I'm not suggesting that approach because I don't think that format is actually user friendly enough to be written by hand. That in addition to the concerns around forwards compatibility mentioned above, and that we'd need processing to maintain consistency in the relations anyway.
Baked into the above alternate approach is to put the
relations
field insidemetadata
. I'm not suggesting that approach, because I think it fits more nicely withGraphQL
to keep it as a top-level field. In general relations also seem like a separate enough thing that they deserve a top-level field.GraphQL Example
As a bit of extra food for thought, here's a mock of what a GraphQL query could look like:
The text was updated successfully, but these errors were encountered: