Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Entity Relations #1964

Closed
Rugvip opened this issue Aug 14, 2020 · 13 comments
Closed

[RFC] Entity Relations #1964

Rugvip opened this issue Aug 14, 2020 · 13 comments
Labels
area:catalog Related to the Catalog Project Area enhancement New feature or request help wanted Help/Contributions wanted from community members question Further information is requested rfc Request For Comment(s)

Comments

@Rugvip
Copy link
Member

Rugvip commented Aug 14, 2020

Entity Relations

This RFC suggests a possible implementation of modeling relations between entities in the software catalog.

Scope

The scope of this RFC is limited to relations between entities in the software catalog. It does not include modelling of relations to things outside of the catalog, for example DNS names or git repo locations. It also aims to encompass all such relations, such that there are no relations between entities that are not modeled using this RFC.

Context

There have been many discussions surrounding the software catalog where it has become clear that having a shared concept of entity relations may provide a significant benefit to the catalog model.

For example this discussion around APIs, this issue around ownership, this RFC for organizational data, or this comment on limiting relations to certain kinds.

Most importantly, by having a shared concept of relations as part of the catalog model, we should significantly improve the power and capabilities of our catalog GraphQL API.

We will also be discussing processing in the catalog as a part of this RFC. This is a feature of the Software Catalog where you can register various processors to process entities as they are ingested. This can be used to modify entities in various ways as they are being added to the catalog.

Use-cases

This RFC is based around the following 4 use-cases of relation modeling:

Ownership

With organizational data modeled in the software catalog, ownership can be expressed as links to entities of the kind User and Group. There may also be different types of ownership, such as "maintainer", "authors", "operator", "tech owner", "product owner" etc.

Current implementation:

kind: Component
metadata:
  owner: patriko

API Implementation

This is the relation between API entities and the systems and components that implement those APIs. The link is a simple A implements B, but it may also be bidirectional, with B is implemented by A.

Current implementation:

kind: Component
spec:
  implementsApis:
    - petstore
    - streetlights
    - hello-world

Organizational Data

This covers the relations mentioned in #1401, and partially implemented in #1838. It includes parent/ancestor + child/descendants relations between groups. As well as members/memberOf relations between users and groups.

Current implementation:

kind: Group
spec:
  parent: group-a
  ancestors:
    - group-a
    - global-synergies
    - acme-corp
  children:
    - child-a
    - child-b
  descendants:
    - child-a
    - child-b
    - desc-a
    - desc-b

Dependencies

This is the relation between a component and other entities the component depends on to fulfil its function. Dependencies could be of different kinds, such as API, Component, or Resource. And why not User? ¯\_(ツ)_/¯

Dependencies may also have additional metadata, such as whether it's a statically declared dependency, or one discovered at runtime. In the case of runtime dependencies, they may in turn have additional information such as when it was last updated, and the source of the information.

Current implementation:

none

Schema Design Considerations

In this section we discuss a couple of different points to consider as we decide on the schema.

Referencing Other Entities

As a part of defining relations, we will be referencing entities of various kinds. It will most likely be convenient if we have a way to express the Kind/namespace/name triplet of entities with a single string. We won't try to solve that in this RFC however, as it has its own RFC. In this RFC we will assume a syntax similar to <kind>:<name> or <kind>:<namespace>:<name>.

Processed vs Static

A controversial point has been whether the relation model should be tailored for being written by humans or not.

If we aim to make it easy to be hand-written, we would want to avoid things like ancestors and descendants of Groups, and possibly only keep one of parent or children. In general relationships are bidirectional in the A --X-> B, B --X'-> A sense, and we would likely always want to select one of the directions that each relation can be declared and use that in the model. It would then be up to systems outside the catalog to resolve these links to be able to traverse relations in both directions, for example in the GraphQL layer.

On the other hand, we could make relation resolution part of the catalog itself, using catalog processors to fill in the gaps in the relations graph. For example, a ApiGraphProcessor could populate all API and Component entities with implementsApis and implementedBy entries where they are missing. This would allow the entity definition files to only declare half of the relation, but consumers of the catalog will be guaranteed that both fields are filled in. This also leaves it up to organizations or even individual teams to decide how and where they want to declare their relations.

Processed Canonical Schema

In addition to filling in the gaps, catalog processing could be used to allow all relations to be represented using a more verbose canonical schema.

For example a relation like

owner: User:patriko

Could be processed into a much more verbose, but more complete description like

relations:
  - type: backstage.io/owner
    kind: User
    namespace: default
    name: patriko

An approach like this could improve the scalability and customizability of the model, at the cost of introducing more complexity. It would enable organizations to define their own short-hands for relations, while allowing for a well-defined standardized format for consuming the relations.

This processing can also be part of validating the relations, for example, it could reject an attempt to set the owner to an entity of the Component kind.

Hand-written Lean Canonical Schema

An alternative to defining a verbose canonical schema and using catalog processors to populate it, is to use a much more lean and simpler to write schema. For example, we could express the previous owner example like this:

relations:
  backstage.io/owner: User:patriko

This re-uses the pattern of labels and annotations, and would likely also be placed alongside them in the entity metadata. As it is hand-written it would completely replace the existing fields such as spec.owner.

Relations as Entities

A different way to model relations could be to let them be complete standalone entities. For example:

kind: OwnerRelation
metadata:
  name: petstore-owner
spec:
  target: Component:petstore
  owner: User:patriko

A benefit of this approach is that the relations themselves can use all of the existing modeling tools of the catalog, such as labels, annotations, and extensions.

This could also be combined with catalog processors to generate relation entities based on fields in other entities, similar to the processed canonical schema approach.

Relation Attributes

Some of the use-cases we're exploring as part of this RFC could benefit from allowing additional metadata to be defined as part of relations. For example, component dependencies can benefit from having additional metadata attached that describes whether a dependency is a statically defined one or one discovered at runtime.

Given the runtime/static dependency distinction, it is much nicer to model it as two different kinds of dependency relations as opposed to two different kind of relations. This becomes clearer when looking at it from a catalog consumer's point of view, and querying for component dependencies. With the first option you'll just query for all relations of the type "dependency", while with separate types you need to query for all "static-dependency" and "runtime-dependency" relations and join them together. A workaround would be to duplicate the relations, declaring for example both "runtime-dependency" and "dependency" relations between the same two components, but that doesn't seem like a direction we'd want to go.

It can of course be argued that runtime vs static dependency is not an important distinction to make, at which point I must remind you that it's just used as an example and there are likely many more 😁

Suggested Schema

This is the point where this RFC leads to an actual suggested implementation.

Entity relations are defined in a new top-level relations field. Each relation is an object with the fields type, and entity. The type is from a set of well-defined values, similar to metadata annotation and label keys. The entity field is an object with the kind:namespace/name entity identifier triplet.

Entity relations can not be defined by hand, they have to be added through processing in the catalog using other fields. For example, the default catalog implementation ships with a processor that translates the spec.owner field into an backstage.io/owner relation. The entity triplet in the relation description will always be fully populated, whereas the source field may use relative entity references such as for example omitting the namespace.

The following is an example of what the documentation for the implements relation could look like as part of the software catalog documentation:

## Kind: Component

...

### `spec.implementsApis` [optional]

Links APIs that are implemented by the component, e.g. artist-api. The value should be a list of catalog entity references, where the kind is implied to be API. This field is optional.

This field produces relations of the type `backstage.io/implements` and `backstage.io/implemented-by`.

...

## Relations

An index of all types of relations defined by the core Backstage Software Catalog.

...

### `backstage.io/implements`

A link between a source entity that implements the API described by the target.

Allowed Kinds:

- Source: `System`, `Component`
- Target: `API`

### `backstage.io/implemented-by`

A link between a source entity that describes an API that is implemented by the target.

Allowed Kinds:

- Source: `API`
- Target: `System`, `Component`

Example

As mentioned the suggested schema would not be produced by hand, so given an entity defined like this:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: artist-web
  description: The place to be, for great artists
spec:
  type: website
  lifecycle: production
  owner: Group:artist-relations
  implementsApis:
    - API:artist-api

It would be read back from the catalog like this:

{
  "apiVersion": "backstage.io/v1alpha1",
  "kind": "Component",
  "metadata": {
    "name": "artist-web",
    "description": "The place to be, for great artists"
  },
  "spec": {
    "type": "website",
    "lifecycle": "production",
    "owner": "Group:artist-relations",
    "implementsApis": ["API:artist-api"]
  },
  "relations": [
    {
      "type": "backstage.io/owner",
      "entity": {
        "kind": "Group",
        "namespace": "default",
        "name": "artist-relations"
      }
    },
    {
      "type": "backstage.io/implements",
      "entity": {
        "kind": "API",
        "namespace": "default",
        "name": "artist-api"
      }
    }
  ]
}

With this being the equivalent yaml for the relations part:

relations:
  - type: backstage.io/owner
    entity:
      kind: Group
      namespace: default
      name: artist-relations
  - type: backstage.io/implements
    entity:
      kind: API
      namespace: default
      name: artist-api

Note that the spec.owner and spec.implementsApis fields are still part of the spec in the response for completeness, although we could possibly remove them, to ensure that they aren't read by plugins.

Reasoning

I'm suggesting a solution that leans heavily into processing, because I think we can't get away with defining relations in the catalog without any processing involved. The reason for that being that any relationship is bi-directional, and maintaining all those edges by hand would be very tedious work. As part of that I'm of course also arguing that we should model relations as part of the catalog, and not leave it to the GraphQL layer to resolve. That's because the catalog is already a great place to do processing and has a built-in caching layer, and duplicating that in the GraphQL service seems like a waste.

The reason for going with such a verbose descriptor format is forwards compatibility and ease of use. We could've for example modeled the relations as <relation-type>: [<entity-ref>], i.e:

relations:
  backstage.io/owner:
    - Group:default:artist-relations
  backstage.io/implements:
    - API:default:artist-api

But using string values would've severely limited our options for extending the model in the future. Note that I am not suggesting that we add any form of additional metadata tied to relations at this point, even though I argued that it will likely be needed earlier 😉

Moving the type to be the key is something I don't have a strong opinion on however, and is more based on the preference of modeling JSON data as arrays of objects. Although it also doesn't save many characters or bytes to move the type to be a key.

The suggested solution also doesn't model relations as standalone entities. This is primarily due to the overhead of this approach, especially since we'd likely end up having to name each relation individually. It may make sense for some larger and more complex relations as separate entities though, and could be combined with with the suggestion in this RFC. It does feel very overkill for relations like owner and dependency though.

I'm also a bit worried that modelling relations as entities may cause confusion as they extend the existing entity model. What are for example the role of labels in relation entities, would you query for a set of relations using a label query? ^_> It's also not as clear how to map that model to GraphQL imo.

Other Approaches

This PR popped up while working on this, and takes the approach of modeling relations as standalone entities. Reasoning for not suggesting that approach is just above ^

Another approach is the lean canonical schema, which would be intended to be written by hand, and replace the existing fields. With that, an entity could look something like this:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: artist-web
  description: The place to be, for great artists
  relations:
    owner:
      - Group:artist-relations
    implements:
      - API:artist-api
spec:
  type: website
  lifecycle: production

I'm not suggesting that approach because I don't think that format is actually user friendly enough to be written by hand. That in addition to the concerns around forwards compatibility mentioned above, and that we'd need processing to maintain consistency in the relations anyway.

Baked into the above alternate approach is to put the relations field inside metadata. I'm not suggesting that approach, because I think it fits more nicely with GraphQL to keep it as a top-level field. In general relations also seem like a separate enough thing that they deserve a top-level field.

GraphQL Example

As a bit of extra food for thought, here's a mock of what a GraphQL query could look like:

{
  entity(name: "backstage") {
    owners: relations(query: { type: "backstage.io/owner" }) {
      entity {
        kind
        metadata {
          name
          namespace
        }
        spec {
          email
        }
      }
    }

    dependencies: relations(query: { type: "backstage.io/dependency" }) {
      entity {
        kind
        metadata {
          name
          namespace
        }
      }
    }
  }
}
@Rugvip Rugvip added enhancement New feature or request help wanted Help/Contributions wanted from community members question Further information is requested rfc Request For Comment(s) area:catalog Related to the Catalog Project Area gql labels Aug 14, 2020
@Fox32
Copy link
Contributor

Fox32 commented Aug 14, 2020

especially since we'd likely end up having to name each relationship individually
That is also something we run into in our approach... Both naming the entity instances and the entity types 😉

We (@dhenneke) like that the relationships are easily available at both sides of the implementation. That allows to find them quickly from everywhere, nice idea! The processor system is perfect for that.

Note that the spec.owner and spec.implementsApis fields are still part of the spec in the response for completeness, although we could possibly remove them, to ensure that they aren't read by plugins.

We are a bit concerned that having the duplicated data might lead to code that is using the source fields instead of the relationships. However, doesn't removing them from the spec make it difficult to validate the entities? So I guess you would only plan to not expose them at the REST/GraphQL APIs.

Skipping over metadata makes it a bit too easy 😉 However on first sight, one could support a full object beside a simple string, too. This approach is quite common in the k8s resources. This allows to easily write relationships in the short hand form (a simple string), or fall back to the more complex one if needed. The difference could already be resolved before passing it to relationship processor.

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: artist-web
  description: The place to be, for great artists
spec:
  type: website
  lifecycle: production
  owner: Group:artist-relations
  implementsApis:
    - ref: API:another-api
      metadata:
        annotations:
          example.com/reason: We implemented it because...
        label:
          example.com/text: I have no idea for metadata for implementApi 😉, but I guess there are many!
    - API:artist-api

This would that be processed to:

relations:
  - type: backstage.io/owner
    entity:
      kind: Group
      namespace: default
      name: artist-relations
    metadata:
      annotations:
        example.com/reason: We implemented it because...
      label:
        example.com/text: I have no idea for metadata for implementApi 😉, but I guess there are many!
  - type: backstage.io/owner
    entity:
      kind: Group
      namespace: default
      name: artist-relations
  - type: backstage.io/implements
    entity:
      kind: API
      namespace: default
      name: artist-api

Another topic might be derived/indirect dependencies. These could be derived by processors to easily evaluate them. For example:

  • A component, resource or API is part of a domain, because they are of a system, which itself is part of a domain.
  • A component depends on a component, because one implements an API and the other consumes it. It could be even more complicated, because the late only consumes an API exposed by a system.
    If one want to handle these, it could be done with the suggested approach.

Just to give more examples for relationships:

  • implementsApi
  • consumesApi
  • exposeApi
  • isPartOfDomain/isPartOfSystem (would this be generalized as isPartOf?)

I guess this is much deeper thought through that what we suggested, but our goal was just to start a discussion. We will close #1951 and are happy to see the suggestion in this PR to come along (and are even interested to push it by helping with the implementation!).

@Rugvip
Copy link
Member Author

Rugvip commented Aug 15, 2020

We are a bit concerned that having the duplicated data might lead to code that is using the source fields instead of the relationships. However, doesn't removing them from the spec make it difficult to validate the entities? So I guess you would only plan to not expose them at the REST/GraphQL APIs.

Yep, my worry as well. I think source fields would have to be removed in the processor that reads them, since that'd be the source of truth of the mapping logic. We could either structure the processing in a way where validation of the source yaml happens before the relationship processing, or we could allow the fields to be marked for deletion so they can be removed later.

Skipping over metadata makes it a bit too easy 😉 However on first sight, one could support a full object beside a simple string, too. This approach is quite common in the k8s resources. This allows to easily write relationships in the short hand form (a simple string), or fall back to the more complex one if needed. The difference could already be resolved before passing it to relationship processor.

Yeah I know it will likely be needed and is tricky to model, but that's why I want to skip it for now. With the suggested model I think it is fairly straightforward to add metadata, it'll just take time to figure out the schema for it. Your example does fit perfectly with what I had in mind though 😁

In general it is also possible to use any kind of shape of the source fields for the relationships, and they don't necessarily have to map exactly to the structure of the relationship either.

Another topic might be derived/indirect dependencies. These could be derived by processors to easily evaluate them.

Yep, I think plenty relationships could be generated by processors without any specific source field. Other examples would be a relationship to replace the backstage.io/managed-by-location annotation once we make all locations be entities, and runtime dependencies collected by an external system.

@freben
Copy link
Member

freben commented Aug 15, 2020

I'm less concerned about leaving the original fields in place, for what it's worth. Skipping the complexity (in code and in explaining the reasons for them disappearing) greatly outweighs potential confusion as I see it.

I think confusion is equally likely to happen regarding not being supposed to write these manually. Could these be part of a larger concept of a "computed status" of the entity? Similar to the status of k8s objects. We need this for several other concepts (refresh loop state is one of them). Then the relations would just be another subkey in the status space.

This clearly separates things into "here's the hand written spec for how I want things to be" and "here's the current runtime status of this entity".

@Rugvip
Copy link
Member Author

Rugvip commented Aug 15, 2020

I'm less concerned about leaving the original fields in place, for what it's worth. Skipping the complexity (in code and in explaining the reasons for them disappearing) greatly outweighs potential confusion as I see it.

Not sure about that. Given the suggestion in this RFC it would always be an error for a plugin to read the source field, and at that point it might be best to remove them. The idea is to allow in customization in how relationships are generated. For example you could flip around the owner field to be defined in terms of a groups owns field instead, a plugin that then tries to read the spec.owner would be broken. Bit of a contrived example but you get the idea.

I think confusion is equally likely to happen regarding not being supposed to write these manually. Could these be part of a larger concept of a "computed status" of the entity? Similar to the status of k8s objects. We need this for several other concepts (refresh loop state is one of them). Then the relations would just be another subkey in the status space.

Yep, I like the idea of having a more clear separation between computed values and the entity definition. It's a bit awkward to have exceptions for some fields can't be written by hand. That's only if it can be done without too much overhead though imo, since we ofc already have a bunch of exceptions for fields that can't be written by hand.

Also I'm wondering if the k8s status isn't more of a runtime thing, as in a request to /status will actually go and inspect the current state of the entity, which isn't quite what we'd be aiming for.

@freben
Copy link
Member

freben commented Aug 15, 2020

I see the original data as holy, reflecting exactly what I took the care to write. I want the storage part of the catalog as such to be as "dumb" as possible. Reading from the source data is fine but you have to understand the implications. It's a matter of documentation that if you want resolved/computed/real-world things, the status is where you go.

I feel that this is not exclusive to these relations either. A perhaps hypothetical example - a manually added annotation may point to the github project that a component is built in. A completely different process could be ingesting bulk data from github and deducing what components are being built in each project. A third process could be... Etc. All of them may eventually lead to a resulting annotation in the state that points to the github project.

One more. The source data omits an owner entirely, and a separate build hook analyzes codeowners and contributes that part of the data to all components that were part of the build. There even wouldn't exist an original source to read from at all.

What I'm saying is, I like where this proposal is going, and I think it's time that we DEMOTE the value of the original data, to essentially be regarded as a template or target of sorts. We could (so to speak) read Partial<Entity> out of the yaml.

@Rugvip
Copy link
Member Author

Rugvip commented Aug 15, 2020

@freben Yeh, something in the vein of this, given the example input in the RFC?

apiVersion: backstage.io/v1alpha1

kind: Component

metadata:
  name: artist-web
  labels:
    backstage.io/type: website
    backstage.io/lifecycle: production
  annotations:
    backstage.io/description: The place to be, for great artists

relations:
  - type: backstage.io/owner
    entity:
      kind: Group
      namespace: default
      name: artist-relations
  - type: backstage.io/implements
    entity:
      kind: API
      namespace: default
      name: artist-api

sourceDefinition:
  apiVersion: backstage.io/v1alpha1
  kind: Component
  metadata:
    name: artist-web
    description: The place to be, for great artists
  spec:
    type: website
    lifecycle: production
    owner: Group:artist-relations
    implementsApis:
      - API:artist-api

@freben
Copy link
Member

freben commented Aug 15, 2020

Maybe. I was thinking more in terms of keeping the exact structure we had, but essentially adding a new readonly status root key to entities. It would, in turn, have several subkeys with defined semantics.

@Rugvip
Copy link
Member Author

Rugvip commented Aug 18, 2020

So pending discussion around how to separate out generated bits from the source definition, should we move ahead with just adding a top-level relations for now, and keep the source fields intact?

@freben
Copy link
Member

freben commented Aug 18, 2020

If we are very clear that the location of that field may change, then I'm fine with that.

@Rugvip
Copy link
Member Author

Rugvip commented Aug 18, 2020

@freben yeah that'd be the case. I'd like to validate the other parts of this suggestion without blocking on figuring that piece out.

@mbruggmann
Copy link
Contributor

mbruggmann commented Aug 19, 2020

+1 for the more granular format to represent relationships in the catalog. At this point I don't worry too much about the shorthand form it in the source yaml files, since I suspect that in the long run we'll gather most of the data from other existing configuration (like CODEOWNERS, Dockerfile, Terraform/CloudFormation/KCC, BUILD/pom/gradle, etc). That'd be in line with @freben also, in that I see the source yaml and catalog format further diverging over time. Gotta say that I'm not up to date enough with the processor pieces to comment on that though.

@nikek
Copy link
Contributor

nikek commented Sep 1, 2020

Just an idea after reading this through. An alternative of having to manually write a relationship in the yaml in one "human friendly format" and query for it in a different could be an "entity yaml editor" in Backstage(and/or an IDE extension) that helps you with directly putting the trickier syntax into the yaml. An editor would help with autocompleting from existing entities and spit out the correct format for any type of relationship, then create a PR to update the yaml in the repo.

Obviously this doesn't help if the specific piece of data is dynamic, but I guess no dynamic data can be in the source yaml anyways. But a helper tool could let us avoid having two different formats for basically the same set of data, and the debate of having to remove the write-friendly one upon retrieval would be a non-issue.

@Rugvip
Copy link
Member Author

Rugvip commented Dec 16, 2020

Closing since this is now implemented, and any additional discussions around relations should happen in separate issues/RFCs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:catalog Related to the Catalog Project Area enhancement New feature or request help wanted Help/Contributions wanted from community members question Further information is requested rfc Request For Comment(s)
Projects
None yet
Development

No branches or pull requests

6 participants