Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to skip items of a list when resolving their fields raises an error. #4496

Closed
amomchilov opened this issue Jun 1, 2023 · 6 comments

Comments

@amomchilov
Copy link
Contributor

amomchilov commented Jun 1, 2023

Is your feature request related to a problem? Please describe.

If an error is raised while resolving the field of an object in a list, the entire response will error out.

We would like to contribute new API that lets you say "remove this whole object from the list", which will return the other successful items. It would be opt-in on a per-field basis, to preserve backwards compatibility.

This is similar to the existing nil propagation feature built into the interpreter (and it's implemented in much the same way), but for list types instead of nullable types.

Describe the solution you'd like

Everything here is work-in-progress, all API spellings are tentative, but the overall behaviour is tested and working. This prototype adds new skip_nodes_on_raise and on_raise arguments to the field API, like so:

field :my_varied_content,  [VariendContent], 
  skip_nodes_on_raise: true, 
  on_raise: ->(err, current_object, args, context, current_field, nearest_skippable_list_item) { 
    # Log, emit metrics, notify Bugsnag, etc.
    log("Had to skip over #{nearest_skippable_list_item} because it raised #{err}")
    
    # this instructs the interpreter to skip `nearest_skippable_list_item`
    context.skip_from_parent_list 
  } 

PoC Pull request: Shopify#6

I would love to get your feedback!

Advantages:

  1. Doesn't clutter the schema-level rescue_from callback
  2. Since it's specific to the list field with the errored item, it's clear where the error happened

Disadvantages:

  1. If a GraphQL API has lots of these skippable lists, then these callbacks could get pretty repetitive.
    • Users can extract the shared callback into a module and pass the method as the callback on_raise: method(:the_shared_callback)

Describe alternatives you've considered

This feature will introduce new public API to the library that lets library users opt-in and configure this new behaviour.

We've given this a fair bit of thought, so sharing our research might be valuable. I put these into collapsible sections so it doesn't spam everything out.

Extend the existing schema-level rescue_from callback

You can intercept errors with the schema-level rescue_from callback, but your only options today are:

  1. Return a value which will be used as a substitute value for the field.
    • This seems pretty circumstantial. Not many fields have default values that would make sense. E.g. what's a default value for a Person's firstName: String!?
  2. Return context.skip, causing that field to be omitted from that object.
    • It's easy to confuse this with the list item skipping behaviour we want, but it's quite different. When you context.skip, that immediate field will be removed from that object. If the field was not nullable, this would be really weird for the client. There's no propagation behaviour to say "remove this whole object from the list", which is what we want to build.
  3. raise, to propagate the error
    • The schema.execute call will end up raising this error up to the caller, e.g. your Rails controller, which then trigger's Rails' error-handling mechanisms.

There is no way to say "just skip the whole list item". We can add something like context.skip_from_parent_list, but it's still quite clunky:

  • rescue_from already has 5 parameters, and we probably don't want to add more, especially if they're not going to be used most of the time.

    • We need to inform the user that skipping a list item is an option for the particular error. Perhaps it could be part of the context?
    • We need to tell the user which object would be skipped (e.g. so they can report to Bugsnag which kind of item has issues). Perhaps this would also be part of the context?
  • To make skipping work, every schema would need boilerplate like this, even if they have no other error logic:

    rescue_from(StandardError) do |err, obj, args, ctx, field|
      if ctx.can_skip_list_item?
        next ctx.skip_from_parent_list 
      end
      
      raise
    end

    If library users didn't add this, then their errorred items will always raise, and never be skipped, which would be a bit surprising.

Upsides:

  • Reuses and API surface that already exists, and is well-known and discoverable.

Add a new schema-level on_skip callback

(on_skip naming is tentative)

Similar to rescue_from. Since it's specialized to this item-skipping use case, we can give its own set of parameters that has all the data needed for this use case.

Not ideal however, because it would require repeating a lot of rescue_from boilerplate, like https://github.com/rmosolgo/graphql-ruby/blob/master/lib/graphql/execution/errors.rb

Add a new callback to the Field Extension API

Doc: https://rubydoc.info/gems/graphql/2.0.22/GraphQL/Schema/FieldExtension

  • Something like #on_list_item_raise
  • This is probably the best bet, but I'm not as familiar with Field Extensions.
  • This would declutter the arguments to Field#initailize.
  • The interpreter would still need to be tweaked to support the correct propagation behaviour, so this field extension will be coupled and not truly self-contained/modular.

Additional context

Motivation

Shopify has use cases that involve returning lists/connections of highly varied content, sourced from different data sources and teams. As the number of these grows, the failure surface gets bigger, and there's a risk that any one piece of content can bring down the entire request.

To increase resilience, it's imperative that failures in one piece of content can be contained, and don't take down the entire request. If resolving a list item's fields leads to an exception being raised, we'd like to skip that item entirely, but keep the other successful list items. "The show must go on."

Other considerations

This can't be solved in user code

Solving this problem is not as simple as just making users wrap up all their resolvers into a large rescue block; most of the failures don't come from the resolver logic that constructs the list items, but rather the resolution of the child fields from those list items, which can be deeply nested.

This is a process users fundamentally don't control, it's the GraphQL interpreter that's going over the query, deciding which fields need to be selected, and then resolving those fields on our objects. There's no way for "userland" code to rescue this; this requires modifications to the GraphQL interpreter, to wrap each field resolution in a rescue block, and provide an API for configuring how/when list items should be skipped.

Related

@rmosolgo
Copy link
Owner

Hey, thanks again for your investigation on this. Personally, I'm really torn: on one hand, I can understand why this would be useful, but on the other hand, I'm very wary of adding a lot of complexity to the library in order to support a non-spec'd edge case. First...

If an error is raised while resolving the field of an object in a list, the entire response will error out.

There are several other ways to address this:

  • If the field is nullable, then it can be nil and all of the rest of the response will still be returned.
  • If the field is non-nullable, then GraphQL-Ruby will go "up" the response tree, removing any non-null fields until it reaches a nullable one, which it replaces with null. (This is described in the GraphQL spec: https://spec.graphql.org/draft/#sel-EANTNDNAADPMxB97W)
  • Alternatively, resolvers can use "empty values" to avoid returning null in cases like this (for example, and empty string, 0, [], and so on). That is, instead of an error, it can be treated as if the field is empty. This is really application-dependent and may not suit every case, but it's an option.

"The show must go on."

I think the spec's plan for the show going on is to return null in this case. (Or, in GraphQL-Ruby, raise GraphQL::ExecutionError, which adds an error message to the response and returns null.) Am I right that null doesn't work in your case here, because of the expectation of the client? Why not allow null in cases when data can't be fetched?

A couple other questions:

  • What about when these objects are not part of lists? (Does that happen in your schema?) That is, if I understand right, there are objects which are sometimes part of a list, and when they're part of a list, and one of their fields encounters an error, you'd like to remove the object entirely from the list. Does this object exist outside of the list? If so, how do you (and/or how would you) handle errors in child fields in that case?
  • How many steps of "depth" should this behavior handle? Should it only handle List > Object > Field, or should it handle errors from arbitrary depth (eg List > Object > Field > Field2 > Field3 ...)?

Let me know what you think on those questions. I'd love to find something that works for your case but adds minimal complexity to the runtime and type system 🤔

@amomchilov
Copy link
Contributor Author

amomchilov commented Jun 21, 2023

Hey Robert!

Preface: A lot of my responses are phrased in terms of our use case, because that gives a concrete grounding/motivation to these things, but I don't mean for it to sound like it's all about us. 😅

(This is described in the GraphQL spec: spec.graphql.org/draft/#sel-EANTNDNAADPMxB97W)

Interesting, I had no idea that this behaviour was described by the spec!

Isn't that kind of strange? It doesn't seem to have much to do with the interface itself, but rather, a detail of the resolution of values that get surfaced on the interface. I didn't expect the GraphQL spec to care about that area. 🤔

There are several other ways to address this

Of those options, I guess the most applicable one to our needs would be to make a list of nullable items, but that would require a new foos2: [Foo]! field and deprecating our current foos: [Foo!]! field.

This seems like a bit of a sharp edge in GraphQL API design. If you didn't foresee the need to skip items upfront and make your items nullable, then you're painting into a corner until you bite the bullet and making a breaking change. This is particularly odd to me, because skipping items should be a server concern, that the client shouldn't necessarily need to know about, or be an active participant in (they would need logic to skip over nulls they get).

Am I right that null doesn't work in your case here, because of the expectation of the client?

Yep! We'd need a breaking API change, and only new clients would be able to have the improved error handling capability.

Why not allow null in cases when data can't be fetched?

We can, but didn't design for it at the time, so it'd be API-breaking to change now.

I find this to be a rather strange design in the spec. In our cases, we don't have a "genuine" need to return null, but for this error-handling behaviour. Without any item-skipping feature, it seems that the incentives are to:

  1. Always make lists of nullable items, even if you don't need null, just so you can return null to handle errors
    • null has an overloaded meaning now, and you can't distinguish "null because error" from "null because there's genuinely nothing there"
  2. Have the clients know to skip null items, of practically all lists everywhere

Knowing what I know now, I think I'd probably only ever make lists of nullable items going forward, just to give myself the flexibility to be able to skip items on error.

Does this object exist outside of the list?

It doesn't in our case, but I don't think there's an issue there in any case.

Failures outside of a list don't have the same conceptual issue: fields always "fail independently", and return an error/null for a single field. Contrast this with a list, where 1 error out of 100 objects will cause 99 other valid objects to all be discarded. The failure of 1 item is ok, it's the blast radius of 99 others lost that's the problem.

How many steps of "depth" should this behavior handle? Should it only handle List > Object > Field, or should it handle errors from arbitrary depth (eg List > Object > Field > Field2 > Field3 ...)?

Great question! It would need to handle arbitrary depth. In our real-world use case, we have fields around 10 levels deep into a list object that we need to resolve. They should be "protected" by this, in that there's no part of the sub-tree whose errors can bring down the entire list up top.

The included PR already supports handling arbitrary depth, just like the existing nil propagation.

A related question is "How should this handle if an item is in a list, in an item that's in a list?". This PR doesn't do anything in particular to support handling multiple levels of lists, since I don't think there's as much of a concrete motivation for that. It just skips the current object at the first ancestor list that's skip_nodes_on_raise: true.

@rmosolgo
Copy link
Owner

Isn't that kind of strange?

On the other hand, it's part of the GraphQL type system: non-null fields can't be null, and the spec describes what happens when no value can be provided for a non-null field. Makes sense to me 🤷

skipping items should be a server concern

Agreed, and that reminds me of one more possible implementation option, scoping: https://graphql-ruby.org/authorization/scoping.html. It's billed as an authorization feature, but it could be used here, too. You could check the list of items to see if they ought to be rejected, and remove them from the list if so. If you only want to perform the check when certain fields were selected by the client, you could use a lookahead (https://graphql-ruby.org/queries/lookahead.html):

def self.scope_items(items, context)
  lookahead = Execution::Lookahead.new(
    query: context.query,
    # TODO: The field that returns this type will need to add `extras: [:ast_node]` for this to work 
    # Also, this doesn't cover the case that this field is selected by _multiple_ AST nodes. 
    # GraphQL-Ruby would need to support `extras: [:ast_nodes]` (plural) to cover this case, 
    # which would be straightforward, but isn't done yet. 
    ast_nodes: [context[:current_arguments].fetch(:ast_node)],
    field: context[:current_field],
  )
  # Then, check the lookahead for the relevant fields, and if any were selected, 
  # also check the members of `items` to make sure they don't have invalid values for those fields. 
end 

The included PR already supports handling arbitrary depth

I certainly appreciate your proving it's possible to implement this in the heart of the runtime, but from my standpoint, I have a few doubts:

  • The API is (understandably, appropriately) complicated, and I consider this a code smell. (Not that the feature is wrong, just that the solution belongs somewhere where it fits more easily.) It will be hard to document, and...
  • People who find it will think they should use it, when in fact, what they should do is use one the other solutions mentioned in this thread. (I realize they aren't a perfect fit in your use case, but in 9 years -- this is the first!)
  • It will add maintenance and performance overhead for the rest of my life; The more features find their way into the core runtime, the slower it will run and the harder it will be to change in the future.
  • As far as I know, this feature has no precedent in other GraphQL implementations (please correct me if there is one!), which suggests to me that other applications have found other solutions to this problem.

For these reasons, I strongly prefer to find a solution using the current feature set of GraphQL-Ruby.

At the very bottom of the list is a custom implementation where:

  • context: { ... } includes an empty array of error_paths_to_remove: []
  • When an error is encountered where paths should be removed, context[:errors_to_remove] << context[:current_path]
  • Then, before returning a response Hash to the client, the response hash is modified such that those invalid objects are removed from lists they belong to (this could be done in after_query, for example: https://graphql-ruby.org/queries/instrumentation.html

What do you think of scope_items or the manual implementation?

@amomchilov
Copy link
Contributor Author

amomchilov commented Jul 10, 2023

Hey Robert!

I totally understand your reservations from a maintenance/complexity perspective.

I'm open to the possibility that the complexity of solving this problem isn't worth the benefit, and that we should instead bite the bullet to navigate around it (by deprecating our fields of non-nullable list items and migrating to lists of nullable items). However, I'd like to clarify that the interpreter is the only place this can reasonably be solved. (Unless I'm mistaken, which would be great actually, because the solution would be simpler. 😄). I'll elaborate...

When an error is encountered where paths should be removed,

Herein lies the hard part, and where scope_items is insufficient. Today, it's not possible to detect that an error was encountered, without discarding the whole result. scope_items is really great for filtering items by some clear boolean signal, e.g.:

def self.scope_items(items, context)
  items.filter { |item| item.visible_to?(context.user) }
end

In our example, we don't have a neat and tidy boolean predicate like #visible_to?. What we'd need is a predicate like #has_any_selected_fields_that_raise_on_resolution_or_fail_graphql_type_checking_or_fail_for_any_other_reason?. Detecting that this is the case would require us to examine the field selections (which we can get from the lookahead feature, like you suggested), evaluate the values for those fields within a begin/rescue block, and reject the items that raise. This is easy for a single level, but the problem is that the errors can be raised at any arbitrary depth in the result object hierarchy. To detect those, we would need to evaluate the selected fields, and then recurse in to evaluate the selected fields on those objects, and then the selected fields on those, and so on. Recursive descent over an AST of GraphQL selections, visiting nodes to evaluate field selections, type checking results, and making control flow decisions based on the results and ... wait a minute, that seems familiar, ... that's just a re-implementation of the interpreter!

It's a valid open question whether or not this problem is important enough to warrant solving in the interpreter itself. However, unless my reasoning is woefully mistaken, this problem can only be solved in the interpreter. Believe me, it wasn't my first choice to make the interpreter any more complex either, for what that's worth 😅.

Though my proposed implementation isn't quite there yet, ideally, I would envision this item-skipping logic to share the same common implementation as the nil-propagation mechanism that already exists. If you squint a little, they're kind of the same (detecting a value in a disallowed context, and climbing up the result tree to find something that can handle it).

BTW, I've also sent you some further sensitive details about our use case on Slack

@amomchilov
Copy link
Contributor Author

Hey again @rmosolgo, I don't mean to nag, but would you mind sharing your latest thoughts on this?

@rmosolgo
Copy link
Owner

I'm sorry, I don't have plans to implement this and I think adding it to the current runtime would add undue complexity. I realize that there's no other way to implement exactly this solution, but as discussed above, there are other solutions that don't require custom execution code. Sorry to say no, but I hope one of those will be feasible enough!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants