Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Authentication to Federation #117

Open
VladimirAlexiev opened this issue Sep 16, 2020 · 22 comments
Open

add Authentication to Federation #117

VladimirAlexiev opened this issue Sep 16, 2020 · 22 comments

Comments

@VladimirAlexiev
Copy link
Contributor

Why?

SPARQL 1.1 Federation does not specify any way to authenticate.
This is a significant impediment in enterprise scenarios.

Previous work

Vendor-specific solutions:

Many vendors have additional authentication solutions (eg integrations with LDAP, SSO, etc) but afaik they are not exposed to Federation.

Proposed solution

Sorry, someone smarter than me should propose solutions :-)

Considerations for backward compatibility

None, this will be new functionality.

@rubensworks
Copy link
Member

One possible solution would be to make it possible to pass one or more service-specific auth headers to the SPARQL endpoint.
Something roughly like:

X-SPARQL-Service-Auth: "http://example.org/endpoint1": <auth-value1>, "http://example.org/endpoint2": <auth-value2>, 

The standard auth headers could be used as auth values.


Side-note for related work: The user:password@server:port approach is also supported in Comunica: https://comunica.dev/docs/query/advanced/basic_auth/

@afs
Copy link
Collaborator

afs commented Sep 23, 2020

Yes, authentication matters.

Is there anything special to SPARQL Federation here though? If it is general matter of HTTP connection handling, then using machinery available from libraries, and the systems that enterprises know how to manage and control, would seem better.

The challenge is to answer why this is not the same as securing any HTTP connection and identify, what, if anything, does SPARQL 1.2 the-standard has to do.

Jena allows SERVICE requests to be modified (e.g. adding user/password to the URL) so the query text is separated from the request time action and the authentication can be by calling user etc. Putting user:password@ seems limiting.

@afs afs mentioned this issue Sep 24, 2020
@JervenBolleman
Copy link
Collaborator

I believe this is the most difficult problem to actually do. Because, in this kind of federation you are passing on trust.
I talk to server A which talks to server B, so now server B needs to validate that server A actually represents me. At all times avoiding sharing secrets and stuffing them into queries. It's really hard to do, and I worry that a spec that did it safe and correct
will be so difficult to implement that it never happens :(

@ericprud
Copy link
Member

ericprud commented Sep 25, 2020

@rubensworks wrote:

Side-note for related work: The user:password@server:port approach is also supported in Comunica: https://comunica.dev/docs/query/advanced/basic_auth/

Promiscuous query

Isn't that generally available now (if you don't mind sharing your credentials with the executor of the query and any service they might pass a part of the query to)? If I paste this query into service0.example's SPARQL query interface:

SELECT ... { ...
  SERVICE <http://user1:password1@service1.example/> { ...
    SERVICE <http://user2:password2@service2.example/> {
    } ...
    SERVICE <http://user3:password3@service3.example/> {
    } ...
  } ...
}

, service0 will see credentials for services 1-3 and it's likely that service1 will see credentials for 2-3. If they're using generic HTTP libraries to dereference the SERVICE URLs, this should work in SPARQL 1.1. If that's true, I think we don't get much benefit from having a syntactic extension that moves the credentials out of the URLs.

I think in practice, when folks want to write such queries in a non-promiscuous way, they end up contacting services 0-3 and setting up lots of tedious custom interfaces and giving up on service 0 and service 1's implementation of SPARQL Federation. This indicates that there's some need to deal with this at the protocol level.

Bag o' credentials

A simple extension, which would move the credentials out of plain site, would be to move them into a header so that when e.g. service 1 was calling service 2 or 3, it could pull the credentials out of a (hypothetical) WWW-BagOfCredentials header and stick it into a WWW-Authenticate header. The HTTP machinery would be the same, the service executor would have to root around in the WWW-BagOfCredentials header so it would have to have access to the incoming headers, which limits platforms a bit.

    • moves credentials out of the query
    • shares every credentials with other services

You could mitigate the last point a bit by making dynamically creating short-duration, limited use credentials with services 1-3 before initiating the query.

Ultimately, it would be nice to have a token per SERVICE call, perhaps scoped to execute only that subquery. I fiddled around with how an OAuth2 Implicit pattern might enable each service to get a token when it first gets queried. Will add a comment if that coalesces.

@lisp
Copy link
Contributor

lisp commented Oct 6, 2020

A simple extension, which would move the credentials out of plain site, would be to move them into a header

while this removes them from the query text, this still entrusts the credentials to the intermediate sites.

@lisp
Copy link
Contributor

lisp commented Oct 6, 2020

when placing remote requests in connection with service clauses, dydra handles just the first authentication level. the rbac graph which governs access to remote service location resources (see #121) can also include credentials, which are decrypted and incorporated into request headers, as appropriate.

this does not address successive authentication stages, but i would not want the authority to do so.

@ericprud
Copy link
Member

ericprud commented Oct 7, 2020

@lisp, I'm guessing that if you were service1 in the Promiscuous query example above, you wouldn't want the initial querier to hand you a big bag of of credentials for unlimited access to service2 and 3. Would you feel better about any of these approaches?

  1. querier prearranges tokens with service2,3 which give time-boxed permission to execute the respective service clauses but nothing else.

  2. querier prearranges tokens with service2,3 which give one-time permission to execute the respective service clauses but nothing else. This means you have to gather everything in a BINDINGs clause it make only one query, which could be difficult for non-trivial cases.

  3. querier sends you temporary endpoints (or temporary tokens for more persistent endpoints) to get the querier (or some delegate) to execute the query. This moves more bits around the network, but you (service1) never see the querier's credentials for service2 or 3.

3 is arguably easier than 1 cause in 3, the querier gets to decide whether the service clause that you call e.g. service1 with is allowed. In 1, service1 would have to make that call, which would require more sophistication to register the token, store approved service clause, and test equivalence between that and the query you pass to them. Since in 3, the querier does all that, themselves, they can make that code more sophisticated when they need, without any coordination with service1.

(I guess in principle, folks could use 3 today by setting up proxy query endpoints and rewriting the query to use them instead of service1,2,3.)

@lisp
Copy link
Contributor

lisp commented Oct 7, 2020 via email

@smart-trust
Copy link

smart-trust commented Aug 31, 2021

We need a encrypted, mutual trust handshake. Isn’t this where WebID steps in?
The standard TLS transport handles client certificates. Ideally a 2nd factor such as OAUTH would be preferable to embedded u/p in headers or worse SPARQL queries! At very least they should be injected to maintain separation of concerns and least privilege.

@smart-trust
Copy link

can anyone point me to the relevant bits of codebase?

@VladimirAlexiev
Copy link
Contributor Author

I'm glad people are gradually contributing bits of knowledge to this difficult problem.

Here's how I understand the difficulties:

  • trust: How to allow server0 to pass credentials to server1, without being able to steal these credentials?
  • per-user credentials: If I have a user account on server0 that give me some special rights (eg through RBAC) and a "parallel" user on server1 but no user (i.e. global or no credentials) on server2, how to arrange for server0 to pass this along? Given that I as end-user am not able to mess with the setup of server0, i.e. the server0 admin should be able to set this "parallelism" for different users
  • simplicity: Security and cryptography have advanced a lot (eg zero-knowledge proofs, distributed identities), so how to use these advances without complicating the approach too much?

@afs

Jena allows SERVICE requests to be modified
Andy, can you point to the docs about this?

why this is not the same as securing any HTTP connection

Because server0 has to decide what credentials to pass to server1, based on the user, server0 setup, the identity of server1, and perhaps even the query.

@ericprud: WWW-BagOfCredentials is a hypothetic thing, not yet existing, right?

@cto-troven

point me to the relevant bits of codebase?

AFAIK there isn't any commonly accepted codebase, we're still gathering prior art.

@afs
Copy link
Collaborator

afs commented Aug 31, 2021

Because server0 has to decide what credentials to pass to server1, based on the user, server0 setup, the identity of server1, and perhaps even the query.

(insert general concern of custom security approaches)

Let's set a baseline - what of this isn't OAuth/OpenID/VC/...?

We have things like github-related authorization of apps to do things on behalf of the user. In these services, the user has authorized server0 to use server1 on its behalf - and trust server0 to do so only as appropriate.

A solution focused on ABAC is preferable to RBAC.

can you point to the docs about this?

Current:
https://jena.apache.org/documentation/query/service.html
and the security handling of HTTP connections.

(The implementation will "soon" to be significantly upgraded.)

In outline: server0 has "if contacting endpoint E on behalf of user A, use credentials XYZ on the HTTP connection". This allows two cases:

  • pre-registration at server1 - user trusts server0 with the user's access to server1. An external authn/authz system can reduce that (and os more of an enterprise scenario).
  • only server0 can query server1, not the user directly (not the issue here)

@ericprud
Copy link
Member

@ericprud: WWW-BagOfCredentials is a hypothetic thing, not yet existing, right?

Yep. I tweaked my comment: "out of a (hypothetical) WWW-BagOfCredentials header".

@ericprud
Copy link
Member

ericprud commented Sep 22, 2021

Apologies for waiting almost a year to reply to @lisp's comprehensive comments above.

On 2020-10-07, at 13:14:34, ericprud notifications@github.com wrote:

@lisp, I'm guessing that if you were service1 in the Promiscuous query example above, you wouldn't want the initial querier to hand you a big bag of of credentials for unlimited access to service2 and 3.

yes, we play the role of service1 only.
even if we are service2 or service3 in a larger context, our relation to the respective client is as a service1.
we do not now allow anyone to “hand us a big bag of credentials”.
as we operate publicly accessible network services, “backend” resources are all managed. just as we do not permit arbitrary account creation, we do not permit an existing account to create an arbitrary remote resource.
we control that process.
it’s a political, rather than a technology decision.

And a sensible one.
We should also probably design for cases where the primary query service is completely trusted to re-express queries with the querier's credentials.

Would you feel better about any of these approaches?

• querier prearranges tokens with service2,3 which give time-boxed permission to execute the respective service clauses but nothing else.

as use of these tokens is governed strictly by relationships between the querier and service2,3, i have no opinion about it.
so long as i am responsible for the credentials for the process of authentication with service2 only, and not for action (beyond authentication) to be governed by either of service2,3, it does not matter to me.
even if the querier has decided that part of the protocol with services2,3 involves that i would convey such tokens in the text of a federation request, so long as that text is conformant, why should i feel anything about it?

That makes sense. Your service is running in a pretty conservative mode, but since these arrangements are outside of your control, you don't have to care about them.

• querier prearranges tokens with service2,3 which give one-time permission to execute the respective service clauses but nothing else. This means you have to gather everything in a BINDINGs clause it make only one query, which could be difficult for non-trivial cases.

i do not understand this mechanism - esp "gather everything in a BINDINGs clause it make only one query”. please give a more detailed example.
if it means that i need to do more that the mechanics of introducing values clauses to effect sidewards-information-passing, that is, it is not inherent the algebra, then i would be wary.

Nothing so exotic. I was just noting that if service 1 (who's acting as a sort of aggregator) were given OTPs for services 2 and 3, it would need to be able to formulate the federation as single queries (no getting 2's results and iteratively constructing queries to 3).

• querier sends you temporary endpoints (or temporary tokens for more persistent endpoints) to get the querier (or some delegate) to execute the query. This moves more bits around the network, but you (service1) never see the querier's credentials for service2 or 3.

this is analogous to scenario 1. that is, the service1 federation requests are opaque to service1.
so long as they are conformant, why should i feel anything about them?
if it is to be dynamic, we would have to relax the constraint on creating resources, to permit the querier to create the remote temporary endpoints and supply their respective credentials.

Exploring static and dynamic (or verbatim and derived) queries a bit here:

verbatim queries

Static (verbatim) sounds like it's limited to cases where the querier pre-constructed query like (an example from 11 yeara go):

SELECT ?symbol ?label 
WHERE
{
  SERVICE <service2>
    {
      [] uniprot:gene\#acc "P04637" ;
         uniprot:gene\#val ?symbol .
    }
  SERVICE <service3>
    {
      [] ucsc:association\#gene_product_id [
           ucsc:gene_product\#Symbol ?symbol
         ] ;
         ucsc:association\#term_id [
           ucsc:term\#name ?label
         ] .
    }
}

In this case, the aggregator (service1) can almost parrot the exact query on to services 2 and 3, which means the querier could have pre-arranged with those services to accept specific queries from service1. Of course, whitespace tweaks are a problem, but the bigger problem is that service1 would probably not want to pose an such an unconstrained query to service3 (which returns all of the gene symbol/label pairs). Instead it would want to constrain that query to return only the symbol for P04637, e.g.

SELECT ?symbol ?label
WHERE
    {
      [] ucsc:association\#gene_product_id [
           ucsc:gene_product\#Symbol ?symbol
         ] ;
         ucsc:association\#term_id [
           ucsc:term\#name ?label
         ] .
    }
VALUES ?symbol { "TP53" } # value return from uniprot

That means that the querier would have to tell service3 to accept from servive1 any query with an algebra of:

(base <http://example/base/>
  (prefix ((ucsc: <>))
    (project (?symbol ?label)
      (join
        (bgp
          (triple ??0 <association#gene_product_id> ??1)
          (triple ??1 <gene_product#Symbol> ?symbol)
          (triple ??0 <association#term_id> ??2)
          (triple ??2 <term#name> ?label)
        )
        (table (vars ?symbol)
          (row [?symbol X]) *
        )))))

That's not too tough but it does require inventing a bit of language for X and * to templatize the query so that service1 can substitute in rows for the ?symbol values returned from service 2.
Or you can hard-code some way for service3 to chop off the VALUES clause before testing that the query from service1 matches something pre-arranged with the querier.

Are we over-engineering yet? But wait, there's more...

derived queries

If the aggregator were helfully figuring out to go to uniprot and UCSC to execute a query like:

SELECT ?id ?gene_symbol WHERE {
    ?gene uniprot:id ?id ; skos:prefLabel ?gene_symbol
}

, the aggregator would be inventing the whole orchestration, including constructing from whole cloth the queries of services 2 and 3.

One way to do that would be to make it interactive so that the querier submits the above query to service1 and service1 says "for me to execute this, I need permission to execute template A on service2 and template B on service3." Those templates may be many pages of eye-gouging SPARQL algebra, but i'd expect that in most cases, the querier would say "yeah, sure, whatever" and tell services 2 and 3 to permit any queries from service1, maybe with some ulimits.

Now service1 can go ahead and weave a bit of Semantic Web from those services.

3 is arguably easier than 1 cause in 3, the querier gets to decide whether the service clause that you call e.g. service1 with is allowed. In 1, service1 would have to make that call, which would require more sophistication to register the token, store approved service clause, and test equivalence between that and the query you pass to them. Since in 3, the querier does all that, themselves, they can make that code more sophisticated when they need, without any coordination with service1.

(I guess in principle, folks could use 3 today by setting up proxy query endpoints and rewriting the query to use them instead of service1,2,3.)

modulo, that we restrict the use of remote endpoints to those which have been configured in advance.

Right, I guess in that case, the proxy could automatically configure the remote endpoints before issuing the query.

@lisp
Copy link
Contributor

lisp commented Sep 22, 2021

Static (verbatim) sounds like it's limited to cases where the querier pre-constructed query like ... [but] service1 would probably not want to pose an such an unconstrained query to service3

on the contrary, where we permit an external request

  • the resources required by the remote service to process the request are a matter to be settled between the querier and that service
  • the resources required by service 1 to process the results of the remote request are subject to the same constraints are those which govern request which is entirely internal to service1, as if the service operation is a subselect.
  • this includes the intention to use sidewards information passing where possible.

the querier would have to tell service3 to accept from service1 any query with an algebra of: ...

was the example not standard sparql?
anything beyond that involves the relationships between the querier and services 2 and 3, which do not concern us.

... the aggregator [could] be inventing the whole orchestration, including constructing from whole cloth the queries of services 2 and 3.

yes, we have done that. that is, we have relied on schema definitions to determine how to deconstruct an aggregate query and delegate aspects to the respective sources and associated the requisite location and authentication information with the schema definitions. in this case, we accepted and managed the authentication credentials.
it involved translating some clauses to rest requests and others to sql.
the remote services accepted the credentials and made resources available to the requests as per agreement with the querier.

One way to do that would be to make it interactive

In the case at hand, everything was declared in relation to the schemas which drove the process to deconstruct the aggregate query. it was not difficult. remote service reliability was much more the issue. no interaction was necessary.

@ericprud
Copy link
Member

Static (verbatim) sounds like it's limited to cases where the querier pre-constructed query like ... [but] service1 would probably not want to pose an such an unconstrained query to service3

on the contrary, where we permit an external request

  • the resources required by the remote service to process the request are a matter to be settled between the querier and that service
  • the resources required by service 1 to process the results of the remote request are subject to the same constraints are those which govern request which is entirely internal to service1, as if the service operation is a subselect.
  • this includes the intention to use sidewards information passing where possible.

There are two use cases that I'm juggling:

  1. user has relationships with service{1,2} to enable queries;
  2. user wants to limit what those privileged queries can see.

For use case 1 (alone), the user can pass an access token (password, timeboxed, OTP, etc).

For 2, the user needs to associate that token with some parameterized access. One way to do that would be with verbatim queries, hence my point that if service3's query depends on service2's response, the querier would not be able to predict the query to pre-approve it with service3 (unless the answer is already known, but then, why issue the query). Below, I explore parameterizing the access by tying it to some templated query encompassing the query issued from service1 to service3.

the querier would have to tell service3 to accept from service1 any query with an algebra of: ...

was the example not standard sparql?

Still standard SPARQL 1.1, so far, but with the query structure policed by service3 to match something pre-apprived by the querier.

anything beyond that involves the relationships between the querier and services 2 and 3, which do not concern us.

Sure, but even if you're passing a token, we also have to envision ways that the contract between the querier and service3 can be both:

  • flexible enough to allow for precise queries computed from earlier results and
  • fussy enough that the querier knows that service1 is only accessing expected stuff (isn't e.g. doing an SPO query using the querier's account).

... the aggregator [could] be inventing the whole orchestration, including constructing from whole cloth the queries of services 2 and 3.

yes, we have done that. that is, we have relied on schema definitions to determine how to deconstruct an aggregate query and delegate aspects to the respective sources and associated the requisite location and authentication information with the schema definitions. in this case, we accepted and managed the authentication credentials.

Gotcha. I was exploring query templates, but you could also enumerate name graphs or permitted predicate paths or...

it involved translating some clauses to rest requests and others to sql.
the remote services accepted the credentials and made resources available to the requests as per agreement with the querier.

If service1 were mischevious, could it abuse the querier's credentials to ask for data beyond what the querier inteded?

One way to do that would be to make it interactive

In the case at hand, everything was declared in relation to the schemas which drove the process to deconstruct the aggregate query. it was not difficult. remote service reliability was much more the issue. no interaction was necessary.

Sorry, are we talking about service1 or service3 parsing that description and policing the query?

@lisp
Copy link
Contributor

lisp commented Sep 22, 2021

There are two use cases that I'm juggling: 1) user has relationships with service{1,2} to enable queries; 2) user wants to limit what those privileged queries can see.

[ i elide the majority of the text, as i seek to avoid any juggling acts which involve external services.
rather than juggle, pick up one thing, resolve it and put it down. then pick up the other and do the same. ]

while there is adequate literature on access control via query rewriting, service2 would be negligent to rely on service1 to implement authorization constraints by delegating just approved requests.
which means service2 must ignore any assertions "that service1 is only accessing expected stuff".
which means, the any limitations under the contract between querier and service2 are between them.

the interaction between service1 and service2 is in principle no different than that between service2 and any other client. that service1 uses credentials provided by querier does not change that.

were service1 a reseller rather than an aggregator, that might change.
absent that, service1 deconstructs and delegates as per agreement with the querier, service2 responds in accord with the authorization afforded by it to the querier, service 1 interprets the responses and consolidates the results.
where are the overlaps in authority or capability which would lead to juggling?

@ericprud
Copy link
Member

There are two use cases that I'm juggling: 1) user has relationships with service{1,2} to enable queries; 2) user wants to limit what those privileged queries can see.

[ i elide the majority of the text, as i seek to avoid any juggling acts which involve external services.
rather than juggle, pick up one thing, resolve it and put it down. then pick up the other and do the same. ]

while there is adequate literature on access control via query rewriting, service2 would be negligent to rely on service1 to implement authorization constraints by delegating just approved requests.
which means service2 must ignore any assertions "that service1 is only accessing expected stuff".
which means, the any limitations under the contract between querier and service2 are between them.

the interaction between service1 and service2 is in principle no different than that between service2 and any other client. that service1 uses credentials provided by querier does not change that.

Agreed, modulo the point that service1 has to stay inside the lines of the constract between querier and service 2 or 3.
In order to make this more than theoretically useful, it would be good to examine some practical contracts. SPARQL 1.next can probably exploit them by simply passing on a token from the querier so it's less about SPARQL requirements and more about a practical use.

were service1 a reseller rather than an aggregator, that might change.
absent that, service1 deconstructs and delegates as per agreement with the querier, service2 responds in accord with the authorization afforded by it to the querier, service 1 interprets the responses and consolidates the results.
where are the overlaps in authority or capability which would lead to juggling?

By "juggling", i meant i was examining two use cases; not that service1 has to do any juggling.

@lisp
Copy link
Contributor

lisp commented Sep 23, 2021

By "juggling", i meant i was examining two use cases; not that service1 has to do any juggling.

you wrote,

There are two use cases that I'm juggling:

  • user has relationships with service{1,2} to enable queries;
  • user wants to limit what those privileged queries can see.

the protocols and apis among querier and service1 and service2 must ensure that the two use cases are the same.

SPARQL 1.next can probably exploit them by simply passing on a token from the querier, so it's less about SPARQL requirements and more about a practical use.

yes, a literal "token" is likely not sufficient, but sparql 1.1 already permitted the equivalent of that.

@nvbach91
Copy link

This is how Amazon Neptune (Blazegraph) handles it now:

https://docs.aws.amazon.com/neptune/latest/userguide/sparql-service.html

Access control for federated queries in Neptune

Neptune uses AWS Identity and Access Management (IAM) for authentication and authorization. Access control for a federated query can involve more than one Neptune DB instance. These instances might have different requirements for access control. In certain circumstances, this can limit your ability to make a federated query.

# send to http://neptune-1:8182/sparql
SELECT * WHERE {
   ?person rdf:type foaf:Person .
   SERVICE <http://neptune-2:8182/sparql> {
       ?person foaf:knows ?friend .
    }
}

Consider the simple example presented in the previous section. Neptune-1 calls Neptune-2 with the same credentials it was called with.

If Neptune-1 requires IAM authentication and authorization, but Neptune-2 does not, all you need is appropriate IAM permissions for Neptune-1 to make the federated query.

If Neptune-1 and Neptune-2 both require IAM authentication and authorization, you need to attach IAM permissions for both databases to make the federated query.

However, in the case where Neptune-1 is not IAM-enabled but Neptune-2 is, you can't make a federated query. The reason is that Neptune-1 can't retrieve your IAM credentials and pass them on to Neptune-2 to authorize the second part of the query.

It is not possible to put username:password@ as part of the SERVICE URI in blazegraph, instead, they are somehow able to forward the authentication credentials to the blazegraph instances that execute the SERVICE subqueries. It means that all SERVICE subqueries must either not require authentication, or authenticate the same credentials used in the main query.

@ericprud
Copy link
Member

I wonder how they keep your credentials from leaking out to non-neptune services. I suppose they check the IP against some list of masks for in-house IP addrs.

I don't think we want to encourage a monolithic trust realm. That said, we can learn from the use cases this enables and re-envision them spread across services hosted by diverse institutions, enabled by bearer tokens and authorized templates and all the other mechanisms we dreamed up.

@TallTed
Copy link
Member

TallTed commented Mar 22, 2023

I suggest reordering the elements in the title of this issue — as the desire is to add Authentication to Federation. The current "Authentication with Federation" implies adding Federation to Authentication.

@VladimirAlexiev VladimirAlexiev changed the title Authentication with Federation add Authentication to Federation Mar 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants