Skip to content

Apache Marmotta Deployment Within HSLynk

Eric Jahn edited this page Jan 14, 2020 · 35 revisions

See also: https://github.com/servinglynk/hslynk-open-source-docs/wiki/Entity-Support-in-HSLynk

Apache Marmotta is our chosen implemention of "Entity Support" aka W3C Linked Data within HSLynk.

HSLynk, to support W3C Linked Data (LD), will implement a new microservice to communicate with the data used in LD, known as Resource Description Framework (RDF). In order to accomplish this, HSLynk will make use of Linked Data Platform (LDP). LDP is built on top of the principles of LD and describes the use of HTTP for reading/writing resources from servers that expose their resources as LD.

Some of the advantages that LDP bring are:

  • The use of a well-known data model: RDF.
  • URIs are identifiers.
  • Data can be extended and discovered, which means that it is also flexible.
  • There are a variety of serialization formats such as RDF/XML, Turtle, JSON-LD, etc
  • Support for RDF and non-RDF resources.

LDP Support

LDP uses containers (LDPC) to represent a collection of links to other resources. For example. the following figure illustrates the use of containers and resources (LDPR). First, we have a projects collection, which has the park and encampment entities. Next, an encampment might have temporary shelters, so a resource is also a container.

See our HSLynk specific usage docs for LDP: https://github.com/hserv/entity-tracking/wiki/How-to-Use-Linked-Data-Platform-APIs-(LDP)-in-Apache-Marmotta

Architecture

The support of LD in HSLynk will need the implementation of a control module called the Marmotta Access Layer (MAL) to receive client requests, verify authorization from the HSLynk Access Layer (HAL), and return (if authorized by the HAL) the requested resources of the LDP server to the client. For instance, Apache Marmotta is the LD server used. The interaction of HSLynk and the new LD components are illustrated in the following figure.

Marmotta High Level Architecture

In general, the Marmotta Access Layer's function could be described as follows:

  • MAL receives a client LDPR request with an OAuth2 token. The token was obtained by the client earlier directly from the HAL (using https://app.swaggerhub.com/apis/hslynk/hmis-authorization_service/1.0.0#/default/GET_authorize ).
  • MAL forwards the request onto HAL for a simple "accept" or "reject" determination (using standard HTTP Auth codes). HAL checks if a user has the necessary rights to access a resource. HSLynk determines both authentication/authorization for a specific resource
  • If the HAL sends back an "accept" message, the MAL fulfills the request for the client, fetching the requested LDPR and transmitting it back to the client.

The actual pipeline can be split into two steps, 1) user requests access to HSLynk and 2) a user makes LDP requests to MAL

Getting access to HSLynk

  1. An external third-party client (requester) asks for access to the LDP functionality to the HSLynk Access Layer
  2. HSLynk Access Layer authenticates the user and returns an OAuth2 token to the requester, which is used to later make LDP request calls.

LDP request

  1. A client requester makes an LDP request to MAL.
  2. The MAL forward the request onto the HAL for an accept/reject message. If the token is invalid, an HTTP 401 (unauthorized) status is generated. The HAL then checks if the user is able to make the request (e.g. GET, POST, PUT, DELETE, etc.) based on access control information it possesses. If the user does not has sufficient rights, an HTTP 403 (forbidden) status is returned to MAL which forwarded it onto the client requester.
  3. If the HAL determines the user does has sufficient rights to make the call a "accept" message is returned to MAL which has the LDP action component makes the LDP request to the LDP server.
  4. The requester obtains the results of a call.

An example of LDP support

The following example shows the interaction for a user that has generated an OAuth2 token and is trying to get information about a park.

Full steps:

  1. LDP Authentication request from an external 3rd-party web client to HSLynk Gateway/OAuth 2 Server
  2. HSLynk OAuth 2 Server authenticates and sends back to the 3rd party web client an OAuth2 token they can use to make further requests (no HAL in this step; that comes later)
  3. 3rd party web client makes an LDP resource request to the Marmotta Access Layer (MAL) with the OAuth2 token they just obtained, attached to it. LDP requests will have a projectId in the URI (eg GET /project/{projectId}/park/tree). There will be a LDP-BC for each /project/{projectid}
  4. The full LDP request with OAuth token gets forwarded to HSLynk Access Layer (HAL) for an authorization "accept/reject" determination which is sent back to the MAL.
  5. If it receives an "accept" from the HAL, the MAL make a request through the LDP server.
  6. MAL returns the LDP REST response to the external 3rd-party web client.

Querying the Triplestore directly via a SPARQL endpoint

  • The users using the Marmotta triplestore SPARQL endpoint would be analysts, doing read-only queries at first. Eventually, we could also create triggers based on SPARQL results. They would be only restricted by project group at first.

Synching to the HSLynk Data Warehouse

  • One-way sync job that syncs from the triplestore to HBase.
  • Scheduler updates HBase data at certain time interval
  • For user access to the HBase Hive views, we already have an LDAP server containing credentials that restrict the views to users by project group. This could eventually be modified to restrict by project and subgroups of projects. We already have this functionality with the global projects.
  • Hive views can be programmed to inspect the RDF in HBase, and use the contents to construct/filter the view. It's not as ad hoc or feature-rich as SPARQL queries, but at least it would be together with the rest of our data already in HBase. It would be interesting though, to do composite/federated queries that joined results found in both the Marmotta triplestore (from a SPARQL query) and from HBase (from a SQL query over Hive).

Future Improvements

A main feature that Marmotta Access Layer could bring to HSLynk is that it will not only return resources based on HTTP requests access but will also be able to filter information on a triple-level approach. In other words, return resources that a user can query and filter results based on access restrictions.

In order to accomplish this, the actual LDP action component will need to change by adding the following behavior (see the changes below).

  • An authorization request is in charge to make a request to the LDP server and filter resources in order to generate a response with only resources a user has access to, and This authorization request component makes the LDP request to the LDP server and returns only the triples that the user has access to. These triples are filtered based on the policy obtained.

architecture

An example of the complete pipeline when filtering triples

future_example

Considerations

Clone this wiki locally