Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle huge set of Thing Descriptions (pagination, streaming, etc.) #16

Closed
sebastiankb opened this issue Apr 20, 2020 · 23 comments · Fixed by #213
Closed

Handle huge set of Thing Descriptions (pagination, streaming, etc.) #16

sebastiankb opened this issue Apr 20, 2020 · 23 comments · Fixed by #213

Comments

@sebastiankb
Copy link
Contributor

sebastiankb commented Apr 20, 2020

Use case
A TD Directory manage a huge set of TDs, maybe around 1000-10000 TDs. A client queries the TD directory where about 1000 TDs would match.

Problem statement
How are the 1000 TDs responded to the client in a resource-efficiently way? Will this be a huge file where all TDs are encapsulated? Or will the TDs be fragmented into blocks and answered? Or will there be a stream?

First brainstormings

  1. TDs maybe encoded such as with EXI/CBOR
  2. Rely on a standard that handles such similar use case (e.g., Block-Wise Transfers)
  1. How would we do something like a paging through a set of TDs?
  2. How do established graph databases deal with such use cases?
@danielpeintner
Copy link
Contributor

Yet another possibility would be that a discovery call simply returns a list of links to the actual TDs.

@egekorkan
Copy link
Contributor

egekorkan commented Apr 20, 2020

Edit: Sorry, your 3rd point was added later on and I didn't see it in my email. My comment was talking about this exactly.

Another idea would be treating it like search engines where the most relevant results are placed first, like 10 TDs, which hopefully does not make a huge document. The client then goes to other "pages" and looks again. Based on the client's processing capabilities, it can ask for more TDs in the first request, like saying that a shopping website should display 100 items in a page based on user preference.

@relu91
Copy link
Member

relu91 commented Apr 21, 2020

IMHO pagination is the best option. We just need to represent a set of TDs; it could be JSON or JSON-LD or more efficient formats as presented in point 1.

Yet another possibility would be that a discovery call simply returns a list of links to the actual TDs.

A more sophisticated approach could be a Level Of Detail method. So that only partial TDs are returned. How much 'partial' the TD is, it is specified inside the search request. The problem with returning only links is that clients probably would end up to fetch every single TD out of them and, therefore, creating just networking overhead. Anyway, I still prefer pagination over this method.

How do established graph databases deal with such use cases?

For databases that support SPARQL (i.e. SPARQL endpoints) pagination is handled with a combination of OFFSET LIMIT and ORDER BY query modifiers. Basically, you order by a criteria and then select a portion of the whole RDF Dataset matched using a starting pointer (OFFSET) and a length parameter (LIMIT). Since the order will be the same in the next queries you can "navigate" through portions of the result.

@mmccool
Copy link
Contributor

mmccool commented May 4, 2020

  1. Let's assume all queries return a set of results. In our case they could be everything from links to full TDs.
  2. Pagination is probably essential to make sure each transmission is not too big.
  3. Minimum page size can be one item; we probably should make the assumption that a single item is not too big (although in theory a TD could be very very large... but if the "profile" we are discussing limits the size of TDs, then we could still make it finite). Alternatively we could add this as an options, but it ends up being complicated; since TCP already takes care of this, my personal preference is to let that layer deal with it. But what if we want to run the directory service over CoAP/UDP? I don't think it's unreasonable for a "first cut" to specify only a HTTP/TCP API.

@zolkis
Copy link

zolkis commented May 5, 2020

Separation of concerns.

A client could tell options to the directory service: whether it only wants URLs, or URL and intro, or full TDs, together with parameters of the response (size, format etc).

When a Thing Directory has lots of data, it might want to expose a different service/API (with subscription, pagination, etc) vs when it has relatively simple set of TDs. They scale differently so it's also the servers' interest.

On the client side, generally with HTTP we should be able to use the Fetch standard that allows handling Response's via stream reader, or blob, or arraybuffer, json, string etc. It allows options/URI variables, cross-origin policies etc. We might just want a convenient wrapper on top (in Scripting).

Also, with WSS we could have a sub-protocol for handling this.

For other (eventually supported) protocols, the Thing Directory implementation should handle flow control, following the specifics of that protocol.

@mmccool
Copy link
Contributor

mmccool commented Oct 12, 2020

We need to add an Editor's note about this to the FPWD. Farshid will create a PR. This will not close this issue, it will just point it out in the draft.

farshidtz added a commit to farshidtz/w3c-wot-discovery that referenced this issue Oct 13, 2020
@mmccool
Copy link
Contributor

mmccool commented Nov 9, 2020

pagination was discussed in https://w3c.github.io/w3c-api/
see also #93

@mmccool mmccool changed the title Handle huge set of Thing Descriptions Handle huge set of Thing Descriptions (pagination) Feb 8, 2021
@mmccool mmccool changed the title Handle huge set of Thing Descriptions (pagination) Handle huge set of Thing Descriptions (pagination, streaming, etc.) Feb 8, 2021
@mmccool
Copy link
Contributor

mmccool commented Feb 15, 2021

Discussion in discovery call Feb 15:

  • Could also put pagination information INTO enriched TD.
    • Then each response is an (enriched) TD.
    • But then can only return one TD at a time (and this increases the number of responses)
    • Note enriched TD increases size, which may exceed profile limits
  • Could also put pagination information into HTTP headers (what about CoAP, etc?)
    • Perhaps we could ALSO put it in headers
  • pagination is also expensive, e.g. do we need to count TDs?
    • perhaps separate query that JUST does counting
    • In some cases we only care about "1" vs "many"
    • is pagination RESTful?
  • Maybe just a "next" link when there are more?
    • link can include an embedded query
    • Does not allow skipping pages
    • Does not allow parallel requests
  • Go back to idea of returning a set of links/ids...
    • avoid issue that response is "not RDF" with a wrapper object
    • may still be a lot of links, so still need to paginate over links
    • follow-up queries should be able to grab multiple APIs at once
  • Worthwhile looking at
  • Maybe have different sub-APIs for different use cases
  • Note:
    • for queries, JSONpath (?), SPARQL, and Xpath all have their own indexing schemes, so pagination interface is already defined (indexing)
    • so this is just for RESTful API

Use cases:

  • Want to get everything, e.g. dashboard display; copying/archive/shadow the database
  • Want to determine if there is 1 or many of some "kind" of Thing (pairing switch and light); but this is query, not RESTful API

Requirements:

  • capture...

@mmccool
Copy link
Contributor

mmccool commented Feb 15, 2021

Next steps:

  • Comment on this issue... need use cases and requirements
  • Capture and document options (here, or in a linked document) in concrete proposals (@farshidtz to provide at least one PR)
  • Large TDs and profiles, streaming

@wiresio
Copy link
Member

wiresio commented Feb 16, 2021

Yet another link worth looking at (from a former W3C community group):
https://www.w3.org/community/hydra/wiki/Pagination

@relu91
Copy link
Member

relu91 commented Feb 18, 2021

For convenience here's the link to Github pagination documentation:
https://docs.github.com/en/rest/overview/resources-in-the-rest-api#pagination

@farshidtz
Copy link
Member

The W3C API Spec, shared by @ashimura.

We are also collecting some pagination practices at linksmart/thing-directory#6. I really like Github's API but returning the links in headers will not be possible with CoAP. I don't know if we should limit ourselves in the HTTP API design with regards to what is possible with CoAP.

@wiresio
Copy link
Member

wiresio commented Feb 19, 2021

Proposal for extension of discovery-context and a sample response (please see inline comments):

{
   "@context":{
      "discovery":"https://www.w3.org/2021/wot/discovery#",
      "tdd":"https://www.w3.org/2021/wot/discovery#",
      "dcterms":"http://purl.org/dc/terms/",
      "DirectoryDescription":{
         "@id":"discovery:DirectoryDescription"
      },
      "LinkDescription":{
         "@id":"discovery:LinkDescription"
      },
      "thingGraph":{
         "@id":"discovery:ThingGraph",
         "dcterms:description":"A graph of things, basically a shorthand for a named json-ld @graph, following: https://w3c.github.io/json-ld-syntax/#named-graph-data-indexing",
         "@container":[
            "@graph",
            "@index"
         ]
      },
      "pagination":{
         "@id":"discovery:Pagination",
         "dcterms:description":"A block of pagination information, inspired by: https://www.w3.org/community/hydra/wiki/Pagination#PartialCollection",
         "@type":"@none"
      }
   }
}
{
   "@context":[
      "https://www.w3.org/2019/wot/td/v1",
      "https://w3c.github.io/wot-discovery/context/discovery-context.jsonld"
   ],
   "@id":"urn:my.tdd.response",
   "name":"My TDD response",
   "base":"http://server:port",		// Could we allow inheritance for TDs listed in "thingGraph"?
   "version":{ ... },
   "securityDefinitions":{ ... },	// Could we allow inheritance for TDs listed in "thingGraph"?
   "thingGraph":{
      "thing_000001":{ ... },
      "thing_000002":{ ... },
      "thing_000010":{ ... },
      "thing_000020":{ ... },
      "thing_000100":{ ... }
   },
   "pagination":{
      "size":5,
      "self":"a relative path in here",
      "next":"a relative path and / or query string in here"
   }
}

@farshidtz
Copy link
Member

I think TD is not really useful for describing a page of TD collection. The directory will already have another TD describing the APIs at the top level.

I prefer a simple response containing only what is necessary. For the collection object, array is better than dictionary because of size (no duplicate key/id) and order (sorting by attributes other than key).

With query parameters such as page, per_page, count(, and order_by), the response could be:

If not using HTTP headers, everything in body:

{
    "@context": "<discovery or tdd context>",
    "@type": "Collection", // or TDCollection
    "items": [ {TD}, ... ], // or tds
    "page": 1,
    "perPage": 100,
    "total": 350 // if ?count=true
}

If using HTTP headers, body:

[ {TD}, ... ]

Content-Range header:
Content-Range: TDs 0-99/350 if ?count=true
Content-Range: TDs 0-99/*

Optional Link header for self, next links:
Link: </tds?page=2&per_page=100>; rel="next"
Link: </tds?page=1&per_page=100>; rel="self", </tds?page=2&per_page=100>; rel="next"

@wiresio
Copy link
Member

wiresio commented Feb 22, 2021

I agree on having a simple response (format). However, what is necessary, depends on the individual use cases. So, defining some parts as mandatory and other parts as optional, might be the solution here. Wrt "base" and "securityDefinitions" inheritance, I think we could leave this out for the moment as I'm not sure whether this is actually possible in JSON-LD.

Concerning the "container" in which TDs are wrapped: I'd like to have an option to name the respective TDs and have the possibility to create shortcuts with the help of the TD names e.g. for describing links between things as (proprietary / optional) part of my response without having the need to dig into the individual TDs (for the name and / or links section). I'd assume that this doesn't complicate the server side implementation too much and removes a lot of burden from the client side. Therefor I propose object instead of array for it. I would not name this container as "items", since in the TD "items" is already "Used to define the characteristics of an array".

Using the HTTP header for transporting pagination or other additional information should not be considered.

@zolkis
Copy link

zolkis commented Feb 22, 2021

Use cases (clients) of a directory service could be using various protocols (HTTP, CoAP, MQTT, etc).
The server needs to adapt to the clients capabilities (protocol, parameters, options).
Various IoT protocols will want to control network bandwidth, battery consumption etc. So they might want to tell options to the server.

For HTTP, we have streaming support and libraries to handle transparent streaming.
For CoAP, there is streaming that is based on the observe pattern, i.e. a form of pagination.
For MQTT there are flows over TCP.

One of the "best" common mechanisms would be a generic streaming API (the reply is a stream of TDs), which is easily implementable on most protocols given the existing libraries, but if someone needs to implement from scratch, it will be a variation of an observe/pagination/indexing mechanism.

So IMHO the best common mechanism for IoT discovery would be an observe pattern, something like what is spec'd in the Scripting API (there page size is 1 at the API level, but could be more on the wire).

We need to discern between chunks of TD and pages of TD, and any given response can be either a TD chunk or a page of TDs (but not a mix), for instance when we have a few huge TDs and a lot of small TDs, all of that match the discovery query.

Segmentation/reassembly could be handled seamlessly by the runtime (or Scripting implementation, where there is Scripting), or by the application if it requests so (for instance buffers are so small that a full TD cannot be processed in-situ). Quite unlikely scenario, but then solvable with a request option.

@mmccool
Copy link
Contributor

mmccool commented Feb 22, 2021

Could have arrays of objects, where objects have metadata + tds, like:

     [  { "id": <local_id>,
           "td": {<TD>}
         },
         ...
      ]

@mmccool
Copy link
Contributor

mmccool commented Feb 22, 2021

Issues/proposals:

  • object vs array for collections of TDs
    * sorting order
    * local ids (dealing with fact ids in TDs are optional)
    - could return an array of objects that include both metadata and TDs
  • wrappers
    • TD with "thingGraph" section
      • response is not really describing a Thing
    • special JSON wrapper
      • special data schema for this response; not a TD
      • contradicts choice to use enriched TDs
    • extra information in headers
      • does not map well to other protocols
  • alternatives
    * reusing existing standards
    - WebAPI
    - Fetch

Concerns:

  • Future extension to other protocols; necessary for small devices
    - Rules out header approach
  • Sorting order stability
    - Rule out objects for TD collections; use arrays to be safe on consistent ordering

Other comments:

  • use cases
    * need some examples

@mmccool
Copy link
Contributor

mmccool commented Feb 22, 2021

Let's look at examples here and follow them: linksmart/thing-directory#6
Notes:

  • arrays typically used for results
  • links often used for "next" (allows embedding of session ID, etc)

@farshidtz
Copy link
Member

The draft spec for paginated listing has been added with the following response model:

{
    "@context": "<discovery context>",
    "id": "/td?offset=0&limit=10",
    "type": "Collection",
    "items": [ 
        {
            "@context": "https://www.w3.org/2019/wot/td/v1",
            "id": "urn:example:simple-td",
            "title": "Simple TD",
            "security": "basic_sc",
            "securityDefinitions": {
                "basic_sc": {
                    "scheme": "basic"
                }
            }
        }, 
        ... nine more TDs
    ],
    "total": 350,
    "nextLink": "/td?offset=10&limit=10"
}

Streaming support is left as optional and possible via server-driven content negotiation, without any specification.

The listing section of the draft spec: anchored link - API spec is not yet available.
Feedback about this is ongoing at #125

@AndreaCimminoArriaga
Copy link
Contributor

Towards the solution using headers, the current Linked Data Platform Paging 1.0, as I commented on issue #54, this would solve some problems related to nested contexts or namespace collisions.

@farshidtz
Copy link
Member

farshidtz commented Mar 9, 2021

Towards the solution using headers, the current Linked Data Platform Paging 1.0, as I commented on issue #54, this would solve some problems related to nested contexts or namespace collisions.

I've taken a deeper look at LDP Paging 1.0. It does not describe pagination of a single TD in JSON form (other listing mechanisms may allow that). Otherwise it is same as the header-based proposal above with a few additions. Following that and adding our requirements, the operation can be as follows:

Request /td will serve the default first page (e.g. /td?offset=0&limit=1)

Request /td?offset=0&limit=10

Response body:

[ {TD}, ... nine other TDs ... ]

Response headers:

Link:
Link: </td?offset=10&limit=10>; rel="next" --> if there are more TDs
Link: </td>; rel="canonical"; etag="<state identifier>" --> etag is an identifier which must represent the current state of the TD collection. It could be a version, unix timestamp, or UUID, set as soon as a TD is added/removed[/updated]

Optional links:
Link: <http://www.w3.org/ns/td#Thing>; rel="type", <http://www.w3.org/ns/ldp#Page>; rel="type"
Link: </td?offset=0&limit=10>; rel="self"

Content-Range:
Content-Range: TDs 0-99/350 if ?count=true
Content-Range: TDs 0-99/*
This is not defined in LDP Paging 1.0, but necessary to return the total and also to allow parallel queries. The client may query with limit=0&count=true to only get the total.

farshidtz added a commit to farshidtz/w3c-wot-discovery that referenced this issue Mar 9, 2021
farshidtz added a commit to farshidtz/w3c-wot-discovery that referenced this issue Mar 9, 2021
@farshidtz farshidtz mentioned this issue Apr 13, 2021
3 tasks
@mmccool
Copy link
Contributor

mmccool commented Jul 26, 2021

While technically solved, the API may still change (there are some ongoing discussions and PRs) so will keep it open for now.

mmccool added a commit that referenced this issue Jul 26, 2021
Add alternative payload format (2).  Minor cleanups should be in followup issues and PRs.  Should close issue #16 also.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants