Handle huge set of Thing Descriptions (pagination, streaming, etc.) #16

sebastiankb · 2020-04-20T12:52:53Z

Use case
A TD Directory manage a huge set of TDs, maybe around 1000-10000 TDs. A client queries the TD directory where about 1000 TDs would match.

Problem statement
How are the 1000 TDs responded to the client in a resource-efficiently way? Will this be a huge file where all TDs are encapsulated? Or will the TDs be fragmented into blocks and answered? Or will there be a stream?

First brainstormings

TDs maybe encoded such as with EXI/CBOR
Rely on a standard that handles such similar use case (e.g., Block-Wise Transfers)

How would we do something like a paging through a set of TDs?
How do established graph databases deal with such use cases?

danielpeintner · 2020-04-20T15:16:22Z

Yet another possibility would be that a discovery call simply returns a list of links to the actual TDs.

egekorkan · 2020-04-20T17:16:31Z

Edit: Sorry, your 3rd point was added later on and I didn't see it in my email. My comment was talking about this exactly.

Another idea would be treating it like search engines where the most relevant results are placed first, like 10 TDs, which hopefully does not make a huge document. The client then goes to other "pages" and looks again. Based on the client's processing capabilities, it can ask for more TDs in the first request, like saying that a shopping website should display 100 items in a page based on user preference.

relu91 · 2020-04-21T14:39:41Z

IMHO pagination is the best option. We just need to represent a set of TDs; it could be JSON or JSON-LD or more efficient formats as presented in point 1.

Yet another possibility would be that a discovery call simply returns a list of links to the actual TDs.

A more sophisticated approach could be a Level Of Detail method. So that only partial TDs are returned. How much 'partial' the TD is, it is specified inside the search request. The problem with returning only links is that clients probably would end up to fetch every single TD out of them and, therefore, creating just networking overhead. Anyway, I still prefer pagination over this method.

How do established graph databases deal with such use cases?

For databases that support SPARQL (i.e. SPARQL endpoints) pagination is handled with a combination of OFFSET LIMIT and ORDER BY query modifiers. Basically, you order by a criteria and then select a portion of the whole RDF Dataset matched using a starting pointer (OFFSET) and a length parameter (LIMIT). Since the order will be the same in the next queries you can "navigate" through portions of the result.

mmccool · 2020-05-04T14:56:58Z

Let's assume all queries return a set of results. In our case they could be everything from links to full TDs.
Pagination is probably essential to make sure each transmission is not too big.
Minimum page size can be one item; we probably should make the assumption that a single item is not too big (although in theory a TD could be very very large... but if the "profile" we are discussing limits the size of TDs, then we could still make it finite). Alternatively we could add this as an options, but it ends up being complicated; since TCP already takes care of this, my personal preference is to let that layer deal with it. But what if we want to run the directory service over CoAP/UDP? I don't think it's unreasonable for a "first cut" to specify only a HTTP/TCP API.

zolkis · 2020-05-05T07:50:40Z

Separation of concerns.

A client could tell options to the directory service: whether it only wants URLs, or URL and intro, or full TDs, together with parameters of the response (size, format etc).

When a Thing Directory has lots of data, it might want to expose a different service/API (with subscription, pagination, etc) vs when it has relatively simple set of TDs. They scale differently so it's also the servers' interest.

On the client side, generally with HTTP we should be able to use the Fetch standard that allows handling Response's via stream reader, or blob, or arraybuffer, json, string etc. It allows options/URI variables, cross-origin policies etc. We might just want a convenient wrapper on top (in Scripting).

Also, with WSS we could have a sub-protocol for handling this.

For other (eventually supported) protocols, the Thing Directory implementation should handle flow control, following the specifics of that protocol.

mmccool · 2020-10-12T14:49:56Z

We need to add an Editor's note about this to the FPWD. Farshid will create a PR. This will not close this issue, it will just point it out in the draft.

mmccool · 2020-11-09T15:51:59Z

pagination was discussed in https://w3c.github.io/w3c-api/
see also #93

mmccool · 2021-02-15T16:54:30Z

Example of pagination result (TD-by-TD): https://demo.linksmart.eu/thing-directory/td

Discussion in discovery call Feb 15:

Could also put pagination information INTO enriched TD.
- Then each response is an (enriched) TD.
- But then can only return one TD at a time (and this increases the number of responses)
- Note enriched TD increases size, which may exceed profile limits
Could also put pagination information into HTTP headers (what about CoAP, etc?)
- Perhaps we could ALSO put it in headers
pagination is also expensive, e.g. do we need to count TDs?
- perhaps separate query that JUST does counting
- In some cases we only care about "1" vs "many"
- is pagination RESTful?
Maybe just a "next" link when there are more?
- link can include an embedded query
- Does not allow skipping pages
- Does not allow parallel requests
Go back to idea of returning a set of links/ids...
- avoid issue that response is "not RDF" with a wrapper object
- may still be a lot of links, so still need to paginate over links
- follow-up queries should be able to grab multiple APIs at once
Worthwhile looking at
- github API
- SPARQL
- JSONpath indexing
- W3C API for pagination: https://w3c.github.io/w3c-api/
Maybe have different sub-APIs for different use cases
Note:
- for queries, JSONpath (?), SPARQL, and Xpath all have their own indexing schemes, so pagination interface is already defined (indexing)
- so this is just for RESTful API

Use cases:

Want to get everything, e.g. dashboard display; copying/archive/shadow the database
Want to determine if there is 1 or many of some "kind" of Thing (pairing switch and light); but this is query, not RESTful API

Requirements:

capture...

mmccool · 2021-02-15T16:59:06Z

Next steps:

Comment on this issue... need use cases and requirements
Capture and document options (here, or in a linked document) in concrete proposals (@farshidtz to provide at least one PR)
Large TDs and profiles, streaming

wiresio · 2021-02-16T14:15:05Z

Yet another link worth looking at (from a former W3C community group):
https://www.w3.org/community/hydra/wiki/Pagination

relu91 · 2021-02-18T10:17:16Z

For convenience here's the link to Github pagination documentation:
https://docs.github.com/en/rest/overview/resources-in-the-rest-api#pagination

farshidtz · 2021-02-18T13:45:06Z

The W3C API Spec, shared by @ashimura.

We are also collecting some pagination practices at linksmart/thing-directory#6. I really like Github's API but returning the links in headers will not be possible with CoAP. I don't know if we should limit ourselves in the HTTP API design with regards to what is possible with CoAP.

wiresio · 2021-02-19T12:22:43Z

Proposal for extension of discovery-context and a sample response (please see inline comments):

{
   "@context":{
      "discovery":"https://www.w3.org/2021/wot/discovery#",
      "tdd":"https://www.w3.org/2021/wot/discovery#",
      "dcterms":"http://purl.org/dc/terms/",
      "DirectoryDescription":{
         "@id":"discovery:DirectoryDescription"
      },
      "LinkDescription":{
         "@id":"discovery:LinkDescription"
      },
      "thingGraph":{
         "@id":"discovery:ThingGraph",
         "dcterms:description":"A graph of things, basically a shorthand for a named json-ld @graph, following: https://w3c.github.io/json-ld-syntax/#named-graph-data-indexing",
         "@container":[
            "@graph",
            "@index"
         ]
      },
      "pagination":{
         "@id":"discovery:Pagination",
         "dcterms:description":"A block of pagination information, inspired by: https://www.w3.org/community/hydra/wiki/Pagination#PartialCollection",
         "@type":"@none"
      }
   }
}

{
   "@context":[
      "https://www.w3.org/2019/wot/td/v1",
      "https://w3c.github.io/wot-discovery/context/discovery-context.jsonld"
   ],
   "@id":"urn:my.tdd.response",
   "name":"My TDD response",
   "base":"http://server:port",		// Could we allow inheritance for TDs listed in "thingGraph"?
   "version":{ ... },
   "securityDefinitions":{ ... },	// Could we allow inheritance for TDs listed in "thingGraph"?
   "thingGraph":{
      "thing_000001":{ ... },
      "thing_000002":{ ... },
      "thing_000010":{ ... },
      "thing_000020":{ ... },
      "thing_000100":{ ... }
   },
   "pagination":{
      "size":5,
      "self":"a relative path in here",
      "next":"a relative path and / or query string in here"
   }
}

farshidtz · 2021-02-20T14:29:38Z

I think TD is not really useful for describing a page of TD collection. The directory will already have another TD describing the APIs at the top level.

I prefer a simple response containing only what is necessary. For the collection object, array is better than dictionary because of size (no duplicate key/id) and order (sorting by attributes other than key).

With query parameters such as page, per_page, count(, and order_by), the response could be:

If not using HTTP headers, everything in body:

{
    "@context": "<discovery or tdd context>",
    "@type": "Collection", // or TDCollection
    "items": [ {TD}, ... ], // or tds
    "page": 1,
    "perPage": 100,
    "total": 350 // if ?count=true
}

If using HTTP headers, body:

[ {TD}, ... ]

Content-Range header:
Content-Range: TDs 0-99/350 if ?count=true
Content-Range: TDs 0-99/*

Optional Link header for self, next links:
Link: </tds?page=2&per_page=100>; rel="next"
Link: </tds?page=1&per_page=100>; rel="self", </tds?page=2&per_page=100>; rel="next"

wiresio · 2021-02-22T08:27:57Z

I agree on having a simple response (format). However, what is necessary, depends on the individual use cases. So, defining some parts as mandatory and other parts as optional, might be the solution here. Wrt "base" and "securityDefinitions" inheritance, I think we could leave this out for the moment as I'm not sure whether this is actually possible in JSON-LD.

Concerning the "container" in which TDs are wrapped: I'd like to have an option to name the respective TDs and have the possibility to create shortcuts with the help of the TD names e.g. for describing links between things as (proprietary / optional) part of my response without having the need to dig into the individual TDs (for the name and / or links section). I'd assume that this doesn't complicate the server side implementation too much and removes a lot of burden from the client side. Therefor I propose object instead of array for it. I would not name this container as "items", since in the TD "items" is already "Used to define the characteristics of an array".

Using the HTTP header for transporting pagination or other additional information should not be considered.

zolkis · 2021-02-22T09:41:39Z

Use cases (clients) of a directory service could be using various protocols (HTTP, CoAP, MQTT, etc).
The server needs to adapt to the clients capabilities (protocol, parameters, options).
Various IoT protocols will want to control network bandwidth, battery consumption etc. So they might want to tell options to the server.

For HTTP, we have streaming support and libraries to handle transparent streaming.
For CoAP, there is streaming that is based on the observe pattern, i.e. a form of pagination.
For MQTT there are flows over TCP.

One of the "best" common mechanisms would be a generic streaming API (the reply is a stream of TDs), which is easily implementable on most protocols given the existing libraries, but if someone needs to implement from scratch, it will be a variation of an observe/pagination/indexing mechanism.

So IMHO the best common mechanism for IoT discovery would be an observe pattern, something like what is spec'd in the Scripting API (there page size is 1 at the API level, but could be more on the wire).

We need to discern between chunks of TD and pages of TD, and any given response can be either a TD chunk or a page of TDs (but not a mix), for instance when we have a few huge TDs and a lot of small TDs, all of that match the discovery query.

Segmentation/reassembly could be handled seamlessly by the runtime (or Scripting implementation, where there is Scripting), or by the application if it requests so (for instance buffers are so small that a full TD cannot be processed in-situ). Quite unlikely scenario, but then solvable with a request option.

mmccool · 2021-02-22T16:33:50Z

Could have arrays of objects, where objects have metadata + tds, like:

     [  { "id": <local_id>,
           "td": {<TD>}
         },
         ...
      ]

mmccool · 2021-02-22T16:40:01Z

Issues/proposals:

object vs array for collections of TDs
* sorting order
* local ids (dealing with fact ids in TDs are optional)
- could return an array of objects that include both metadata and TDs
wrappers
- TD with "thingGraph" section
  - response is not really describing a Thing
- special JSON wrapper
  - special data schema for this response; not a TD
  - contradicts choice to use enriched TDs
- extra information in headers
  - does not map well to other protocols
alternatives
* reusing existing standards
- WebAPI
- Fetch

Concerns:

Future extension to other protocols; necessary for small devices
- Rules out header approach
Sorting order stability
- Rule out objects for TD collections; use arrays to be safe on consistent ordering

Other comments:

use cases
* need some examples

mmccool · 2021-02-22T17:03:10Z

Let's look at examples here and follow them: linksmart/thing-directory#6
Notes:

arrays typically used for results
links often used for "next" (allows embedding of session ID, etc)

farshidtz · 2021-03-02T08:32:40Z

The draft spec for paginated listing has been added with the following response model:

{
    "@context": "<discovery context>",
    "id": "/td?offset=0&limit=10",
    "type": "Collection",
    "items": [ 
        {
            "@context": "https://www.w3.org/2019/wot/td/v1",
            "id": "urn:example:simple-td",
            "title": "Simple TD",
            "security": "basic_sc",
            "securityDefinitions": {
                "basic_sc": {
                    "scheme": "basic"
                }
            }
        }, 
        ... nine more TDs
    ],
    "total": 350,
    "nextLink": "/td?offset=10&limit=10"
}

Streaming support is left as optional and possible via server-driven content negotiation, without any specification.

The listing section of the draft spec: anchored link - API spec is not yet available.
Feedback about this is ongoing at #125

AndreaCimminoArriaga · 2021-03-08T10:41:13Z

Towards the solution using headers, the current Linked Data Platform Paging 1.0, as I commented on issue #54, this would solve some problems related to nested contexts or namespace collisions.

farshidtz · 2021-03-09T12:34:35Z

Towards the solution using headers, the current Linked Data Platform Paging 1.0, as I commented on issue #54, this would solve some problems related to nested contexts or namespace collisions.

I've taken a deeper look at LDP Paging 1.0. It does not describe pagination of a single TD in JSON form (other listing mechanisms may allow that). Otherwise it is same as the header-based proposal above with a few additions. Following that and adding our requirements, the operation can be as follows:

Request /td will serve the default first page (e.g. /td?offset=0&limit=1)

Request `/td?offset=0&limit=10`

Response body:

[ {TD}, ... nine other TDs ... ]

Response headers:

Link:
Link: </td?offset=10&limit=10>; rel="next" --> if there are more TDs
Link: </td>; rel="canonical"; etag="<state identifier>" --> etag is an identifier which must represent the current state of the TD collection. It could be a version, unix timestamp, or UUID, set as soon as a TD is added/removed[/updated]

Optional links:
Link: <http://www.w3.org/ns/td#Thing>; rel="type", <http://www.w3.org/ns/ldp#Page>; rel="type"
Link: </td?offset=0&limit=10>; rel="self"

Content-Range:
Content-Range: TDs 0-99/350 if ?count=true
Content-Range: TDs 0-99/*
This is not defined in LDP Paging 1.0, but necessary to return the total and also to allow parallel queries. The client may query with limit=0&count=true to only get the total.

This relates to w3c#16 (comment)

Related to w3c#16 (comment)

mmccool · 2021-07-26T15:02:06Z

While technically solved, the API may still change (there are some ongoing discussions and PRs) so will keep it open for now.

Add alternative payload format (2). Minor cleanups should be in followup issues and PRs. Should close issue #16 also.

farshidtz mentioned this issue May 7, 2020

Improve the catalog response format linksmart/thing-directory#6

Closed

farshidtz added a commit to farshidtz/w3c-wot-discovery that referenced this issue Oct 13, 2020

Add note for directory's listing operation, related to w3c#16

9f91143

farshidtz mentioned this issue Dec 7, 2020

Need information model for directory #98

Closed

mmccool changed the title ~~Handle huge set of Thing Descriptions~~ Handle huge set of Thing Descriptions (pagination) Feb 8, 2021

mmccool changed the title ~~Handle huge set of Thing Descriptions (pagination)~~ Handle huge set of Thing Descriptions (pagination, streaming, etc.) Feb 8, 2021

farshidtz mentioned this issue Feb 28, 2021

Registration spec: listing #124

Merged

wiresio mentioned this issue Mar 1, 2021

Discussion about PR 124 #125

Closed

farshidtz mentioned this issue Mar 8, 2021

Improvements to paginated TD retrieval spec #130

Closed

3 tasks

relu91 mentioned this issue Mar 9, 2021

Handle ThingDescriptions as streams w3c/wot-scripting-api#309

Open

farshidtz added a commit to farshidtz/w3c-wot-discovery that referenced this issue Mar 9, 2021

Add response headers, query correction

37d21eb

This relates to w3c#16 (comment)

farshidtz added a commit to farshidtz/w3c-wot-discovery that referenced this issue Mar 9, 2021

Change to header-based listing approach

85b8ec3

Related to w3c#16 (comment)

farshidtz mentioned this issue Mar 23, 2021

Listing with chunked transfer #145

Merged

farshidtz mentioned this issue Apr 13, 2021

Listing with pagination #153

Merged

3 tasks

mmccool mentioned this issue Jul 19, 2021

Add alternative payload format (2) #213

Merged

farshidtz added the Propose Closing label Jul 26, 2021

mmccool removed the Propose Closing label Jul 26, 2021

mmccool closed this as completed in #213 Jul 26, 2021

mmccool added a commit that referenced this issue Jul 26, 2021

Merge pull request #213 from wiresio/main

a64f17c

Add alternative payload format (2). Minor cleanups should be in followup issues and PRs. Should close issue #16 also.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle huge set of Thing Descriptions (pagination, streaming, etc.) #16

Handle huge set of Thing Descriptions (pagination, streaming, etc.) #16

sebastiankb commented Apr 20, 2020 •

edited

Loading

danielpeintner commented Apr 20, 2020

egekorkan commented Apr 20, 2020 •

edited

Loading

relu91 commented Apr 21, 2020

mmccool commented May 4, 2020

zolkis commented May 5, 2020

mmccool commented Oct 12, 2020

mmccool commented Nov 9, 2020

mmccool commented Feb 15, 2021

mmccool commented Feb 15, 2021

wiresio commented Feb 16, 2021

relu91 commented Feb 18, 2021

farshidtz commented Feb 18, 2021

wiresio commented Feb 19, 2021

farshidtz commented Feb 20, 2021

wiresio commented Feb 22, 2021

zolkis commented Feb 22, 2021

mmccool commented Feb 22, 2021

mmccool commented Feb 22, 2021 •

edited

Loading

mmccool commented Feb 22, 2021

farshidtz commented Mar 2, 2021

AndreaCimminoArriaga commented Mar 8, 2021

farshidtz commented Mar 9, 2021 •

edited

Loading

mmccool commented Jul 26, 2021

Handle huge set of Thing Descriptions (pagination, streaming, etc.) #16

Handle huge set of Thing Descriptions (pagination, streaming, etc.) #16

Comments

sebastiankb commented Apr 20, 2020 • edited Loading

danielpeintner commented Apr 20, 2020

egekorkan commented Apr 20, 2020 • edited Loading

relu91 commented Apr 21, 2020

mmccool commented May 4, 2020

zolkis commented May 5, 2020

mmccool commented Oct 12, 2020

mmccool commented Nov 9, 2020

mmccool commented Feb 15, 2021

mmccool commented Feb 15, 2021

wiresio commented Feb 16, 2021

relu91 commented Feb 18, 2021

farshidtz commented Feb 18, 2021

wiresio commented Feb 19, 2021

farshidtz commented Feb 20, 2021

wiresio commented Feb 22, 2021

zolkis commented Feb 22, 2021

mmccool commented Feb 22, 2021

mmccool commented Feb 22, 2021 • edited Loading

mmccool commented Feb 22, 2021

farshidtz commented Mar 2, 2021

AndreaCimminoArriaga commented Mar 8, 2021

farshidtz commented Mar 9, 2021 • edited Loading

Request /td?offset=0&limit=10

Response body:

Response headers:

mmccool commented Jul 26, 2021

sebastiankb commented Apr 20, 2020 •

edited

Loading

egekorkan commented Apr 20, 2020 •

edited

Loading

mmccool commented Feb 22, 2021 •

edited

Loading

farshidtz commented Mar 9, 2021 •

edited

Loading

Request `/td?offset=0&limit=10`