Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting standards for basic querying/filtering #18

Open
SiBell opened this issue Oct 17, 2019 · 38 comments
Open

Setting standards for basic querying/filtering #18

SiBell opened this issue Oct 17, 2019 · 38 comments
Labels
help wanted Extra attention is needed needs-documenting Need to document the outcome

Comments

@SiBell
Copy link
Contributor

SiBell commented Oct 17, 2019

20191017_112039

Key parameters are as follows:

  • Time Window
  • Time of day/week/month/year
  • Spatial bounding box
  • Point and radius
  • Equality filter
  • Thresholds
  • Pagination
  • Response format

Two more we didn't discuss in the meeting, but might be worthy of adding:

  • Limit (i.e. limit=1 to get just the first/last reading)
  • Sort (i.e. ascending or descending, can be used with limit e.g. to get either the first n or last n observations)

Please provide suggestions for how to do each and we'll pick a favourite for each.

@aarepuu aarepuu added the help wanted Extra attention is needed label Oct 18, 2019
@SiBell
Copy link
Contributor Author

SiBell commented Oct 18, 2019

Ok here's my stab at this. I've done everything but Pagination and Response Format. Let me know what you think.

General rules

  • Use camelCase, because the JSON response also uses camelCase. For example the JSON response for a Platform might have an inDeployment property, and thus to filter by deployment when requesting a list of Platforms you could use inDeployment=weather-stations.
  • A double underscore __ prefixes a modifier, e.g. dateTime__gt.

Modifiers include:

  • gt - greater than
  • lt - less than
  • gte - greater than or equal to
  • lte - less than or equal to

Time window

Used to filter the data temporally.

Keys

  • dateTime__gt
  • dateTime__lt
  • dateTime__gte
  • dateTime__lte

Usage

The dateTime must be in ISO8601 format.

Defaults to UTC unless specified otherwise.

None, 1 or 2 of the keys can be provided.

Required validation

Returns error response if:

  • Dates are not in ISO8601 format.
  • dateTime__gt is after dateTime__lt (and other similar scenarios).

Examples

?dateTime__gte:2019-10-18

?dateTime__gt:2019-10-18T15:03:34.614Z

?dateTime__gt:2019-10-18T15:03:34.614+04

?dateTime__gte:2019-10-18&dateTime__lt:2019-10-24

Filter by component of date/time

Keys

  • minuteOfHour
  • hourOfDay
  • dayOfWeek
  • dayOfMonth
  • dayOfYear
  • monthOfYear
  • year

Usage

Assumes UTC is being used.

hourOfDay uses 24H clock

Several of these can be used together

Required validation

Check values fall within expected range, e.g.

  • minuteOfHour between 0 and 59
  • dayOfWeek should be between 1 and 7
  • dayOfMonth between 1 and 31

Examples

?minuteOfHour=30

?hourOfDay=22

?dayOfWeek=2 (i.e. for Tuesday)

?dayOfMonth=2

?dayOfYear=301

?monthOfYear=11

?year=2019

Spatial window

Keys

  • latitude__gt
  • latitude__lt
  • latitude__gte
  • latitude__lte
  • longitude__gt
  • longitude__lt
  • longitude__gte
  • longitude__lte
  • height__gt
  • height__lt
  • height__gte
  • height__lte

Usage

Latitudes and longitudes are given in WGS84 datum.

Height is in meters above or below the WGS 84 reference ellipsoid (same as GSOJSON).

None, 1 or 2 height keys can be used in a request.

None, 1 or 2 latitude keys can be used in a request.

None or 2 longitude keys can be used in a request.

Required validation

  • Values fall within expected range, e.g latitude between -90 and +90.
  • If there's only 1 longitude key then return an error. This is to avoid 180th meridian issues.
  • Can't use a latitude__lt with latitude__lte, and likewise for longitude, and for gt with gte.

Examples

?latitude__gt=52

?longitude__gt=-8.5&longitude__lte=2

?height___lt=10

Point and radius

Keys

  • proximityCentre
  • proximityRadius

Usage

Allows the user to find all resources (e.g. Platforms, observations, etc) within a given distance of a point.

The proximityCentre is the centre given in the form longitude,latitude, height. The height is optional. When height is given the filtered region turns from a circle into a sphere.

Longitude and latitude are in WGS84. Height is in metres.

proximityRadius is the distance from the proximityCentre in metres.

Required validation

Return error response if:

  • Only one of the keys is provided. Both proximityCentre and proximityRadius are required together.
  • The longitude and latitude used in the proximityCentre aren't valid coordinates.
  • The longitude and latitude aren't specified for proximityCentre. At least the longitude and latitude are required, with the height being optional.

Examples

?proximityCentre=-1.9,52.2&proximityRadius=1000

?proximityCentre=-1.9,52.2,10&proximityRadius=1000

Equality filter

Keys

Depends on the resource.

Usage

Certain resources will be filterable by certain attributes.

Required validation

  • Only certain keys will be valid (depending on the resource). e.g. hairColour wouldn't be a valid parameter when querying an endpoint that serves a list of sensors.

Examples

If, for example, you needed to find people whose hair colour is brown then your request might look like this:

https://api.example.com/people?hairColour=brown

More examples:

?inDeployment=weather-stations

?isHostedBy=lamppost-101

?age=18

Thresholds

Keys

Depends on the resource.

Usage

Applies modifiers to keys that are specific to the resource being queried.

Required validation

  • Only certain keys and modifiers are valid for certain resources.

Examples

/people?age__lt=18

/observations?value__gte=30.5&observedProperty=air-temperature

Limit

Keys

  • limit

Usage

For endpoints returning a collection of resources this parameter will limit the number of resources returned.

Can be used in combination with the sortBy and sortOrder keys to get just the "last n" or "first n" resources in the collection.

Required validation

  • Value can't be less than or equal to 0.

Examples

?limit=100

?limit=1&sortOrder=asc&sortBy=age

Sort

Keys

  • sortBy
  • sortOrder

Usage

For endpoints returning a collection of resources this parameter will sort the resources returned.

Sorts both numerical fields and also strings (i.e. alphabetically).

Can be used in combination with the limit key to get just the "last n" or "first n" resources in the collection.

sortOrder defaults to asc if the sortBy key is provided without sortOrder.

Required validation

  • sortOrder values can only be asc or desc.
  • Returns error if sortOrder is provided without sortBy.

Examples

?sort=desc

?limit=1&sortOrder=asc&sortBy=age

@aarepuu aarepuu added this to the MVP functionality milestone Oct 21, 2019
@aarepuu
Copy link
Collaborator

aarepuu commented Oct 21, 2019

Good work Simon. I agree most of the things. There are couple of things I would add/specify.

Filter by component of date/time

  • dayOfWeek should be one of MO, TU, WE, TH, FR, SA, SU to be explicit on the days, some countries start the week with Sunday.
  • we should also support comma separated list for filters
    • ?dayOfWeek=MO,WE,FR
  • we should also support ranges for filters
    • ?dayOfWeek=MO-FR

Additionally ranges and comma separated list should also apply for other filters that are numeric.

@aarepuu aarepuu pinned this issue Oct 21, 2019
@aarepuu
Copy link
Collaborator

aarepuu commented Oct 21, 2019

Here's what I think for pagination.

Pagination

Keys

  • limit
  • offset

Usage

Used for pagination of results.
Both are represented as integers and are not required parameters.
If not specified it defaults to limit=10 and offset=0.

Can be used in combination with the sortBy and sortOrder keys.

Required validation

  • limit has to be integer greater than or equal to 1.
  • offset has to be integer greater than or equal to 0.
  • limit must be specified when using offset

Examples

?limit=100
?limit=10&offset=10
?limit=10&offset=10&sortOrder=asc&sortBy=age

Not completely sure if we should make limit compulsory when using offset or just use default limit=10 when not specified?

@lukeshope
Copy link
Member

I agree with all of this, and also with Aare's comments on days of the week and ranges. Thanks both.

In the interests of making this more widely applicable, I think it would be worth nailing down what the general case is. Specifically:

  • how are we dereferencing when using filters that correspond to IDs such as your inDeployment example; do we require full IRIs for example, inDeployment=https://birmingham.uo.ac.uk/api/deployment/weather-stations, or do we allow relative paths following RFC3986 as in JSON-LD? Personal preference is allow either, but this does add complexity
  • do we allow filtering on properties in nested objects? Personal preference is this optional and implementation dependent
  • if so, what is the general form for the query parameter in that case? See below, it's complicated

As an example:

{
  "@type": "Sensor",
  [...],
  "madeObservation": {
    "@type": "ObservationCollection",
    "member": [{
      "@type": "Observation",
      "hasResult": {
        "@type": "Temperature",
        "value": 12.0
      },
      "resultTime": "2019-10-25T15:14:00Z"
    }]
  }
}

In the above example:

  • would the query parameter be value to apply a threshold, e.g. ?value__gte=10.0?
  • would we only allow the value filter when querying against the IRI of the ObservationCollection?
  • if not, what would the behaviour be if I used a query filter on a Sensor or SensorCollection, which contained ObservationCollections and Observations as nested objects?
  • what would happen if two query parameters with the same name both existed within different nested objects (e.g. a name on a platform, and also a name on a sensor attached to it); which would the name filter apply to? My preference would be that the filter applies to the highest level name only

@lukeshope
Copy link
Member

lukeshope commented Oct 25, 2019

I think we might also have to consider, for the sake of search functionality...

Wildcard matching

Keys

  • name__contains
  • name__containsAny
  • name__containsAll
  • Depends on the instance

Usage

  • An optional filter to be supported by some implementations
  • Functions as an or or and filter, depending on whether __containsAny or __containsAll is used
  • The contains variant only allows one word to be specified, or a phrase if surrounded by double quotes
  • The containsAny and containsAll variant allows multiple words or phrases to be specified, joined by a plus symbol

Validation

  • Double quotes must be in complete pairs
  • Double quotes may not be used as part of the filter itself
  • Only one word or phrase passed when using contains

Examples

  • http://example.com/platforms?name__contains="Room 2.048"
  • http://example.com/platforms?name__contains=2.048
  • http://example.com/platforms?name__containsAll=Room+2.048
  • http://example.com/platforms?name__containsAny=2.060+2.048
  • http://example.com/platforms?name__containsAny="Room 2.060"+"Room 2.048"

Disclaimer: The above is similar to how it's already been implemented in some Newcastle APIs, happy to look at alternatives

@nharris172
Copy link
Member

Will case be considered? can we have icontains

@lukeshope
Copy link
Member

Will case be considered? can we have icontains

Personally, I don't think case sensitivity is necessary. Would be good to know if people feel strongly the other way.

@SiBell
Copy link
Contributor Author

SiBell commented Oct 28, 2019

Oh boy, this gets "fun" quickly!

My thoughts:

  • Happy to go case insensitive with the query string parameters. I'm struggling to think of a situation where it would cause us any issues. I'll probably use camelCase in any documentation, but keep the api itself insensitive.
  • Let's allow full OR relative URIs. Surely in most cases our code will be constructing the full URI from the base and relative parts on the fly anyway.
  • I'd say ?value__gte=10.0 in your nested example is fine. If we start doing member.hasResult.value__gte=10.0 things are going to get pretty gnarly for the end user. I personally can't see myself allowing all that many filterable properties on a given endpoint so the risk of collisions is fairly low. I'd agree that the filter should apply to the highest level for Luke's name example.
  • I agree with Aare's points on dayOfWeek and MO,WE,FR and MO-FR. Presumable these are case insensitive too, i.e. MO-FR is the same as mo-fr.
  • Aare's Pagination approach looks good to me. Although I don't have much experience with Pagination.
  • For containsAll and containsAny could you use a comma separated approach, e.g. name__containsAny=2.060,2.048? Feels slightly odd seeing double quotes in a URL. Are these for looking for substrings in a longer string, or for searching for elements in a property that's an array, or both? Either way it's different to Aare's dayOfWeek=MO,TU example which is more about filtering discrete values right?

@lukeshope lukeshope added the needs-documenting Need to document the outcome label Dec 3, 2019
@lukeshope
Copy link
Member

Personally I'm happy with all of the above. I think enough time has passed, and we should now consider writing this up into the standards doc, and schematising the query parameters etc.

Any objections?

@geoanorak
Copy link

geoanorak commented Dec 3, 2019 via email

@SiBell
Copy link
Contributor Author

SiBell commented Dec 3, 2019

Aye let's get it written up. Let me know if you want me to do any of it. Looks like we can click "edit" on any of the posts above, so should be relatively quick to copy the markdown over into the working document.

@SiBell
Copy link
Contributor Author

SiBell commented Dec 4, 2019

One more to add to this list before we get this written up, which follows on from my previous issue.

Can we have an exists condition?

E.g.

To get a list of all sensors not yet hosted on a platform:

GET /sensors?isHostedBy__exists=false

Or perhaps all the observations for which the featureOfInterest is defined:

GET /observations?hasFeatureOfInterest__exists=true

Just reads a bit nicer than our previous isDefined suggestion.

@lukeshope
Copy link
Member

No inherent problem with __exists, but we need to clarify how we would handle null values in that case. A null value would exist, but wouldn't be defined.

@SiBell
Copy link
Contributor Author

SiBell commented Dec 4, 2019

Good point. Guess this depends on whether we're showing null values to the user or not.

E.g.

{
  "id": "sensor-123",
  "observes": "air-temperature",
  "inDeployment": null
}

vs.

{
  "id": "sensor-123",
  "observes": "air-temperature"
}

I was veering towards the latter, in which case any null values that may exist in the backend database don't "exist" to the end user and therefore __exists on its own would suffice.

However, if there's merit in showing null values then my preference would actually be to stick with just __isDefined and drop __exists.

Interested to hear peoples thoughts on this.

@lukeshope
Copy link
Member

The only place I can see a clear rationale for having null values is in the observation value itself, where for example we might have an 'alarm' timeseries, and null means no alarm.

I admit I can't think of any other places where null would be useful. We can certainly discourage the use of null values in serialisations in favour of omission.

Maybe there isn't a problem in that case, and we should just allow __exists and __isDefined, but neither are mandatory and implementations could be free to implement none, one or both combinations in their filters.

Does that work?

@lukeshope
Copy link
Member

Actually, I suppose it should be __exists and __defined for consistency?

@SiBell
Copy link
Contributor Author

SiBell commented Dec 4, 2019

Yep works for me, and yes __defined is better than __isDefined.

@EttoreHector
Copy link
Contributor

EttoreHector commented Dec 4, 2019 via email

@lukeshope
Copy link
Member

Perhaps, if the alarm were binary. But there might be an enumeration of alarms presented as an array for example:

[
  "https://example.org/alarm/low-temperature",
  "https://example.org/alarm/no-signal"
]

In the above case, either an empty array or null would be appropriate for when there are no alarms.

I'm not suggesting any of this is the right way to do it, just trying to retain flexibility as much as possible. Open to being convinced otherwise :-)

@lukeshope
Copy link
Member

Aye let's get it written up. Let me know if you want me to do any of it. Looks like we can click "edit" on any of the posts above, so should be relatively quick to copy the markdown over into the working document.

I've made some progress implementing this into a JS library, but not quite ready to share yet.

If you get chance to look at a transforming the above mess into a simple HTML table for the actual document, I'd really appreciate it.

@SiBell
Copy link
Contributor Author

SiBell commented Jan 7, 2020

Sure, I can wack these in a HTML table. Might have to be 2 tables then a few specific examples:

Table 1: Special keys

e.g.

key description example
limit limit the number of records returns ?limit=1
proximityradius the distance from the proximitycentre in metres ?proximityradius=1000

Table 2: Modifiers

e.g.

modifier description example
gt greater than ?datetime__gt=2019-01-01
contains For wildcard matching. Only allows one word to be specified, or a phrase if surrounded by double quotes ?name__contains=west

Special Examples

More detailed description of how to query with spatial windows, time windows, pagination, by proximity, etc.

@SiBell
Copy link
Contributor Author

SiBell commented Jan 9, 2020

I'm in the process of putting all parameters in a table now, just wondering if the time based parameters, i.e. minuteofhour, hourofday, etc should actually be "modifiers".

e.g. change:

monthofyear=10

for:

resultTime__monthofyear=10

Reason:

  1. It's more consistent with how we apply other time-based filters, e.g. resultTime__gte=2019.
  2. If you have a resource, e.g. observations, that have more than one time-based properties, e.g. an observation might have a timestamp for the time of measurement (resultTime), and one for when it arrived at the server (arrivalTime), then this approach lets you choose which one to filter by.

The upshot is that the only special parameter keys we're left with are those for pagination, i.e. limit, offset, sortorder, sortby, and those for circular bounding area: proximityradius and proximityradius. This is no bad thing.

Any objections?

@EttoreHector
Copy link
Contributor

EttoreHector commented Jan 9, 2020 via email

@lukeshope
Copy link
Member

No objections from me, this sounds like a really sensible idea, and would provide options for granular filtering on other date-time based data too in its generic form.

I think Ettore's suggestion is right, that would be a valid query for results in October 2017, 2018, 2019 etc., with query constraints being additive.

The other aspect to consider is multiple 'modifiers', which I suggest we allow but don't mandate as a minimum. I'm thinking for example resultTime__monthOfYear__gte=10 for October through December.

Perhaps based on the above we need to clarify the terminology slightly. resultTime is a selector (picking a specific value within the JSON response), monthOfYear is a sub-selector (picking a part of that value), and gte is a modifier? That way the order would always be selector, sub-selector, modifier, and resultTime__gte__monthOfYear would be invalid. Does that make sense?

@EttoreHector
Copy link
Contributor

EttoreHector commented Jan 10, 2020 via email

@SiBell
Copy link
Contributor Author

SiBell commented Jan 10, 2020

I agree that Ettore's suggestion is right.

Happy to allow sub-selectors too. I'll add it to the docs. Although I'm struggling to think of a use-case other than time-based selectors, but definitely worth having it as an option.

@SiBell
Copy link
Contributor Author

SiBell commented Jan 10, 2020

Query string Parameters

Query string parameters allow greater control over the resources returned when making a request.

They come in particularly useful when making GET requests.

For example the following request doesn't have any query string parameters:

GET https://api.urbanobservatory.com/observations

Whereas the following does:

GET https://api.urbanobservatory.com/observations?madeBySensor=thermometer-abc123

The latter lets us filter the observations returned to just those made by the sensor with id: thermometer-abc123.

This simple example has the form:

selector=value

i.e. madebysensor is the selector, and thermometer-abc123 is the value.

We also accept more complex query string parameters of the form:

selector__modifier=value

This modifer allow you to perform more than just an equality filter.

The following example has no modifier:

GET .../observations?resultTime=2020-01-09T18:05:24.969Z

This would only get observations recorded at that exact millisecond, but what if you wanted all observations since the start of 2020, well that's where a modifier (in this case gte) can help. Here's how the request would look:

GET .../observations?resultTime__gte=2020-01-01T00:00:00.000Z

The modifier always follows a double underscore: __. It acts upon the selector listed before the __.

We also have sub-selectors, that focus on a specific component of a selector. They're particularly useful for dealing with timestamps. The format is as follows:

selector__subselector=value

For example:

resultTime__monthOfYear=10

Where resultTime is the selector, and monthOfYear is the sub-selector. In this example it allows you to only retrieve observations with a resultTime within the month of October.

You can also use a subselector in combination with a modifier using the following format:

selector__subselector__modifier=value

e.g.

resultTime__monthOfYear__gte=10

This would retrieve any observations with a resultTime in October, November or December.

N.B. Not every observatory will support all of these formats, and each endpoint may only have a small number of query string parameters it accepts. However, when available, this is the format each observatory will abide by.

N.B. query string selectors, sub-selectors and modifiers and values are case-insensitive unless specifically defined otherwise.

Special selectors

Typically the selector is a property of the resource being returned, e.g. resultTime or madeBySensor. However, there are some special selectors that provided further functionality. They are listed in the following table.

key description examples
limit Limits the number of records returns. Commonly used with the offset, sortorder and sortby parameters. MUST be an integer value ≥ 1. limit=1 or limit=100&offset=200&sortorder=asc&sortby=resultTime.
offset Commonly used for pagination in combination with the limit parameter to skip the first n resources. MUST be an integer value ≥ 0. limit=10&offset=30
sortorder Used in combination sortby to sort the returned resources by the property provided. Use asc for ascending and desc for descending. sortorder=desc&sortby=resultTime
sortby Used in combination with sortorder to sort the returned resources by the property provided. sortorder=asc&sortby=madeBySensor
proximitycentre MUST be used in combination with proximityradius. Sets the centre of a circular or spherical (if height is given) bounding area. I.e. only resources within the spatial area are returned. Uses the format: longitude,latitude, height. Height is optional. Longitude and latitude use WGS84. Height is in metres. proximitycentre=-1.9,52.2&proximityradius=1000 or proximitycentre=-1.9,52.2,200&proximityRadius=100
proximityradius MUST be used in combination with proximitycentre. Sets the distance from the proximitycentre in metres. proximityradius=1000

Sub-selectors

N.B. sub-selectors that deal with times and dates assume that the timezone is UTC.

sub-selector description examples
minuteofhour Filters by the minute of the hour. Integer values between 0 and 59. resultTime__minuteOfHour=30
hourofday Filters by the hour of day. An integer value between 0 and 23. I.e. a 24 hour clock. resultTime__hourOfDay=22
dayofweek Filters by the day of the week. Valid values are: mo, tu, we, th, fr, sa, su. For multiple days use a comma separate list, e.g. mo,we,fr, or a range mo-fr. resultTime__dayofweek=mo or resultTime__dayofweek=mo,we,fr or dayofweek=mo-fr
dayofmonth Filters by the day of the month. Integer values between 0 and 31 resultTime__dayofmonth=2
dayofyear Filters by the day of the year. resultTime__dayofyear=301
monthofyear Filters by the month of year. Integer values between 1 and 12 resultTime__monthofyear=11
year Filter resources to just a single year. resultTime__year=2019

Modifiers

modifier description examples
none Format: key=value. When no modifier is present, and assuming the parameter key isn't listed in the table above (e.g. it's not limit, offset, etc), then the key is a property that exists on the resources being requested. Only those resources that have a matching value for this property will be returned. inDeployment=weather-stations-in-schools or isHostedBy=lamppost-32
gt greater than resultTime__gt=2019-01-01
gte greater than or equal to height__gte=10
lt less than latitude__lt=60
lte less than or equal to value__lte=20
contains For wildcard matching. Only allows one word to be specified, or a phrase if surrounded by double quotes name__contains=west or name__contains="Room 2.048"
containsany allows multiple words or phrases to be specified, joined by a + symbol. name__containsAny=2.060+2.048 or name__containsAny="Room 2.060"+"Room 2.048"
containsall allows multiple words or phrases to be specified, joined by a + symbol. name__containsAll=Room+2.048
exists Used to check if a resource property exists or not. isHostedBy__exists=false
defined Used to check if a resource property has been defined or not. In most cases it will behave the same as as exists, the only time is may differ is if resources can have properties will of value of null, in which case that property would exist, but would not be defined. value__defined=false

Specific Examples

Time window

The following gets observations with a resultTime between two dates.

GET .../observations?resultTime__gte=2020-01-01T00:00:00.000Z&resultTime__gte=2020-01-01T12:00:00.000Z

The following only gets observations in the year 2020 on weekdays.

GET .../observations?resultTime__dayofweek=mo-fr&resultTime__year=2020

Spatial Bounding box

The following retrieves any observations within a bounding box (in this case around Birmingham city centre).

GET .../observations?latitude__lte=52.495768&latitude__gte=52.464492&longitude__lte=-1.875352&longitude__gte=-1.928481

And now for only observations above 1 m.

GET .../observations?latitude__lte=52.495768&latitude__gte=52.464492&longitude__lte=-1.875352&longitude__gte=-1.928481&height__gt=1

Proximity

The following retrieves any observations within 1000 m from the centre of Birmingham:

GET .../observations?proximitycentre=1.895007,52.477096&proximityradius=1000

Pagination

Let's say you want all air-temperature observations from a platform called mobile-sensing-van in the year 2019. There's potentially thousands of observations available and therefore we want to get them in chunks. Our initial request looks like this.

GET .../observations?observedProperty=air-temperature&platform=mobile-sensing-van&resultTime__year=2019&limit=100&offset=0&sortby=resultTime&sortorder=asc

This returns exactly 100 observations, and therefore there's still more observations to retrieve, so we adjust the offset and make the following request:

GET .../observations?observedProperty=air-temperature&platform=mobile-sensing-van&resultTime__year=2019&limit=100&offset=100&sortby=resultTime&sortorder=asc

In this scenario we'd keep incrementing the offset by 100 until we no longer received 100 observations back.

@EttoreHector EttoreHector unpinned this issue Jan 24, 2020
@EttoreHector
Copy link
Contributor

EttoreHector commented Jan 24, 2020

List of Endpoints

The List of endpoints suggested by Simon in his examples are:

Deployments

<base_url>/deployments/<deployment_name>
where the response contains the deployment specified by <deployment_name>

<base_url>/deployments
where the response contains a list of all the deployments

Platforms

<base_url>/deployments/<deployment_name>/platforms
where the response contains the list of all the platforms in the specified deployment <deployment_name>

Observations

<base_url>/observations?startDate=<start_date>&endDate=<end_date>
where the response contains a list of ALL the observations recorded between <start_date> and <end_date>

Comments and proposals

1. Would it make sense to query for all the platform belonging to any deployment, hence having an endpoint such as

<base_url>/platforms ?

Or is it preferable to always specify the deployment as in:

<base_url>/deployments/<deployment_name>/platforms/<platform_name> ?

(either one or both endpoints would have to be added to the list proposed by Simon).

2. If someone wants to retrieve a single platform, should he/she use

<base_url>/platforms/<platform_name>

or

<base_url>/deployments/<deployment_name>/platforms/<platform_name> ?

(the second endpoint being possibly redundant if the platform name is kept unique across all the deployments)

3. Does it make sense to ask for ALL observations within a specified time window regardless of the sensor that made it, or at least the ObservedProperty it refers to, as Simon suggested in his example?

Should we consider instead a more specific query when it comes to retrieving observations, like:

<base_url>/sensors/<sensor_name>/observations?startDate=<start_date>&endDate=<end_date>

and / or

<base_url>/observedproperty/<property_name>/observations?startDate=<start_date>&endDate=<end_date> ?

4. How would the endpoint for querying a given sensor look like?

<base_url>/sensors/<sensor_name> (assuming unique sensor names across all platforms / systems / deployment)

or

<base_url>/platform/<platform_name>/sensors/<sensor_name> (assuming unique sensor names only within a platform)

or

<base_url>/deployments/<deployment_name>/sensors/<sensor_name> (assuming unique sensor names within an entire deployment)

or else...?

@SiBell
Copy link
Contributor Author

SiBell commented Jan 24, 2020

My preference would be that all your suggestions are valid, because some endpoint structures will be better suited to particular clients/frontends that others.

For example, I will be handling much of my authorisation at the deployment level, e.g. only certain users will have admin rights to a particular private deployment. Therefore, when I create a front-end that allows admin users to, for example, remove a sensor from a deployment I will want the deployment ID in the URL. For any URL starting <base_url>/deployments/<deployment_id>/ my API server will verify that the user actually has access rights to this deployment.

Alternatively, the web app that we'll build for the general public to use will be better off using endpoints such as <base_url>/observations or <base_url>/platforms. I just make sure that the observations or platforms returned haven't come from private deployments.

I think we need to pick a small handful of endpoints that we MUST support. The obvious one being <base_url>/observations. Then if we want to support more then we can do so, trying our best to be consistent so that we don't end up with one observatory using /deployments/ and another /deployment/.

@EttoreHector
Copy link
Contributor

EttoreHector commented Jan 25, 2020 via email

@SiBell
Copy link
Contributor Author

SiBell commented Jan 25, 2020

This is where pagination should come to the rescue. So that even if they make a request that would match millions of observations, we only return a maximum of 1000 (for example). Most databases should have some limit, offset and sort functionality to help with this.

What we haven't decided on yet is how we tell the user they have hit our maximum limit and presumably provide a URL for them to get the next 1000.

It would be very easy for me to support endpoints such <base_url>/sensors/<sensor_name>/observation and <base_url>/observedProperty/<property_name>/observations as well as <base_url>/observations, so more than happy to add these to the MUST list.

It's worth saying that a single sensor could in theory upload thousands of observations everyday by itself, e.g. if it sampled every second, therefore we'd almost certainly need pagination on these additional endpoints too.

@lukeshope
Copy link
Member

lukeshope commented Jan 26, 2020

I worry we're going down the wrong path with a list of endpoints. A REST API shouldn't have a list of endpoints, because it's driven by hypermedia, meaning it doesn't matter what the web addresses are because you follow the links to get there.

There's absolutely nothing wrong with the endpoints you've suggested, it looks sensible as a way of implementation. But if I wanted to file my Platforms under https://api.example.com/silly-sausages then I should be able to.

We do need is some agreement on how we manage Collections (if we call them that... this might for example be a collection of platforms, thus paginated, as Si refers to) that aren't ObservationCollections. Examples of how we do that would be either hydra:Collection or rdf:Bag.

It's also entirely possible that you might have all your platforms in one API (a lamp post API, say) and all your sensors in another (an air quality API, say) and all your historic observations in another (an observation collection API, say) and they would all just link to each other.

We also need an entrypoint that directs clients to these collections as a starting point. In other words, when I hit https://api.example.com it gives me links to a collection of sensors, a collection of platforms, a collection of observations, etc. It wouldn't need to give me all of those necessarily, you might not have a collection of all observations from all sensors (which could be huge, but might be useful), you might only have collections of observations under each sensor.

In theory, this would/could look something like...

GET https://api.example.com/
{
  "@context": {
    "@base": "https://api.example.com/",
    "uo": "https://urbanobservatory.github.io/standards/vocabulary/latest/",
    "title": "http://purl.org/dc/terms/title",
    "collections": {
      "@id": "uo:EntrypointCollections",
      "@container": "@id"
    }
  },
  "collections": {
    "/sensors": {
      "@type": ["@id", "uo:Collection", "uo:SensorCollection"],
      "title": "All sensors available in Newcastle upon Tyne"
    }
  }
}

Is this discussion best split into a new issue? Not sure we're talking about filtering anymore...

@EttoreHector
Copy link
Contributor

EttoreHector commented Jan 27, 2020 via email

@SiBell
Copy link
Contributor Author

SiBell commented Jan 27, 2020

@LukeSSmith I've created a new issue on Collections and Pagination, as I agree it makes sense to start a new thread for this.

Being able to reach all the endpoints by following links makes perfect sense, but surely there's benefit to keeping some consistency between observatories? E.g. so that any scripts or front-ends that use an observatory's API would work just as well with other observatories' without having to change much more than the base url.

@SiBell
Copy link
Contributor Author

SiBell commented Feb 3, 2020

At the risk of getting carried away, I have another two modifiers that would be useful:

  • __in, e.g. ?inDeployments__in=weather-stations,aq-sensors&observedProperty__in=air-temperature,water-temperature.
  • __begins, e.g. ?name__begins=michae. So a bit like __contains, but the substring must be at the start. Comes in handy for autocomplete form fields.

@SiBell
Copy link
Contributor Author

SiBell commented Mar 31, 2020

An __includes modifier would come in handy for selecting resources for which the provided item occurs with an array property.

For example an observation might have a flag property {flag: ['persistence', 'upperbound']}.

Then to query for all observations that have been flagged as breaching a climatic upper bound you can use:

/observations?flag__includes=upperbound

@SiBell
Copy link
Contributor Author

SiBell commented Apr 27, 2020

I've found myself using a query parameter called search. E.g.

/platforms?search=lamppost

It behaves a little bit like the __contains except it searches across more than one field. In my case it will typically search both the id and the name for any keyword matches. Mentioning it in case it's something others see themselves using and therefore worthy of adding to the docs.

@SiBell
Copy link
Contributor Author

SiBell commented Jun 2, 2020

Another addition, as discussed on the technical call today: not. For when we want to exclude something, or perform the opposite of a filter.

For example:

/observations?unit__not=uo:kelvin

Will exclude observations given in the unit Kelvin.

Another example:

/observations?resultTime__not__gte=2020-01-01

This would be the opposite of resultTime__gte. Although this is a bad example as we could just use resultTime__lt.

We'd also want to be able to provide a comma-separated list e.g:

/observations?unit__not=uo:kelvin,uo:fahrenheit

Although thinking about it, the right way to do this might be in combination with the __in modifier mentioned above, i.e.

/observations?unit__not__in=uo:kelvin,uo:fahrenheit

Because the __in modifier basically implies that the query parameter value will be an array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed needs-documenting Need to document the outcome
Projects
None yet
Development

No branches or pull requests

6 participants