Reintroducing SPECTQL #140

pietercolpaert · 2013-12-07T16:55:57Z

Problem

The DataTank has a query language especially designed to query the resources. The language is described in this master thesis (in Dutch).

What do we want to support

The query language should be used to:

filter (e.g., show only data in the vicinity of point X)
order (e.g., natural sort on a certain field)
select (e.g., only select a couple of fields)
limit and offset (already supported in the normal interface)

Over both resources that contain a lot of data as resources that only contain 1 page.

This means we are going to throw away the aggregation functionality. If you want to aggregate data over different resources, using tdt/input is advised.

Documentation and valorisation

This will not be a 1 night work. We need to document this in the docs, we need a UI (see #22), we need to have support for GEO features, and so on.

Valorisation should happen with a nice blog post when implemented (april?). I'm might write a paper about the semantic aspects of SPECTQL for ISWC 2014 (/cc @mielvds @andimou deadline ~april/june).

(Feature stake holders are: FlatTurtle, ITS Belgium and VDAB)

coreation · 2013-12-07T19:05:25Z

I hear the arguments to keep some basic querying in memory available, but I never see this happen apart from select (drop fields that you don't want to see) and limit/offset which is already implemented.

The arguments to not do this for filter and order are the following, or really is based one large argument though, and it's the same one for not using grouping functions, filter ( and especially geo functionality ) is intensive to do, depending on the filter you use. But let's just assume that you're using large datasets (range of >=10MB's, which isn't even that big and uncommon).

If you want to filter, you'll have to read ALL (maintain paging) of the lines, and process the filter on the every object, geo makes it even harder to calculate. Sorting is less intensive, as its only compares bytes but paging will still have to be applied, so again ALL entries have to be read.

Yes, we can totally re-use code (although it's gonna some refactoring still to make things open source viable, complicated structure, not that much commentary) but that's not the point here :).

I'd suggest making a split between field selection and limit/offset (already present), and actual querying e.g. altering results, or the way results are returned.

Now for the final part, I don't think querying was working for hierarchical objects which is also something "weird" to do imho. I just think any querying related stuff should be handled by a system designed to filter, not by an in memory data-adapter.

pietercolpaert · 2013-12-07T19:41:58Z

Rephrasing your problem statement

I'll try to rephrase your (valid) argument in order to make sure that I understood correctly:

Cases

When reading data from a certain source, there are a couple of cases:

the data that has been read fits in memory and does not have to be paged
the data doesn't fit in memory and has to be paged
1. The source reader supports ordering, filtering, selecting and limiting (e.g., SQL, SPARQL, MongoDB...)
2. The source reader doesn't support ordering, filtering, selecting and limiting
  1. The source reader works streamingly (e.g., CSV, XLS(?), SHP, XML...)
  2. The source reader doesn't work streamingly (e.g., a bad implementation of something? Is there an example of this?)

The problem

No problem:

1. ordering, filtering, selecting and limiting can happen in memory
2.i ordering, filtering, selecting and limiting can be resolved by the external source
2.ii.a selecting, limiting & offsetting work fine

Problem:

2.ii.a ordering and filtering cannot happen
2.ii.b ordering, filtering, selecting and limiting cannot happen. Just showing the resource through the normal interface will hurt the server.

Unresolved issue

How do we know when a source will give something that falls under case 1 and when in case 2?

coreation · 2013-12-07T22:59:32Z

Great analysis, however my point and answer is the following: acceptable execution time.
e.g. to read a shape file and page it (and thus streamingly read every record till the last one so proper paging can be applied) that is about 8MB, you'll have to wait about 20-30 seconds (iirc, not sure we played around with some shp's we found on some of the european data portals). That's just for reading alone, we're not even talking about applying a processor on it (sorting, filtering).

Furthermore, the question we really (!!!) need to think about before doing all this is how we want our query endpoint to be. If you want to provide 2 ways (non aggregation vs aggregation) of querying, then this will have to be very clear to the user (how would that be informed best?) which kinda grinds my gears since people will sometimes be able to group and sometimes they wont, but on the same endpoint. Which isn't transparant.

pietercolpaert · 2013-12-08T04:59:50Z

How do you define aggregation? Joining over 2 or more different resources? Or do you mean faceting and group by?

You always read the entire file in order to page it? So if a 1GB CSV file is given, you will read the entire file, even if paged?

Do you mean with different endpoints, the SPECTQL one and our regular tdt/identifier?

coreation · 2013-12-08T08:54:20Z

Aggregation: both!

Yes, we read the entire thing, which is different from loading it into memory, we keep things paged (e.g. keep line 500-1000) and count the rest to let the user know what the last page is. -> A last page isn't something we should offer optional now I come to think of it, so yes, read the entire file.

endpoint = 1 query endpoint or in other words, if I go to the endpoint I'll be able to use identifier(s) and perform a query on the data they represent.

pietercolpaert · 2013-12-08T22:36:02Z

I want to drop support for aggregations over different resources. Group by should be possible in O(n) (with n being the number of records in e.g. a shp file), right? So why not support it?
Good point. Files bigger than a certain number of megabytes need to be split or need to be put in an intermediate store.
still not sure what point you're trying to make with the endpoints. I only want 1 spectql endpoint, which in case 2.ii throws a 500 error since the source does not support this.

coreation · 2013-12-08T23:18:08Z

all valid points till the third one, namely "if the source does not support this", weren't we always the adapter that made the source of the data irrelevant towards the end user?

I think we have two options here:

make SPECTQL the way it was with selection of fields, filtering with basic regex'es and paging: this will work for mediocre (2MB ish ~ 5MB ish ) files, any more experience learned that the response is taking too long and sometimes making the memory of PHP go nuts.

Big (!) caveat here though yet again is the situation when (e.g. old appsforghent portal) ppl are requestion 3K requests a day, and say a peak of request with all different filters on a "mediocre" field might trigger the PHP mem into overload. So my concern really is not only will the querying work on smaller files, the efficiency of the the datatank instance will also ~ amount of queries. Which is why quering (regexes, geo filtering, ... ) is done in a therefore designed system.

next to the normal in memory spectql (of which I've explained my concerns), you can have a different one that resided in input that puts objects in a document store, provides full text search, facetted search, ... all built in querying functionalities that were designed to be done at a tremendous efficiency level.

Now, if I were a datatank interestee, I'd just pick option 2 and think "why is option 1 even still there"... no?

pietercolpaert · 2013-12-08T23:52:07Z

Now, if I were a datatank interestee, I'd just pick option 2 and think "why is option 1 even still there"... no?

From a perspective of relatively static data, I would agree with what you said in 2).

What you described in option 1) is however very relevant for data sources that get updated every minute and/or need required parameters. For example, SPECTQL could be used to get the 5 next trains without any delay with a direction towards "Oostende" for a certain resource with train departures in a certain station.

coreation · 2013-12-09T13:08:29Z

True, so another dimension of this discussion is time-critical vs non-time-critical?

Edit: note that this train resource can also be a huge file, and not necessarely the result of a scraper or something.

coreation · 2014-01-07T12:13:01Z

When nothing is replied (@pietercolpaert) the following will be done:

re-integrate spectql
provide messages when files are too large for spectql to handle
analysis of how hierarchical structures can be queries using spectql (this was assumed working, but hasn't been tested as much)

pietercolpaert · 2014-01-07T12:15:45Z

Agreed!

coreation · 2014-01-08T15:45:47Z

Basic functionality is back, needs some finetuning. Those finetuning issues will be done in new issues, closing this one. (2f0d7c6)

ghost assigned coreation Dec 7, 2013

coreation closed this as completed Jan 8, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reintroducing SPECTQL #140

Reintroducing SPECTQL #140

pietercolpaert commented Dec 7, 2013

coreation commented Dec 7, 2013

pietercolpaert commented Dec 7, 2013

coreation commented Dec 7, 2013

pietercolpaert commented Dec 8, 2013

coreation commented Dec 8, 2013

pietercolpaert commented Dec 8, 2013

coreation commented Dec 8, 2013

pietercolpaert commented Dec 8, 2013

coreation commented Dec 9, 2013

coreation commented Jan 7, 2014

pietercolpaert commented Jan 7, 2014

coreation commented Jan 8, 2014

Reintroducing SPECTQL #140

Reintroducing SPECTQL #140

Comments

pietercolpaert commented Dec 7, 2013

Problem

What do we want to support

Suggested solution

Documentation and valorisation

coreation commented Dec 7, 2013

pietercolpaert commented Dec 7, 2013

Rephrasing your problem statement

Cases

The problem

Suggested solution(s)

Case 2.ii.a: the content doesn't fit in memory and the reader works streamingly

Case 2.ii.b: The content doesn't fit in memory and the code doesn't work streamingly

Unresolved issue

coreation commented Dec 7, 2013

pietercolpaert commented Dec 8, 2013

coreation commented Dec 8, 2013

pietercolpaert commented Dec 8, 2013

coreation commented Dec 8, 2013

pietercolpaert commented Dec 8, 2013

coreation commented Dec 9, 2013

coreation commented Jan 7, 2014

pietercolpaert commented Jan 7, 2014

coreation commented Jan 8, 2014