Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reintroducing SPECTQL #140

Closed
pietercolpaert opened this issue Dec 7, 2013 · 12 comments
Closed

Reintroducing SPECTQL #140

pietercolpaert opened this issue Dec 7, 2013 · 12 comments
Assignees
Labels

Comments

@pietercolpaert
Copy link
Member

Problem

The DataTank has a query language especially designed to query the resources. The language is described in this master thesis (in Dutch).

What do we want to support

The query language should be used to:

  • filter (e.g., show only data in the vicinity of point X)
  • order (e.g., natural sort on a certain field)
  • select (e.g., only select a couple of fields)
  • limit and offset (already supported in the normal interface)

Over both resources that contain a lot of data as resources that only contain 1 page.

This means we are going to throw away the aggregation functionality. If you want to aggregate data over different resources, using tdt/input is advised.

Suggested solution

An abstract syntax tree has to be public to the entire code base. This syntax tree contains a todo list of where a filter, order by, select and/or limit and offset have to be done. When the source is for instance a SQL interface, it's easy to translate this todo list in extra parameters for the SQL query that's going to be sent to the server. If the source reader is a scraper, none of the items in the todo list will be able to be executed on "read-time".

If the resource is read, and there are still todo items in the list, the rest of the todo list (abstract syntax tree) is going to be done in memory on the resulting PHP object.

We should be able to reuse a lot of code from the TDT 2.0 release.

Documentation and valorisation

This will not be a 1 night work. We need to document this in the docs, we need a UI (see #22), we need to have support for GEO features, and so on.

Valorisation should happen with a nice blog post when implemented (april?). I'm might write a paper about the semantic aspects of SPECTQL for ISWC 2014 (/cc @mielvds @andimou deadline ~april/june).

(Feature stake holders are: FlatTurtle, ITS Belgium and VDAB)

@ghost ghost assigned coreation Dec 7, 2013
@coreation
Copy link
Member

I hear the arguments to keep some basic querying in memory available, but I never see this happen apart from select (drop fields that you don't want to see) and limit/offset which is already implemented.

The arguments to not do this for filter and order are the following, or really is based one large argument though, and it's the same one for not using grouping functions, filter ( and especially geo functionality ) is intensive to do, depending on the filter you use. But let's just assume that you're using large datasets (range of >=10MB's, which isn't even that big and uncommon).

If you want to filter, you'll have to read ALL (maintain paging) of the lines, and process the filter on the every object, geo makes it even harder to calculate. Sorting is less intensive, as its only compares bytes but paging will still have to be applied, so again ALL entries have to be read.

Yes, we can totally re-use code (although it's gonna some refactoring still to make things open source viable, complicated structure, not that much commentary) but that's not the point here :).

I'd suggest making a split between field selection and limit/offset (already present), and actual querying e.g. altering results, or the way results are returned.

Now for the final part, I don't think querying was working for hierarchical objects which is also something "weird" to do imho. I just think any querying related stuff should be handled by a system designed to filter, not by an in memory data-adapter.

@pietercolpaert
Copy link
Member Author

Rephrasing your problem statement

I'll try to rephrase your (valid) argument in order to make sure that I understood correctly:

Cases

When reading data from a certain source, there are a couple of cases:

  1. the data that has been read fits in memory and does not have to be paged
  2. the data doesn't fit in memory and has to be paged
    1. The source reader supports ordering, filtering, selecting and limiting (e.g., SQL, SPARQL, MongoDB...)
    2. The source reader doesn't support ordering, filtering, selecting and limiting
      1. The source reader works streamingly (e.g., CSV, XLS(?), SHP, XML...)
      2. The source reader doesn't work streamingly (e.g., a bad implementation of something? Is there an example of this?)

The problem

No problem:

    1. ordering, filtering, selecting and limiting can happen in memory
  • 2.i ordering, filtering, selecting and limiting can be resolved by the external source
  • 2.ii.a selecting, limiting & offsetting work fine

Problem:

  • 2.ii.a ordering and filtering cannot happen
  • 2.ii.b ordering, filtering, selecting and limiting cannot happen. Just showing the resource through the normal interface will hurt the server.

Suggested solution(s)

Case 2.ii.a: the content doesn't fit in memory and the reader works streamingly

We can support limiting and offsetting, and we can support selecting fields. From the moment the todo list (or tree) contains a filter or an order by, then we can throw a 500 and say that this is not supported for this source.

Case 2.ii.b: The content doesn't fit in memory and the code doesn't work streamingly

This should be fixed by tdt/core right now as well. I would suggest throwing a 500 error in this case: the server has been configured to read a file that is simply too big.

Should we open a new issue for this case?

Unresolved issue

How do we know when a source will give something that falls under case 1 and when in case 2?

@coreation
Copy link
Member

Great analysis, however my point and answer is the following: acceptable execution time.
e.g. to read a shape file and page it (and thus streamingly read every record till the last one so proper paging can be applied) that is about 8MB, you'll have to wait about 20-30 seconds (iirc, not sure we played around with some shp's we found on some of the european data portals). That's just for reading alone, we're not even talking about applying a processor on it (sorting, filtering).

Furthermore, the question we really (!!!) need to think about before doing all this is how we want our query endpoint to be. If you want to provide 2 ways (non aggregation vs aggregation) of querying, then this will have to be very clear to the user (how would that be informed best?) which kinda grinds my gears since people will sometimes be able to group and sometimes they wont, but on the same endpoint. Which isn't transparant.

@pietercolpaert
Copy link
Member Author

How do you define aggregation? Joining over 2 or more different resources? Or do you mean faceting and group by?

You always read the entire file in order to page it? So if a 1GB CSV file is given, you will read the entire file, even if paged?

Do you mean with different endpoints, the SPECTQL one and our regular tdt/identifier?

@coreation
Copy link
Member

Aggregation: both!

Yes, we read the entire thing, which is different from loading it into memory, we keep things paged (e.g. keep line 500-1000) and count the rest to let the user know what the last page is. -> A last page isn't something we should offer optional now I come to think of it, so yes, read the entire file.

endpoint = 1 query endpoint or in other words, if I go to the endpoint I'll be able to use identifier(s) and perform a query on the data they represent.

@pietercolpaert
Copy link
Member Author

  • I want to drop support for aggregations over different resources. Group by should be possible in O(n) (with n being the number of records in e.g. a shp file), right? So why not support it?
  • Good point. Files bigger than a certain number of megabytes need to be split or need to be put in an intermediate store.
  • still not sure what point you're trying to make with the endpoints. I only want 1 spectql endpoint, which in case 2.ii throws a 500 error since the source does not support this.

@coreation
Copy link
Member

all valid points till the third one, namely "if the source does not support this", weren't we always the adapter that made the source of the data irrelevant towards the end user?

I think we have two options here:

  1. make SPECTQL the way it was with selection of fields, filtering with basic regex'es and paging: this will work for mediocre (2MB ish ~ 5MB ish ) files, any more experience learned that the response is taking too long and sometimes making the memory of PHP go nuts.

Big (!) caveat here though yet again is the situation when (e.g. old appsforghent portal) ppl are requestion 3K requests a day, and say a peak of request with all different filters on a "mediocre" field might trigger the PHP mem into overload. So my concern really is not only will the querying work on smaller files, the efficiency of the the datatank instance will also ~ amount of queries. Which is why quering (regexes, geo filtering, ... ) is done in a therefore designed system.

  1. next to the normal in memory spectql (of which I've explained my concerns), you can have a different one that resided in input that puts objects in a document store, provides full text search, facetted search, ... all built in querying functionalities that were designed to be done at a tremendous efficiency level.

Now, if I were a datatank interestee, I'd just pick option 2 and think "why is option 1 even still there"... no?

@pietercolpaert
Copy link
Member Author

Now, if I were a datatank interestee, I'd just pick option 2 and think "why is option 1 even still there"... no?

From a perspective of relatively static data, I would agree with what you said in 2).

What you described in option 1) is however very relevant for data sources that get updated every minute and/or need required parameters. For example, SPECTQL could be used to get the 5 next trains without any delay with a direction towards "Oostende" for a certain resource with train departures in a certain station.

@coreation
Copy link
Member

True, so another dimension of this discussion is time-critical vs non-time-critical?

Edit: note that this train resource can also be a huge file, and not necessarely the result of a scraper or something.

@coreation
Copy link
Member

When nothing is replied (@pietercolpaert) the following will be done:

re-integrate spectql
provide messages when files are too large for spectql to handle
analysis of how hierarchical structures can be queries using spectql (this was assumed working, but hasn't been tested as much)

@pietercolpaert
Copy link
Member Author

Agreed!

@coreation
Copy link
Member

Basic functionality is back, needs some finetuning. Those finetuning issues will be done in new issues, closing this one. (2f0d7c6)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants