-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reintroducing SPECTQL #140
Comments
I hear the arguments to keep some basic querying in memory available, but I never see this happen apart from select (drop fields that you don't want to see) and limit/offset which is already implemented. The arguments to not do this for filter and order are the following, or really is based one large argument though, and it's the same one for not using grouping functions, filter ( and especially geo functionality ) is intensive to do, depending on the filter you use. But let's just assume that you're using large datasets (range of >=10MB's, which isn't even that big and uncommon). If you want to filter, you'll have to read ALL (maintain paging) of the lines, and process the filter on the every object, geo makes it even harder to calculate. Sorting is less intensive, as its only compares bytes but paging will still have to be applied, so again ALL entries have to be read. Yes, we can totally re-use code (although it's gonna some refactoring still to make things open source viable, complicated structure, not that much commentary) but that's not the point here :). I'd suggest making a split between field selection and limit/offset (already present), and actual querying e.g. altering results, or the way results are returned. Now for the final part, I don't think querying was working for hierarchical objects which is also something "weird" to do imho. I just think any querying related stuff should be handled by a system designed to filter, not by an in memory data-adapter. |
Rephrasing your problem statementI'll try to rephrase your (valid) argument in order to make sure that I understood correctly: CasesWhen reading data from a certain source, there are a couple of cases:
The problemNo problem:
Problem:
Suggested solution(s)Case 2.ii.a: the content doesn't fit in memory and the reader works streaminglyWe can support limiting and offsetting, and we can support selecting fields. From the moment the todo list (or tree) contains a filter or an order by, then we can throw a 500 and say that this is not supported for this source. Case 2.ii.b: The content doesn't fit in memory and the code doesn't work streaminglyThis should be fixed by tdt/core right now as well. I would suggest throwing a 500 error in this case: the server has been configured to read a file that is simply too big. Should we open a new issue for this case? Unresolved issueHow do we know when a source will give something that falls under case 1 and when in case 2? |
Great analysis, however my point and answer is the following: acceptable execution time. Furthermore, the question we really (!!!) need to think about before doing all this is how we want our query endpoint to be. If you want to provide 2 ways (non aggregation vs aggregation) of querying, then this will have to be very clear to the user (how would that be informed best?) which kinda grinds my gears since people will sometimes be able to group and sometimes they wont, but on the same endpoint. Which isn't transparant. |
How do you define aggregation? Joining over 2 or more different resources? Or do you mean faceting and group by? You always read the entire file in order to page it? So if a 1GB CSV file is given, you will read the entire file, even if paged? Do you mean with different endpoints, the SPECTQL one and our regular tdt/identifier? |
Aggregation: both! Yes, we read the entire thing, which is different from loading it into memory, we keep things paged (e.g. keep line 500-1000) and count the rest to let the user know what the last page is. -> A last page isn't something we should offer optional now I come to think of it, so yes, read the entire file. endpoint = 1 query endpoint or in other words, if I go to the endpoint I'll be able to use identifier(s) and perform a query on the data they represent. |
|
all valid points till the third one, namely "if the source does not support this", weren't we always the adapter that made the source of the data irrelevant towards the end user? I think we have two options here:
Big (!) caveat here though yet again is the situation when (e.g. old appsforghent portal) ppl are requestion 3K requests a day, and say a peak of request with all different filters on a "mediocre" field might trigger the PHP mem into overload. So my concern really is not only will the querying work on smaller files, the efficiency of the the datatank instance will also ~ amount of queries. Which is why quering (regexes, geo filtering, ... ) is done in a therefore designed system.
Now, if I were a datatank interestee, I'd just pick option 2 and think "why is option 1 even still there"... no? |
From a perspective of relatively static data, I would agree with what you said in 2). What you described in option 1) is however very relevant for data sources that get updated every minute and/or need required parameters. For example, SPECTQL could be used to get the 5 next trains without any delay with a direction towards "Oostende" for a certain resource with train departures in a certain station. |
True, so another dimension of this discussion is time-critical vs non-time-critical? Edit: note that this train resource can also be a huge file, and not necessarely the result of a scraper or something. |
When nothing is replied (@pietercolpaert) the following will be done: re-integrate spectql |
Agreed! |
Basic functionality is back, needs some finetuning. Those finetuning issues will be done in new issues, closing this one. (2f0d7c6) |
Problem
The DataTank has a query language especially designed to query the resources. The language is described in this master thesis (in Dutch).
What do we want to support
The query language should be used to:
Over both resources that contain a lot of data as resources that only contain 1 page.
This means we are going to throw away the aggregation functionality. If you want to aggregate data over different resources, using tdt/input is advised.
Suggested solution
An abstract syntax tree has to be public to the entire code base. This syntax tree contains a todo list of where a filter, order by, select and/or limit and offset have to be done. When the source is for instance a SQL interface, it's easy to translate this todo list in extra parameters for the SQL query that's going to be sent to the server. If the source reader is a scraper, none of the items in the todo list will be able to be executed on "read-time".
If the resource is read, and there are still todo items in the list, the rest of the todo list (abstract syntax tree) is going to be done in memory on the resulting PHP object.
We should be able to reuse a lot of code from the TDT 2.0 release.
Documentation and valorisation
This will not be a 1 night work. We need to document this in the docs, we need a UI (see #22), we need to have support for GEO features, and so on.
Valorisation should happen with a nice blog post when implemented (april?). I'm might write a paper about the semantic aspects of SPECTQL for ISWC 2014 (/cc @mielvds @andimou deadline ~april/june).
(Feature stake holders are: FlatTurtle, ITS Belgium and VDAB)
The text was updated successfully, but these errors were encountered: