Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lookup with missing attributes #685

Closed
brainopia opened this issue Dec 12, 2016 · 5 comments · Fixed by #715
Closed

Lookup with missing attributes #685

brainopia opened this issue Dec 12, 2016 · 5 comments · Fixed by #715
Assignees

Comments

@brainopia
Copy link
Contributor

Hello,

As I've previously mentioned in #611 (comment), currently lookup works only if record-attribute-value or record-attribute-value-node specified. If any of record/attribute/value is missing then lookup will break.

There are multiple ways to fix it.

For example, substitute missing parts with new variables brainopia@0d8209a. But then unnecessary variables will be solved for.

Or introduce hidden variables brainopia@c32bbd1 then only intermediate hidden variables that are needed will be solved for. But there is still overhead from extra round solving wasted on intermediate variables instead of solving directly appropriate variables at once.

A faster approach would look like brainopia@692024d but since it relies on full-scan when missing variable is solved for, it'd need a full-scan optimization like @ibdknox mentioned at #611 (review). If we decide to pursue this approach I'll update full-scan implementation with appropriate index selection. The main disadvantage is that it requires more changes compared with previous approaches.

So which way would you prefer to fix this? I'll prepare the appropriate PR after you decide.

@ibdknox ibdknox self-assigned this Dec 12, 2016
@ibdknox
Copy link
Contributor

ibdknox commented Dec 12, 2016

Alternatively, we could make Scan handle partial lookups. If we adjust Scan's fully resolved check to ignore undefined e,a,v variables like it does for node, we can then change toLookupType to put in _ for cases where we don't need to know it and add those to the switch. e.g. "_*__" is lookup only the attributes. Aside from the value one, we can answer all the questions really easily given the indexes we have. We should probably special case the value only lookup ("__*_") to run through the aveIndex for each a and gather the v's.

@brainopia
Copy link
Contributor Author

Ok, let's compare a couple of approaches.

First, current code for optimized scans (which is pasted here to serve as a baseline):

       case "e***":
          this.setProposal(curIndex.eavIndex.lookup(e), this.a, scopeIx);
          break;
        case "ea**":
          this.setProposal(curIndex.eavIndex.lookup(e,a), this.v, scopeIx);
          break;
        case "eav*":
          this.setProposal(curIndex.eavIndex.lookup(e,a,v), this.node, scopeIx);
          break;
        case "*a**":
          this.setProposal(curIndex.aveIndex.lookup(a), this.v, scopeIx);
          break;
        case "*av*":
          this.setProposal(curIndex.aveIndex.lookup(a,v), this.e, scopeIx);
          break;
        case "***n":
          this.setProposal(curIndex.neavIndex.lookup(node), this.e, scopeIx);
          break;
        case "e**n":
          this.setProposal(curIndex.neavIndex.lookup(node,e), this.a, scopeIx);
          break;
        case "ea*n":
          this.setProposal(curIndex.neavIndex.lookup(node,e,a), this.v, scopeIx);
          break;

Let's compare with using '_' to denote a missing variable. The same optimized scans will look like

        case "e*__":
        case "e**_":
        case "e*_*":
        case "e***":
          this.setProposal(curIndex.eavIndex.lookup(e), this.a, scopeIx);
          break;
        case "ea*_":
        case "ea**":
          this.setProposal(curIndex.eavIndex.lookup(e,a), this.v, scopeIx);
          break;
        case "eav*":
          this.setProposal(curIndex.eavIndex.lookup(e,a,v), this.node, scopeIx);
          break;
        case "_a*_":
        case "_a**":
        case "*a*_":
        case "*a**":
          this.setProposal(curIndex.aveIndex.lookup(a), this.v, scopeIx);
          break;
        case "*av_":
        case "*av*":
          this.setProposal(curIndex.aveIndex.lookup(a,v), this.e, scopeIx);
          break;
        case "*__n":
        case "*_*n":
        case "**_n":
        case "***n":
          this.setProposal(curIndex.neavIndex.lookup(node), this.e, scopeIx);
          break;
        case "e*_n":
        case "e**n":
          this.setProposal(curIndex.neavIndex.lookup(node,e), this.a, scopeIx);
          break;
        case "ea*n":
          this.setProposal(curIndex.neavIndex.lookup(node,e,a), this.v, scopeIx);
          break;

But in addition there will be new cases where intermediate variables are missing (I mean the situation like e_*_ when at least one variable is already solved for (so an appropriate index can be picked), but the next variable that is needed is not the next one in the selected index. E.g. e_*_: just entity is bound and therefore eav-index should be used, but the next needed variable to solve for is value which is after a missing variable for attribute).

I'm separating these cases from optimized ones since we can't use an exact cardinality for proposal. Only next level of index is known in terms of cardinality. E.g. for the specific entity (therefore eav index) cardinality of attribute is known, but the cardinality of value or node is not.

If there were estimates for each level of index (like there is a cardinalityEstimate for the top-level index) then these new cases would look like optimized ones before. But I think a simple full-scan cardinality (that is cardinalityEstimate of a full index) to be used in proposal is sufficient (let me know if you think otherwise).

And if indeed cardinalityEstimate is used for proposals when there is an intermediate missing variable then using full-scan to solve proposal for all needed variables at once seems good as well (but full-scan should be optimized to use appropriate index and depth). There are rare edge-cases when it would not be optimal but it's a different topic for discussion (since **** is also solved at once via full-scan although there also could be other constraints which will be beneficial for one-by-one pass).

And if we will use a full-scan when there are missing intermediate variables then there is no benefit to using _. Since the above optimized cases could be rewritten to support missing variables without _:

        case "e***":
          indexPos = curIndex.eavIndex.lookup(e)
          indexVar = this.a
          break;
        case "ea**":
          indexPos = curIndex.eavIndex.lookup(e,a)
          indexVar = this.v
          break;
        case "eav*":
          indexPos = curIndex.eavIndex.lookup(e,a,v)
          indexVar = this.node
          break;
        case "*a**":
          indexPos = curIndex.aveIndex.lookup(a)
          indexVar = this.v
          break;
        case "*av*":
          indexPos = curIndex.aveIndex.lookup(a,v)
          indexVar = this.e
          break;
        case "***n":
          indexPos = curIndex.neavIndex.lookup(node)
          indexVar = this.e
          break;
        case "e**n":
          indexPos = curIndex.neavIndex.lookup(node,e)
          indexVar = this.a
          break;
        case "ea*n":
          indexPos = curIndex.neavIndex.lookup(node,e,a)
          indexVar = this.v
          break;
      if (indexVar) {
        this.setProposal(indexPos, indexVar, scopeIx);
      } else {
         // proposal for optimized fullscan (which will use appropriate index (indexPos) and appropriate depth)
      }

What do you think?

@ibdknox
Copy link
Contributor

ibdknox commented Dec 15, 2016

Seems like solid reasoning. I think this is a relatively rare case as is, so I'm not sure it makes too much of a difference anyways. I agree that we should try to stick to the simpler approach, which is what it sounds like you're advocating for :)

As an aside, we should make **** just use the ave index since it has the smallest initial set of values (there's likely to be fewer attributes than anything else).

@brainopia
Copy link
Contributor Author

Good idea with ave-index 👍

Regarding the simplest approach, it will depend on whether an optimized full-scan (which will pick the best index based on present variables) will be useful in other contexts.

If yes, then adding support for partial lookups via full-scan will be simple (the complex part is optimized full-scan) and efficient way to go. If no, then usual variables are the simplest approach (even though not efficient).

Judging by your previous comment in another PR it seems optimized full-scan will be useful. But still would like to have your official confirmation on this :-)

@brainopia
Copy link
Contributor Author

Nevermind, I'll send the PR today, and you'll decide then :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants