Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix state col offset #891

Merged
merged 8 commits into from
May 5, 2017
Merged

Fix state col offset #891

merged 8 commits into from
May 5, 2017

Conversation

jingjingwang
Copy link
Contributor

@jingjingwang jingjingwang commented May 4, 2017

This PR fixes two problems: 1. when STATE is used as input for an aggregate with group by, the StateExpression needs to be aware of its column index offset. 2. add a special check for count(*) on an empty table to return 0, but do not return anything if there are more aggregates.

Travis will be fixed by uwescience/raco#558.

Copy link
Contributor

@senderista senderista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few nitpicks

@@ -115,6 +115,7 @@ public PythonUDFEvaluator(
pyWorker = new PythonWorker();
pyWorker.sendCodePickle(fs.getBinary(), columnIdxs.length, outputType, isMultiValued);
buffer = new TupleBuffer(stateSchema);
groups = new IntObjectHashMap<IntArrayList>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this not properly initialized before?

Copy link
Contributor Author

@jingjingwang jingjingwang May 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we missed it. ( @parmitam also found it)

@@ -361,6 +361,5 @@ public final void put(
final int destColumn, final ReadableColumn sourceColumn, final int sourceRow) {
checkPutIndex(destColumn);
TupleUtils.copyValue(sourceColumn, sourceRow, this, destColumn);
columnPut(destColumn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't immediately see the redundant call to columnPut(); where was it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in TupleUtils.copyValue, this function calls putXXX()s which call columnPut()

@@ -46,6 +47,8 @@
protected final int[] gfields;
/** Buffer for restoring results. */
protected TupleBatchBuffer resultBuffer;
/** If we have outputted 0 as the count for count(*) on an empty relation. */
private boolean COUNTALL_ON_EMPTY;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a constant, so shouldn't be named like one.

@@ -102,6 +105,16 @@ protected TupleBatch fetchNextReady() throws DbException {
tb = child.nextReady();
}
if (child.eos()) {
if (!COUNTALL_ON_EMPTY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more broadly useful to have a boolean getter on Operator indicating whether an operator had ever returned any results (i.e. nextReady() != null)? If that could be reused elsewhere, it would be better than an ad-hoc flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, I switched to use Operator.numOutputTuples instead.

@@ -102,6 +105,16 @@ protected TupleBatch fetchNextReady() throws DbException {
tb = child.nextReady();
}
if (child.eos()) {
if (!COUNTALL_ON_EMPTY
&& gfields.length == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check might be a little clearer if it were encapsulated in a helper.

@senderista
Copy link
Contributor

This looks good, but to be super-nitpicky (and this is purely a matter of taste), I think the logic would be clearer if the helper only tested whether CountAll was the sole aggregate, so the check for emitting a 0 with no results would look like:

private boolean isCountAllOnlyAggregate() {
  return gfields.length == 0
      && internalAggs.size() == 1
      && internalAggs.get(0) instanceof PrimitiveAggregator
      && ((PrimitiveAggregator) (internalAggs.get(0))).aggOp == AggregationOp.COUNT;
}

if (getNumOutputTuples() == 0 && groupStates.numTuples() == 0 && isCountAllOnlyAggregate()) {
...
}

@parmitam
Copy link
Contributor

parmitam commented May 5, 2017

This has broken stateful agg for pythonUDFs. it should be sending a tuple list, but it is sending a tuple at a time again...

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling c5804f6 on fix_state_col_offset into ** on master**.

@jingjingwang
Copy link
Contributor Author

From the code https://github.com/uwescience/myria/blob/fix_state_col_offset/src/edu/washington/escience/myria/expression/evaluate/PythonUDFEvaluator.java#L181 it seems we still send a list of tuples. Is there a quick ref for the error log?

@parmitam
Copy link
Contributor

parmitam commented May 5, 2017

There is no failure, it just sends one tuple at a time. Also, it just send the tuple not the state -- this is a big deal, as I python can't update the state, if it is not sent everytime. Sending a tuple at a time woudl be fine just slow, but not sending the state is breaking the uda functionality-- it cannot calc an aggregate.
( I am printing what is being sent to the python process so not sure what I can provide to help)

@senderista
Copy link
Contributor

Looks good from my end, I guess we'll have to wait for the Python issue to be resolved...

@parmitam
Copy link
Contributor

parmitam commented May 5, 2017

Python issue resolved :)
There were some changes in semantics that were throwing things off. everything works!
Looks great to me!

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling b5d8681 on fix_state_col_offset into ** on master**.

@senderista senderista merged commit 5b992a0 into master May 5, 2017
@senderista senderista deleted the fix_state_col_offset branch May 5, 2017 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants