Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding blob data type and tests #862

Merged
merged 21 commits into from
Feb 10, 2017
Merged

adding blob data type and tests #862

merged 21 commits into from
Feb 10, 2017

Conversation

parmitam
Copy link
Contributor

No description provided.

@parmitam
Copy link
Contributor Author

Starting to push Python UDF and Blob data type changes to master.
This commit contains support for blob data type.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.06%) to 27.388% when pulling f48bb38 on blob-udf-new-merge into 41d0d45 on master.

@parmitam
Copy link
Contributor Author

This commit is to add python UDF registration support.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.04%) to 27.285% when pulling 6f69cba on blob-udf-new-merge into 41d0d45 on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 27.048% when pulling bc206c0 on blob-udf-new-merge into 41d0d45 on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 27.007% when pulling ea9ec0e on blob-udf-new-merge into 41d0d45 on master.

Copy link
Contributor

@jingjingwang jingjingwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For f48bb38, I have several general comments before going into details:

  • I think the new function of getting the size of a tuple batch should be wrapped within a TupleBatch/TupleBatchBuffer/TupleBuffer/Mutable... object, Instead of calling TupleUtils with an explicit schema. The schema is determined when the TupleBatch/TupleBatchBuffer/... object is constructed, so the size should just be an internal property. With this change, some of the tuple batch building functions need to be cleaned up to use TupleBatchBuffer, instead of an explicit list of ColumnBuilders, in order to build a TupleBatch. ColumnBuilders should just be internal data structures of a TupleBatchBuffer most of the time, and their sizes are determined by the TupleBatchBuffer.
  • There are multiple names of this new data type, I have seen Bytes, Blob, and ByteBuffer all being used. It would be more clear to just stick with one. I think either Bytes or Blob is good (personally prefer Blob for its clear definition as "binary large object", while Bytes is about physical representation), but ByteBuffer is how this new data type is implemented internally so should be avoided when possible.
  • Have you tried any test case with more than one row in a BytesColumn? Just in case if there is any implementation detail that happens to rely on this assumption.

It might be more clear if you could just squash new changes with this commit once they are ready, since I believe they are independent from other python function registration commits.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 26.995% when pulling 3265300 on blob-udf-new-merge into 41d0d45 on master.

@parmitam
Copy link
Contributor Author

parmitam commented Jan 4, 2017

I have changes bytebuffer and bytes to blob. Also made batchSize member of TupleBatch, TupleBuffer etc. I have not cleaned up to use tuplebatchbuffer, which is a little bit involved ( issue#865 assigned to me).
I can't squash the two commits as new files were added in PyUDF creation that were changed

@jingjingwang
Copy link
Contributor

There are some small stuffs, such as missing Javadoc (there might be more), typo, using Java-style function names (getBatchSize()), unremoved comments, etc. But more importantly, do you plan to do #865 in this PR? Looks like there are only a few places need to be changed, e.g. JdbcAccessMethod and maybe a few more, and they would make this PR self-contained.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.4%) to 26.938% when pulling ed29c11 on blob-udf-new-merge into 41d0d45 on master.

@parmitam
Copy link
Contributor Author

parmitam commented Jan 6, 2017

Addressing code review comments.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.4%) to 26.929% when pulling 14a0e31 on blob-udf-new-merge into 41d0d45 on master.

Type.BYTES_TYPE,
"output_type",
Type.STRING_TYPE);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment similar to profiling?

@@ -470,6 +472,16 @@ public Response createFunction(final CreateFunctionEncoding encoding) throws DbE
ResponseBuilder response = Response.ok();
return response.entity(functionCreationResponse).build();
}
/**
* @param queryId an optional query ID specifying which datasets to get.
* @return a list of datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

/**
* Python function registrar.
*/
protected PythonFunctionRegistrar pyFuncReg;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyFuncReg is a per-worker property and should not be kept here. We can make the method getPythonFunctionRegistrar() below as a wrapper of calling Worker's getPythonFunctionRegistrar(). (and my understanding is that the return type is non-null (or can we make it non-null)? Annotate the return type with @nonnull if that's the case and get rid of all the checkings like if (pyFuncReg != null).)

public PythonFunctionRegistrar getPythonFunctionRegistrar() {
Preconditions.checkNotNull(pyFuncReg);
if (execEnvVars.get(MyriaConstants.EXEC_ENV_VAR_TEST_MODE) != null
&& execEnvVars.get(MyriaConstants.EXEC_ENV_VAR_TEST_MODE).equals("false")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you say a little more about this? Why the test mode has to be false to get a python function registrar (and seems it's never been set to be false in the whole code base)?

@@ -345,4 +411,12 @@ private void writeToStream(
throw new DbException(e);
}
}

@Override
public void sendEos() throws DbException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong but you probably don't need to write anything as PythonWorker.sendEos() closes the connection anyway, unless the python process needs this information to do something before close.

if (input != null && input.hasArray()) {
// LOGGER.info("input array buffer length" + input.array().length);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nit picking but please get rid of unnecessary empty lines. (There are some more)

@@ -352,7 +352,8 @@ private MyriaConstants() {}
public static final int PYTHON_EXCEPTION = -3;
/** python function return is null.*/
public static final int NULL_LENGTH = -5;

/** Send EOS tp python strea, is null.*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to" and "stream". And if my comment below makes sense we can get rid of this declaration.

throws DbException {

BitSet bs;
switch (gColumnType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These types share many lines, merge them.

}
}

private void setBitset(final ReadableTable table, final int row, final Object[] groupAgg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'setBitSet'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

if (aggregators[agg]
.getClass()
.getName()
.equals(StatefulUserDefinedAggregator.class.getName())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instanceof

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


@Override
public void add(final List<TupleBatch> from, final Object state) throws DbException {
// LOGGER.info("add tuple called");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unnecessary comments, same above & below

String script = compute.append(output).toString();
LOGGER.debug("Compiling UDA {}", script);
LOGGER.info("Compiling UDA {}", script);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think debug is enough since it is not some crucial state transition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@jingjingwang
Copy link
Contributor

jingjingwang commented Jan 8, 2017

I made a pass, some of my comments are showing "outdated" but if you click on them, most of them still make sense. I feel that we may need another pass after these being resolved, in particular I think some newly added interfaces may get improved but I'd like to wait until we have a more clear version. Also, I think testing on having > 1 tuples in a blob column should help us discover many bugs. It's the best if you can add such a test. Even if it's hard, running some informal test should be helpful.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 27.105% when pulling f178f30 on blob-udf-new-merge into 48d79ca on master.

*
* @param value the value of this constant.
*/
public ConstantExpression(final ByteBuffer value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, this function is never used, and second, the return value would be a summary string of the ByteBuffer. It would cause a problem for getJavaString since it can't be parsed as some value, and also for the ConstantExpression constructor since we're expecting a direct value like in a JSON string. I think an UnsupportedOperationException makes more sense if you don't really want to output all the bytes as string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

* @return
* @throws DbException
*/
public abstract void sendEos() throws DbException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this function is never used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,4 @@
/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file is unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -14,6 +14,7 @@
import org.apache.commons.httpclient.URIException;

import com.amazonaws.ClientConfiguration;
import com.amazonaws.auth.AnonymousAWSCredentials;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import is never used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

countIdx.putInt(0, flatmapid);
flatmapid = 0;
}
if (getAddCounter()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if {} can be merged with the above one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -453,9 +453,13 @@ private void leftAndRightEqual() throws Exception {
// advance the one with the larger set of equal tuples because this produces fewer join tuples
// not exact but good approximation
final int leftSizeOfGroupOfEqualTuples =
leftRowIndex + TupleBatch.BATCH_SIZE * (leftBatches.size() - 1) - leftBeginIndex;
leftRowIndex
+ TupleUtils.getBatchSize(generateSchema()) * (leftBatches.size() - 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the schema should be the schema of the left batch, not the output of this join. They may have different schemas.

* @return PythonFunctionRegistrar for operator.
* @throws DbException in case of error.
*/
public PythonFunctionRegistrar getPythonFunctionRegistrar() throws DbException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is, pyFuncRegistrar is only used by PythonUDFEvaluator, so you don't need to add this function to Operator, and many aggregator-related methods. for example, AggUtils.allocateAggs(). Just get pyFuncRegistrar from the worker in PythonUDFEvaluator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets talk about this. Not clear to me how aggutils has access to the worker.

@@ -94,7 +99,7 @@ private void setUpdateExpressions(final List<Expression> updaterExpressions) {
}

@Override
protected TupleBatch fetchNextReady() throws DbException, InvocationTargetException {
protected TupleBatch fetchNextReady() throws DbException, InvocationTargetException, IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOException is not thrown from anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

*/
void getResult(AppendableTable dest, int destColumn, Object state) throws DbException;
void getResult(AppendableTable dest, int destColumn, Object state)
throws DbException, IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the IOException is never thrown.

@@ -53,7 +55,7 @@ public Aggregate(@Nullable final Operator child, final AggregatorFactory... aggr
}

@Override
protected TupleBatch fetchNextReady() throws DbException {
protected TupleBatch fetchNextReady() throws DbException, IOException {
Copy link
Contributor

@jingjingwang jingjingwang Feb 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IOException is never thrown if we remove the IOException from Aggregator.getResult(). (See another comment in Aggregator.java)

*/
@Override
protected TupleBatch fetchNextReady() throws DbException {
protected TupleBatch fetchNextReady() throws DbException, IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this IOException.

/** Holds the corresponding TB for each group key in {@link #groupKeys}. */
private transient List<List<TupleBatch>> tbgroupState;
/** Holds the bitset for each group key in {@link #groupKeys}. */
HashMap<Integer, BitSet> bs = new HashMap<Integer, BitSet>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads-up that these keys will be replace by rows in primitive columns in my refactor_aggregate PR.

@senderista
Copy link
Contributor

senderista commented Feb 9, 2017 via email

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 27.098% when pulling 28208c4 on blob-udf-new-merge into 48d79ca on master.

remove extra comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 27.15% when pulling e6504da on blob-udf-new-merge into 48d79ca on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 27.105% when pulling e6504da on blob-udf-new-merge into 48d79ca on master.

wChannel.write(bb);
wChannel.close();
fos.close();
private static String createTempFile(final ByteBuffer bb) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeToTempFile is more clear

fos.close();
private static String createTempFile(final ByteBuffer bb) throws IOException {
Path path = Files.createTempFile("out", null);
File file = path.toFile();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is unnecessary, you can use path.toAbsolutePath() to get the absolute path.

@@ -1,5 +1,8 @@
package edu.washington.escience.myria.expression;

import java.io.IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two imports are not used.

@@ -96,7 +96,9 @@ public void testApplywithCounter() throws DbException {
for (int batchIdx = 0; batchIdx < result.numTuples(); ++batchIdx, ++rowIdx) {
char[] ngramChars = new char[] {(char) rowIdx, (char) (rowIdx + 1), (char) (rowIdx + 2)};
String ngram = new String(ngramChars);
int fltmapid = (int) rowIdx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, and nit-picking: it's fine to just declare rowIdx to be an int

wChannel.close();
fos.close();
private static String createTempFile(final ByteBuffer bb) throws IOException {
Path path = Files.createTempFile("out", null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit-picking: maybe a better suffix

stateEvaluator.evaluate(null, 0, state, null);

/* Set up the updaters. */

pyUpdateEvaluators = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is never used. (And you may need to change some constructors to remove it)

Copy link
Contributor

@jingjingwang jingjingwang Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I take back my word, it is used.

@jingjingwang
Copy link
Contributor

Looks good to me, only a few new comments. Also two issues need to be opened based on our discussion.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 27.132% when pulling 319a6c8 on blob-udf-new-merge into 48d79ca on master.

@jingjingwang jingjingwang merged commit 6c56e3e into master Feb 10, 2017
@parmitam parmitam deleted the blob-udf-new-merge branch February 15, 2017 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants