adding blob data type and tests #862

parmitam · 2016-12-15T18:53:24Z

No description provided.

parmitam · 2016-12-15T18:54:41Z

Starting to push Python UDF and Blob data type changes to master.
This commit contains support for blob data type.

coveralls · 2016-12-15T18:57:24Z

Coverage increased (+0.06%) to 27.388% when pulling f48bb38 on blob-udf-new-merge into 41d0d45 on master.

parmitam · 2016-12-15T23:11:39Z

This commit is to add python UDF registration support.

coveralls · 2016-12-15T23:13:58Z

Coverage decreased (-0.04%) to 27.285% when pulling 6f69cba on blob-udf-new-merge into 41d0d45 on master.

coveralls · 2017-01-03T20:53:16Z

Coverage decreased (-0.3%) to 27.048% when pulling bc206c0 on blob-udf-new-merge into 41d0d45 on master.

…yria-python support

coveralls · 2017-01-03T22:43:11Z

Coverage decreased (-0.3%) to 27.007% when pulling ea9ec0e on blob-udf-new-merge into 41d0d45 on master.

jingjingwang

For f48bb38, I have several general comments before going into details:

I think the new function of getting the size of a tuple batch should be wrapped within a TupleBatch/TupleBatchBuffer/TupleBuffer/Mutable... object, Instead of calling TupleUtils with an explicit schema. The schema is determined when the TupleBatch/TupleBatchBuffer/... object is constructed, so the size should just be an internal property. With this change, some of the tuple batch building functions need to be cleaned up to use TupleBatchBuffer, instead of an explicit list of ColumnBuilders, in order to build a TupleBatch. ColumnBuilders should just be internal data structures of a TupleBatchBuffer most of the time, and their sizes are determined by the TupleBatchBuffer.
There are multiple names of this new data type, I have seen Bytes, Blob, and ByteBuffer all being used. It would be more clear to just stick with one. I think either Bytes or Blob is good (personally prefer Blob for its clear definition as "binary large object", while Bytes is about physical representation), but ByteBuffer is how this new data type is implemented internally so should be avoided when possible.
Have you tried any test case with more than one row in a BytesColumn? Just in case if there is any implementation detail that happens to rely on this assumption.

It might be more clear if you could just squash new changes with this commit once they are ready, since I believe they are independent from other python function registration commits.

…d mutable tb

coveralls · 2017-01-04T20:12:11Z

Coverage decreased (-0.3%) to 26.995% when pulling 3265300 on blob-udf-new-merge into 41d0d45 on master.

parmitam · 2017-01-04T20:22:11Z

I have changes bytebuffer and bytes to blob. Also made batchSize member of TupleBatch, TupleBuffer etc. I have not cleaned up to use tuplebatchbuffer, which is a little bit involved ( issue#865 assigned to me).
I can't squash the two commits as new files were added in PyUDF creation that were changed

jingjingwang · 2017-01-05T08:18:14Z

There are some small stuffs, such as missing Javadoc (there might be more), typo, using Java-style function names (getBatchSize()), unremoved comments, etc. But more importantly, do you plan to do #865 in this PR? Looks like there are only a few places need to be changed, e.g. JdbcAccessMethod and maybe a few more, and they would make this PR self-contained.

coveralls · 2017-01-06T18:43:55Z

Coverage decreased (-0.4%) to 26.938% when pulling ed29c11 on blob-udf-new-merge into 41d0d45 on master.

…and using camelcase for function name.

parmitam · 2017-01-06T19:10:31Z

Addressing code review comments.

coveralls · 2017-01-06T21:25:38Z

Coverage decreased (-0.4%) to 26.929% when pulling 14a0e31 on blob-udf-new-merge into 41d0d45 on master.

jortiz16 · 2017-01-06T22:22:20Z

src/edu/washington/escience/myria/MyriaConstants.java

+          Type.BYTES_TYPE,
+          "output_type",
+          Type.STRING_TYPE);
+


Add comment similar to profiling?

jortiz16 · 2017-01-06T22:30:45Z

src/edu/washington/escience/myria/api/DatasetResource.java

@@ -470,6 +472,16 @@ public Response createFunction(final CreateFunctionEncoding encoding) throws DbE
    ResponseBuilder response = Response.ok();
    return response.entity(functionCreationResponse).build();
  }
+  /**
+   * @param queryId an optional query ID specifying which datasets to get.
+   * @return a list of datasets.


fix comments

jingjingwang · 2017-01-08T05:27:10Z

src/edu/washington/escience/myria/operator/Operator.java

+  /**
+   * Python function registrar.
+   */
+  protected PythonFunctionRegistrar pyFuncReg;


pyFuncReg is a per-worker property and should not be kept here. We can make the method getPythonFunctionRegistrar() below as a wrapper of calling Worker's getPythonFunctionRegistrar(). (and my understanding is that the return type is non-null (or can we make it non-null)? Annotate the return type with @nonnull if that's the case and get rid of all the checkings like if (pyFuncReg != null).)

jingjingwang · 2017-01-08T05:31:14Z

src/edu/washington/escience/myria/operator/Operator.java

  public PythonFunctionRegistrar getPythonFunctionRegistrar() {
-    Preconditions.checkNotNull(pyFuncReg);
+    if (execEnvVars.get(MyriaConstants.EXEC_ENV_VAR_TEST_MODE) != null
+        && execEnvVars.get(MyriaConstants.EXEC_ENV_VAR_TEST_MODE).equals("false")) {


Could you say a little more about this? Why the test mode has to be false to get a python function registrar (and seems it's never been set to be false in the whole code base)?

jingjingwang · 2017-01-08T05:37:54Z

src/edu/washington/escience/myria/expression/evaluate/PythonUDFEvaluator.java

@@ -345,4 +411,12 @@ private void writeToStream(
      throw new DbException(e);
    }
  }
+
+  @Override
+  public void sendEos() throws DbException {


I might be wrong but you probably don't need to write anything as PythonWorker.sendEos() closes the connection anyway, unless the python process needs this information to do something before close.

jingjingwang · 2017-01-08T05:38:33Z

src/edu/washington/escience/myria/expression/evaluate/PythonUDFEvaluator.java

          if (input != null && input.hasArray()) {
-            // LOGGER.info("input array buffer length" + input.array().length);
+


This is nit picking but please get rid of unnecessary empty lines. (There are some more)

jingjingwang · 2017-01-08T05:39:46Z

src/edu/washington/escience/myria/MyriaConstants.java

@@ -352,7 +352,8 @@ private MyriaConstants() {}
  public static final int PYTHON_EXCEPTION = -3;
  /** python function return is null.*/
  public static final int NULL_LENGTH = -5;
-
+  /** Send EOS tp python strea, is null.*/


"to" and "stream". And if my comment below makes sense we can get rid of this declaration.

jingjingwang · 2017-01-08T07:14:29Z

src/edu/washington/escience/myria/operator/agg/SingleGroupByAggregate.java

+      throws DbException {
+
+    BitSet bs;
+    switch (gColumnType) {


These types share many lines, merge them.

jingjingwang · 2017-01-08T07:14:43Z

src/edu/washington/escience/myria/operator/agg/SingleGroupByAggregate.java

+    }
+  }
+
+  private void setBitset(final ReadableTable table, final int row, final Object[] groupAgg)


'setBitSet'

jingjingwang · 2017-01-08T07:15:24Z

src/edu/washington/escience/myria/operator/agg/SingleGroupByAggregate.java

+      if (aggregators[agg]
+          .getClass()
+          .getName()
+          .equals(StatefulUserDefinedAggregator.class.getName())) {


jingjingwang · 2017-01-08T07:20:23Z

src/edu/washington/escience/myria/operator/agg/StatefulUserDefinedAggregator.java

+
+  @Override
+  public void add(final List<TupleBatch> from, final Object state) throws DbException {
+    // LOGGER.info("add tuple called");


Remove unnecessary comments, same above & below

jingjingwang · 2017-01-08T07:23:02Z

src/edu/washington/escience/myria/operator/agg/UserDefinedAggregatorFactory.java

    String script = compute.append(output).toString();
-    LOGGER.debug("Compiling UDA {}", script);
+    LOGGER.info("Compiling UDA {}", script);


I think debug is enough since it is not some crucial state transition.

jingjingwang · 2017-01-08T07:31:04Z

I made a pass, some of my comments are showing "outdated" but if you click on them, most of them still make sense. I feel that we may need another pass after these being resolved, in particular I think some newly added interfaces may get improved but I'd like to wait until we have a more clear version. Also, I think testing on having > 1 tuples in a blob column should help us discover many bugs. It's the best if you can add such a test. Even if it's hard, running some informal test should be helpful.

coveralls · 2017-02-02T21:49:31Z

Coverage decreased (-0.3%) to 27.105% when pulling f178f30 on blob-udf-new-merge into 48d79ca on master.

jingjingwang · 2017-02-09T04:42:35Z

src/edu/washington/escience/myria/expression/ConstantExpression.java

+   *
+   * @param value the value of this constant.
+   */
+  public ConstantExpression(final ByteBuffer value) {


First, this function is never used, and second, the return value would be a summary string of the ByteBuffer. It would cause a problem for getJavaString since it can't be parsed as some value, and also for the ConstantExpression constructor since we're expecting a direct value like in a JSON string. I think an UnsupportedOperationException makes more sense if you don't really want to output all the bytes as string.

jingjingwang · 2017-02-09T06:59:49Z

src/edu/washington/escience/myria/expression/evaluate/Evaluator.java

+   * @return
+   * @throws DbException
+   */
+  public abstract void sendEos() throws DbException;


Seems this function is never used.

jingjingwang · 2017-02-09T07:12:30Z

src/edu/washington/escience/myria/functions/package-info.java

@@ -0,0 +1,4 @@
+/**


I think this file is unnecessary.

jingjingwang · 2017-02-09T07:13:51Z

src/edu/washington/escience/myria/io/AmazonS3Source.java

@@ -14,6 +14,7 @@
 import org.apache.commons.httpclient.URIException;

 import com.amazonaws.ClientConfiguration;
+import com.amazonaws.auth.AnonymousAWSCredentials;


This import is never used.

jingjingwang · 2017-02-09T07:26:20Z

src/edu/washington/escience/myria/operator/Apply.java

+                  countIdx.putInt(0, flatmapid);
+                  flatmapid = 0;
+                }
+                if (getAddCounter()) {


This if {} can be merged with the above one.

jingjingwang · 2017-02-09T07:31:34Z

src/edu/washington/escience/myria/operator/MergeJoin.java

@@ -453,9 +453,13 @@ private void leftAndRightEqual() throws Exception {
    // advance the one with the larger set of equal tuples because this produces fewer join tuples
    // not exact but good approximation
    final int leftSizeOfGroupOfEqualTuples =
-        leftRowIndex + TupleBatch.BATCH_SIZE * (leftBatches.size() - 1) - leftBeginIndex;
+        leftRowIndex
+            + TupleUtils.getBatchSize(generateSchema()) * (leftBatches.size() - 1)


I think the schema should be the schema of the left batch, not the output of this join. They may have different schemas.

jingjingwang · 2017-02-09T07:37:01Z

src/edu/washington/escience/myria/operator/Operator.java

+   * @return PythonFunctionRegistrar for operator.
+   * @throws DbException  in case of error.
+   */
+  public PythonFunctionRegistrar getPythonFunctionRegistrar() throws DbException {


My understanding is, pyFuncRegistrar is only used by PythonUDFEvaluator, so you don't need to add this function to Operator, and many aggregator-related methods. for example, AggUtils.allocateAggs(). Just get pyFuncRegistrar from the worker in PythonUDFEvaluator.

lets talk about this. Not clear to me how aggutils has access to the worker.

jingjingwang · 2017-02-09T07:39:44Z

src/edu/washington/escience/myria/operator/StatefulApply.java

@@ -94,7 +99,7 @@ private void setUpdateExpressions(final List<Expression> updaterExpressions) {
  }

  @Override
-  protected TupleBatch fetchNextReady() throws DbException, InvocationTargetException {
+  protected TupleBatch fetchNextReady() throws DbException, InvocationTargetException, IOException {


IOException is not thrown from anywhere.

jingjingwang · 2017-02-09T07:48:27Z

src/edu/washington/escience/myria/operator/agg/Aggregator.java

   */
-  void getResult(AppendableTable dest, int destColumn, Object state) throws DbException;
+  void getResult(AppendableTable dest, int destColumn, Object state)
+      throws DbException, IOException;


I think the IOException is never thrown.

jingjingwang · 2017-02-09T07:49:03Z

src/edu/washington/escience/myria/operator/agg/Aggregate.java

@@ -53,7 +55,7 @@ public Aggregate(@Nullable final Operator child, final AggregatorFactory... aggr
  }

  @Override
-  protected TupleBatch fetchNextReady() throws DbException {
+  protected TupleBatch fetchNextReady() throws DbException, IOException {


The IOException is never thrown if we remove the IOException from Aggregator.getResult(). (See another comment in Aggregator.java)

jingjingwang · 2017-02-09T07:52:12Z

src/edu/washington/escience/myria/operator/agg/MultiGroupByAggregate.java

   */
  @Override
-  protected TupleBatch fetchNextReady() throws DbException {
+  protected TupleBatch fetchNextReady() throws DbException, IOException {


Also this IOException.

jingjingwang · 2017-02-09T08:02:37Z

src/edu/washington/escience/myria/operator/agg/MultiGroupByAggregate.java

+  /** Holds the corresponding TB for each group key in {@link #groupKeys}. */
+  private transient List<List<TupleBatch>> tbgroupState;
+  /** Holds the bitset for each group key in {@link #groupKeys}. */
+  HashMap<Integer, BitSet> bs = new HashMap<Integer, BitSet>();


Just a heads-up that these keys will be replace by rows in primitive columns in my refactor_aggregate PR.

senderista · 2017-02-09T21:18:57Z

Agree with @BrandonHaynes: using a stdlib API is not taking on a dependency, and using temp files correctly is much less simple than one would think at first glance.

…

On Thu, Feb 9, 2017 at 12:16 PM, Brandon Haynes ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/edu/washington/escience/myria/CsvTupleWriter.java <#862>: > for (int i = 0; i < tuples.numTuples(); ++i) { for (int j = 0; j < tuples.numColumns(); ++j) { - row[j] = tuples.getObject(j, i).toString(); + Type type = tbsc.getColumnType(j); + if (type.equals(Type.BLOB_TYPE)) { + // write the file out + // add filename to the csv file + String filename = UUID.randomUUID().toString(); Strongly disagree; IMO we should rely on the core language API by default, and only roll-our-own implementation if there is a strong reason to do otherwise (this is uncommon). The code would also be strengthened by using Java-flavored RAII and automatically cleaning up the temp files ( deleteOnExit). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#862>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAIBYdhjMM77prdC5m1bScntaXzFRnTdks5ra3Q5gaJpZM4LOb-j> .

coveralls · 2017-02-09T22:23:55Z

Coverage decreased (-0.3%) to 27.098% when pulling 28208c4 on blob-udf-new-merge into 48d79ca on master.

remove extra comment

coveralls · 2017-02-09T22:34:29Z

Coverage decreased (-0.2%) to 27.15% when pulling e6504da on blob-udf-new-merge into 48d79ca on master.

coveralls · 2017-02-09T22:36:58Z

Coverage decreased (-0.3%) to 27.105% when pulling e6504da on blob-udf-new-merge into 48d79ca on master.

jingjingwang · 2017-02-10T02:42:07Z

src/edu/washington/escience/myria/CsvTupleWriter.java

-    wChannel.write(bb);
-    wChannel.close();
-    fos.close();
+  private static String createTempFile(final ByteBuffer bb) throws IOException {


writeToTempFile is more clear

jingjingwang · 2017-02-10T02:42:50Z

src/edu/washington/escience/myria/CsvTupleWriter.java

-    fos.close();
+  private static String createTempFile(final ByteBuffer bb) throws IOException {
+    Path path = Files.createTempFile("out", null);
+    File file = path.toFile();


This file is unnecessary, you can use path.toAbsolutePath() to get the absolute path.

jingjingwang · 2017-02-10T02:43:32Z

src/edu/washington/escience/myria/expression/DownloadBlobExpression.java

@@ -1,5 +1,8 @@
 package edu.washington.escience.myria.expression;

+import java.io.IOException;


These two imports are not used.

jingjingwang · 2017-02-10T02:58:51Z

test/edu/washington/escience/myria/operator/apply/ApplyNgramTest.java

@@ -96,7 +96,9 @@ public void testApplywithCounter() throws DbException {
        for (int batchIdx = 0; batchIdx < result.numTuples(); ++batchIdx, ++rowIdx) {
          char[] ngramChars = new char[] {(char) rowIdx, (char) (rowIdx + 1), (char) (rowIdx + 2)};
          String ngram = new String(ngramChars);
+          int fltmapid = (int) rowIdx;


typo, and nit-picking: it's fine to just declare rowIdx to be an int

jingjingwang · 2017-02-10T03:02:57Z

src/edu/washington/escience/myria/CsvTupleWriter.java

-    wChannel.close();
-    fos.close();
+  private static String createTempFile(final ByteBuffer bb) throws IOException {
+    Path path = Files.createTempFile("out", null);


nit-picking: maybe a better suffix

jingjingwang · 2017-02-10T03:35:02Z

src/edu/washington/escience/myria/operator/agg/UserDefinedAggregatorFactory.java

      stateEvaluator.evaluate(null, 0, state, null);

      /* Set up the updaters. */
+
+      pyUpdateEvaluators = new ArrayList<>();


It is never used. (And you may need to change some constructors to remove it)

Sorry I take back my word, it is used.

jingjingwang · 2017-02-10T07:12:40Z

Looks good to me, only a few new comments. Also two issues need to be opened based on our discussion.

coveralls · 2017-02-10T17:16:49Z

Coverage decreased (-0.2%) to 27.132% when pulling 319a6c8 on blob-udf-new-merge into 48d79ca on master.

adding blob data type and tests

f48bb38

parmitam assigned parmitam, jortiz16 and jingjingwang Dec 15, 2016

adding python function registration

6f69cba

parmitam requested review from jingjingwang and jortiz16 December 15, 2016 23:16

adding support for python UDF

bc206c0

added support for get_function for registered function, this is for m…

ea9ec0e

…yria-python support

jingjingwang reviewed Jan 4, 2017

View reviewed changes

rename bytes/bytebuffer to blob; make batch size member of tb,tbb an…

3265300

…d mutable tb

stateful agg for python UDA

ed29c11

addressing code review comments: removing comments, adding javadoc, …

e37f7f4

…and using camelcase for function name.

Myria python worker

14a0e31

jortiz16 reviewed Jan 6, 2017

View reviewed changes

jingjingwang reviewed Jan 8, 2017

View reviewed changes

jingjingwang reviewed Feb 9, 2017

View reviewed changes

addressing code review comments.

e6504da

remove extra comment

parmitam force-pushed the blob-udf-new-merge branch from bfa4a48 to e6504da Compare February 9, 2017 22:32

jingjingwang reviewed Feb 10, 2017

View reviewed changes

addressing last of code review comments

319a6c8

jingjingwang merged commit 6c56e3e into master Feb 10, 2017

parmitam deleted the blob-udf-new-merge branch February 15, 2017 17:50

		if (input != null && input.hasArray()) {
		// LOGGER.info("input array buffer length" + input.array().length);

		@@ -1,5 +1,8 @@
		package edu.washington.escience.myria.expression;

		import java.io.IOException;

adding blob data type and tests #862

adding blob data type and tests #862

Conversation

parmitam commented Dec 15, 2016

parmitam commented Dec 15, 2016

coveralls commented Dec 15, 2016

parmitam commented Dec 15, 2016

coveralls commented Dec 15, 2016

coveralls commented Jan 3, 2017

coveralls commented Jan 3, 2017

jingjingwang left a comment • edited Loading

Choose a reason for hiding this comment

coveralls commented Jan 4, 2017

parmitam commented Jan 4, 2017

jingjingwang commented Jan 5, 2017

coveralls commented Jan 6, 2017

parmitam commented Jan 6, 2017

coveralls commented Jan 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingjingwang commented Jan 8, 2017 • edited Loading

coveralls commented Feb 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingjingwang Feb 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

senderista commented Feb 9, 2017 via email

coveralls commented Feb 9, 2017

coveralls commented Feb 9, 2017

coveralls commented Feb 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingjingwang Feb 10, 2017 • edited Loading

Choose a reason for hiding this comment

jingjingwang commented Feb 10, 2017

coveralls commented Feb 10, 2017

jingjingwang left a comment •

edited

Loading

jingjingwang commented Jan 8, 2017 •

edited

Loading

jingjingwang Feb 9, 2017 •

edited

Loading

jingjingwang Feb 10, 2017 •

edited

Loading