Fix udfs #99

kanterov · 2017-01-23T23:17:44Z

We didn't use our encoders before, instead we used ScalaUDF that uses
default ones that aren't compatible with TypedEncoder. We introduce
FramelessUdf that uses TypedEncoder to support udfs. There are
2 special cases to mention:

udf is called within codegen
udf is directly evaluated within local projection

TypedEncoder doesn't support runtime evaluation, only codegen, that's
why we handle (2) by generating code from (1), compiling and executing
it.

kanterov · 2017-01-23T23:18:29Z

During testing, I've discovered a bug with collect, going to take a look if we can do something about it.

imarios · 2017-01-24T19:11:07Z

@kanterov man this is amazing but so complex. It will take me hours to review :). Did you figure out the issue with collect?

OlivierBlanvillain

I'm totally unfamiliar with Spark's udf and the CodeGen APIs used in this PR, I cannot do more than a superficial review...

Question: is (or could?) the code complied for evaluation be cached?

OlivierBlanvillain · 2017-01-28T13:28:49Z

dataset/src/test/scala/frameless/functions/UdfTests.scala

    }

-    check(forAll(prop[Int, Int, Int] _))
-    check(forAll(prop[String, Int, Int] _))
-    check(forAll(prop[Option[Int], X2[Double, Long], Int] _))


Why remove these checks? I looks like you don't have any X2 in the new ones

According to implementation details, there shouldn't be a difference between X1 and X2 because they are both encoded using the same type of encoder. I wanted to simplify things a bit.

OlivierBlanvillain · 2017-01-28T13:33:07Z

dataset/src/main/scala/frameless/functions/Udf.scala

      new TypedColumn[T, R](scalaUdf)
    }
 }

+case class FramelessUdf[T, R](


Should this be private[functions]?

I don't think we should stop people from accessing internals if they want, what if we document this as internal?

OlivierBlanvillain · 2017-01-28T13:44:02Z

dataset/src/main/scala/frameless/functions/Udf.scala

      new TypedColumn[T, R](scalaUdf)
    }
 }

+case class FramelessUdf[T, R](
+  function: AnyRef,


I don't understand where function is used inside of FramelessUdf

That's tricky, it used in generated code:

ctx.addMutableState(funcClassName, funcTerm, s"this.$funcTerm = ($funcClassName)((($framelessUdfClassName)references" + s"[$funcExpressionIdx]).function());")

imarios · 2017-01-30T02:09:26Z

@kanterov can you run this test with your changes?

  test("x") {
    val f = TypedDataset.create((1,Vector(2,2,3)) :: (1,Vector(2,4)) :: Nil)
    val ooo = f.makeUDF( (v: Vector[Int]) => v.sum )
    f.select( f('_1), ooo(f('_2))).show().run()
  }

imarios · 2017-01-30T02:59:03Z

@kanterov can you take a look at #106. Will this fix, fix the issues I mentioned there?

kanterov · 2017-01-31T21:17:53Z

@imarios yeah, it works now

kanterov · 2017-01-31T21:22:22Z

@OlivierBlanvillain I've addressed your comment for marking FramelessUdf as internal and squashed commits

codecov-io · 2017-01-31T21:30:47Z

Codecov Report

Merging #99 into master will increase coverage by 0.61%.

@@            Coverage Diff             @@
##           master      #99      +/-   ##
==========================================
+ Coverage   90.06%   90.67%   +0.61%     
==========================================
  Files          25       25              
  Lines         473      504      +31     
  Branches        7        8       +1     
==========================================
+ Hits          426      457      +31     
  Misses         47       47

Impacted Files	Coverage Δ
dataset/src/main/scala/frameless/TypedColumn.scala	`92.85% <ø> (ø)`	✅
...taset/src/main/scala/frameless/functions/Udf.scala	`100% <100%> (ø)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9a5addc...6cf5d07. Read the comment docs.

We didn't use our encoders before, instead we used `ScalaUDF` that uses default ones that aren't compatible with `TypedEncoder`. We introduce `FramelessUdf` that uses `TypedEncoder` to support udfs. There are 2 special cases to mention: 1. udf is called within codegen 2. udf is directly evaluated within local projection `TypedEncoder` doesn't support runtime evaluation, only codegen, that's why we handle (2) by generating code from (1), compiling and executing it.

imarios · 2017-02-01T02:48:20Z

@kanterov we still run to the case where we call a method that exists in "Vector" but doesn't exists on "Arrays", right?

kanterov · 2017-02-01T08:09:21Z

@imarios we shouldn't, current implementation would take Array and convert it to Vector before calling udf

OlivierBlanvillain · 2017-02-01T09:34:57Z

@kanterov sounds good! I'm not sure if I can help further on this one, unless someone else wants to give it a shot I would say we can merge :)

imarios · 2017-02-01T18:30:45Z

I think we should merge too. I will try to add a LOT more unit tests to cover different cases. I will open another PR for the new test additions.

kanterov · 2017-02-01T22:06:36Z

Awesome, thanks for review, merging.

OlivierBlanvillain self-requested a review January 25, 2017 07:58

OlivierBlanvillain reviewed Jan 28, 2017

View reviewed changes

kanterov force-pushed the fix-udf branch from e47acf5 to 331e034 Compare January 31, 2017 21:16

kanterov mentioned this pull request Jan 31, 2017

Better bounded types for TypedDatasets #106

Closed

kanterov force-pushed the fix-udf branch from 331e034 to a1a16ce Compare January 31, 2017 21:21

kanterov force-pushed the fix-udf branch 2 times, most recently from 1d6a172 to 6cf5d07 Compare January 31, 2017 21:48

kanterov merged commit 031b50e into master Feb 1, 2017

kanterov deleted the fix-udf branch February 1, 2017 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix udfs #99

Fix udfs #99

kanterov commented Jan 23, 2017

kanterov commented Jan 23, 2017

imarios commented Jan 24, 2017

OlivierBlanvillain left a comment

OlivierBlanvillain Jan 28, 2017

kanterov Jan 30, 2017

OlivierBlanvillain Jan 28, 2017

kanterov Jan 30, 2017 •

edited

Loading

OlivierBlanvillain Jan 28, 2017

kanterov Jan 30, 2017

imarios commented Jan 30, 2017

imarios commented Jan 30, 2017

kanterov commented Jan 31, 2017

kanterov commented Jan 31, 2017

codecov-io commented Jan 31, 2017 •

edited

Loading

imarios commented Feb 1, 2017

kanterov commented Feb 1, 2017

OlivierBlanvillain commented Feb 1, 2017

imarios commented Feb 1, 2017

kanterov commented Feb 1, 2017

Fix udfs #99

Fix udfs #99

Conversation

kanterov commented Jan 23, 2017

kanterov commented Jan 23, 2017

imarios commented Jan 24, 2017

OlivierBlanvillain left a comment

Choose a reason for hiding this comment

OlivierBlanvillain Jan 28, 2017

Choose a reason for hiding this comment

kanterov Jan 30, 2017

Choose a reason for hiding this comment

OlivierBlanvillain Jan 28, 2017

Choose a reason for hiding this comment

kanterov Jan 30, 2017 • edited Loading

Choose a reason for hiding this comment

OlivierBlanvillain Jan 28, 2017

Choose a reason for hiding this comment

kanterov Jan 30, 2017

Choose a reason for hiding this comment

imarios commented Jan 30, 2017

imarios commented Jan 30, 2017

kanterov commented Jan 31, 2017

kanterov commented Jan 31, 2017

codecov-io commented Jan 31, 2017 • edited Loading

Codecov Report

imarios commented Feb 1, 2017

kanterov commented Feb 1, 2017

OlivierBlanvillain commented Feb 1, 2017

imarios commented Feb 1, 2017

kanterov commented Feb 1, 2017

kanterov Jan 30, 2017 •

edited

Loading

codecov-io commented Jan 31, 2017 •

edited

Loading