Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix udfs #99

Merged
merged 1 commit into from
Feb 1, 2017
Merged

Fix udfs #99

merged 1 commit into from
Feb 1, 2017

Conversation

kanterov
Copy link
Contributor

We didn't use our encoders before, instead we used ScalaUDF that uses
default ones that aren't compatible with TypedEncoder. We introduce
FramelessUdf that uses TypedEncoder to support udfs. There are
2 special cases to mention:

  1. udf is called within codegen
  2. udf is directly evaluated within local projection

TypedEncoder doesn't support runtime evaluation, only codegen, that's
why we handle (2) by generating code from (1), compiling and executing
it.

@kanterov
Copy link
Contributor Author

During testing, I've discovered a bug with collect, going to take a look if we can do something about it.

@imarios
Copy link
Contributor

imarios commented Jan 24, 2017

@kanterov man this is amazing but so complex. It will take me hours to review :). Did you figure out the issue with collect?

Copy link
Contributor

@OlivierBlanvillain OlivierBlanvillain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally unfamiliar with Spark's udf and the CodeGen APIs used in this PR, I cannot do more than a superficial review...

Question: is (or could?) the code complied for evaluation be cached?

}

check(forAll(prop[Int, Int, Int] _))
check(forAll(prop[String, Int, Int] _))
check(forAll(prop[Option[Int], X2[Double, Long], Int] _))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove these checks? I looks like you don't have any X2 in the new ones

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to implementation details, there shouldn't be a difference between X1 and X2 because they are both encoded using the same type of encoder. I wanted to simplify things a bit.

new TypedColumn[T, R](scalaUdf)
}
}

case class FramelessUdf[T, R](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be private[functions]?

Copy link
Contributor Author

@kanterov kanterov Jan 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should stop people from accessing internals if they want, what if we document this as internal?

new TypedColumn[T, R](scalaUdf)
}
}

case class FramelessUdf[T, R](
function: AnyRef,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand where function is used inside of FramelessUdf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's tricky, it used in generated code:

    ctx.addMutableState(funcClassName, funcTerm,
       s"this.$funcTerm = ($funcClassName)((($framelessUdfClassName)references" +
         s"[$funcExpressionIdx]).function());")

@imarios
Copy link
Contributor

imarios commented Jan 30, 2017

@kanterov can you run this test with your changes?

  test("x") {
    val f = TypedDataset.create((1,Vector(2,2,3)) :: (1,Vector(2,4)) :: Nil)
    val ooo = f.makeUDF( (v: Vector[Int]) => v.sum )
    f.select( f('_1), ooo(f('_2))).show().run()
  }

@imarios
Copy link
Contributor

imarios commented Jan 30, 2017

@kanterov can you take a look at #106. Will this fix, fix the issues I mentioned there?

@kanterov
Copy link
Contributor Author

@imarios yeah, it works now

@kanterov
Copy link
Contributor Author

@OlivierBlanvillain I've addressed your comment for marking FramelessUdf as internal and squashed commits

@codecov-io
Copy link

codecov-io commented Jan 31, 2017

Codecov Report

Merging #99 into master will increase coverage by 0.61%.

@@            Coverage Diff             @@
##           master      #99      +/-   ##
==========================================
+ Coverage   90.06%   90.67%   +0.61%     
==========================================
  Files          25       25              
  Lines         473      504      +31     
  Branches        7        8       +1     
==========================================
+ Hits          426      457      +31     
  Misses         47       47
Impacted Files Coverage Δ
dataset/src/main/scala/frameless/TypedColumn.scala 92.85% <ø> (ø)
...taset/src/main/scala/frameless/functions/Udf.scala 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9a5addc...6cf5d07. Read the comment docs.

@kanterov kanterov force-pushed the fix-udf branch 2 times, most recently from 1d6a172 to 6cf5d07 Compare January 31, 2017 21:48
We didn't use our encoders before, instead we used `ScalaUDF` that uses
default ones that aren't compatible with `TypedEncoder`. We introduce
`FramelessUdf` that uses `TypedEncoder` to support udfs. There are
2 special cases to mention:

1. udf is called within codegen
2. udf is directly evaluated within local projection

`TypedEncoder` doesn't support runtime evaluation, only codegen, that's
why we handle (2) by generating code from (1), compiling and executing
it.
@imarios
Copy link
Contributor

imarios commented Feb 1, 2017

@kanterov we still run to the case where we call a method that exists in "Vector" but doesn't exists on "Arrays", right?

@kanterov
Copy link
Contributor Author

kanterov commented Feb 1, 2017

@imarios we shouldn't, current implementation would take Array and convert it to Vector before calling udf

@OlivierBlanvillain
Copy link
Contributor

@kanterov sounds good! I'm not sure if I can help further on this one, unless someone else wants to give it a shot I would say we can merge :)

@imarios
Copy link
Contributor

imarios commented Feb 1, 2017

I think we should merge too. I will try to add a LOT more unit tests to cover different cases. I will open another PR for the new test additions.

@kanterov
Copy link
Contributor Author

kanterov commented Feb 1, 2017

Awesome, thanks for review, merging.

@kanterov kanterov merged commit 031b50e into master Feb 1, 2017
@kanterov kanterov deleted the fix-udf branch February 1, 2017 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants