Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary sequence estimator #84

Merged
merged 8 commits into from
Aug 28, 2018
Merged

Binary sequence estimator #84

merged 8 commits into from
Aug 28, 2018

Conversation

mikeloh77
Copy link
Contributor

It would be useful to have an Estimator/Transformer combo where it can take a single feature of type I1 and a sequence of features of type I2. This is to augment the concept of the SequenceEstimator.

@salesforce-cla
Copy link

Thanks for the contribution! Before we can merge this, we need @mikeloh77 to sign the Salesforce.com Contributor License Agreement.

val convertI1 = FeatureTypeSparkConverter[I1]()
val convertI2 = FeatureTypeSparkConverter[I2]()
val func = (r: Row) => {
val arr = new ArrayBuffer[I2](r.length-1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add spaces r.length - 1

@@ -470,6 +470,46 @@ case object FeatureSparkTypes {
}
}

def udf2N[I1 <: FeatureType : TypeTag, I2 <: FeatureType : TypeTag, O <: FeatureType : TypeTag]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs

UserDefinedFunction(func, outputType, inputTypes = None)
}

def transform2N[I1 <: FeatureType : TypeTag, I2 <: FeatureType : TypeTag, O <: FeatureType: TypeTag]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs here as well

@@ -549,6 +549,43 @@ trait OpPipelineStageN[I <: FeatureType, O <: FeatureType] extends OpPipelineSta
)(tto)
}

/**
* Pipeline stage of single Feature of type I1 with multiple Features of type I2 to output 1Feature of type O
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo 1Feature

implicit val tti1: TypeTag[I1],
val tti2: TypeTag[I2],
val tto: TypeTag[O],
val tti1v: TypeTag[I1#Value],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rename tti1v -> ttiv1 and tti2v -> ttiv2 (since we have those in other classes as well) + update the docs

val columns = Array(col(in1.name)) ++ seqColumns

val df = dataset.select(columns: _*)
val ds = df.map(r => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be using the converters here instead:

val ds = df.map { r =>
   val rowSeq = r.toSeq
   (convertI1.fromSpark(rowSeq.head), rowSeq.tail.map(seqIConvert.fromSpark(_).value))
}

import scala.reflect.runtime.universe.TypeTag
import scala.util.Try

trait OpTransformer2N[I1 <: FeatureType, I2 <: FeatureType, O <: FeatureType]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs

* @return a fitted model that will perform the transformation specified by the function defined in constructor fit
*/
override def fit(dataset: Dataset[_]): BinarySequenceModel[I1, I2, O] = {
assert(inN.nonEmpty, "Inputs cannot be empty")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps getTransientFeatures.nonEmpty?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a check specifically on the sequence input getTransientFeatures will be nonempty

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then getTransientFeatures.size > 1, so it will include the check of existence of in1 as well

* @return a new dataset containing a column for the transformed feature
*/
override def transform(dataset: Dataset[_]): DataFrame = {
assert(inN.nonEmpty, "Inputs cannot be empty")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps getTransientFeatures.nonEmpty here as well?

Set("-1.0", "three", "four").toMultiPickList,
Set("15.0", "five", "six").toMultiPickList,
Set("1.111", "seven", "").toMultiPickList
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.map(_.toMultiPickList)

Vectors.dense(0.4, 0.5, Double.PositiveInfinity).toOPVector,
Vectors.dense(0.4, 1.0, Double.PositiveInfinity).toOPVector,
Vectors.dense(0.2, 0.5, Double.PositiveInfinity).toOPVector
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.map(_.toOPVector)

@codecov
Copy link

codecov bot commented Aug 24, 2018

Codecov Report

Merging #84 into master will increase coverage by 0.09%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #84      +/-   ##
==========================================
+ Coverage   86.08%   86.17%   +0.09%     
==========================================
  Files         292      294       +2     
  Lines        9490     9550      +60     
  Branches      316      528     +212     
==========================================
+ Hits         8169     8230      +61     
+ Misses       1321     1320       -1
Impacted Files Coverage Δ
...rc/main/scala/com/salesforce/op/stages/HasIn.scala 100% <100%> (ø) ⬆️
...com/salesforce/op/features/FeatureSparkTypes.scala 90.62% <100%> (+0.91%) ⬆️
...stages/base/sequence/BinarySequenceEstimator.scala 100% <100%> (ø)
...la/com/salesforce/op/stages/OpPipelineStages.scala 75.45% <100%> (+2.45%) ⬆️
...ages/base/sequence/BinarySequenceTransformer.scala 100% <100%> (ø)
...om/salesforce/op/utils/spark/OpSparkListener.scala 98.7% <0%> (+1.29%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b16b7cb...0238fe4. Read the comment docs.

dataset.select(col("*"), functionUDF(struct(columns: _*)).as(getOutputFeatureName, meta))
}

private val transformNFn = FeatureSparkTypes.transform2N[I1, I2, O](transformFn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private lazy val

}

private val transformNFn = FeatureSparkTypes.transform2N[I1, I2, O](transformFn)
override def transformKeyValue: KeyValue => Any = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lazy val

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what should be lazy here. Please clarify.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikeloh77 you dont want to know 😛 but if you are curious, mainly because of SelectedModel class in ModelSelector

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@tovbinm tovbinm changed the title Ml/binary sequence esimator Binary sequence estimator Aug 28, 2018
Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!!

@tovbinm tovbinm merged commit f9348de into master Aug 28, 2018
@tovbinm tovbinm deleted the ml/BinarySequenceEsimator branch August 28, 2018 06:04
ericwayman pushed a commit that referenced this pull request Feb 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants