Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply DateToUnitCircleTransformer logic in raw feature transformations. #130

Merged
merged 18 commits into from
Sep 21, 2018
Merged

Apply DateToUnitCircleTransformer logic in raw feature transformations. #130

merged 18 commits into from
Sep 21, 2018

Conversation

marcovivero
Copy link
Contributor

Related issues
There is currently no related issue open for this, this is an enhancement.

Describe the proposed solution
Add the option to apply circular date time transformer in RawFeatureFilter to any Date, DateTime, DateList, or DateMap feature type. This is the same logic utilized for DateToUnitCircleTransformer:

https://github.com/salesforce/TransmogrifAI/blob/master/core/src/main/scala/com/salesforce/op/stages/impl/feature/DateToUnitCircleTransformer.scala

This is currently turned off by default and should be up to user to turn on in their respective workflows.

Describe alternatives you've considered
We currently use the standard approach to processing Numeric types, i.e. just mapping these to doubles. This may lead to issues when comparing different distributions over time.

Additional context
Add any other context about the changes here.

@salesforce-cla
Copy link

Thanks for the contribution! Unfortunately we can't verify the commit author(s): marcovivero <m***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

@codecov
Copy link

codecov bot commented Sep 13, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@45241e6). Click here to learn what that means.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #130   +/-   ##
=========================================
  Coverage          ?   37.87%           
=========================================
  Files             ?      298           
  Lines             ?     9692           
  Branches          ?      552           
=========================================
  Hits              ?     3671           
  Misses            ?     6021           
  Partials          ?        0
Impacted Files Coverage Δ
...a/com/salesforce/op/filters/RawFeatureFilter.scala 0% <0%> (ø)
.../src/main/scala/com/salesforce/op/OpWorkflow.scala 31.03% <0%> (ø)
...a/com/salesforce/op/filters/PreparedFeatures.scala 0% <0%> (ø)
...ges/impl/feature/DateToUnitCircleTransformer.scala 0% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 45241e6...104169e. Read the comment docs.

@@ -30,9 +30,14 @@

package com.salesforce.op.filters


import java.time.{Instant, OffsetDateTime, ZoneOffset}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not use java.time. instead use joda time

@@ -183,4 +200,38 @@ private[filters] object PreparedFeatures {
* @return array of string tokens
*/
private def tokenize(s: String) = TextTokenizer.Analyzer.analyze(s, Language.Unknown)

private def prepareDateValue(timestamp: Long, timePeriod: Option[TimePeriod]): Double =
Copy link
Collaborator

@tovbinm tovbinm Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please make DateToUnitCircle.convertToRandians reuable instead? this is essentially the same code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -521,12 +522,15 @@ class OpWorkflow(val uid: String = UID[OpWorkflow]) extends OpWorkflowCore {
maxCorrelation: Double = 0.95,
correlationType: CorrelationType = CorrelationType.Pearson,
protectedFeatures: Array[OPFeature] = Array.empty,
textBinsFormula: (Summary, Int) => Int = RawFeatureFilter.textBinsFormula
protectedJSFeatures: Array[OPFeature] = Array.empty,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. how is protectedJSFeatures different from protectedFeatures?
  2. please add docs for protectedJSFeatures param

Copy link
Contributor Author

@marcovivero marcovivero Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already enabled in RawFeatureFilter, it protects features from JS divergence check. I didn't see it OpWorkflow.withRawFeatureFilter, just making sure it's here as well.

row: Row,
responses: Array[TransientFeature],
predictors: Array[TransientFeature],
timePeriod: Option[TimePeriod]): PreparedFeatures = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs for timePeriod

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

private def prepareFeature[T <: FeatureType](
name: String,
value: T,
timePeriod: Option[TimePeriod]): Map[FeatureKey, ProcessedSeq] =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs for timePeriod again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

value match {
case v: Text => v.value
.map(s => Map[FeatureKey, ProcessedSeq]((name, None) -> Left(tokenize(s)))).getOrElse(Map.empty)
case v: Date => v.value.map { timestamp =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why Text, Date and OPNumeric are handled differently than the other values which require some value to be present case ft@SomeValue(_) below? @leahmcguire @marcovivero

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is because all of the values inside case ft@SomeValue(_) already have ft.value as sequences, we'll end up needing to do the same operation to unwrap the Option[Double].

@@ -93,7 +94,8 @@ class RawFeatureFilter[T]
val correlationType: CorrelationType = CorrelationType.Pearson,
val jsDivergenceProtectedFeatures: Set[String] = Set.empty,
val protectedFeatures: Set[String] = Set.empty,
val textBinsFormula: (Summary, Int) => Int = RawFeatureFilter.textBinsFormula
val textBinsFormula: (Summary, Int) => Int = RawFeatureFilter.textBinsFormula,
val timePeriod: Option[TimePeriod] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs for timePeriod

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, please add timePeriod to OpWorkflow.withRawFeatureFilter as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -157,6 +162,68 @@ class PreparedFeaturesTest extends FlatSpec with TestSparkContext {
testCorrMatrix(allResponseKeys2, CorrelationType.Spearman, expected)
}

it should "correctly transform date features when time period is specified" in {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test has to be split into separate cases or be a for loop not a copy/paste

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, there's a single function for running each check.

@@ -157,6 +162,66 @@ class PreparedFeaturesTest extends FlatSpec with TestSparkContext {
testCorrMatrix(allResponseKeys2, CorrelationType.Spearman, expected)
}

it should "correctly transform date features when time period DayOfMonth is specified" in {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shorten test names to maybe: transform dates with DayOfMonth time period


def runDateToUnitCircleTest(
period: TimePeriod,
expected1: Double,
Copy link
Collaborator

@tovbinm tovbinm Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make expected: Double* or expected: Seq[Double]

@@ -100,23 +100,31 @@ class DateToUnitCircleTransformer[T <: Date]
}
}

private[op] object DateToUnitCircle {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove private[op]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh woops, I needed to use this with repl at some point

case SomeValue(v: DenseVector) => Map((name, None) -> Right(v.toArray.toSeq))
case SomeValue(v: SparseVector) => Map((name, None) -> Right(v.indices.map(_.toDouble).toSeq))
case ft@SomeValue(_) => ft match {
case v: Text => Map((name, None) -> Left(v.value.toSeq.flatMap(tokenize)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you! @marcovivero

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@tovbinm
Copy link
Collaborator

tovbinm commented Sep 17, 2018

I added DO NOT MERGE because we are waiting for @marcovivero to test it.

@tovbinm tovbinm merged commit f9a3718 into salesforce:master Sep 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants