Option to calculate LOCO for dates/texts by Leaving Out Entire Vector. #418

sanmitra · 2019-10-11T05:30:15Z

Related issues
Raw features of types - texts and dates are converted to vectors during feature engineering. To calculate locos for such features, we calculate loco for each element in the vector and then average them. So, the complexity of calculating LOCO for each date/text feature is O(n*m)
where n is the size of entire feature vector (i.e containing all the features) and m is the size of individual text/date feature vector.

Describe the proposed solution
Alternate way to calculate the locos for date/text feature is zero out the entire vector of that feature and then calculate loco. So, the complexity of calculating LOCO for each date/text feature is O(n). The enum for this new approach is LeaveOutVector and old approach is Avg, and user one can choose between them.

…eaving out their entire vector.

codecov · 2019-10-11T05:47:53Z

Codecov Report

Merging #418 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #418      +/-   ##
==========================================
+ Coverage   86.93%   86.93%   +<.01%     
==========================================
  Files         337      337              
  Lines       11098    11100       +2     
  Branches      362      366       +4     
==========================================
+ Hits         9648     9650       +2     
  Misses       1450     1450

Impacted Files	Coverage Δ
...e/op/stages/impl/insights/RecordInsightsLOCO.scala	`96.84% <100%> (+0.06%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c69c356...9776e79. Read the comment docs.

michaelweilsalesforce · 2019-10-11T22:29:55Z

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

+
+      case VectorAggregationStrategy.LeaveOutVector =>
+        val copyFeatureSparse = featureSparse.copy
+        aggIndices.map {case (i, oldInd) => copyFeatureSparse.updated(i, oldInd, 0.0)}


Maybe foreach instead of map?
Also doesn't copyFeatureSparse need to be a var?

Sure, i will use foreach
But I am not sure about whether i should make it var. Reassignment of val is not allowed, but object can have its internal state modified, So guess val is fine.

val a = Array(10, 20) a.update(0, 15) // works a = Array(20, 30) // fails

michaelweilsalesforce · 2019-10-11T22:49:48Z

core/src/test/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCOTest.scala

+            val sumLOCOs = locos.reduce((a1, a2) => a1.zip(a2).map { case (l, r) => l + r })
+            sumLOCOs.map(_ / indices.length)
+          case VectorAggregationStrategy.LeaveOutVector =>
+            indices.map { i => featureArray.update(i, 0.0) }


foreach also here

michaelweilsalesforce · 2019-10-11T22:51:23Z

core/src/test/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCOTest.scala

+          case VectorAggregationStrategy.LeaveOutVector =>
+            indices.map { i => featureArray.update(i, 0.0) }
+            val newScore = model.transformFn(l.toRealNN, featureArray.toOPVector).score.toSeq
+            baseScore.zip(newScore).map { case (b, n) => b - n }
        }


Don't you need to revert featureArray with the old vals ?

No, because its a copy so no need, check line val featureArray = v.copy.toArray

michaelweilsalesforce · 2019-10-11T22:54:50Z

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

-    val featureIndexSet = featuresSparse.indices.toSet
-
-    // Besides non 0 values, we want to check the text/date features as well
-    val zeroValIndices = (textFeatureIndices ++ dateFeatureIndices)


How did you manage to remove the zero val indices logic?

From featureSparse vector we know the active indices count, but to calculate the average LOCO for each date/text field, we needed the zero val indices logic. We used to calculate the total count of indices for a date/text feature in each transformation of a individual record/row in transformFn function. There is no need to do this, we can just calculate the total count of indices per date/text feature only once using OpVectorColumnHistory at the global level (i.e outside transformFn function ), this is what I did in line 139

private lazy val textFeaturesCount: Map[String, Int] = getFeatureCount(isTextIndex) private lazy val dateFeaturesCount: Map[String, Int] = getFeatureCount(isDateIndex)

michaelweilsalesforce · 2019-10-14T21:37:49Z

core/src/test/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCOTest.scala

+              val oldVal = v(i)
+              featureArray.update(i, 0.0)
+              val newScore = model.transformFn(l.toRealNN, featureArray.toOPVector).score.toSeq
+              featureArray.update(i, oldVal)


Then why do you update back, since featureArray is a copy?

Here, we have to because featureArray is a copy created before we calculate loco for each element, i.e

val featureArray = v.copy.toArray val locos = indices.map { i => featureArray.update(i, 0.0) calculateLOCO(featureArray, ....) featureArray.update(i, 0.0) }

If we refactor the above code like below, then we don't need to update back.

val locos = indices.map { i => val featureArray = v.copy.toArray featureArray.update(i, 0.0) calculateLOCO(featureArray,...) }

leahmcguire · 2019-10-15T16:40:45Z

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

+  def setTextAggregationStrategy(strategy: VectorAggregationStrategy): this.type =
+    set(textAggregationStrategy, strategy.entryName)
+  def getTextAggregationStrategy: VectorAggregationStrategy = VectorAggregationStrategy.withName(
+    $(textAggregationStrategy))


do we really want/need to expose strategies for each feature type? I imagine that there are other types of features me may eventually want to control as either average or leave out vector and putting in individual settings for each seems excessive...maybe one parameter to control how ALL vector treated features are handled?

Yeah, I agree. One parameter should be enough.

sanmitra · 2019-10-30T20:20:27Z

@tovbinm can you please take a look at this PR ?

tovbinm

I dont see where do we went from N*M to N complexity. Please comment.

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

tovbinm · 2019-10-31T17:15:53Z

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

+  private def getFeaturesSize(predicate: OpVectorColumnHistory => Boolean): Map[String, Int] = histories
+    .filter(predicate)
+    .groupBy { h => getRawFeatureName(h).get }
+    .mapValues(_.length).view.toMap


what's the point of .view here? either add it before filter of not add it at all.

add docs

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

gerashegalov

LGTM, overall; address previous comments please

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

sanmitra · 2019-11-04T22:54:00Z

@tovbinm @gerashegalov I have addressed your review comments. Thanks

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

sanmitra · 2019-11-04T23:20:13Z

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

+    strategy: VectorAggregationStrategy,
+    baseScore: Array[Double],
+    featureSize: Int
+  ): Array[Double] = {


The cost of computeDiff is O(n).
When VectorAggregationStrategy=Avg note that, we do computeDiff m times and
When VectorAggregationStrategy=LeaveOutVector, we do computeDiff only 1 time.
So the former computation is O(n*m) and the latter is O(n)

where n is the size of entire feature vector (i.e containing all the features) and m is the size of individual text/date feature vector which is to be aggregated.

@tovbinm ^^

sanmitra · 2019-11-05T00:10:07Z

@tovbinm can you please merge this PR ?

tovbinm · 2019-11-05T00:12:12Z

@sanmitra build timed out. restarted.

tovbinm · 2019-11-12T03:59:33Z

@sanmitra the build is failing on master branch due to LOCO test failure. please have a look. thanks.

sanmitra added 8 commits September 6, 2019 10:28

Add the ability to calculate the loco for text and date features by l…

62de0be

…eaving out their entire vector.

Refactoring

93afc12

Refactoring

b117b9a

Refactoring

b7bd84f

Merge branch 'master' into san/loco-aggregate

8286d7d

Refactoring

47f2ae1

Fixing scala style errors

97483a6

Merge branch 'master' into san/loco-aggregate

719fe73

sanmitra requested review from gerashegalov, Jauntbox, leahmcguire, tovbinm and wsuchy as code owners October 11, 2019 05:30

sanmitra requested a review from michaelweilsalesforce October 11, 2019 05:30

Merge branch 'master' into san/loco-aggregate

2620076

salesforce-cla bot added the cla:signed label Oct 11, 2019

sanmitra added the ready for review label Oct 11, 2019

michaelweilsalesforce reviewed Oct 11, 2019

View reviewed changes

sanmitra and others added 2 commits October 13, 2019 12:52

Merge branch 'master' into san/loco-aggregate

aa45766

Refractoring

356e79f

michaelweilsalesforce reviewed Oct 14, 2019

View reviewed changes

leahmcguire reviewed Oct 15, 2019

View reviewed changes

sanmitra added 4 commits October 21, 2019 11:40

Adding only one paramter to control aggregation of all vector features

9346c68

Refactoring

de23c9c

Fixing scala style

2fcb05e

Merge branch 'master' into san/loco-aggregate

90d759b

sanmitra and others added 2 commits October 28, 2019 17:19

Merge branch 'master' into san/loco-aggregate

9e82af4

Refactoring

75934d1

michaelweilsalesforce approved these changes Oct 29, 2019

View reviewed changes

tovbinm reviewed Oct 31, 2019

View reviewed changes

Merge branch 'master' into san/loco-aggregate

ad1dca7

gerashegalov reviewed Nov 2, 2019

View reviewed changes

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala Outdated Show resolved Hide resolved

sanmitra and others added 3 commits November 4, 2019 11:43

Merge branch 'master' into san/loco-aggregate

80556ca

Addressed Mathew's and Gera's PR review comments

3ebda0f

Refactoring

9776e79

tovbinm reviewed Nov 4, 2019

View reviewed changes

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala Show resolved Hide resolved

sanmitra commented Nov 4, 2019

View reviewed changes

tovbinm approved these changes Nov 4, 2019

View reviewed changes

sanmitra merged commit 8ec6234 into master Nov 5, 2019

sanmitra deleted the san/loco-aggregate branch November 5, 2019 00:28

nicodv mentioned this pull request Jun 11, 2020

0.7.0 release #481

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to calculate LOCO for dates/texts by Leaving Out Entire Vector. #418

Option to calculate LOCO for dates/texts by Leaving Out Entire Vector. #418

sanmitra commented Oct 11, 2019

codecov bot commented Oct 11, 2019 •

edited

Loading

michaelweilsalesforce Oct 11, 2019

sanmitra Oct 14, 2019

michaelweilsalesforce Oct 11, 2019

michaelweilsalesforce Oct 11, 2019

sanmitra Oct 13, 2019 •

edited

Loading

michaelweilsalesforce Oct 11, 2019

sanmitra Oct 14, 2019

michaelweilsalesforce Oct 14, 2019

sanmitra Oct 21, 2019

leahmcguire Oct 15, 2019

sanmitra Oct 21, 2019

sanmitra Oct 21, 2019

sanmitra commented Oct 30, 2019

tovbinm left a comment •

edited

Loading

tovbinm Oct 31, 2019

gerashegalov left a comment

sanmitra commented Nov 4, 2019

sanmitra Nov 4, 2019 •

edited

Loading

sanmitra commented Nov 5, 2019

tovbinm commented Nov 5, 2019

tovbinm commented Nov 12, 2019

Option to calculate LOCO for dates/texts by Leaving Out Entire Vector. #418

Option to calculate LOCO for dates/texts by Leaving Out Entire Vector. #418

Conversation

sanmitra commented Oct 11, 2019

codecov bot commented Oct 11, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanmitra Oct 13, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanmitra commented Oct 30, 2019

tovbinm left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerashegalov left a comment

Choose a reason for hiding this comment

sanmitra commented Nov 4, 2019

sanmitra Nov 4, 2019 • edited Loading

Choose a reason for hiding this comment

sanmitra commented Nov 5, 2019

tovbinm commented Nov 5, 2019

tovbinm commented Nov 12, 2019

codecov bot commented Oct 11, 2019 •

edited

Loading

sanmitra Oct 13, 2019 •

edited

Loading

tovbinm left a comment •

edited

Loading

sanmitra Nov 4, 2019 •

edited

Loading