Add minimum rows for scoring set in RawFeatureFilter + rewrite tests to use data generators #250

Jauntbox · 2019-03-25T23:15:07Z

Related issues
n/a

Describe the proposed solution
Adds minRowsForScoringSet to RawFeatureFilter so that features are not flagged for removal when the scoring set is extremely small and not representative. The threshold is currently set to be the same as the minimum training set size.

Describe alternatives you've considered
n/a

Additional context
This change broke many of the existing tests in RawFeatureFilterTest since they relied on fixed local datasets that fell well below the new threshold on scoring set size. All of the dataframe cleaning tests have been rewritten to use randomly generated data from the testkit generators so the data we perform the tests on is more robust.

…ame thing working for maps

…ecks

…rff-limits

codecov · 2019-03-26T04:47:29Z

Codecov Report

Merging #250 into master will increase coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #250      +/-   ##
==========================================
+ Coverage   86.55%   86.58%   +0.03%     
==========================================
  Files         314      314              
  Lines       10298    10302       +4     
  Branches      342      556     +214     
==========================================
+ Hits         8913     8920       +7     
+ Misses       1385     1382       -3

Impacted Files	Coverage Δ
.../src/main/scala/com/salesforce/op/OpWorkflow.scala	`87.5% <ø> (ø)`	⬆️
...a/com/salesforce/op/filters/RawFeatureFilter.scala	`92.77% <100%> (+1.86%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3aa144a...bee777b. Read the comment docs.

leahmcguire

@Jauntbox lets make the min scoring rows settable by the user with a default of 500 - then override it so the old tests are still valid and add some new tests....

leahmcguire · 2019-03-26T16:04:07Z

core/src/main/scala/com/salesforce/op/filters/RawFeatureFilter.scala

@@ -371,6 +373,10 @@ object RawFeatureFilter {
    bins
  }

+  // If there are not enough rows in the scoring set, we should not perform comparisons between the training and
+  // scoring sets since they will not be reliable. Currently, this is set to the same as the minimum training size.
+  val minRowsForScoringSet = 500


lets not make this hard coded - it should be a parameter that the user can override

Good point, fixed.

leahmcguire · 2019-03-26T16:04:30Z

core/src/test/scala/com/salesforce/op/OpWorkflowTest.scala

    data.schema.fields.map(_.name).toSet shouldEqual
-      Set("key", "height", "survived", "stringMap", "numericMap", "booleanMap")
+      Set("booleanMap", "description", "height", "stringMap", "age", "key", "survived", "numericMap")


once things are parameterized lets set them so that the tests remain the same

I was just about to suggest the same ;)

…n necessary

tovbinm · 2019-03-27T04:32:42Z

core/src/test/scala/com/salesforce/op/filters/RawFeatureFilterTest.scala

+  }
+
+  /**
+   * This test generates three numeric generators with the same underlying distribution, but different fill rates.


great test docs!!

tovbinm

@leahmcguire @Jauntbox can we not test RFF without OpWorkflow?

tovbinm

lgtm! great test docs!!!

leahmcguire

LGTM - nice tests!

tovbinm · 2019-03-28T21:00:32Z

core/src/test/scala/com/salesforce/op/filters/RawFeatureFilterTest.scala

 @RunWith(classOf[JUnitRunner])
 class RawFeatureFilterTest extends FlatSpec with PassengerSparkFixtureTest with FiltersTestData {

+  // loggingLevel(Level.INFO)


please remove this line // loggingLevel(Level.INFO)

tovbinm · 2019-03-28T21:00:41Z

core/src/test/scala/com/salesforce/op/filters/RawFeatureFilterTest.scala

+  val featureUniverse = Set("myF1", "myF2", "myF3")
+  val mapKeyUniverse = Set("f1", "f2", "f3")
+  // Number of rows to use in randomly generated data sets
+  val numRows = 1000


can we drop this to 500?

tovbinm

minor comments, lgtm!

…ngs up

Jauntbox added 7 commits March 15, 2019 12:24

Adds min scoring set size to RFF

9ba57d6

Updated tests with random feature generation, attempting to get the s…

3e6bc34

…ame thing working for maps

Several new tests with randomly generated features

c05f61e

Cleaned up new RFF tests, and added helper function for repetetive ch…

8469d1f

…ecks

More documentation and readability changes

0999ac9

Small cleanup

8b483bf

Merge branch 'master' of github.com:salesforce/TransmogrifAI into km/…

787bca6

…rff-limits

Jauntbox requested review from leahmcguire and tovbinm as code owners March 25, 2019 23:15

Jauntbox assigned tovbinm and leahmcguire Mar 25, 2019

Jauntbox added 3 commits March 25, 2019 16:50

Fix scalastyle errors

aa1ebb6

Debugging

8e911b9

Fixed OpWorkflowTest and reduced flakiness of LOCO test

c08ac2d

leahmcguire requested changes Mar 26, 2019

View reviewed changes

Jauntbox added 2 commits March 26, 2019 09:46

Made minScoringRows settable, with a default

bef03dd

Put all the old RFF tests back in and overrode the minScoringRows whe…

73d9d61

…n necessary

tovbinm reviewed Mar 27, 2019

View reviewed changes

tovbinm approved these changes Mar 27, 2019

View reviewed changes

leahmcguire approved these changes Mar 27, 2019

View reviewed changes

tovbinm reviewed Mar 28, 2019

View reviewed changes

tovbinm approved these changes Mar 28, 2019

View reviewed changes

Jauntbox added 3 commits March 28, 2019 14:11

Fix merge conflicts

ebcebdd

Reduced number of rows generated down to 500 (from 1000) to speed thi…

94f856a

…ngs up

Fixed scalastyle issues and made a test less likely to be flaky

bee777b

tovbinm merged commit 32ec731 into master Mar 28, 2019

tovbinm deleted the km/rff-limits branch March 28, 2019 22:23

tovbinm mentioned this pull request Apr 10, 2019

Release 0.5.2 #277

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add minimum rows for scoring set in RawFeatureFilter + rewrite tests to use data generators #250

Add minimum rows for scoring set in RawFeatureFilter + rewrite tests to use data generators #250

Jauntbox commented Mar 25, 2019

codecov bot commented Mar 26, 2019 •

edited

Loading

leahmcguire left a comment

leahmcguire Mar 26, 2019

Jauntbox Mar 26, 2019

leahmcguire Mar 26, 2019

Jauntbox Mar 26, 2019

tovbinm Mar 27, 2019

tovbinm Mar 27, 2019

tovbinm left a comment

tovbinm left a comment

leahmcguire left a comment

tovbinm Mar 28, 2019

tovbinm Mar 28, 2019

tovbinm left a comment

Add minimum rows for scoring set in RawFeatureFilter + rewrite tests to use data generators #250

Add minimum rows for scoring set in RawFeatureFilter + rewrite tests to use data generators #250

Conversation

Jauntbox commented Mar 25, 2019

codecov bot commented Mar 26, 2019 • edited Loading

Codecov Report

leahmcguire left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tovbinm left a comment

Choose a reason for hiding this comment

tovbinm left a comment

Choose a reason for hiding this comment

leahmcguire left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tovbinm left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 26, 2019 •

edited

Loading