Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add minimum rows for scoring set in RawFeatureFilter + rewrite tests to use data generators #250

Merged
merged 15 commits into from
Mar 28, 2019

Conversation

Jauntbox
Copy link
Contributor

Related issues
n/a

Describe the proposed solution
Adds minRowsForScoringSet to RawFeatureFilter so that features are not flagged for removal when the scoring set is extremely small and not representative. The threshold is currently set to be the same as the minimum training set size.

Describe alternatives you've considered
n/a

Additional context
This change broke many of the existing tests in RawFeatureFilterTest since they relied on fixed local datasets that fell well below the new threshold on scoring set size. All of the dataframe cleaning tests have been rewritten to use randomly generated data from the testkit generators so the data we perform the tests on is more robust.

@codecov
Copy link

codecov bot commented Mar 26, 2019

Codecov Report

Merging #250 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #250      +/-   ##
==========================================
+ Coverage   86.55%   86.58%   +0.03%     
==========================================
  Files         314      314              
  Lines       10298    10302       +4     
  Branches      342      556     +214     
==========================================
+ Hits         8913     8920       +7     
+ Misses       1385     1382       -3
Impacted Files Coverage Δ
.../src/main/scala/com/salesforce/op/OpWorkflow.scala 87.5% <ø> (ø) ⬆️
...a/com/salesforce/op/filters/RawFeatureFilter.scala 92.77% <100%> (+1.86%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3aa144a...bee777b. Read the comment docs.

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jauntbox lets make the min scoring rows settable by the user with a default of 500 - then override it so the old tests are still valid and add some new tests....

@@ -371,6 +373,10 @@ object RawFeatureFilter {
bins
}

// If there are not enough rows in the scoring set, we should not perform comparisons between the training and
// scoring sets since they will not be reliable. Currently, this is set to the same as the minimum training size.
val minRowsForScoringSet = 500
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not make this hard coded - it should be a parameter that the user can override

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, fixed.

data.schema.fields.map(_.name).toSet shouldEqual
Set("key", "height", "survived", "stringMap", "numericMap", "booleanMap")
Set("booleanMap", "description", "height", "stringMap", "age", "key", "survived", "numericMap")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once things are parameterized lets set them so that the tests remain the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just about to suggest the same ;)

}

/**
* This test generates three numeric generators with the same underlying distribution, but different fill rates.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great test docs!!

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leahmcguire @Jauntbox can we not test RFF without OpWorkflow?

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! great test docs!!!

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - nice tests!

@RunWith(classOf[JUnitRunner])
class RawFeatureFilterTest extends FlatSpec with PassengerSparkFixtureTest with FiltersTestData {

// loggingLevel(Level.INFO)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this line // loggingLevel(Level.INFO)

val featureUniverse = Set("myF1", "myF2", "myF3")
val mapKeyUniverse = Set("f1", "f2", "f3")
// Number of rows to use in randomly generated data sets
val numRows = 1000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we drop this to 500?

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, lgtm!

@tovbinm tovbinm merged commit 32ec731 into master Mar 28, 2019
@tovbinm tovbinm deleted the km/rff-limits branch March 28, 2019 22:23
@tovbinm tovbinm mentioned this pull request Apr 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants