-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add minimum rows for scoring set in RawFeatureFilter + rewrite tests to use data generators #250
Conversation
…ame thing working for maps
Codecov Report
@@ Coverage Diff @@
## master #250 +/- ##
==========================================
+ Coverage 86.55% 86.58% +0.03%
==========================================
Files 314 314
Lines 10298 10302 +4
Branches 342 556 +214
==========================================
+ Hits 8913 8920 +7
+ Misses 1385 1382 -3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jauntbox lets make the min scoring rows settable by the user with a default of 500 - then override it so the old tests are still valid and add some new tests....
@@ -371,6 +373,10 @@ object RawFeatureFilter { | |||
bins | |||
} | |||
|
|||
// If there are not enough rows in the scoring set, we should not perform comparisons between the training and | |||
// scoring sets since they will not be reliable. Currently, this is set to the same as the minimum training size. | |||
val minRowsForScoringSet = 500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets not make this hard coded - it should be a parameter that the user can override
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, fixed.
data.schema.fields.map(_.name).toSet shouldEqual | ||
Set("key", "height", "survived", "stringMap", "numericMap", "booleanMap") | ||
Set("booleanMap", "description", "height", "stringMap", "age", "key", "survived", "numericMap") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once things are parameterized lets set them so that the tests remain the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just about to suggest the same ;)
} | ||
|
||
/** | ||
* This test generates three numeric generators with the same underlying distribution, but different fill rates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great test docs!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leahmcguire @Jauntbox can we not test RFF without OpWorkflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! great test docs!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - nice tests!
@RunWith(classOf[JUnitRunner]) | ||
class RawFeatureFilterTest extends FlatSpec with PassengerSparkFixtureTest with FiltersTestData { | ||
|
||
// loggingLevel(Level.INFO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove this line // loggingLevel(Level.INFO)
val featureUniverse = Set("myF1", "myF2", "myF3") | ||
val mapKeyUniverse = Set("f1", "f2", "f3") | ||
// Number of rows to use in randomly generated data sets | ||
val numRows = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we drop this to 500?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments, lgtm!
Related issues
n/a
Describe the proposed solution
Adds
minRowsForScoringSet
to RawFeatureFilter so that features are not flagged for removal when the scoring set is extremely small and not representative. The threshold is currently set to be the same as the minimum training set size.Describe alternatives you've considered
n/a
Additional context
This change broke many of the existing tests in
RawFeatureFilterTest
since they relied on fixed local datasets that fell well below the new threshold on scoring set size. All of the dataframe cleaning tests have been rewritten to use randomly generated data from the testkit generators so the data we perform the tests on is more robust.