Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add minimum rows for scoring set in RawFeatureFilter + rewrite tests to use data generators #250

Merged
merged 15 commits into from
Mar 28, 2019
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -300,9 +300,11 @@ class RawFeatureFilter[T]
val scoreData = scoringReader.flatMap { s =>
val sd = s.generateDataFrame(rawFeatures, parameters.switchReaderParams()).persist()
log.info("Loaded scoring data")
if (sd.count() > 0) Some(sd)
val scoringDataCount = sd.count()
if (scoringDataCount >= RawFeatureFilter.minRowsForScoringSet) Some(sd)
else {
log.warn("Scoring dataset was empty. Only training data checks will be used.")
log.warn(s"Scoring dataset has $scoringDataCount rows, which is less than the minimum required of " +
s"${RawFeatureFilter.minRowsForScoringSet}. Only training data checks will be used.")
None
}
}
Expand Down Expand Up @@ -371,6 +373,10 @@ object RawFeatureFilter {
bins
}

// If there are not enough rows in the scoring set, we should not perform comparisons between the training and
// scoring sets since they will not be reliable. Currently, this is set to the same as the minimum training size.
val minRowsForScoringSet = 500
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not make this hard coded - it should be a parameter that the user can override

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, fixed.


}

/**
Expand Down
8 changes: 5 additions & 3 deletions core/src/test/scala/com/salesforce/op/OpWorkflowTest.scala
Original file line number Diff line number Diff line change
Expand Up @@ -241,14 +241,16 @@ class OpWorkflowTest extends FlatSpec with PassengerSparkFixtureTest {
val fv = Seq(age, gender, height, weight, description, boarded, stringMap, numericMap, booleanMap).transmogrify()
val survivedNum = survived.occurs()
val pred = BinaryClassificationModelSelector().setInput(survivedNum, fv).getOutput()

val wf = new OpWorkflow()
.setResultFeatures(pred)
.withRawFeatureFilter(Option(dataReader), Option(simpleReader),
maxFillRatioDiff = 1.0) // only height and the female key of maps should meet this criteria
.withRawFeatureFilter(Option(dataReader), None, maxFillRatioDiff = 1.0)
val data = wf.computeDataUpTo(weight)

// Since there are < 500 rows in the scoring set, only the training set checks are applied here, and the only
// removal reasons should be null indicator - label correlations
data.schema.fields.map(_.name).toSet shouldEqual
Set("key", "height", "survived", "stringMap", "numericMap", "booleanMap")
Set("booleanMap", "description", "height", "stringMap", "age", "key", "survived", "numericMap")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once things are parameterized lets set them so that the tests remain the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just about to suggest the same ;)

}

it should "return a model that transforms the data correctly" in {
Expand Down
Loading