Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweaks to OpBinScoreEvaluator #233

Merged
merged 24 commits into from
Apr 3, 2019
Merged

Conversation

shaeselix
Copy link
Contributor

Related issues
Refer to issue(s) addressed in this pull request from Issues page.
N/A

Describe the proposed solution
Adding Lift, a new BinaryClassificationMetrics evaluator, to show how well scores predict positive labels in various score groupings / bands. As seen in the first plot of this article: https://www.kdnuggets.com/2016/03/lift-analysis-data-scientist-secret-weapon.html
A new parameter has been added to BinaryClassificationMetrics, LiftMetrics, that is filled with a Seq[LiftMetricBand], calculated with an RDD of scoreAndLabel in LiftEvaluator.

Describe alternatives you've considered
Rather than an evaluator, this could be an estimator. However, since it's a method of evaluation that summarizes a trained dataset, rather than estimating new values from data, IMO it belongs in evaluators. MultiClassificationMetrics has ThresholdMetrics, designed for a Confidence Plot, and this PR was emulated on that design.

Additional context

@salesforce-cla
Copy link

Thanks for the contribution! It looks like @shaeselix is an internal user so signing the CLA is not required. However, we need to confirm this.

@amateurhuman
Copy link

@shaeselix you should have an email invite to join the Salesforce org, or you can visit https://github.com/salesforce to accept. Once you've accepted, you can refresh the CLA check at https://cla.salesforce.com/status/salesforce/TransmogrifAI/pull/233

@codecov
Copy link

codecov bot commented Feb 28, 2019

Codecov Report

Merging #233 into master will decrease coverage by 17.95%.
The diff coverage is 89.13%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #233       +/-   ##
===========================================
- Coverage    86.4%   68.44%   -17.96%     
===========================================
  Files         312      313        +1     
  Lines       10187    10231       +44     
  Branches      336      553      +217     
===========================================
- Hits         8802     7003     -1799     
- Misses       1385     3228     +1843
Impacted Files Coverage Δ
...p/evaluators/OpBinaryClassificationEvaluator.scala 78.57% <50%> (-3.93%) ⬇️
...a/com/salesforce/op/evaluators/LiftEvaluator.scala 92.85% <92.85%> (ø)
...ce/op/stages/impl/feature/TextLenTransformer.scala 0% <0%> (-100%) ⬇️
...lesforce/op/stages/impl/feature/LangDetector.scala 0% <0%> (-100%) ⬇️
...alesforce/op/cli/gen/templates/SimpleProject.scala 0% <0%> (-100%) ⬇️
...main/scala/com/salesforce/op/filters/Summary.scala 0% <0%> (-100%) ⬇️
.../scala/com/salesforce/op/cli/gen/ProblemKind.scala 0% <0%> (-100%) ⬇️
...orce/op/stages/impl/feature/RealNNVectorizer.scala 0% <0%> (-100%) ⬇️
...cala/com/salesforce/op/cli/gen/FileInProject.scala 0% <0%> (-100%) ⬇️
...p/stages/impl/feature/OpScalarStandardScaler.scala 0% <0%> (-100%) ⬇️
... and 56 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da1b27f...cc65e50. Read the comment docs.

@codecov
Copy link

codecov bot commented Feb 28, 2019

Codecov Report

Merging #233 into master will decrease coverage by 7.16%.
The diff coverage is 95.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #233      +/-   ##
==========================================
- Coverage    86.6%   79.43%   -7.17%     
==========================================
  Files         315      315              
  Lines       10341    10345       +4     
  Branches      325      533     +208     
==========================================
- Hits         8956     8218     -738     
- Misses       1385     2127     +742
Impacted Files Coverage Δ
...p/evaluators/OpBinaryClassificationEvaluator.scala 82.5% <ø> (ø) ⬆️
...salesforce/op/evaluators/OpBinScoreEvaluator.scala 96.77% <95.83%> (+0.47%) ⬆️
...ala/com/salesforce/op/testkit/InfiniteStream.scala 0% <0%> (-100%) ⬇️
...alesforce/op/cli/gen/templates/SimpleProject.scala 0% <0%> (-100%) ⬇️
...ala/com/salesforce/op/testkit/RandomIntegral.scala 0% <0%> (-100%) ⬇️
...scala/com/salesforce/op/testkit/RandomStream.scala 0% <0%> (-100%) ⬇️
...n/scala/com/salesforce/op/testkit/RandomData.scala 0% <0%> (-100%) ⬇️
...com/salesforce/op/local/OpWorkflowModelLocal.scala 0% <0%> (-100%) ⬇️
...com/salesforce/op/testkit/ProbabilityOfEmpty.scala 0% <0%> (-100%) ⬇️
...om/salesforce/op/local/OpWorkflowRunnerLocal.scala 0% <0%> (-100%) ⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update df42f37...e67646a. Read the comment docs.

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shaeselix thanks for the contribution!! Is this different that the score binning done (and stored) in the BrierScore calculation https://github.com/salesforce/TransmogrifAI/blob/master/core/src/main/scala/com/salesforce/op/evaluators/OpBinScoreEvaluator.scala ? Could the lift metric be added to that evaluator as another output metric?

scoreAndLabels: RDD[(Double, Double)],
getScoreBands: RDD[Double] => Seq[(Double, Double, String)]
): Seq[LiftMetricBand] = {
val bands = getScoreBands(scoreAndLabels.map { case (score, _) => score })
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val scores = scoreAndLabels.map { case (score, _) => score }

}.collect { case (Some(band), label) => (band, label) }
val perBandCounts = aggregateBandedLabels(bandedLabels)
val overallRate = overallLiftRate(perBandCounts)
bands.map({ case (lower, upper, band) =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bands.map { case (lower, upper, band) => ... }

* @param scores RDD of scores. unused in this function
* @return sequence of (lowerBound, upperBound, bandString) tuples
*/
private[op] def getDefaultScoreBands(scores: RDD[Double]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant endline private[op] def getDefaultScoreBands(scores: RDD[Double]): Seq[(Double, Double, String)] =

): Seq[LiftMetricBand] = {
liftMetricBands(
scoreAndLabels,
getDefaultScoreBands
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so why not allow users to customize score bands?

* Algorithm for calculating a chart as seen here:
* https://www.kdnuggets.com/2016/03/lift-analysis-data-scientist-secret-weapon.html
*/
object LiftEvaluator {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The evaluator has to extend OpBinaryClassificationEvaluatorBase, OpMultiClassificationEvaluatorBase or OpRegressionEvaluatorBase

@shaeselix
Copy link
Contributor Author

@leahmcguire @tovbinm apologies for the extended reply to this. I was going through the BrierScore code and while they're calculating very similar things, I don't think they can be combined because the BrierScore calculations are bound by MinScore and MaxScore, and Lift Plots should be bound by 0.0 and 1.0. I was able to refactor it to extend OpBinaryClassificationEvaluatorBase and evaluate to a single value when a threshold for the score decision boundary is provided. Since F1 is also in BinaryClassificationMetrics and uses a threshold, I think it should be okay to also use one here.

@leahmcguire
Copy link
Collaborator

@shaeselix - the reason the Brier score goes from min to max and not 0 to 1 is that not every classification model provides a probability as an output. This will fail if you try to run it on an SVM.

If that is really the difference in what you are trying to do then you should parameterize the Brier score so you can optionally set the min and max and throw an error if the data falls outside them.

@shaeselix shaeselix changed the title Adding LiftEvaluator and tests to op.evaluators for BinaryClassificationMetrics Tweaks to OpBinScoreEvaluator Mar 29, 2019
@shaeselix
Copy link
Contributor Author

@leahmcguire @tovbinm okay, we figured out that, yes, this could all be expressed in the OpBinScoreEvaluator. Just made a few tweaks and updated tests. Also happy to contribute to the docs docs for this class, since it looks like they haven't been added yet.

@tovbinm
Copy link
Collaborator

tovbinm commented Mar 29, 2019

@shaeselix please do!! ;)

@shaeselix
Copy link
Contributor Author

@leahmcguire @tovbinm any issues here or can we merge?

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just change the name and then LGTM

* @param binCenters center of each bin
* @param numberOfDataPoints total number of data points in each bin
* @param sumOfLabels sum of the label in each bin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or do you mean count? - then I request a name change :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's meant to be the count of the "yes" labels, calculated as the sum of the labels. Does that make sense?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense for the name but then see my comment below. Why the sum of yes and not just the count of all (that combined with the average gives you the same info but in a more intuitive formate)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't numberOfDataPoints the count of all?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah - good point :-) is what you want really the sum of the labels or the count of the positive labels? I dont think it is possible with the current models but say someone ran binary classification and fed in -1 and 1 instead of 0, 1 what do you want to see?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, what I'm wanting is the count of the positive labels. This could be reconstructed by multiplying the averageConversionRate with the numberOfDataPoints but figured it would be cleaner and less subject to round off error by having a parameter for it explicitly.

I assumed your -1/1 example wasn't possible but I can refactor the code to make sure it's only counting positive labels rather than just summing. Would numberOfPositiveLabels be an appropriate param name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, fix is in! Thanks for the review!

binSize = diff / numOfBins,
binCenters = binCenters,
numberOfDataPoints = numberOfDataPoints,
sumOfLabels = sumOfLabels,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why sum of labels, wouldn't count of labels be more useful?

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! please add a few lines of docs on how to use brier score as lift metric. perhaps we should still include it somewhere here, e.g. Evaluators.BinaryClassification.lift()? https://github.com/salesforce/TransmogrifAI/blob/master/core/src/main/scala/com/salesforce/op/evaluators/Evaluators.scala#L40 @leahmcguire wdyt?

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@leahmcguire leahmcguire merged commit 6bb63ba into salesforce:master Apr 3, 2019
@tovbinm
Copy link
Collaborator

tovbinm commented Apr 3, 2019

@shaeselix are you planning to address this in a separate PR then?

@shaeselix
Copy link
Contributor Author

@tovbinm I'm happy to add it in another PR. I would say that BrierScore != Lift. My understanding of lift as a singular metric is that it requires a threshold for a decision boundary. That can be a parameter in the Evaluators.BinaryClassification.lift(threshold), or it can simply be assumed to be 0.5, as it appears the f1(), error(), etc also assume.

@tovbinm tovbinm mentioned this pull request Apr 10, 2019
@tovbinm tovbinm mentioned this pull request Jul 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants