Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Descale feature contribution for Linear Regression & Logistic Regression #345

Merged
merged 43 commits into from
Jul 25, 2019

Conversation

TuanNguyen27
Copy link
Collaborator

@TuanNguyen27 TuanNguyen27 commented Jun 25, 2019

Problem context
Spark returns feature contribution on the original scale of the feature, making it hard to compare relative importance between two features of different scales. During training for linear regression, Spark normalizes features and label to zero mean and unit variance, and we want to display feature contribution (which is the model's coefficient) on this normalized scale.

Note: Spark does the same standardization for logistic regression, but it is unclear if the same problem applies to feature contribution for logistic regression.

Describe the proposed solution
Descale feature contribution by multiplying Spark's returned weight with the respective feature standard deviation, and divide by the label standard deviation. This change only involves ModelInsights, and will not affect scoring.

Describe alternatives you've considered
N/A. The descaling needs access to the best trained model, label summary stats & feature summary stats.

@codecov
Copy link

codecov bot commented Jun 25, 2019

Codecov Report

Merging #345 into master will increase coverage by <.01%.
The diff coverage is 82.14%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #345      +/-   ##
==========================================
+ Coverage   86.79%   86.79%   +<.01%     
==========================================
  Files         336      336              
  Lines       10895    10921      +26     
  Branches      347      570     +223     
==========================================
+ Hits         9456     9479      +23     
- Misses       1439     1442       +3
Impacted Files Coverage Δ
...c/main/scala/com/salesforce/op/ModelInsights.scala 91.54% <82.14%> (-1.14%) ⬇️
...ges/impl/classification/OpLogisticRegression.scala 60.71% <0%> (+3.57%) ⬆️
...op/stages/impl/regression/OpLinearRegression.scala 80.95% <0%> (+4.76%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3e02bf7...51627e9. Read the comment docs.

@tovbinm
Copy link
Collaborator

tovbinm commented Jun 27, 2019

@TuanNguyen27 can you also please explain what problem are you solving in the PR description?

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add some more tests to cover new scenarios

case Continuous(_, _, _, variance) => math.sqrt(variance)
// for (binary) logistic regression we only need to multiply by feature standard deviation
case Discrete(domain, prob) =>
def computeVariance(domain: Seq[String], prob: Seq[Double]): Double = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you dont need the inner function and you can do it faster. simply do:

case Discrete(domain, prob) =>
   val (weighted, sqweighted) = (domain zip prob).foldLeft((0.0, 0.0)) { case ((weightSum, sqweightSum), (d, p)) =>
         val floatD = d.toDouble
         val weight = floatD * prob
         val sqweight = floatD * weight
         (weightSum+ weight, sqweightSum + sqweight)  
   }
   sqweighted - weighted

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another question, since domain is Seq[String] how can we be sure that d.toDouble wont throw an error?! should we handle it gracefully?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For regression problem where there are too few unique labels (and they get treated as categoricals), d.toDouble should work fine. I'm not sure how it will behave on classification. For classification I assume that domain = Array("0", "1"), is this correct ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we decide to define domain as Seq[String] in the first place? @Jauntbox @leahmcguire

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have the raw label rather than the indexed label you will get strings - this was designed to support that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to do toDouble in this case ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any value which is not numeric with throw an exception, e.g. "".toDouble

Copy link
Collaborator Author

@TuanNguyen27 TuanNguyen27 Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but will there ever be non numeric string in the label field...? i thought they should be filtered out.

core/src/main/scala/com/salesforce/op/ModelInsights.scala Outdated Show resolved Hide resolved
core/src/main/scala/com/salesforce/op/ModelInsights.scala Outdated Show resolved Hide resolved
Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fix the error message and then LGTM

core/src/main/scala/com/salesforce/op/ModelInsights.scala Outdated Show resolved Hide resolved
@TuanNguyen27 TuanNguyen27 changed the title Descale feature contribution for Linear Regression Descale feature contribution for Linear Regression & Logistic Regression Jul 22, 2019
Copy link
Contributor

@Jauntbox Jauntbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comparison thing I'd like you to check, then LGTM.

Copy link
Contributor

@Jauntbox Jauntbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to hold this up, lgtm

@salesforce-cla
Copy link

Thanks for the contribution! Before we can merge this, we need @tovbinm to sign the Salesforce.com Contributor License Agreement.

@salesforce-cla
Copy link

Thanks for the contribution! It looks like @leahmcguire is an internal user so signing the CLA is not required. However, we need to confirm this.

@leahmcguire leahmcguire merged commit 82bb2c1 into master Jul 25, 2019
@leahmcguire leahmcguire deleted the tn/descaleLR branch July 25, 2019 17:39
@gerashegalov gerashegalov mentioned this pull request Sep 8, 2019
gerashegalov added a commit that referenced this pull request Sep 11, 2019
Bug fixes:
- Ensure correct metrics despite model failures on some CV folds [#404](#404)
- Fix flaky `ModelInsight` tests [#395](#395)
- Avoid creating `SparseVector`s for LOCO [#377](#377)

New features / updates:
- Model combiner [#385](#399)
- Added new sample for HousingPrices [#365](#365)
- Test to verify that custom metrics appear in model insight metrics [#387](#387)
- Add `FeatureDistribution` to `SerializationFormat`s [#383](#383)
- Add metadata to `OpStandadrdScaler` to allow for descaling [#378](#378)
- Improve json serde error in `evalMetFromJson` [#380](#380)
- Track mean & standard deviation as metrics for numeric features and for text length of text features [#354](#354)
- Making model selectors robust to failing models [#372](#372)
- Use compact and compressed model json by default [#375](#375)
- Descale feature contribution for Linear Regression & Logistic Regression [#345](#345)

Dependency updates:   
- Update tika version [#382](#382)
@salesforce-cla
Copy link

salesforce-cla bot commented Dec 9, 2020

Thanks for the contribution! Unfortunately we can't verify the commit author(s): Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants