Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Descale feature contribution for Linear Regression & Logistic Regression #345

Merged
merged 43 commits into from
Jul 25, 2019
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
fe8e6df
move PR to a branch on Tmog
TuanNguyen27 Jun 25, 2019
4ad6204
add condition to descale only when it is Linear / Logistic regression…
TuanNguyen27 Jun 25, 2019
1e2bc81
fix modelinsighttest
TuanNguyen27 Jun 25, 2019
5cbe132
only compute descaled contrib if a model is present & fits our criteria
TuanNguyen27 Jun 25, 2019
673dad8
fix test failure with a check for empty list of feature contribution
TuanNguyen27 Jun 25, 2019
4c44263
addressing comments
TuanNguyen27 Jun 27, 2019
d3fa01b
more comment addressing
TuanNguyen27 Jun 27, 2019
7a882f2
test in progress, still broken
TuanNguyen27 Jun 27, 2019
8ba36c2
Merge branch 'master' into tn/descaleLR
tovbinm Jun 27, 2019
e4da5ab
seems to be working
TuanNguyen27 Jun 27, 2019
0767e84
first version of test
TuanNguyen27 Jun 28, 2019
992db1a
Merge branch 'tn/descaleLR' of https://github.com/salesforce/Transmog…
TuanNguyen27 Jun 28, 2019
7eaa209
fix scala style
TuanNguyen27 Jun 28, 2019
2c5d2f7
Merge branch 'master' into tn/descaleLR
tovbinm Jul 2, 2019
99236f4
Merge branch 'master' into tn/descaleLR
TuanNguyen27 Jul 3, 2019
e1bb482
addressing comments
TuanNguyen27 Jul 8, 2019
00fb0d6
Merge branch 'master' into tn/descaleLR
TuanNguyen27 Jul 8, 2019
27a2449
Merge branch 'master' into tn/descaleLR
TuanNguyen27 Jul 9, 2019
cb798db
change log to warn
TuanNguyen27 Jul 9, 2019
3a541ed
Merge branch 'master' into tn/descaleLR
leahmcguire Jul 11, 2019
083448a
Merge branch 'master' into tn/descaleLR
leahmcguire Jul 11, 2019
606b6e1
fix an error in calculating standard deviation for discrete distribution
TuanNguyen27 Jul 11, 2019
1643635
Merge branch 'tn/descaleLR' of https://github.com/salesforce/Transmog…
TuanNguyen27 Jul 11, 2019
6ca53fc
Merge branch 'master' into tn/descaleLR
leahmcguire Jul 11, 2019
4c432e6
correctly pull out standard deviation for each feature
TuanNguyen27 Jul 11, 2019
4f252bd
Merge branch 'master' into tn/descaleLR
TuanNguyen27 Jul 12, 2019
4be8752
Merge branch 'master' into tn/descaleLR
TuanNguyen27 Jul 12, 2019
0e63491
descale entire contribution vector & clearly separate out between lin…
TuanNguyen27 Jul 12, 2019
e6de82b
Merge branch 'tn/descaleLR' of https://github.com/salesforce/Transmog…
TuanNguyen27 Jul 12, 2019
54d28a1
Merge branch 'master' into tn/descaleLR
TuanNguyen27 Jul 14, 2019
4677c96
fix scala idiom
TuanNguyen27 Jul 15, 2019
fa4221f
Merge branch 'tn/descaleLR' of https://github.com/salesforce/Transmog…
TuanNguyen27 Jul 15, 2019
a5901d8
remove logistic regression pattern matching
TuanNguyen27 Jul 15, 2019
90ff504
add citations for future readability
TuanNguyen27 Jul 15, 2019
a7dea4e
refactor & add test for binary logistic regression case
TuanNguyen27 Jul 18, 2019
35bdfe8
Merge branch 'master' into tn/descaleLR
TuanNguyen27 Jul 18, 2019
c6bae48
remove redundant import
TuanNguyen27 Jul 18, 2019
bc60187
fix scala style
TuanNguyen27 Jul 19, 2019
a6839b2
fix scala style again
TuanNguyen27 Jul 19, 2019
23b2443
Update warning message
TuanNguyen27 Jul 22, 2019
36c8420
update failure threshold so test will pass
TuanNguyen27 Jul 22, 2019
c80cc1a
update test to be ratio instead of absolute difference
TuanNguyen27 Jul 24, 2019
51627e9
small update to set tolerance threshold
TuanNguyen27 Jul 25, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 74 additions & 4 deletions core/src/main/scala/com/salesforce/op/ModelInsights.scala
Original file line number Diff line number Diff line change
Expand Up @@ -484,10 +484,12 @@ case object ModelInsights {
s" to fill in model insights"
)

val labelSummary = getLabelSummary(label, checkerSummary)

ModelInsights(
label = getLabelSummary(label, checkerSummary),
label = labelSummary,
features = getFeatureInsights(vectorInput, checkerSummary, model, rawFeatures,
blacklistedFeatures, blacklistedMapKeys, rawFeatureFilterResults),
blacklistedFeatures, blacklistedMapKeys, rawFeatureFilterResults, labelSummary),
selectedModelInfo = getModelInfo(model),
trainingParams = trainingParams,
stageInfo = RawFeatureFilterConfig.toStageInfo(rawFeatureFilterResults.rawFeatureFilterConfig)
Expand Down Expand Up @@ -537,7 +539,8 @@ case object ModelInsights {
rawFeatures: Array[features.OPFeature],
blacklistedFeatures: Array[features.OPFeature],
blacklistedMapKeys: Map[String, Set[String]],
rawFeatureFilterResults: RawFeatureFilterResults = RawFeatureFilterResults()
rawFeatureFilterResults: RawFeatureFilterResults = RawFeatureFilterResults(),
label: LabelSummary
): Seq[FeatureInsights] = {
val featureInsights = (vectorInfo, summary) match {
case (Some(v), Some(s)) =>
Expand All @@ -557,6 +560,42 @@ case object ModelInsights {
case _ => None
}
val keptIndex = indexInToIndexKept.get(h.index)
val featureStd = math.sqrt(getIfExists(h.index, s.featuresStatistics.variance).getOrElse(1.0))
val sparkFtrContrib = keptIndex
.map(i => contributions.map(_.applyOrElse(i, (_: Int) => 0.0))).getOrElse(Seq.empty)
val defaultLabelStd = 1.0
val labelStd = label.distribution match {
case Some(Continuous(_, _, _, variance)) =>
if (variance == 0) {
log.warn("The standard deviation of the label is zero, " +
"so the coefficients and intercepts of the model will be zeros, training is not needed.\"")
defaultLabelStd
}
else math.sqrt(variance)
TuanNguyen27 marked this conversation as resolved.
Show resolved Hide resolved
case Some(Discrete(domain, prob)) =>
// mean = sum (x_i * p_i)
val mean = (domain zip prob).foldLeft(0.0) {
case (weightSum, (d, p)) => weightSum + d.toDouble * p
}
// variance = sum (x_i - mu)^2 * p_i
val discreteVariance = (domain zip prob).foldLeft(0.0) {
case (sqweightSum, (d, p)) => sqweightSum + (d.toDouble - mean) * (d.toDouble - mean) * p
}
if (discreteVariance == 0) {
log.warn("The standard deviation of the label is zero, " +
"so the coefficients and intercepts of the model will be zeros, training is not needed.\"")
defaultLabelStd
}
else math.sqrt(discreteVariance)
case Some(_) => {
log.warn("Performing weight descaling on an unsupported distribution")
TuanNguyen27 marked this conversation as resolved.
Show resolved Hide resolved
defaultLabelStd
}
case None => {
log.warn("Label does not exist, please check your data")
defaultLabelStd
}
TuanNguyen27 marked this conversation as resolved.
Show resolved Hide resolved
}

h.parentFeatureOrigins ->
Insights(
Expand All @@ -579,7 +618,8 @@ case object ModelInsights {
case _ => Map.empty[String, Double]
},
contribution =
keptIndex.map(i => contributions.map(_.applyOrElse(i, (_: Int) => 0.0))).getOrElse(Seq.empty),
descaleLRContrib(model, sparkFtrContrib, featureStd, labelStd).getOrElse(sparkFtrContrib),

min = getIfExists(h.index, s.featuresStatistics.min),
max = getIfExists(h.index, s.featuresStatistics.max),
mean = getIfExists(h.index, s.featuresStatistics.mean),
Expand Down Expand Up @@ -647,6 +687,36 @@ case object ModelInsights {
}
}

private[op] def descaleLRContrib(
model: Option[Model[_]],
sparkFtrContrib: Seq[Double],
featureStd: Double,
labelStd: Double): Option[Seq[Double]] = {
val stage = model.flatMap {
TuanNguyen27 marked this conversation as resolved.
Show resolved Hide resolved
case m: SparkWrapperParams[_] => m.getSparkMlStage()
case _ => None
}
stage.collect {
case m: LogisticRegressionModel =>
if (m.getStandardization && sparkFtrContrib.nonEmpty) {
// scale entire feature contribution vector
// See https://think-lab.github.io/d/205/
// § 4.5.2 Standardized Interpretations, An Introduction to Categorical Data Analysis, Alan Agresti
sparkFtrContrib.map(_ * featureStd)
TuanNguyen27 marked this conversation as resolved.
Show resolved Hide resolved
}
else sparkFtrContrib
case m: LinearRegressionModel =>
if (m.getStandardization && sparkFtrContrib.nonEmpty) {
// need to also divide by labelStd for linear regression
// See https://u.demog.berkeley.edu/~andrew/teaching/standard_coeff.pdf
// See https://en.wikipedia.org/wiki/Standardized_coefficient
sparkFtrContrib.map(_ * featureStd / labelStd)
}
else sparkFtrContrib
case _ => sparkFtrContrib
}
}

private[op] def getModelContributions
(model: Option[Model[_]], featureVectorSize: Option[Int] = None): Seq[Seq[Double]] = {
val stage = model.flatMap {
Expand Down
113 changes: 110 additions & 3 deletions core/src/test/scala/com/salesforce/op/ModelInsightsTest.scala
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
package com.salesforce.op

import com.salesforce.op.features.types._
import com.salesforce.op.features.{Feature, FeatureDistributionType}
import com.salesforce.op.features.{Feature, FeatureDistributionType, FeatureLike}
import com.salesforce.op.filters._
import com.salesforce.op.stages.impl.classification._
import com.salesforce.op.stages.impl.preparators._
Expand All @@ -40,12 +40,15 @@ import com.salesforce.op.stages.impl.selector.ModelSelectorNames.EstimatorType
import com.salesforce.op.stages.impl.selector.SelectedModel
import com.salesforce.op.stages.impl.selector.ValidationType._
import com.salesforce.op.stages.impl.tuning.{DataCutter, DataSplitter}
import com.salesforce.op.test.PassengerSparkFixtureTest
import com.salesforce.op.test.{PassengerSparkFixtureTest, TestFeatureBuilder}
import com.salesforce.op.testkit.RandomReal
import com.salesforce.op.utils.spark.{OpVectorColumnMetadata, OpVectorMetadata}
import ml.dmlc.xgboost4j.scala.spark.OpXGBoostQuietLogging
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.tuning.ParamGridBuilder
import org.junit.runner.RunWith
import com.salesforce.op.features.types.Real
import org.apache.spark.sql.DataFrame
import org.scalatest.FlatSpec
import org.scalatest.junit.JUnitRunner

Expand Down Expand Up @@ -95,6 +98,72 @@ class ModelInsightsTest extends FlatSpec with PassengerSparkFixtureTest with Dou
.setInput(label, features)
.getOutput()

val smallFeatureVariance = 10.0
val mediumFeatureVariance = 1.0
val bigFeatureVariance = 100.0
val smallNorm = RandomReal.normal[Real](0.0, smallFeatureVariance).limit(1000)
val mediumNorm = RandomReal.normal[Real](10, mediumFeatureVariance).limit(1000)
val bigNorm = RandomReal.normal[Real](10000.0, bigFeatureVariance).limit(1000)
val noise = RandomReal.normal[Real](0.0, 100.0).limit(1000)
// make a simple linear combination of the features (with noise), pass through sigmoid function and binarize
// to make labels for logistic reg toy data
def binarize(x: Double): Int = {
val sigmoid = 1.0 / (1.0 + math.exp(-x))
if (sigmoid > 0.5) 1 else 0
}
val logisticRegLabel = (smallNorm, mediumNorm, noise)
.zipped.map(_.toDouble.get * 10 + _.toDouble.get + _.toDouble.get).map(binarize(_)).map(RealNN(_))
// toy label for linear reg is a sum of two scaled Normals, hence we also know its standard deviation
val linearRegLabel = (smallNorm, bigNorm)
.zipped.map(_.toDouble.get * 5000 + _.toDouble.get).map(RealNN(_))
val labelStd = math.sqrt(5000 * 5000 * smallFeatureVariance + bigFeatureVariance)

def twoFeatureDF(feature1: List[Real], feature2: List[Real], label: List[RealNN]):
(Feature[RealNN], FeatureLike[OPVector], DataFrame) = {
val generatedData = feature1.zip(feature2).zip(label).map {
case ((f1, f2), label) => (f1, f2, label)
}
val (rawDF, raw1, raw2, rawLabel) = TestFeatureBuilder("feature1", "feature2", "label", generatedData)
val labelData = rawLabel.copy(isResponse = true)
val featureVector = raw1
.vectorize(fillValue = 0, fillWithMean = true, trackNulls = false, others = Array(raw2))
val checkedFeatures = labelData.sanityCheck(featureVector, removeBadFeatures = false)
return (labelData, checkedFeatures, rawDF)
}

val linRegDF = twoFeatureDF(smallNorm, bigNorm, linearRegLabel)
val logRegDF = twoFeatureDF(smallNorm, mediumNorm, logisticRegLabel)

val unstandardizedLinpred = new OpLinearRegression().setStandardization(false)
.setInput(linRegDF._1, linRegDF._2).getOutput()

val standardizedLinpred = new OpLinearRegression().setStandardization(true)
.setInput(linRegDF._1, linRegDF._2).getOutput()

val unstandardizedLogpred = new OpLogisticRegression().setStandardization(false)
.setInput(logRegDF._1, logRegDF._2).getOutput()

val standardizedLogpred = new OpLogisticRegression().setStandardization(true)
.setInput(logRegDF._1, logRegDF._2).getOutput()

def getFeatureImp(standardizedModel: FeatureLike[Prediction],
unstandardizedModel: FeatureLike[Prediction],
DF: DataFrame): Array[Double] = {
lazy val workFlow = new OpWorkflow()
.setResultFeatures(standardizedModel, unstandardizedModel).setInputDataset(DF)
lazy val model = workFlow.train()
val unstandardizedFtImp = model.modelInsights(unstandardizedModel)
.features.map(_.derivedFeatures.map(_.contribution))
val standardizedFtImp = model.modelInsights(standardizedModel)
.features.map(_.derivedFeatures.map(_.contribution))
val descaledsmallCoeff = standardizedFtImp.flatten.flatten.head
val originalsmallCoeff = unstandardizedFtImp.flatten.flatten.head
val descaledbigCoeff = standardizedFtImp.flatten.flatten.last
val orginalbigCoeff = unstandardizedFtImp.flatten.flatten.last
return Array(descaledsmallCoeff, originalsmallCoeff, descaledbigCoeff, orginalbigCoeff)
}


val params = new OpParams()

lazy val workflow = new OpWorkflow().setResultFeatures(predLin, pred).setParameters(params).setReader(dataReader)
Expand Down Expand Up @@ -508,9 +577,11 @@ class ModelInsightsTest extends FlatSpec with PassengerSparkFixtureTest with Dou
}

it should "correctly extract the FeatureInsights from the sanity checker summary and vector metadata" in {
val labelSum = ModelInsights.getLabelSummary(Option(lbl), Option(summary))

val featureInsights = ModelInsights.getFeatureInsights(
Option(meta), Option(summary), None, Array(f1, f0), Array.empty, Map.empty[String, Set[String]],
RawFeatureFilterResults()
RawFeatureFilterResults(), labelSum
)
featureInsights.size shouldBe 2

Expand Down Expand Up @@ -651,4 +722,40 @@ class ModelInsightsTest extends FlatSpec with PassengerSparkFixtureTest with Dou
f.cramersV.isEmpty shouldBe true
}
}

it should "correctly return the descaled coefficient for linear regression, " +
"when standardization is on" in {

// Since 5000 & 1 are always returned as the coefficients of the model
// trained on unstandardized data and we can analytically calculate
// the scaled version of them by the linear regression formula, the coefficients
// of the model trained on standardized data should be within a small distance of the analytical formula.

// difference between the real coefficient and the analytical formula
// return Array(descaledsmallCoeff, originalsmallCoeff, descaledbigCoeff, orginalbigCoeff)
val coeffs = getFeatureImp(standardizedLinpred, unstandardizedLinpred, linRegDF._3)
val descaledsmallCoeff = coeffs(0)
val originalsmallCoeff = coeffs(1)
val descaledbigCoeff = coeffs(2)
val orginalbigCoeff = coeffs(3)
val absError = math.abs(orginalbigCoeff * math.sqrt(smallFeatureVariance) / labelStd - descaledbigCoeff)
val absError2 = math.abs(originalsmallCoeff * math.sqrt(bigFeatureVariance) / labelStd - descaledsmallCoeff)
absError < 0.01 shouldBe true
TuanNguyen27 marked this conversation as resolved.
Show resolved Hide resolved
absError2 < 0.01 shouldBe true
}
TuanNguyen27 marked this conversation as resolved.
Show resolved Hide resolved

it should "correctly return the descaled coefficient for logistic regression, " +
"when standardization is on" in {
val coeffs = getFeatureImp(standardizedLogpred, unstandardizedLogpred, logRegDF._3)
val descaledsmallCoeff = coeffs(0)
val originalsmallCoeff = coeffs(1)
val descaledbigCoeff = coeffs(2)
val orginalbigCoeff = coeffs(3)
// difference between the real coefficient and the analytical formula
val absError = math.abs(orginalbigCoeff * math.sqrt(smallFeatureVariance) - descaledbigCoeff)
val absError2 = math.abs(originalsmallCoeff * math.sqrt(mediumFeatureVariance) - descaledsmallCoeff)
println(descaledsmallCoeff, originalsmallCoeff, descaledbigCoeff, orginalbigCoeff, absError, absError2)
absError < 0.2 shouldBe true
absError2 < 0.01 shouldBe true
TuanNguyen27 marked this conversation as resolved.
Show resolved Hide resolved
}
}