Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outputting Raw Feature Filter information: Part 1 #237

Merged
merged 18 commits into from
Mar 26, 2019

Conversation

clin-projects
Copy link
Contributor

@clin-projects clin-projects commented Mar 1, 2019

Related issues
I would like to pass information that is obtained by RawFeatureFilter (RFF) into ModelInsights. Some information is already passed through (via FeatureDistribution) but some critical information is not, e.g., reason why a feature was excluded by RFF.

Describe the proposed solution
Creating RawFeatureFilterResults case class that contains and passes information through workflow, gets passed to ModelInsights, and is able to export contents as JSON.

For further details and history of this PR, please refer to a previous PR on a forked branch: clin-projects#1

Update (March 20, 2019)

  • Replaced all direct references to rawFeatureDistributions (a workflow variable previously passed around through OpWorkflowCore) with RawFeatureFilterResults (which now contains rawFeatureDistributions)
  • Changed model reader / writer to include RawFeatureFilterResults; made backwards compatible

@codecov
Copy link

codecov bot commented Mar 1, 2019

Codecov Report

Merging #237 into master will decrease coverage by <.01%.
The diff coverage is 95.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #237      +/-   ##
==========================================
- Coverage   86.56%   86.55%   -0.01%     
==========================================
  Files         314      314              
  Lines       10317    10297      -20     
  Branches      567      553      -14     
==========================================
- Hits         8931     8913      -18     
+ Misses       1386     1384       -2
Impacted Files Coverage Δ
...a/com/salesforce/op/filters/RawFeatureFilter.scala 90.9% <100%> (ø) ⬆️
...com/salesforce/op/local/OpWorkflowModelLocal.scala 100% <100%> (+5.88%) ⬆️
...alesforce/op/filters/RawFeatureFilterResults.scala 100% <100%> (ø) ⬆️
...cala/com/salesforce/op/OpWorkflowModelReader.scala 87.03% <80%> (-2.25%) ⬇️
...e/op/stages/impl/selector/RandomParamBuilder.scala 94.44% <94.44%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6982218...aa12f34. Read the comment docs.

@clin-projects
Copy link
Contributor Author

@leahmcguire @tovbinm @Jauntbox Ready for review

  • Edited description to reflect latest update

Copy link
Contributor

@Jauntbox Jauntbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor comments, otherwise looks good.


object RawFeatureFilterResults {

implicit val jsonFormats: Formats = DefaultFormats +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, we have a bunch of these living in OpPipelineStageReadWriteShared as an implicit Formats variable. Should we put this there to for consistency? Or move the ones in OpPipelineStageReadWriteShared to the classes they belong to?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put it there i think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leahmcguire @Jauntbox

OpPipelineStageReadWriteShared is in the features module, whereas RawFeatureFilter files are all in core. I think if we decide to move the Formats variable to OpPipelineStageReadWriteShared, then we may have to do some refactoring to avoid some circular reference? Like move RawFeatureFilter and all associated files into the features module

Please let me know if you know of a way to do this without having to refactor, or how else you'd like to proceed (keep as is, move the Formats variable to their respective classes, etc.)

For reference, this is what is in OpPipelineStageReadWriteShared:

implicit val formats: Formats =
    DefaultFormats ++
      JodaTimeSerializers.all +
      EnumEntrySerializer.json4s[AnyValueTypes](AnyValueTypes) +
      EnumEntrySerializer.json4s[HashAlgorithm](HashAlgorithm) +
      EnumEntrySerializer.json4s[HashSpaceStrategy](HashSpaceStrategy) +
      EnumEntrySerializer.json4s[ScalingType](ScalingType) +
      EnumEntrySerializer.json4s[TimePeriod](TimePeriod) +
      EnumEntrySerializer.json4s[FeatureDistributionType](FeatureDistributionType) +
      new SpecialDoubleSerializer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, but it turns out that we actually have all the serialization formats for model insights in the companion object so lets move it there for consistency.

https://github.com/salesforce/TransmogrifAI/blob/master/core/src/main/scala/com/salesforce/op/ModelInsights.scala#L391

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leahmcguire Thanks for pointing this example out!

Serialization formats for RawFeatureFilterResults is currently placed in a companion object. So just wondering what you're envisioning if beyond what's implemented already

Couple possibilities below:

  • There are some Formats lines of OpPipelineStageReadWriteShared that correspond to an existing class (e.g., ScalingType). We can move these into companion objects

  • SerializationFormats in RawFeatureFilterResults currently extends from DefaultFormats but the one in ModelInsights does not. We can make RawFeatureFilterResults better resemble ModelInsights by explicitly defining what it needs from DefaultFormats?

Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leahmcguire @clin-projects Is RawFeatureFilterResults going to be serialized as a part of ModelInsights?

Copy link
Contributor Author

@clin-projects clin-projects Mar 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is. RawFeatureFilterResults has four components:

  1. configuration <= will be incorporated in getStageInfo
  2. raw feature distributions <= currently serialized in FeatureInsights
  3. metrics <= will be serialized in FeatureInsights
  4. exclusion reasons <= currently serialized in FeatureInsights

* @param fillRatioDiffMismatch distribution mismatch: fill ratio difference exceeded max allowed
* @param excluded feature excluded after failing one or more tests
*/
case class ExclusionReasons
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few suggestions for clearer names here:
trainingUnfilledState -> trainingFillRate
trainingNullLabelLeaker -> trainingNullLabelCorrelation
scoringUnfilledState -> scoringFillRate
jsDivergenceMismatch -> jsDivergence
fillRateDiffMismatch -> fillRateDifference
fillRatioDiffMismatc -> fillRatioDifference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this is a tough situation because need to make ExclusionReasons names distinct from the those in RawFeatureFilterMetrics (where I've already used many of the names you're suggesting!):

case class RawFeatureFilterMetrics
(
  name: String,
  trainingFillRate: Double,
  trainingNullLabelAbsoluteCorr: Option[Double],
  scoringFillRate: Option[Double],
  jsDivergence: Option[Double],
  fillRateDiff: Option[Double],
  fillRatioDiff: Option[Double]
) extends RawFeatureFilterMetricsLike

Thought really long about this... part of rationale for name choices was that these variables refer to outcomes of a comparison test (boolean), so don't want to make it seem like they are referring to a continuous value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm not sure I follow why the field names in RawFeatureFilterMetrics need to be different from the ones in ExclusionReasons. It should be clear from the type which case class one is referring to, no?

Your point about them being booleans for test results makes sense though, so lets keep them as is.

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good ( I think ), thought it's hard to tell now cause the PR became quite large ;)

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a test that the exclusion reasons in ModelInsights serializes and deserializes correctly https://github.com/salesforce/TransmogrifAI/blob/master/core/src/test/scala/com/salesforce/op/ModelInsightsTest.scala#L347 and then LGTM

@clin-projects
Copy link
Contributor Author

  • Created the serialize / deserialize test in ModelInsights based on Leah's comment
  • Added exception when Reader fails to parse rawFeatureDistribution based on Matthew's comment

@Jauntbox for review! thanks!!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants