Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep raw feature distributions calculated in raw feature filter #76

Merged
merged 7 commits into from
Aug 22, 2018

Conversation

leahmcguire
Copy link
Collaborator

Related issues
Keeping feature distribution information calculated in raw feature filter for use in metrics logging and feature engineering

Describe the proposed solution
When raw feature filter is called it now returns the feature distribution information calculated on the training data. These distributions are kept in the workflow and added to the raw features so that they can be accessed later.

Describe alternatives you've considered
Could just write the information out directly from raw feature filter.

@leahmcguire leahmcguire changed the title Lm/keep dist Keep raw feature distributions calculated in raw feature filter Aug 21, 2018
/**
* Keeps the distribution information for features
*/
trait FeatureDistributionBase {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leahmcguire why do we need the trait? if we do, let's name it FeatureDistributionLike.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - It seemed nicer to make the trait than to move all of the classes in raw feature filter here. though I am willing to hear counter arguments :-P

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed the part that it's located in a separate project. Makes sense then.
Let's rename it then to match our convention and also lets replace vals in the trait with defs.

@codecov
Copy link

codecov bot commented Aug 21, 2018

Codecov Report

Merging #76 into master will decrease coverage by 0.01%.
The diff coverage is 80.76%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #76      +/-   ##
==========================================
- Coverage   85.97%   85.95%   -0.02%     
==========================================
  Files         294      294              
  Lines        9530     9547      +17     
  Branches      320      337      +17     
==========================================
+ Hits         8193     8206      +13     
- Misses       1337     1341       +4
Impacted Files Coverage Δ
...om/salesforce/op/filters/FeatureDistribution.scala 97.87% <ø> (ø) ⬆️
...scala/com/salesforce/op/features/FeatureLike.scala 41.93% <0%> (-0.46%) ⬇️
...ain/scala/com/salesforce/op/features/Feature.scala 33.33% <0%> (-6.67%) ⬇️
...cala/com/salesforce/op/OpWorkflowModelReader.scala 91.3% <100%> (+1.06%) ⬆️
...cala/com/salesforce/op/OpWorkflowModelWriter.scala 100% <100%> (ø) ⬆️
.../src/main/scala/com/salesforce/op/OpWorkflow.scala 87.5% <100%> (+0.26%) ⬆️
.../main/scala/com/salesforce/op/OpWorkflowCore.scala 93.33% <100%> (+0.35%) ⬆️
...a/com/salesforce/op/filters/RawFeatureFilter.scala 93.26% <100%> (ø) ⬆️
...com/salesforce/op/features/FeatureJsonHelper.scala 0% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4a7db6f...8b5cf07. Read the comment docs.

@@ -108,7 +107,7 @@ class OpWorkflow(val uid: String = UID[OpWorkflow]) extends OpWorkflowCore {
* Will set the blacklisted features variable and if list is non-empty it will
* @param features list of features to blacklist
*/
private[op] def setBlacklist(features: Array[OPFeature]): Unit = {
private[op] def setBlacklist(features: Array[OPFeature], distributions: Seq[FeatureDistribution]): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update doc.

val name: String

/**
* key map key associated with distribution (when the feature is a map)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key map key -> map key

val nulls: Long

/**
* binned counts of feature values (hashed for strings, evently spaced bins for numerics)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

evently -> evenly

"with distributions" should {
"make a copy of the feature containing the specified distributions" in {
val distrib = new FeatureDistributionBase {
override val count: Long = 1L
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override is redundant.

@tovbinm
Copy link
Collaborator

tovbinm commented Aug 22, 2018

@leahmcguire I think we should also (de)serialize feature distributions with the model.

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@salesforce-cla
Copy link

salesforce-cla bot commented Dec 6, 2020

Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants