Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust bin values for text features in RFF #99

Merged
merged 27 commits into from Sep 4, 2018
Merged

Conversation

sxd929
Copy link
Contributor

@sxd929 sxd929 commented Aug 28, 2018

Related issues
Not enough hashing space for RFF of text may reduce the js distance (weaken the signal)

Describe the proposed solution

  • Add sum and total count to summary
  • Allow user provided method to calculate the bin number based on data summary as parameter to calculate the distribution

Describe alternatives you've considered
Alternative method like Jaccard similarity, word and sentence embedding might be considered

Additional context
N/A

@codecov
Copy link

codecov bot commented Aug 28, 2018

Codecov Report

Merging #99 into master will decrease coverage by <.01%.
The diff coverage is 89.18%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #99      +/-   ##
==========================================
- Coverage   86.08%   86.07%   -0.01%     
==========================================
  Files         297      297              
  Lines        9659     9660       +1     
  Branches      336      429      +93     
==========================================
  Hits         8315     8315              
- Misses       1344     1345       +1
Impacted Files Coverage Δ
.../src/main/scala/com/salesforce/op/OpWorkflow.scala 87.5% <ø> (ø) ⬆️
...main/scala/com/salesforce/op/filters/Summary.scala 100% <100%> (ø) ⬆️
...a/com/salesforce/op/filters/RawFeatureFilter.scala 92.23% <75%> (-1.04%) ⬇️
...a/com/salesforce/op/filters/PreparedFeatures.scala 78.37% <75%> (-1.11%) ⬇️
...om/salesforce/op/filters/FeatureDistribution.scala 98% <95%> (+0.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 83baade...dfc1a40. Read the comment docs.

@tovbinm tovbinm changed the title Xs/adjust bin value for text Adjust bin values for text features in RFF Aug 28, 2018
val maxBins = MaxBins
val numBins = math.min(math.max(bins, sum.max/AvgBinValue), maxBins).toInt

val hasher: HashingTF = new HashingTF(numFeatures = numBins)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to create hasher every time or perhaps we can create it once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the hashing dimension can be different for different features, the minimum number would be bins
but we can have a shared one with numFeatures = bins, and use that for every case if there is not too many tokens; OR we can create a couple shared hashers with different scales, and choose one based on the scale of token numbers
What do you think? I was assuming creating a hasher for every feature will not be very resource consuming

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let see if we can reuse the hashing function without creating HashingTF everytime.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I think we can. See HashingTF.transform and HashingTF object

case Left(seq) => {
val minBins = bins
val maxBins = MaxBins
val numBins = math.min(math.max(bins, sum.max/AvgBinValue), maxBins).toInt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sum.max / AvgBinValue

Copy link
Collaborator

@tovbinm tovbinm Aug 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of .toInt perhaps specify the explicit rounding instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed! thanks!!

case Left(seq) => {
val minBins = bins
val maxBins = MaxBins
val numBins = math.min(math.max(bins, sum.max / AvgBinValue), maxBins).floor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be we should make this more flexible

instead of defining this allow users to pass in a function for text bins eg:
val textBins: (min: Int, max: Int, sum: Long, count: Long) => Int
to the contructor

and add sum to the Summary info collected

@@ -142,6 +143,13 @@ case class FeatureDistribution
private[op] object FeatureDistribution {

val MaxBins = 100000
val AvgBinValue = 5000
val MaxTokenLowerLimit = 10
val getBins = (sum: Summary, bins: Int) => {
Copy link
Collaborator

@tovbinm tovbinm Aug 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add move this method to Summary class., ie.

private[op] case class Summary(min: Double, max: Double) {
   
   // TODO: docs
   def bins(bins: Int): Int = {
       if (sum.max < MaxTokenLowerLimit) bins
       else math.min(math.max(bins, sum.sum / AvgBinValue), MaxBins).toInt
   }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, please replace Todo with TODO @sxd929

Shall we merge? @leahmcguire

* @param bins default bins of RFF
* @return
*/
def getBinsText(bins: Int): Int = bins
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user should be able to pass the formula for how to compute the text feature bin size into RawFeatureFilter as a parameter. We should provide a default which for now can be no formula but later can be updated

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Map("A" -> Summary(-1.6, 10.6), "B" -> Summary(0.0, 3.0)),
Map("B" -> Summary(0.0, 0.0)))
val summaries = Array(
Map("A" -> Summary(0.0, 2.0, 100.0, 10), "B" -> Summary(0.0, 5.0, 10.0, 10)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did the max change for just the A key here?

Copy link
Contributor Author

@sxd929 sxd929 Aug 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hardcoded, I changed it since a sample of "A" is Seq("male", "female"), number of hashes in the text is already 2, so the max should not be smaller than 2

@@ -107,15 +119,15 @@ class FeatureDistributionTest extends FlatSpec with PassengerSparkFixtureTest wi
else d.distribution.length shouldBe 2
}
distribs(0).nulls shouldBe 0
distribs(0).summaryInfo should contain theSameElementsAs Array(0.0, 1.0)
distribs(0).summaryInfo should contain theSameElementsAs Array(0.0, 2.0, 100.0, 10.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, why did the max change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one reads the info from hardcoded 'summaries' above

textBinsFormula: (Summary, Int) => Int
): (Array[Double], Array[Double]) = values match {
case Left(seq) =>
val numBins = textBinsFormula(summary, bins)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would the formula need the bins?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we set is as def textBinsFormula(s: Summary, bins: Int) = bins.
I can image default bins can be returned in some cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would you recommend then? @leahmcguire

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bins is the default number (e.g. 100) there might be cases that we will just want to fall back to the defaults

@@ -514,7 +517,8 @@ class OpWorkflow(val uid: String = UID[OpWorkflow]) extends OpWorkflowCore {
maxJSDivergence: Double = 0.90,
maxCorrelation: Double = 0.95,
correlationType: CorrelationType = CorrelationType.Pearson,
protectedFeatures: Array[OPFeature] = Array.empty
protectedFeatures: Array[OPFeature] = Array.empty,
textBinsFormula: (Summary, Int) => Int = RawFeatureFilter.textBinsFormula
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the int?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the bins.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but that is not what we want the formula to look like long term

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what is your suggestion on the inputs / outputs then?

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please just add to comments what the bins are

@tovbinm tovbinm merged commit 2d7c275 into master Sep 4, 2018
@tovbinm tovbinm deleted the xs/adjustBinValueForText branch September 4, 2018 18:26
@salesforce-cla
Copy link

Thanks for the contribution! It looks like @sxd929 is an internal user so signing the CLA is not required. However, we need to confirm this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants