Allow TextStats length distribution to be token-based and refactor for testability #464

Jauntbox · 2020-03-05T22:28:10Z

Related issues
n/a

Describe the proposed solution
Tests did not catch that the token length distributions added to TextStats were actually entry length distributions. This PR refactors some of the functions in TextTokenizer, SmartTextVectorizer, and SmartTextMapVectorizer so that they are directly testable. It also adds more robust tests to check desired behavior of the TextStats object.

Describe alternatives you've considered
n/a

Additional context
n/a

codecov · 2020-03-05T22:48:01Z

Codecov Report

Merging #464 into master will increase coverage by 0.00%.
The diff coverage is 92.30%.

@@           Coverage Diff           @@
##           master     #464   +/-   ##
=======================================
  Coverage   86.98%   86.99%           
=======================================
  Files         345      345           
  Lines       11575    11616   +41     
  Branches      376      376           
=======================================
+ Hits        10069    10105   +36     
- Misses       1506     1511    +5

Impacted Files	Coverage Δ
...n/scala/com/salesforce/op/dsl/RichMapFeature.scala	`67.64% <ø> (ø)`
.../scala/com/salesforce/op/dsl/RichTextFeature.scala	`81.94% <ø> (ø)`
...a/com/salesforce/op/filters/RawFeatureFilter.scala	`92.97% <ø> (ø)`
...main/scala/com/salesforce/op/test/TestCommon.scala	`40.90% <0.00%> (-9.10%)`	⬇️
...e/op/stages/impl/feature/SmartTextVectorizer.scala	`95.58% <96.15%> (-0.03%)`	⬇️
...om/salesforce/op/filters/FeatureDistribution.scala	`98.70% <100.00%> (+0.03%)`	⬆️
...sification/BinaryClassificationModelSelector.scala	`98.24% <100.00%> (ø)`
...p/stages/impl/feature/SmartTextMapVectorizer.scala	`100.00% <100.00%> (ø)`
...esforce/op/stages/impl/feature/TextTokenizer.scala	`97.22% <100.00%> (+0.07%)`	⬆️
...sforce/op/stages/OpPipelineStageReaderWriter.scala	`87.50% <100.00%> (+0.40%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed4abfd...d78868d. Read the comment docs.

…token-lens3

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextMapVectorizer.scala

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizer.scala

core/src/main/scala/com/salesforce/op/stages/impl/feature/TextTokenizer.scala

utils/src/main/scala/com/salesforce/op/test/TestCommon.scala

tovbinm · 2020-03-06T20:05:58Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/TextTokenizer.scala

-  def tokenize(
-    text: Text,
+  def tokenizeString(
+    textString: String,


now this function can explode with NullPointerException is textString is null, while before it could not have happened.

Hmmm, I'm not sure if it's possible for the textString argument to be null in practice though. When this function used for tokenizing the map entries, a value that was originally null there will just not show up as an entry in the map. When it's used for tokenizing a normal Text entry, then we should have already safely converted any nulls or missing elements into an Option[String], right?

The actual tokenize call during vectorization is still tokenize(v.toText) where v is the value in a text map. I'd actually argue that that should be changed to tokenizeString(v) to save time converting it to Text and back again.

I agree it's technically less safe, but I don't think it's necessary to have null checking at this point in the flow. We should make sure the data gets created in a safe way, which I think we already do. Are there some specific edge cases I'm missing?

I think the simplest is to add a null check tokenizeString and return

Now that I think about it more, I'm pretty sure the old tokenize function would also give a NPE if you fed it a sneaky null value. The SomeValue.unapply function explicitly calls v.isEmpty which would also fail if v was null.

I put back the old tokenize function as oldTokenize and tried

val sneakyStringOpt: Option[String] = null val myText = Text(sneakyStringOpt) val res = TextTokenizer.oldTokenize(myText)

which did indeed throw a NPE.

We have tests all over the place (eg. our vectorizer tests and FeatureTypeSparkConverterTest) that make sure we can handle null values in dataframes and safely convert them into our types. I'm not aware of any explicit null checks in our functions elsewhere, so it just feels weird to put one here.

@leahmcguire any opinions on this?

SomeValue.unapply operates on value which is Option[String]. Null check is done during the construction of Text when the values are extracted from Dataframe / RDD. NullPointerException is indeed unlikely to be thrown.

Your example you provided is not currently possible and also not a fair one :)

@Jauntbox is this only called from the Option[String] version below? if so make it private and it is fine.

In fact make them both private please

…token-lens3

TuanNguyen27

Some questions on length distribution and token filtering.

TuanNguyen27 · 2020-03-17T16:43:37Z

core/src/main/scala/com/salesforce/op/filters/FeatureDistribution.scala

+      case Right(doubleSeq) => doubleSeq.map(_.toString)
+    }
+    stringVals.foldLeft(TextStats.empty)((acc, el) => acc + SmartTextVectorizer.computeTextStats(
+      Option(el), shouldCleanText = false, maxCardinality = RawFeatureFilter.MaxCardinality)


Should shouldCleanText = true instead ?

I can change it, but I don't think it matters much here. These values aren't used in SmartTextVectorizer. They're the ones that show up in the ModelInsights.

TuanNguyen27 · 2020-03-17T16:49:04Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizer.scala

+      .foldLeft(Map.empty[Int, Long])(
+        (acc, el) => TextStats.additionHelper(acc, Map(el.length -> 1L), maxCardinality)
+      )
+    val (valueCounts, lengthCounts) = text match {


when we reach RawFeatureFilter.MaxCardinality for valueCounts, will lengthCounts also stop accumulating ?

nvm, this is taken care of by val newLengthCounts = additionHelper(l.lengthCounts, r.lengthCounts, maxCardinality) , pls disregard this comment :D

TuanNguyen27 · 2020-03-17T16:50:39Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/TextTokenizer.scala

+      .getOrElse(Seq(lowerTxt))
+      .map { sentence =>
+        val tokens = analyzer.analyze(sentence, language)
+        tokens.filter(_.length >= minTokenLength).toTextList


Why are we only keeping tokens with length > minTokenLength ?

This was existing behavior. It's a configurable parameter (defaulting to 1), so is not required.

TuanNguyen27

lgtm

Jauntbox · 2020-03-18T17:39:45Z

Just a heads up on a few more commits - adding a toggle for tokenization in text lengths. Will cause a problems with Chinese/Korean text based on our current tokenizers.

…ts and refactored a bit

…lity in SmartTextVectorizer

…ignoring

Jauntbox · 2020-03-19T18:38:49Z

Ok - ready for a final look. Sorry for the last-minute refactoring, but realized we needed this toggle exposed for experiments.

Final refactoring:

Removed tokenizeStringOpt method since we can get by without it (should be more readable too)
Moved methods to create TextStats instances from the SmartTextVectorizer objects and into the TextStats object since they make more sense there
Added toggle to SmartTextVectorizer and SmartTextMapVectorizer to enable/disable tokenization when calculating length distributions in TextStats
Added tests to check that this toggle does what we want
Added logging of derived quantities and vectorization method used in SmartTextVectorizer and SmartTextMapVectorizer

TuanNguyen27 · 2020-03-19T18:50:00Z

core/src/main/scala/com/salesforce/op/filters/FeatureDistribution.scala

+      shouldCleanText = shouldCleanText,
+      shouldTokenize = tokenizeForLengths,
+      maxCardinality = RawFeatureFilter.MaxCardinality)
+    )
  }

  private def countStringValues[T](seq: Seq[T]): Map[String, Long] = {


not relevant to this PR but i think countStringValues is no longer used.

ok, I can remove it then

TuanNguyen27 · 2020-03-19T18:54:44Z

core/src/main/scala/com/salesforce/op/filters/PreparedFeatures.scala

@@ -169,7 +169,8 @@ private[filters] object PreparedFeatures {
      case SomeValue(v: DenseVector) => Map((name, None) -> Right(v.toArray.toSeq))
      case SomeValue(v: SparseVector) => Map((name, None) -> Right(v.indices.map(_.toDouble).toSeq))
      case ft@SomeValue(_) => ft match {
-        case v: Text => Map((name, None) -> Left(v.value.toSeq.flatMap(tokenize)))
+        // case v: Text => Map((name, None) -> Left(v.value.toSeq.flatMap(tokenize)))


We are no longer tokenzing text during data prep?

Whoops, that was for testing - forgot to take it out.

TuanNguyen27 · 2020-03-19T22:38:19Z

lgtm !

…tionModelSelector to see if this speeds up tests significantly

leahmcguire · 2020-03-20T18:51:31Z

core/src/main/scala/com/salesforce/op/dsl/RichMapFeature.scala

@@ -322,6 +328,8 @@ trait RichMapFeature {
        .setHashSpaceStrategy(hashSpaceStrategy)
        .setHashAlgorithm(hashAlgorithm)
        .setBinaryFreq(binaryFreq)
+        .setTokenizeForLengths(tokenizeForLengths)


can we make this an enum rather than a boolean? then we have room to expend in the future

leahmcguire

lets switch to an enum for the new flag to stem the proliferation of booleans and then LGTM

…token-lens3

Jauntbox added 6 commits March 3, 2020 14:10

Refactored and added more incremental tests

1c2cbc2

Updated test

83881c0

Added tests and fixed a small bug

ca2e122

More refactoring and updating tests

c190d97

More test refactoring

75b0770

More refactoring

305c57e

Jauntbox requested review from gerashegalov, leahmcguire, tovbinm and wsuchy as code owners March 5, 2020 22:28

Small cleanups

ff53dc4

Jauntbox assigned leahmcguire and TuanNguyen27 Mar 5, 2020

Merge branch 'master' of github.com:salesforce/TransmogrifAI into km/…

13b6076

…token-lens3

tovbinm reviewed Mar 6, 2020

View reviewed changes

Jauntbox added 4 commits March 6, 2020 12:06

Addressing comments

fc6cd07

Added text length distribution to the TextStats calculated in RFF

1434128

Made offending methods private

e117fbd

Merge branch 'master' of github.com:salesforce/TransmogrifAI into km/…

87878e0

…token-lens3

TuanNguyen27 requested changes Mar 17, 2020

View reviewed changes

Jauntbox added 2 commits March 17, 2020 11:47

Comments

fe36ec5

Spelling

cd31e32

leahmcguire approved these changes Mar 17, 2020

View reviewed changes

TuanNguyen27 approved these changes Mar 17, 2020

View reviewed changes

Jauntbox added 3 commits March 18, 2020 11:07

Added toggle for tokenization in length distribution

a322363

Added toggle to turn tokenization on/off for length distribution coun…

32ad893

…ts and refactored a bit

Reverted changes to RFF for now and added logging to help with visibi…

ab9d2a7

…lity in SmartTextVectorizer

Jauntbox added 2 commits March 19, 2020 11:20

Updated tests to check both tokenized and non-tokenized text feature …

4cb5c88

…ignoring

Better logging

e463685

TuanNguyen27 reviewed Mar 19, 2020

View reviewed changes

Jauntbox added 2 commits March 19, 2020 13:31

Revert unintentional RFF changes

e44ca68

Removed unused method

ce5663e

Jauntbox added 3 commits March 19, 2020 16:51

Removed SVC models from the default models to try in BinaryClassifica…

b866bb3

…tionModelSelector to see if this speeds up tests significantly

Added new params to vectorizer shortcuts

e72af36

scalastyle issue

307a014

leahmcguire reviewed Mar 20, 2020

View reviewed changes

Jauntbox added 4 commits March 26, 2020 09:06

Replaced boolean param with enum

49127d9

Added enum to json4s serialization list

95bc3e7

Actually add the enum file

aad13ba

Merge branch 'master' of github.com:salesforce/TransmogrifAI into km/…

d78868d

…token-lens3

Jauntbox changed the title ~~Make TextStats length distribution token-based and refactor for testability~~ Allow TextStats length distribution to be token-based and refactor for testability Mar 26, 2020

Jauntbox merged commit da52ad9 into master Mar 26, 2020

Jauntbox deleted the km/token-lens3 branch March 26, 2020 17:16

nicodv mentioned this pull request Jun 11, 2020

0.7.0 release #481

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow TextStats length distribution to be token-based and refactor for testability #464

Allow TextStats length distribution to be token-based and refactor for testability #464

Jauntbox commented Mar 5, 2020

codecov bot commented Mar 5, 2020 •

edited

tovbinm Mar 6, 2020 •

edited

Jauntbox Mar 6, 2020 •

edited

tovbinm Mar 6, 2020

Jauntbox Mar 6, 2020 •

edited

tovbinm Mar 7, 2020 •

edited

tovbinm Mar 7, 2020

leahmcguire Mar 11, 2020

leahmcguire Mar 11, 2020

TuanNguyen27 left a comment

TuanNguyen27 Mar 17, 2020

Jauntbox Mar 17, 2020

TuanNguyen27 Mar 17, 2020

TuanNguyen27 Mar 17, 2020

TuanNguyen27 Mar 17, 2020

Jauntbox Mar 17, 2020

TuanNguyen27 left a comment

Jauntbox commented Mar 18, 2020

Jauntbox commented Mar 19, 2020 •

edited

TuanNguyen27 Mar 19, 2020

Jauntbox Mar 19, 2020

TuanNguyen27 Mar 19, 2020

Jauntbox Mar 19, 2020

TuanNguyen27 commented Mar 19, 2020

leahmcguire Mar 20, 2020

leahmcguire left a comment

Allow TextStats length distribution to be token-based and refactor for testability #464

Allow TextStats length distribution to be token-based and refactor for testability #464

Conversation

Jauntbox commented Mar 5, 2020

codecov bot commented Mar 5, 2020 • edited

Codecov Report

tovbinm Mar 6, 2020 • edited

Choose a reason for hiding this comment

Jauntbox Mar 6, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jauntbox Mar 6, 2020 • edited

Choose a reason for hiding this comment

tovbinm Mar 7, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TuanNguyen27 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TuanNguyen27 left a comment

Choose a reason for hiding this comment

Jauntbox commented Mar 18, 2020

Jauntbox commented Mar 19, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TuanNguyen27 commented Mar 19, 2020

Choose a reason for hiding this comment

leahmcguire left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 5, 2020 •

edited

tovbinm Mar 6, 2020 •

edited

Jauntbox Mar 6, 2020 •

edited

Jauntbox Mar 6, 2020 •

edited

tovbinm Mar 7, 2020 •

edited

Jauntbox commented Mar 19, 2020 •

edited