Incorporate name detection into SmartTextVectorizer #456

MWYang · 2020-01-14T18:53:20Z

Describe the proposed solution
Incorporates the changes in #445 and #457 into SmartTextVectorizer and SmartTextMapVectorizer.

Additional context
Merge #457 before merging this PR. Compare the diff between this PR and that one on my forked repo.

Changes from #455 needs to be merged before this PR is ready.

…onoids instead of custom accumulators

…ebird problem

… Still need to fix HLL serialization in Spark issue

…ix printing bug for NameDetectStats

…stead of moments of text length; Still need to fix no moments higher than the 1st being calculated

… tokens is empty

… pass

…iling test

…mport statement)

…Cleaned up test code

codecov · 2020-01-24T05:14:47Z

Codecov Report

Merging #456 into master will decrease coverage by 12.27%.
The diff coverage is 84.05%.

@@             Coverage Diff             @@
##           master     #456       +/-   ##
===========================================
- Coverage      87%   74.72%   -12.28%     
===========================================
  Files         341      341               
  Lines       11485    11532       +47     
  Branches      378      597      +219     
===========================================
- Hits         9992     8617     -1375     
- Misses       1493     2915     +1422

Impacted Files	Coverage Δ
.../scala/com/salesforce/op/dsl/RichTextFeature.scala	`72.22% <ø> (-9.73%)`	⬇️
...main/scala/com/salesforce/op/test/TestCommon.scala	`40.9% <0%> (-9.1%)`	⬇️
...e/op/stages/impl/feature/SmartTextVectorizer.scala	`95.79% <100%> (+0.18%)`	⬆️
...m/salesforce/op/utils/stages/NameDetectUtils.scala	`86.11% <100%> (-1.94%)`	⬇️
...s/impl/feature/OPCollectionHashingVectorizer.scala	`93.87% <66.66%> (-2.68%)`	⬇️
...p/stages/impl/feature/SmartTextMapVectorizer.scala	`93.33% <73.91%> (-6.67%)`	⬇️
...sforce/op/stages/base/binary/BinaryEstimator.scala	`0% <0%> (-100%)`	⬇️
...la/com/salesforce/op/aggregators/Geolocation.scala	`0% <0%> (-100%)`	⬇️
.../salesforce/op/aggregators/FeatureAggregator.scala	`0% <0%> (-100%)`	⬇️
...stages/base/sequence/BinarySequenceEstimator.scala	`0% <0%> (-100%)`	⬇️
... and 98 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c0f67b...9344e5f. Read the comment docs.

Jauntbox · 2020-01-29T23:22:00Z

core/src/main/scala/com/salesforce/op/stages/impl/feature/SmartTextVectorizer.scala

+        dataset.map(_.map(computeTextStats(_, shouldCleanText)).toArray).reduce(_ + _),
+        Array.fill[NameDetectStats](inN.length)(NameDetectStats.empty)
+      )
+    } else {


Can you make the formatting of both if branches look the same since they're doing nearly the same thing?

Jauntbox · 2020-01-29T23:31:32Z

core/src/main/scala/com/salesforce/op/utils/stages/NameDetectUtils.scala

+      // In which case create SensitiveFeatureInformation for all features
+      case ((feature: String, key: Option[String]), stats: NameDetectStats)
+        if log.isDebugEnabled || computeTreatAsName(stats) =>
+        val N = stats.dictCheckResult.count.toDouble


more descriptive name, please

MWYang added 30 commits December 5, 2019 15:54

Re-added unary estimator code and started porting logic to Algebird m…

2acf3fc

…onoids instead of custom accumulators

Re-added JRC name dictionary and cleaned up names of methods

b55c31e

Fixed bug with AveragedValue computation; Trying to debug current Alg…

f443952

…ebird problem

Fixed wrong inequality direction in guard checks

557ef39

Added HLL back to monoid accumulator and code now compiles correctly;…

b6728ec

… Still need to fix HLL serialization in Spark issue

Fixed HLL in NameDetectStats not serializing correctly; Now need to f…

ff1b2ef

…ix printing bug for NameDetectStats

Fixed NameDetectStats printing

33b77c9

Fixed guard stat calculation computing moments of number of tokens in…

4caec75

…stead of moments of text length; Still need to fix no moments higher than the 1st being calculated

Fixed moments calculation and fixed divide by zero error when list of…

b60dc4a

… tokens is empty

Added gender identification code transforming; All previous tests now…

469111b

… pass

Undid SparkUtils changes, which are no longer necessary

e5f169e

Renamed class names to be more consistent + small fixes

b701612

Added honorific detection

8342dae

Implemented RegEx checking for gender

2e0e85a

Implemented mixed gender identification strategies

b079a27

Removed TODOs and extraneous functions in preparation for PR

a1197a7

Updated documentation

19bad0b

Ignore null values in detecting names

3172bde

Added flag for ignoring nulls

0d82eef

Added sir/madam to list of honorifics

7812d12

Merge branch 'master' into my/unary-detect-names

345508f

Fixed typo when adding sir/madam to list of honorifics that caused fa…

a80e382

…iling test

Fixed failing test due to divide by zero NA on some inputs

f7817d9

Cleaned up redundant import in tests

cf3eff0

Added failing tests for STV

eaaa23a

Made small changes based on PR comments (updated inline comment and i…

b928a49

…mport statement)

Created metadata case class per PR review; Added tests for metadata; …

048c084

…Cleaned up test code

Added test for name threshold

5d30e79

Updated comment about NameDetectStats.toJson

78c321c

Added tests for new NameStats feature type

a597d21

MWYang marked this pull request as ready for review January 22, 2020 00:24

MWYang requested review from gerashegalov, Jauntbox, leahmcguire, tovbinm and wsuchy as code owners January 22, 2020 00:24

MWYang added 5 commits January 23, 2020 19:56

Removed enum from SensitiveFeatureInformation per PR comments

a8504da

Using case class for GenderDetectionStrategy information

eac0e05

Cleaning up tests per PR comments

26f30fb

Merge branch 'my/sensitive-metadata' into my/stv-detect-names

f66d896

Made fixes for metadata changes

9a3f5af

MWYang closed this Jan 24, 2020

MWYang reopened this Jan 24, 2020

MWYang and others added 3 commits January 24, 2020 13:47

Merge branch 'master' into my/sensitive-metadata

bd7b90d

Merge branch 'my/sensitive-metadata' into my/stv-detect-names

5c37c26

Merge branch 'master' into my/stv-detect-names

9344e5f

MWYang closed this Jan 29, 2020

MWYang reopened this Jan 29, 2020

MWYang added enhancement ready for review and removed DO NOT MERGE labels Jan 29, 2020

Jauntbox reviewed Jan 29, 2020

View reviewed changes

Jauntbox mentioned this pull request Sep 3, 2020

Incorporate name detection into SmartTextVectorizer #508

Merged

Jauntbox added DO NOT MERGE duplicate and removed ready for review DO NOT MERGE labels Sep 11, 2020

leahmcguire closed this Oct 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorporate name detection into SmartTextVectorizer #456

Incorporate name detection into SmartTextVectorizer #456

MWYang commented Jan 14, 2020 •

edited

Loading

codecov bot commented Jan 24, 2020 •

edited

Loading

Jauntbox Jan 29, 2020

Jauntbox Jan 29, 2020

Incorporate name detection into SmartTextVectorizer #456

Incorporate name detection into SmartTextVectorizer #456

Conversation

MWYang commented Jan 14, 2020 • edited Loading

codecov bot commented Jan 24, 2020 • edited Loading

Codecov Report

Jauntbox Jan 29, 2020

Choose a reason for hiding this comment

Jauntbox Jan 29, 2020

Choose a reason for hiding this comment

MWYang commented Jan 14, 2020 •

edited

Loading

codecov bot commented Jan 24, 2020 •

edited

Loading