Detecting names in text fields #440

MWYang · 2019-11-21T00:42:06Z

SmartTextVectorizer now has an optional flag detectSensitive that will guess, using a combination of dictionary lookup and conditional logic, whether any of the input columns are names (which we don't want in our models in case of bias). For right now, just a warning is logged to console that there may be such names in the input fields. In the future, the removeSensitive flag will remove those columns from contributing to the output vector. Also in the future, the gender information that is extracted from name columns (using government data) will be used to check for model fairness.

A unary estimator HumanNameIdentifier is also included as a standalone drop-in for custom workflows.

Additional context
I completed this work as part of my ongoing Salesforce internship. This PR cleans up and replaces #428.

…m Spark dataframes

… a guess as to the gender of the name

…d log results to console if names were detected

codecov · 2019-11-21T01:01:00Z

Codecov Report

Merging #440 into master will decrease coverage by 3.76%.
The diff coverage is 18.62%.

@@            Coverage Diff             @@
##           master     #440      +/-   ##
==========================================
- Coverage   86.93%   83.17%   -3.77%     
==========================================
  Files         337      339       +2     
  Lines       11096    11296     +200     
  Branches      362      597     +235     
==========================================
- Hits         9646     9395     -251     
- Misses       1450     1901     +451

Impacted Files	Coverage Δ
...orce/op/utils/stages/NameIdentificationUtils.scala	`0% <0%> (ø)`
...scala/com/salesforce/op/utils/text/TextUtils.scala	`42.85% <0%> (-57.15%)`	⬇️
.../scala/com/salesforce/op/dsl/RichTextFeature.scala	`69.44% <0%> (-13.66%)`	⬇️
.../scala/com/salesforce/op/features/types/Maps.scala	`77.77% <0%> (-15%)`	⬇️
...n/scala/com/salesforce/op/testkit/RandomText.scala	`98.41% <0%> (-1.59%)`	⬇️
...e/op/stages/impl/feature/HumanNameIdentifier.scala	`0% <0%> (ø)`
...com/salesforce/op/features/FeatureSparkTypes.scala	`99.14% <100%> (ø)`	⬆️
...sforce/op/features/types/FeatureTypeDefaults.scala	`96.15% <100%> (+0.03%)`	⬆️
...e/op/stages/impl/feature/SmartTextVectorizer.scala	`58.82% <26.92%> (-40.03%)`	⬇️
...esforce/op/features/types/FeatureTypeFactory.scala	`98.27% <50%> (-0.85%)`	⬇️
... and 41 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9778481...01b7205. Read the comment docs.

… to be more informative; Fixed detection counting null values when it should ignore them

MWYang · 2019-12-05T18:52:11Z

Closing to rework on comments from reviewers.

MWYang added 3 commits November 20, 2019 15:02

Added util functions for working with RegEx and extracting values fro…

ef390ac

…m Spark dataframes

Added unary estimator for detecting names in Text features and making…

182d9df

… a guess as to the gender of the name

Updated SmartTextVectorizer to (optionally) also do name detection an…

cf175cf

…d log results to console if names were detected

MWYang added enhancement ready for review labels Nov 21, 2019

MWYang requested review from gerashegalov, Jauntbox, leahmcguire, tovbinm and wsuchy as code owners November 21, 2019 00:42

MWYang mentioned this pull request Nov 21, 2019

Detecting names in text fields (deprecated) #428

Closed

Updated name of accumulator for name detection in SmartTextVectorizer…

0922623

… to be more informative; Fixed detection counting null values when it should ignore them

MWYang mentioned this pull request Nov 21, 2019

Remove names from SmartTextVectorizer; Add metadata features for sensitive fields #437

Closed

MWYang and others added 3 commits November 25, 2019 15:39

Removed US cities and states from JRC name data

58e91d4

Removed global countries from JRC name data

6a6e985

Merge branch 'master' into my/detect-names

01b7205

MWYang closed this Dec 5, 2019

MWYang mentioned this pull request Dec 9, 2019

Unary estimator for detecting names and transforming to gender #445

Merged

MWYang deleted the my/detect-names branch January 14, 2020 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detecting names in text fields #440

Detecting names in text fields #440

MWYang commented Nov 21, 2019

codecov bot commented Nov 21, 2019 •

edited

Loading

MWYang commented Dec 5, 2019

Detecting names in text fields #440

Detecting names in text fields #440

Conversation

MWYang commented Nov 21, 2019

codecov bot commented Nov 21, 2019 • edited Loading

Codecov Report

MWYang commented Dec 5, 2019

codecov bot commented Nov 21, 2019 •

edited

Loading