Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detecting names in text fields #440

Closed
wants to merge 7 commits into from

Conversation

MWYang
Copy link
Contributor

@MWYang MWYang commented Nov 21, 2019

SmartTextVectorizer now has an optional flag detectSensitive that will guess, using a combination of dictionary lookup and conditional logic, whether any of the input columns are names (which we don't want in our models in case of bias). For right now, just a warning is logged to console that there may be such names in the input fields. In the future, the removeSensitive flag will remove those columns from contributing to the output vector. Also in the future, the gender information that is extracted from name columns (using government data) will be used to check for model fairness.

A unary estimator HumanNameIdentifier is also included as a standalone drop-in for custom workflows.

Additional context
I completed this work as part of my ongoing Salesforce internship. This PR cleans up and replaces #428.

@codecov
Copy link

codecov bot commented Nov 21, 2019

Codecov Report

Merging #440 into master will decrease coverage by 3.76%.
The diff coverage is 18.62%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #440      +/-   ##
==========================================
- Coverage   86.93%   83.17%   -3.77%     
==========================================
  Files         337      339       +2     
  Lines       11096    11296     +200     
  Branches      362      597     +235     
==========================================
- Hits         9646     9395     -251     
- Misses       1450     1901     +451
Impacted Files Coverage Δ
...orce/op/utils/stages/NameIdentificationUtils.scala 0% <0%> (ø)
...scala/com/salesforce/op/utils/text/TextUtils.scala 42.85% <0%> (-57.15%) ⬇️
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 69.44% <0%> (-13.66%) ⬇️
.../scala/com/salesforce/op/features/types/Maps.scala 77.77% <0%> (-15%) ⬇️
...n/scala/com/salesforce/op/testkit/RandomText.scala 98.41% <0%> (-1.59%) ⬇️
...e/op/stages/impl/feature/HumanNameIdentifier.scala 0% <0%> (ø)
...com/salesforce/op/features/FeatureSparkTypes.scala 99.14% <100%> (ø) ⬆️
...sforce/op/features/types/FeatureTypeDefaults.scala 96.15% <100%> (+0.03%) ⬆️
...e/op/stages/impl/feature/SmartTextVectorizer.scala 58.82% <26.92%> (-40.03%) ⬇️
...esforce/op/features/types/FeatureTypeFactory.scala 98.27% <50%> (-0.85%) ⬇️
... and 41 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9778481...01b7205. Read the comment docs.

… to be more informative; Fixed detection counting null values when it should ignore them
@MWYang
Copy link
Contributor Author

MWYang commented Dec 5, 2019

Closing to rework on comments from reviewers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants