-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detecting names in text fields (deprecated) #428
Conversation
…h NameList" This reverts commit b42469a.
…ing inference for full names yet
…urns empty map (i.e. null) when not a name
Thanks for the contribution! It looks like @MWYang is an internal user so signing the CLA is not required. However, we need to confirm this. |
Codecov Report
@@ Coverage Diff @@
## master #428 +/- ##
==========================================
- Coverage 86.93% 82.07% -4.87%
==========================================
Files 337 340 +3
Lines 11100 11375 +275
Branches 366 376 +10
==========================================
- Hits 9650 9336 -314
- Misses 1450 2039 +589
Continue to review full report at Codecov.
|
@MWYang thanks for the contribution. I would appreciate to get some context about the proposed changes in the PR description. |
…mogrifAI into my/detect-sensitive
…resholding for name identifying
…ed guard check functionality; Added some typing back to name identifcation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR includes size-able resource files. Can this be downloaded from outside the repo and cached by our build?
core/src/main/scala/com/salesforce/op/dsl/RichTextFeature.scala
Outdated
Show resolved
Hide resolved
@MWYang why do we need a new |
You're right, there's nothing special about Should I revert the changes for creating the new |
I'm closing because I made #440, which reduces the PR size by removing out a different feature that I didn't mean to commit into this branch and by removing the extraneous |
Thank you @MWYang I will have a look. |
SmartTextVectorizer
now has an optional flagdetectSensitive
that will guess, using a combination of dictionary lookup and conditional logic, whether any of the input columns are names (which we don't want in our models in case of bias). For right now, just a warning is logged to console that there may be such names in the input fields. In the future, theremoveSensitive
flag will remove those columns from contributing to the output vector. Also in the future, the gender information that is extracted from name columns (using government data) will be used to check for model fairness.A unary estimator
HumanNameIdentifier
is also included as a standalone drop-in for custom workflows.Additional context
I completed this work as part of my ongoing Salesforce internship.