Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detecting names in text fields (deprecated) #428

Closed
wants to merge 46 commits into from

Conversation

MWYang
Copy link
Contributor

@MWYang MWYang commented Oct 18, 2019

SmartTextVectorizer now has an optional flag detectSensitive that will guess, using a combination of dictionary lookup and conditional logic, whether any of the input columns are names (which we don't want in our models in case of bias). For right now, just a warning is logged to console that there may be such names in the input fields. In the future, the removeSensitive flag will remove those columns from contributing to the output vector. Also in the future, the gender information that is extracted from name columns (using government data) will be used to check for model fairness.

A unary estimator HumanNameIdentifier is also included as a standalone drop-in for custom workflows.

Additional context
I completed this work as part of my ongoing Salesforce internship.

@salesforce-cla
Copy link

Thanks for the contribution! It looks like @MWYang is an internal user so signing the CLA is not required. However, we need to confirm this.

@codecov
Copy link

codecov bot commented Oct 18, 2019

Codecov Report

Merging #428 into master will decrease coverage by 4.86%.
The diff coverage is 78.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #428      +/-   ##
==========================================
- Coverage   86.93%   82.07%   -4.87%     
==========================================
  Files         337      340       +3     
  Lines       11100    11375     +275     
  Branches      366      376      +10     
==========================================
- Hits         9650     9336     -314     
- Misses       1450     2039     +589
Impacted Files Coverage Δ
...scala/com/salesforce/op/utils/text/TextUtils.scala 42.85% <0%> (-57.15%) ⬇️
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 68.49% <0%> (-14.61%) ⬇️
.../scala/com/salesforce/op/features/types/Maps.scala 77.77% <0%> (-15%) ⬇️
.../main/scala/org/apache/spark/util/SparkUtils.scala 0% <0%> (ø) ⬆️
...ala/com/salesforce/op/features/types/package.scala 44.59% <0%> (-13.34%) ⬇️
...n/scala/com/salesforce/op/testkit/RandomText.scala 98.41% <0%> (-1.59%) ⬇️
.../scala/com/salesforce/op/features/types/Text.scala 84% <0%> (-10.37%) ⬇️
...sforce/op/stages/impl/feature/Transmogrifier.scala 96.62% <100%> (-1.4%) ⬇️
...com/salesforce/op/features/FeatureSparkTypes.scala 99.14% <100%> (+0.01%) ⬆️
...sforce/op/features/types/FeatureTypeDefaults.scala 96.19% <100%> (+0.07%) ⬆️
... and 64 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ccc1501...9ecad2f. Read the comment docs.

@tovbinm
Copy link
Collaborator

tovbinm commented Oct 19, 2019

@MWYang thanks for the contribution. I would appreciate to get some context about the proposed changes in the PR description.

@MWYang MWYang changed the title Detecting sensitive features in custom fields Detecting names in text fields Nov 7, 2019
@MWYang MWYang marked this pull request as ready for review November 7, 2019 23:37
Copy link
Contributor

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR includes size-able resource files. Can this be downloaded from outside the repo and cached by our build?

@tovbinm
Copy link
Collaborator

tovbinm commented Nov 14, 2019

@MWYang why do we need a new FeatureType for Name? what's so specific about that Text cannot handle?

@MWYang
Copy link
Contributor Author

MWYang commented Nov 14, 2019

@MWYang why do we need a new FeatureType for Name? what's so specific about that Text cannot handle?

You're right, there's nothing special about Name. At the beginning of my ideation process, I had thought that it would make sense to have a different output type for the fields that get flagged as names, but it's no longer a part of the feature. Now, it seems the preferred way to store information about which fields are names is in the metadata, which is what I'm working towards.

Should I revert the changes for creating the new Name feature type then?

@MWYang MWYang changed the title Detecting names in text fields Detecting names in text fields (deprecated) Nov 21, 2019
@MWYang
Copy link
Contributor Author

MWYang commented Nov 21, 2019

I'm closing because I made #440, which reduces the PR size by removing out a different feature that I didn't mean to commit into this branch and by removing the extraneous Name types, per @gerashegalov's suggestion. Would appreciate a review of the new PR so I can know what to work on!

@tovbinm
Copy link
Collaborator

tovbinm commented Nov 21, 2019

Thank you @MWYang I will have a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants