Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unary estimator for detecting names and transforming to gender #445

Merged
merged 64 commits into from
Jan 14, 2020

Conversation

MWYang
Copy link
Contributor

@MWYang MWYang commented Dec 9, 2019

The PR implements the following algorithm for detecting names in text fields and inferring gender as a standalone unary estimator:

  • First, a majority (or some custom threshold) of entries must be found in a large name dictionary
  • If the threshold is reached, the transformer uses several heuristics (the first or last token; RegEx patterns; presence of honorifics like Mr. or Ms.) to find the likely first name and infer the gender.

Additional context
I completed this work as part of my ongoing Salesforce internship. This PR cleans up and replaces many previous ones: #440, #437, #428. Per internal review with @leahmcguire, I'm including my name identification and name-to-gender code in two PRs. This first one implements the standalone unary estimator where most of the actual logic is in a utils file, NameDetectUtils.scala. The code also makes use of Algebird monoids in order to run in a single pass over the data. Pending approval on this PR, I will then make a separate PR with the proper (and hopefully fewer) changes made in SmartTextVectorizer.

… Still need to fix HLL serialization in Spark issue
…stead of moments of text length; Still need to fix no moments higher than the 1st being calculated
@MWYang MWYang requested a review from tovbinm January 7, 2020 20:17
@tovbinm
Copy link
Collaborator

tovbinm commented Jan 8, 2020

let's please follow the convention of camelCasing the parameters, e.g. replace guard_maxNumberOfTokens with guardMaxNumberOfTokens @MWYang

…ues and proper camel casing for guard check params)
@MWYang MWYang requested a review from tovbinm January 8, 2020 18:55
@MWYang MWYang closed this Jan 9, 2020
@MWYang MWYang reopened this Jan 9, 2020
@tovbinm
Copy link
Collaborator

tovbinm commented Jan 10, 2020

BTW, where are the GenderDictionary_USandUK.csv and Names_JRC_Combined.txt coming from? What's the license on these datasets? @MWYang

@MWYang
Copy link
Contributor Author

MWYang commented Jan 10, 2020

BTW, where are the GenderDictionary_USandUK.csv and Names_JRC_Combined.txt coming from? What's the license on these datasets? @MWYang

They are both from governmental organizations. For the first, US and UK birth registry data, and for the second, the EU science commission. I also documented the pre-processing of those files in a separate repo: https://github.com/MWYang/InternationalNames

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! some minor comments.

@MWYang MWYang closed this Jan 14, 2020
@MWYang MWYang reopened this Jan 14, 2020
@MWYang MWYang merged commit ac83ad7 into salesforce:master Jan 14, 2020
@tovbinm
Copy link
Collaborator

tovbinm commented Jan 14, 2020

@MWYang kudos on your 1st contribution!!! 🔥

@nicodv nicodv mentioned this pull request Jun 11, 2020
@salesforce-cla
Copy link

Thanks for the contribution! It looks like @MWYang is an internal user so signing the CLA is not required. However, we need to confirm this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants