-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unary estimator for detecting names and transforming to gender #445
Conversation
…onoids instead of custom accumulators
… Still need to fix HLL serialization in Spark issue
…ix printing bug for NameDetectStats
…stead of moments of text length; Still need to fix no moments higher than the 1st being calculated
let's please follow the convention of camelCasing the parameters, e.g. replace |
features/src/main/scala/com/salesforce/op/features/types/Maps.scala
Outdated
Show resolved
Hide resolved
…ues and proper camel casing for guard check params)
features/src/main/scala/com/salesforce/op/features/types/Maps.scala
Outdated
Show resolved
Hide resolved
features/src/main/scala/com/salesforce/op/features/types/Maps.scala
Outdated
Show resolved
Hide resolved
…emoved BooleanStrings enum
…identifyGender to hopefully reduce complexity
BTW, where are the |
features/src/main/scala/com/salesforce/op/features/types/Maps.scala
Outdated
Show resolved
Hide resolved
They are both from governmental organizations. For the first, US and UK birth registry data, and for the second, the EU science commission. I also documented the pre-processing of those files in a separate repo: https://github.com/MWYang/InternationalNames |
core/src/test/scala/com/salesforce/op/stages/impl/feature/HumanNameDetectorTest.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! some minor comments.
core/src/main/scala/com/salesforce/op/stages/impl/feature/HumanNameDetector.scala
Outdated
Show resolved
Hide resolved
@MWYang kudos on your 1st contribution!!! 🔥 |
Thanks for the contribution! It looks like @MWYang is an internal user so signing the CLA is not required. However, we need to confirm this. |
The PR implements the following algorithm for detecting names in text fields and inferring gender as a standalone unary estimator:
Additional context
I completed this work as part of my ongoing Salesforce internship. This PR cleans up and replaces many previous ones: #440, #437, #428. Per internal review with @leahmcguire, I'm including my name identification and name-to-gender code in two PRs. This first one implements the standalone unary estimator where most of the actual logic is in a utils file,
NameDetectUtils.scala
. The code also makes use of Algebird monoids in order to run in a single pass over the data. Pending approval on this PR, I will then make a separate PR with the proper (and hopefully fewer) changes made in SmartTextVectorizer.