Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unary estimator for detecting names and transforming to gender #445

Merged
merged 64 commits into from
Jan 14, 2020
Merged
Show file tree
Hide file tree
Changes from 60 commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
2acf3fc
Re-added unary estimator code and started porting logic to Algebird m…
MWYang Dec 5, 2019
b55c31e
Re-added JRC name dictionary and cleaned up names of methods
MWYang Dec 6, 2019
f443952
Fixed bug with AveragedValue computation; Trying to debug current Alg…
MWYang Dec 6, 2019
557ef39
Fixed wrong inequality direction in guard checks
MWYang Dec 6, 2019
b6728ec
Added HLL back to monoid accumulator and code now compiles correctly;…
MWYang Dec 6, 2019
ff1b2ef
Fixed HLL in NameDetectStats not serializing correctly; Now need to f…
MWYang Dec 6, 2019
33b77c9
Fixed NameDetectStats printing
MWYang Dec 6, 2019
4caec75
Fixed guard stat calculation computing moments of number of tokens in…
MWYang Dec 6, 2019
b60dc4a
Fixed moments calculation and fixed divide by zero error when list of…
MWYang Dec 6, 2019
469111b
Added gender identification code transforming; All previous tests now…
MWYang Dec 7, 2019
e5f169e
Undid SparkUtils changes, which are no longer necessary
MWYang Dec 7, 2019
b701612
Renamed class names to be more consistent + small fixes
MWYang Dec 9, 2019
8342dae
Added honorific detection
MWYang Dec 9, 2019
2e0e85a
Implemented RegEx checking for gender
MWYang Dec 9, 2019
b079a27
Implemented mixed gender identification strategies
MWYang Dec 9, 2019
a1197a7
Removed TODOs and extraneous functions in preparation for PR
MWYang Dec 9, 2019
19bad0b
Updated documentation
MWYang Dec 9, 2019
3172bde
Ignore null values in detecting names
MWYang Dec 9, 2019
0d82eef
Added flag for ignoring nulls
MWYang Dec 9, 2019
7812d12
Added sir/madam to list of honorifics
MWYang Dec 9, 2019
345508f
Merge branch 'master' into my/unary-detect-names
MWYang Dec 9, 2019
a80e382
Fixed typo when adding sir/madam to list of honorifics that caused fa…
MWYang Dec 10, 2019
f7817d9
Fixed failing test due to divide by zero NA on some inputs
MWYang Dec 10, 2019
cf3eff0
Cleaned up redundant import in tests
MWYang Dec 10, 2019
b928a49
Made small changes based on PR comments (updated inline comment and i…
MWYang Dec 10, 2019
048c084
Created metadata case class per PR review; Added tests for metadata; …
MWYang Dec 10, 2019
5d30e79
Added test for name threshold
MWYang Dec 10, 2019
78c321c
Updated comment about NameDetectStats.toJson
MWYang Dec 10, 2019
a597d21
Added tests for new NameStats feature type
MWYang Dec 10, 2019
997a132
Added private declaration to methods in NameDetectFun trait
MWYang Dec 11, 2019
6b8a039
Abstracted out even more name detection logic into NameDetectUtils
MWYang Dec 11, 2019
1250559
Added default dictionaries to NameDetectUtils object (for lazy and pe…
MWYang Dec 11, 2019
2d804ee
Fixed tests sometimes failing because they were not using the same na…
MWYang Dec 11, 2019
5336e71
Updated NameDetectStats.toJson to be less verbose and use custom seri…
MWYang Dec 11, 2019
464fe52
Added shortcut for unary name detector
MWYang Dec 12, 2019
bff4d1b
Delete accidentally committed temporary test file
MWYang Dec 12, 2019
639af3d
Removed type parameter from NameDetectFun because of later conflict w…
MWYang Dec 12, 2019
440068c
Removed Pythonic i.e. not Scala-ic index thing and added separate cas…
MWYang Dec 13, 2019
e6136f9
Removed extraneous case classes for dictionaries, per PR comment
MWYang Dec 13, 2019
283d76e
Small fixes (updated comments, re-ordered things) per PR comments
MWYang Dec 13, 2019
80d81ed
Removed usage of broadcast variables in transformer b/c it does not s…
MWYang Dec 13, 2019
e003eb2
Fixed serialization of GenderDetectStrategy, per PR recommendation to…
MWYang Dec 13, 2019
b51e14b
Fixed missing plus sign in OpPipelineStageReaderWriter causing double…
MWYang Dec 13, 2019
393275c
Synced changes from upstream feature branch for STV changes
MWYang Dec 14, 2019
ad9574c
Tidied up monoid definition for NameDetectStats after figuring out ho…
MWYang Dec 14, 2019
5d7716b
Abstracted out ordering of gender detection strategies into utils file
MWYang Dec 18, 2019
eaad6f8
Merge branch 'master' into my/unary-detect-names
MWYang Dec 20, 2019
2bea311
Merge branch 'my/unary-detect-names' of https://github.com/MWYang/Tra…
MWYang Dec 20, 2019
77f91a2
Small fixes (better Scala code, more safe, better patterns) from Matt…
MWYang Dec 20, 2019
b6b385a
Improved gender detection strategy tests to check that the correct st…
MWYang Dec 20, 2019
8ed02bd
Broke out guard check numbers into their own params
MWYang Dec 21, 2019
799eb58
Added operationName as an argument to HumanNameDetectorModel for easi…
MWYang Jan 6, 2020
ca69ed8
Merge branch 'master' into my/unary-detect-names
MWYang Jan 6, 2020
b82440a
Merge branch 'my/unary-detect-names' of https://github.com/MWYang/Tra…
MWYang Jan 6, 2020
d747c92
Revert to using container Text class for NameDetectUtils per PR comments
MWYang Jan 6, 2020
7d01d70
Added NameStats to FeatureBuilder
MWYang Jan 6, 2020
b4209c6
Added NameStats to a few more places
MWYang Jan 7, 2020
23d7a57
Added NameStats to TestFeatureBuilder and RandomMap
MWYang Jan 7, 2020
b9522de
Merge branch 'master' into my/unary-detect-names
tovbinm Jan 8, 2020
e4e3ddd
Incorporated PR comments (using enumeratum for NameStats map keys/val…
MWYang Jan 8, 2020
8b02dff
Incorporated PR comments (renamed GenderStrings to GenderValues and r…
MWYang Jan 9, 2020
8844ef5
Removed plural names from NameStats enums and factored out method in …
MWYang Jan 9, 2020
0cc10e0
More small fixes from PR comments
MWYang Jan 10, 2020
4fea007
Removed emptiness check
MWYang Jan 14, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105,601 changes: 105,601 additions & 0 deletions core/src/main/resources/GenderDictionary_USandUK.csv

Large diffs are not rendered by default.

568,127 changes: 568,127 additions & 0 deletions core/src/main/resources/Names_JRC_Combined.txt

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -434,6 +434,15 @@ trait RichTextFeature {
toLowercase: Boolean = TextTokenizer.ToLowercase
): FeatureLike[Binary] =
f.transformWith(new SubstringTransformer[T, T2]().setToLowercase(toLowercase), f2)

/**
* Check if feature is actual human names, and if so, return related gender information (English language only)
*
* @param threshold optional, fraction of rows containing names before processing (default = 0.50)
* @return NameStats, a custom map that will be empty if no name was found
*/
def identifyIfHumanName(threshold: Double = 0.50): FeatureLike[NameStats] =
new HumanNameDetector[T]().setThreshold(threshold).setInput(f).getOutput()
}

implicit class RichPhoneFeature(val f: FeatureLike[Phone]) {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
/*
* Copyright (c) 2017, Salesforce.com, Inc.
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
*
* * Redistributions of source code must retain the above copyright notice, this
* list of conditions and the following disclaimer.
*
* * Redistributions in binary form must reproduce the above copyright notice,
* this list of conditions and the following disclaimer in the documentation
* and/or other materials provided with the distribution.
*
* * Neither the name of the copyright holder nor the names of its
* contributors may be used to endorse or promote products derived from
* this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
* SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
* CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

package com.salesforce.op.stages.impl.feature

import com.salesforce.op._
import com.salesforce.op.features.types.NameStats.GenderStrings
import com.salesforce.op.features.types._
import com.salesforce.op.stages.base.unary.{UnaryEstimator, UnaryModel}
import com.salesforce.op.stages.impl.MetadataLike
import com.salesforce.op.utils.spark.RichDataset._
import com.salesforce.op.utils.stages._
import com.twitter.algebird.Operators._
import org.apache.spark.sql._
import org.apache.spark.sql.types.{Metadata, MetadataBuilder}

import scala.reflect.runtime.universe.TypeTag

/**
* Unary estimator for identifying whether a single Text column is a name or not. If the column does appear to be a
* name, a custom map will be returned that contains the guessed gender for each entry (gender detection only supported
* for English at the moment). If the column does not appear to be a name, then the output will be an empty map.
* @param uid uid for instance
* @param operationName unique name of the operation this stage performs
* @param tti type tag for input
* @param ttiv type tag for input value
* @tparam T the FeatureType (subtype of Text) to operate over
*/
class HumanNameDetector[T <: Text]
(
uid: String = UID[HumanNameDetector[T]],
operationName: String = "humanNameDetect"
)
(
implicit tti: TypeTag[T],
override val ttiv: TypeTag[T#Value]
) extends UnaryEstimator[T, NameStats](
uid = uid,
operationName = operationName
) with NameDetectFun[T] {

def fitFn(dataset: Dataset[T#Value]): HumanNameDetectorModel[T] = {
require(!dataset.isEmpty, "Input dataset cannot be empty")
MWYang marked this conversation as resolved.
Show resolved Hide resolved

implicit val (nameDetectStatsEnc, nameDetectStatsMonoid) = (NameDetectStats.kryo, NameDetectStats.monoid)
val mapFun: T#Value => NameDetectStats = makeMapFunction(dataset.sparkSession)
val aggResults: NameDetectStats = dataset.map(mapFun).reduce(_ + _)
tovbinm marked this conversation as resolved.
Show resolved Hide resolved
val treatAsName = computeTreatAsName(aggResults)

val newMetadata = HumanNameDetectorMetadata(
treatAsName, aggResults.dictCheckResult.value, aggResults.genderResultsByStrategy
).toMetadata()
val metaDataBuilder = new MetadataBuilder().withMetadata(getMetadata()).withMetadata(newMetadata)
setMetadata(metaDataBuilder.build())

val orderedGenderDetectStrategies =
if (treatAsName) orderGenderStrategies(aggResults) else Seq.empty[GenderDetectStrategy]
new HumanNameDetectorModel[T](uid, operationName, treatAsName, orderedGenderDetectStrategies)
}
}

class HumanNameDetectorModel[T <: Text]
(
override val uid: String,
operationName: String,
val treatAsName: Boolean,
val orderedGenderDetectStrategies: Seq[GenderDetectStrategy] = Seq.empty[GenderDetectStrategy]
)(implicit tti: TypeTag[T])
extends UnaryModel[T, NameStats](operationName, uid) with NameDetectFun[T] {

import NameStats.BooleanStrings._
import NameStats.GenderStrings.GenderNA
import NameStats.Keys._
def transformFn: T => NameStats = (input: T) => {
val tokens = preProcess(input)
if (treatAsName) {
require(orderedGenderDetectStrategies.nonEmpty, "There must be a gender extraction strategy if treating as name.")
// Could figure out how to use a broadcast variable for the gender dictionary within a unary transformer
val genders: Seq[GenderStrings] = orderedGenderDetectStrategies map {
identifyGender(input.value, tokens, _, NameDetectUtils.DefaultGenderDictionary)
}
val gender = genders.find(_ != GenderNA).getOrElse(GenderNA)
val map: Map[String, String] = Map(
IsName.toString -> True.toString,
OriginalValue.toString -> input.value.getOrElse(""),
Gender.toString -> gender.toString
)
NameStats(map)
}
else NameStats(Map.empty[String, String])
}
}

case class HumanNameDetectorMetadata
MWYang marked this conversation as resolved.
Show resolved Hide resolved
(
treatAsName: Boolean,
predictedNameProb: Double,
genderResultsByStrategy: Map[String, GenderStats]
) extends MetadataLike {
import HumanNameDetectorMetadata._

override def toMetadata(): Metadata = {
val metaDataBuilder = new MetadataBuilder()
metaDataBuilder.putBoolean(TreatAsNameKey, treatAsName)
metaDataBuilder.putDouble(PredictedNameProbKey, predictedNameProb)
val genderResultsMetaDataBuilder = new MetadataBuilder()
genderResultsByStrategy foreach { case (strategyString, stats) =>
genderResultsMetaDataBuilder.putDoubleArray(strategyString, Array(stats.numMale, stats.numFemale, stats.numOther))
}
metaDataBuilder.putMetadata(GenderResultsByStrategyKey, genderResultsMetaDataBuilder.build())
metaDataBuilder.build()
}

override def toMetadata(skipUnsupported: Boolean): Metadata = toMetadata()
}

case object HumanNameDetectorMetadata {
val TreatAsNameKey = "treatAsName"
val PredictedNameProbKey = "predictedNameProb"
val GenderResultsByStrategyKey = "genderResultsByStrategy"

def fromMetadata(metadata: Metadata): HumanNameDetectorMetadata = {
val genderResultsMetadata = metadata.getMetadata(GenderResultsByStrategyKey)
val genderResultsByStrategy: Map[String, GenderStats] = {
NameDetectUtils.GenderDetectStrategies map { strategy: GenderDetectStrategy =>
val strategyString = strategy.toString
val resultsArray = genderResultsMetadata.getDoubleArray(strategyString)
require(resultsArray.length == 3,
"There must be exactly three values for each gender detection strategy: numMale, numFemale, and numOther.")
strategyString -> GenderStats(
numMale = resultsArray(0).toInt, numFemale = resultsArray(1).toInt, numOther = resultsArray(2).toInt
)
} toMap
}
HumanNameDetectorMetadata(
metadata.getBoolean(TreatAsNameKey),
metadata.getDouble(PredictedNameProbKey),
genderResultsByStrategy
)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,10 @@ private[op] case object Transmogrifier {
val (f, other) = castAs[StreetMap](g) // TODO make Street specific transformer
f.vectorize(topK = TopK, minSupport = MinSupport, cleanText = CleanText, cleanKeys = CleanKeys,
others = other, trackNulls = TrackNulls, maxPctCardinality = MaxPercentCardinality)
case t if t =:= weakTypeOf[NameStats] =>
val (f, other) = castAs[NameStats](g)
f.vectorize(topK = TopK, minSupport = MinSupport, cleanText = CleanText, cleanKeys = CleanKeys,
others = other, trackNulls = TrackNulls, maxPctCardinality = MaxPercentCardinality)
case t if t =:= weakTypeOf[GeolocationMap] =>
val (f, other) = castAs[GeolocationMap](g)
f.vectorize(cleanKeys = CleanKeys, others = other, trackNulls = TrackNulls)
Expand Down
Loading