Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detecting names in text fields (deprecated) #428

Closed
wants to merge 46 commits into from
Closed
Show file tree
Hide file tree
Changes from 44 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
5f9f99d
First pass at name identification with European Union JRC data
MWYang Oct 7, 2019
180c7f1
Perform dictionary check with one UDF
MWYang Oct 7, 2019
fb5da80
Replaced expensive Spark counts with approximate versions
MWYang Oct 7, 2019
b42469a
Bad attempt at representing output of name identification with NameList
MWYang Oct 7, 2019
f391e07
Revert "Bad attempt at representing output of name identification wit…
MWYang Oct 7, 2019
b2c1663
Changed output type to NameMap
MWYang Oct 8, 2019
ea0ac44
First pass at gender identification with SSA data
MWYang Oct 8, 2019
aeba797
Made gender identification more robust to missing values; not attempt…
MWYang Oct 9, 2019
06388ec
Started on test workflow
MWYang Oct 9, 2019
93195c0
Renamed NameMap to the more accurate descriptor NameStats
MWYang Oct 9, 2019
0fe071a
First pass at postal code detection
MWYang Oct 9, 2019
ca7cc24
Added lat-long extraction with geonames data
MWYang Oct 9, 2019
74f8efc
Added RichText implicit for name identifiers, and Name identifier ret…
MWYang Oct 9, 2019
52f86f6
Aded RichText implicit for postal code identifiers
MWYang Oct 9, 2019
685de4e
Handle postal code matches in arbitrary locations of the string
MWYang Oct 9, 2019
45a3eaf
Minor change to output map from postal code identification
MWYang Oct 10, 2019
f6656ad
Added handling of leading zeros in postal codes & refactored a bit
MWYang Oct 15, 2019
d06ef91
Merge branch 'master' into my/detect-sensitive
tovbinm Oct 19, 2019
865414e
Added gender identification for full names
MWYang Oct 21, 2019
0eb9715
Merge branch 'my/detect-sensitive' of https://github.com/MWYang/Trans…
MWYang Oct 21, 2019
e7ca36c
Take the grand mean of dictionary checked words per row instead of th…
MWYang Oct 22, 2019
c7ef496
Started integrating HumanNameIdentifier into SmartTextVectorizer
MWYang Oct 24, 2019
1df72be
Continued migrating name identification into SmartTextVectorizer; Add…
MWYang Oct 26, 2019
f6c317a
Implemented gender checking in SmartTextVectorizer
MWYang Oct 30, 2019
88a5a61
Changed logging and project settings
MWYang Oct 30, 2019
d8a3f33
Added logging back into name identification
MWYang Oct 31, 2019
f7350c0
Moved common Spark functions to utils and fixed bug with empty Row
MWYang Nov 1, 2019
dd811a6
Added type safety to guardCheck, averageCol, and extractDouble functi…
MWYang Nov 1, 2019
8240bf9
Fixed missing pctMale/Female/Other values and added redundant tests
MWYang Nov 1, 2019
3f282bb
REVERT EVENTUALLY: Added logging for a sample of the row to sanity ch…
MWYang Nov 1, 2019
f2b81ae
Cleaned and switched back to using JRC names
MWYang Nov 5, 2019
025a768
First pass at converting code to use only one pass over the dataset w…
MWYang Nov 5, 2019
db98115
Finished using treeAggregate for one pass over the data; using broadc…
MWYang Nov 5, 2019
358f587
Set default timeout for countApprox; use exact count via treeAggregat…
MWYang Nov 5, 2019
e1a09a8
Removed TODOs and added parameter for removing names
MWYang Nov 5, 2019
b7ccb18
Figured out where to put column removal functionality for name columns
MWYang Nov 6, 2019
a914354
Merged SmartTextVectorizerWithBias into SmartTextVectorizer; moved Na…
MWYang Nov 6, 2019
4730b73
Removed unnecessary implicits (replaced with default arguments) and p…
MWYang Nov 6, 2019
580a182
Added documentation to NameIdentificationUtils and updated dictionaries
MWYang Nov 7, 2019
006394c
Finished adding documentation and made small naming changes
MWYang Nov 7, 2019
76b14c8
Reverted property file changes
MWYang Nov 7, 2019
82ca17f
Merge branch 'master' into my/detect-sensitive
MWYang Nov 7, 2019
e8984e2
Hopefully fixed master URL configuration error with SparkSession
MWYang Nov 7, 2019
582a83a
Fixed (hopefully) broadcast variable issue by declaring them when dat…
MWYang Nov 7, 2019
613cf45
Changed SmartTextVectorizer sensitive feature args to an enum from mu…
MWYang Nov 11, 2019
9ecad2f
Merge branch 'master' into my/detect-sensitive
tovbinm Nov 14, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105,601 changes: 105,601 additions & 0 deletions core/src/main/resources/GenderDictionary_USandUK.csv

Large diffs are not rendered by default.

491,975 changes: 491,975 additions & 0 deletions core/src/main/resources/Names_JRC_Combined.txt

Large diffs are not rendered by default.

41,469 changes: 41,469 additions & 0 deletions core/src/main/resources/USPostalCodes.txt

Large diffs are not rendered by default.

22 changes: 22 additions & 0 deletions core/src/main/scala/com/salesforce/op/dsl/RichTextFeature.scala
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,8 @@ trait RichTextFeature {
hashSpaceStrategy: HashSpaceStrategy = TransmogrifierDefaults.HashSpaceStrategy,
defaultLanguage: Language = TextTokenizer.DefaultLanguage,
hashAlgorithm: HashAlgorithm = TransmogrifierDefaults.HashAlgorithm,
detectSensitive: Boolean = false,
MWYang marked this conversation as resolved.
Show resolved Hide resolved
removeSensitive: Boolean = false,
others: Array[FeatureLike[T]] = Array.empty
): FeatureLike[OPVector] = {
// scalastyle:on parameter.number
Expand All @@ -258,6 +260,8 @@ trait RichTextFeature {
.setHashSpaceStrategy(hashSpaceStrategy)
.setHashAlgorithm(hashAlgorithm)
.setBinaryFreq(binaryFreq)
.setDetectSensitive(detectSensitive)
.setRemoveSensitive(removeSensitive)
.getOutput()
}

Expand Down Expand Up @@ -434,6 +438,24 @@ trait RichTextFeature {
toLowercase: Boolean = TextTokenizer.ToLowercase
): FeatureLike[Binary] =
f.transformWith(new SubstringTransformer[T, T2]().setToLowercase(toLowercase), f2)

/**
* Check if feature is actual human names, and if so, return related demographic information
*
* @param threshold optional, fraction of rows containing names before processing (default = 0.50)
* @return NameStats, a custom map that will be empty if no name was found
*/
def identifyIfHumanName(threshold: Double = 0.50): FeatureLike[NameStats] =
new HumanNameIdentifier[T]().setThreshold(threshold).setInput(f).getOutput()

/**
* Check if feature is postal codes, and if so, return postal code with lat/long
*
* @param threshold optional, fraction of rows containing valid postal codes before processing (default = 0.90)
* @return PostalCodeMap, will be empty if no name was found
*/
def identifyIfPostalCode(threshold: Double = 0.90): FeatureLike[PostalCodeMap] =
new PostalCodeIdentifier[T]().setThreshold(threshold).setInput(f).getOutput()
}

implicit class RichPhoneFeature(val f: FeatureLike[Phone]) {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
/*
* Copyright (c) 2017, Salesforce.com, Inc.
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
*
* * Redistributions of source code must retain the above copyright notice, this
* list of conditions and the following disclaimer.
*
* * Redistributions in binary form must reproduce the above copyright notice,
* this list of conditions and the following disclaimer in the documentation
* and/or other materials provided with the distribution.
*
* * Neither the name of the copyright holder nor the names of its
* contributors may be used to endorse or promote products derived from
* this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
* SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
* CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

package com.salesforce.op.stages.impl.feature

import com.salesforce.op._
import com.salesforce.op.features.types._
import com.salesforce.op.stages.base.unary.{UnaryEstimator, UnaryModel}
import com.salesforce.op.utils.stages.NameIdentificationFun
import com.salesforce.op.utils.stages.NameIdentificationUtils.GenderDictionary
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.ml.param.{DoubleParam, IntParam, ParamValidators}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.MetadataBuilder

import scala.reflect.runtime.universe.TypeTag

/**
* Unary estimator for identifying whether a single Text column is a name or not. If the column does appear to be a
* name, a custom map will be returned that contains the guessed gender for each entry. If the column does not appear
* to be a name, then the output will be an empty map.
* @param uid uid for instance
* @param operationName unique name of the operation this stage performs
* @param tti type tag for input
* @param ttiv type tag for input value
* @tparam T the FeatureType (subtype of Text) to operate over
*/
class HumanNameIdentifier[T <: Text]
(
uid: String = UID[HumanNameIdentifier[T]],
operationName: String = "human name identifier"
)
(
implicit tti: TypeTag[T],
override val ttiv: TypeTag[T#Value]
) extends UnaryEstimator[T, NameStats](
uid = uid,
operationName = operationName
) with NameIdentificationFun[T] {

val defaultThreshold = new DoubleParam(
parent = this,
name = "defaultThreshold",
doc = "default fraction of entries to be names before treating as name",
isValid = (value: Double) => {
ParamValidators.gt(0.0)(value) && ParamValidators.lt(1.0)(value)
}
)
setDefault(defaultThreshold, 0.50)
def setThreshold(value: Double): this.type = set(defaultThreshold, value)

val countApproxTimeout = new IntParam(
parent = this,
name = "countApproxTimeout",
doc = "how long to wait (in milliseconds) for result of dataset.rdd.countApprox",
isValid = (value: Int) => { ParamValidators.gt(0)(value) }
)
setDefault(countApproxTimeout, 3 * 60 * 1000)
def setCountApproxTimeout(value: Int): this.type = set(countApproxTimeout, value)

def fitFn(dataset: Dataset[T#Value]): HumanNameIdentifierModel[T] = {
require(dataset.schema.fieldNames.length == 1, "There is exactly one column in this dataset")

val column = col(dataset.schema.fieldNames.head)
val (predictedNameProb, treatAsName, bestFirstNameIndex) = unaryEstimatorFitFn(
dataset, column, $(defaultThreshold), $(countApproxTimeout)
)

// modified from: https://docs.transmogrif.ai/en/stable/developer-guide/index.html#metadata
val preExistingMetadata = getMetadata()
val metaDataBuilder = new MetadataBuilder().withMetadata(preExistingMetadata)
metaDataBuilder.putBoolean("treatAsName", treatAsName)
metaDataBuilder.putLong("predictedNameProb", predictedNameProb.toLong)
metaDataBuilder.putLong("bestFirstNameIndex", bestFirstNameIndex.getOrElse(-1).toLong)
val updatedMetadata = metaDataBuilder.build()
setMetadata(updatedMetadata)

new HumanNameIdentifierModel[T](uid, treatAsName, indexFirstName = bestFirstNameIndex)
}
}


class HumanNameIdentifierModel[T <: Text]
(
override val uid: String,
val treatAsName: Boolean,
val indexFirstName: Option[Int] = None
)(implicit tti: TypeTag[T])
extends UnaryModel[T, NameStats]("human name identifier", uid) with NameIdentificationFun[T] {

var broadcastGenderDict: Option[Broadcast[GenderDictionary]] = None

override def transform(dataset: Dataset[_]): DataFrame = {
val spark: SparkSession = dataset.sparkSession
this.broadcastGenderDict = Some(spark.sparkContext.broadcast(GenderDictionary()))
super.transform(dataset)
}

def transformFn: T => NameStats = (input: T) => {
transformerFn(treatAsName, indexFirstName, input, this.broadcastGenderDict.get)
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
/*
* Copyright (c) 2017, Salesforce.com, Inc.
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
*
* * Redistributions of source code must retain the above copyright notice, this
* list of conditions and the following disclaimer.
*
* * Redistributions in binary form must reproduce the above copyright notice,
* this list of conditions and the following disclaimer in the documentation
* and/or other materials provided with the distribution.
*
* * Neither the name of the copyright holder nor the names of its
* contributors may be used to endorse or promote products derived from
* this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
* SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
* CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

package com.salesforce.op.stages.impl.feature

import com.salesforce.op._
import com.salesforce.op.features.types._
import com.salesforce.op.stages.base.unary.{UnaryEstimator, UnaryModel}
import com.salesforce.op.utils.text.TextUtils.getBestRegexMatch
import org.apache.spark.ml.param.{DoubleParam, ParamValidators}
import org.apache.spark.sql.{Dataset, SparkSession}
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions._
import org.apache.spark.util.SparkUtils.averageBoolCol

import scala.collection.mutable
import scala.io.Source
import scala.reflect.runtime.universe.TypeTag
import scala.util.Try
import scala.util.matching.Regex

trait PostalCodeHelpers {
lazy val postalCodeDictionary: mutable.Map[String, (Option[Double], Option[Double])] = {
val postalCodeDictionary = collection.mutable.Map.empty[String, (Option[Double], Option[Double])]
val dictionaryPath = "/USPostalCodes.txt"
val stream = getClass.getResourceAsStream(dictionaryPath)
val buffer = Source.fromInputStream(stream)
for {row <- buffer.getLines} {
val cols = row.split(",").map(_.trim)
val code = cols(0)
val lat = Try {
cols(1).toDouble
}.toOption
val lng = Try {
cols(2).toDouble
}.toOption
postalCodeDictionary += (code -> (lat, lng))
}
buffer.close
postalCodeDictionary
}
val patterns: Seq[Regex] = Seq(
".*(\\d{5}).*".r,
".*(\\d{4}).*".r,
".*(\\d{3}).*".r
)

def findBestPostalCodeMatch(s: String): String = {
val result = getBestRegexMatch(patterns, s)
// Pad result with leading zeros if needed
if (result.length < 5) {
val numMissingDigits = 5 - result.length
(Seq.fill(numMissingDigits)("0") :+ result).mkString("")
}
else result
}
}

class PostalCodeIdentifier[T <: Text]
(
uid: String = UID[PostalCodeIdentifier[_]],
operationName: String = "postal code identifier"
)
(
implicit tti: TypeTag[T],
override val ttiv: TypeTag[T#Value]
) extends UnaryEstimator[T, PostalCodeMap](
uid = uid,
operationName = operationName
) with PostalCodeHelpers {
private val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
// Parameters
val defaultThreshold = new DoubleParam(
parent = this,
name = "defaultThreshold",
doc = "default fraction of successful postal code validations before treating as Postal Code",
isValid = (value: Double) => {
ParamValidators.gt(0.0)(value) && ParamValidators.lt(1.0)(value)
}
)
setDefault(defaultThreshold, 0.90)

def setThreshold(value: Double): this.type = set(defaultThreshold, value)

private def checkIfPostalCode: UserDefinedFunction = udf((s: String) => {
val matched = findBestPostalCodeMatch(s)
matched != "" && (postalCodeDictionary contains matched)
}: Boolean)

def fitFn(dataset: Dataset[Text#Value]): PostalCodeIdentifierModel[T] = {
assert(dataset.schema.fieldNames.length == 1)
val column = col(dataset.schema.fieldNames.head)
if (
averageBoolCol(
dataset.select(checkIfPostalCode(column).alias(column.toString).as[Boolean]),
column
) >= $(defaultThreshold)
) {
new PostalCodeIdentifierModel[T](uid, true)
} else new PostalCodeIdentifierModel[T](uid, false)
}
}

class PostalCodeIdentifierModel[T <: Text]
(
override val uid: String,
val treatAsPostalCode: Boolean
)(implicit tti: TypeTag[T])
extends UnaryModel[T, PostalCodeMap]("postal code identifier", uid)
with PostalCodeHelpers {
def transformFn: Text => PostalCodeMap = input => {
val rawInput = input.value.getOrElse("")
val postalCode = findBestPostalCodeMatch(rawInput)
if (treatAsPostalCode) {
val (latOption, lngOption) = postalCodeDictionary.getOrElse(postalCode, (None, None))
(latOption, lngOption) match {
case (Some(lat), Some(lng)) =>
PostalCodeMap(Map("postalCode" -> postalCode, "lat" -> lat.toString, "lng" -> lng.toString))
case _ => PostalCodeMap(Map(postalCode -> "true", "lat" -> "", "lng" -> ""))
}
}
else PostalCodeMap(Map.empty[String, String])
}
}
Loading