Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new strategy: genderStrategyScore #357

Closed
paulalbert1 opened this issue May 31, 2019 · 1 comment
Closed

Create new strategy: genderStrategyScore #357

paulalbert1 opened this issue May 31, 2019 · 1 comment
Assignees

Comments

@paulalbert1
Copy link
Contributor

paulalbert1 commented May 31, 2019

Background

There are a number of cases where ReCiter is suggesting articles for someone of the opposite gender. For example...

Screen Shot 2019-05-31 at 10 01 45 AM

We can take advantage of the fact that certain names are more often associated with a particular gender - especially in cases when the inferred gender of the name of our target person does not match the inferred name of the target author of our candidate article.

Caveat: yes, gender is a social construct, but people named Richard tend to be male more often than not (according to SSA, 99.6% of the time), and people named Susan tend to be female more often than not (99.8%). If a person of interest named Susan happens to be a male, ReCiter would not entirely fail to suggest a candidate article where the targetAuthor is "Susananne," It would merely slightly downweight that result as a possible match.

Data source for gender

Howarder downloaded names and genders from the Social Security Administration, which covers 1930-2015. He then computed percentages by gender.
name_gender.json.txt

Some sample data:

Michaeel,M,1
Michael,M,0.9950199847246701
Michaela,F,0.9985816477553034
...
Susaa,F,1
Susan,F,0.9977389059509503

Consistent with other data sets, this table could be loaded as a DynamoDB table, with name = "Gender."

How this would work

  1. Add to application.properties
strategy.genderStrategyScore.minimumScore=-5
strategy.genderStrategyScore.rangeScore=6
  1. Attempt to infer the gender of our target person.
  • Get firstName and middleName from primaryName and alternateName.
  • Split names on space and dash. Names need to be two characters or more.
  • Attempt to do an exact lookup in the Gender table of all these names:
    • primaryName(s) from firstName
    • alternateName(s) from firstName
    • primaryName(s) from middleName
    • alternateName(s) from middleName
  • If there's no exact match, stop.
  1. Identify the gender and percentage for the target person. For example, for Ben:
Ben,M,0.9943987100059407

3a. Take the average gender score of all available names.

  1. Attempt to infer the gender of our targetAuthor.
  • Get firstName from article.
  • Split name on space
  • Attempt to do an exact lookup in the Gender table
  • Example: 24795040 (Y. Claire Wang) --> Claire
  1. Identify the gender and percentage for the target article. For example, for Beth:
Beth,F,0.9979603107858765

5a. Take the average gender score of all available names.

  1. Compute gender score discrepancy between article and identity.
  • For any female gender, subtract score from 1. For example articleGender for Beth, would be 1 - 0.9979 = 0.0021.
  • For a male gender, leave score as is: 0.994.
  • Take the absolute value of the difference from 1. For example: 1 - (0.994 - 0.0021) = 0.0081. We'll call this scoreDifference.
  • Compute the genderScore: (scoreDifference * strategy.genderStrategyScore.rangeScore) + strategy.genderStrategyScore.minimumScore
    For example: (0.0081 * 4) + -3 = -2.967
  1. Output the scores.
  • genderScore-Article = 0.0021
  • genderScore-Identity = 0.994
  • genderScore-IdentityArticleDiscrepancy = -2.967
  1. Handling null cases.
  • There may be cases where a gender is null, e.g., A. Rifkind
  • In this case, the output should look like this.
    • genderScore-Article = null
    • genderScore-Identity = 0.02
    • genderScore-IdentityArticleDiscrepancy = NULL
@sarbajitdutta sarbajitdutta self-assigned this Jun 5, 2019
@sarbajitdutta sarbajitdutta added this to To do in ReCiter enhancements via automation Jun 5, 2019
@paulalbert1
Copy link
Contributor Author

Test case for splitting dashes...
Screen Shot 2019-06-11 at 10 58 41 AM

@sarbajitdutta sarbajitdutta moved this from To do to Testing in ReCiter enhancements Jun 12, 2019
ReCiter enhancements automation moved this from Testing to Done Jun 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants