# We have used the Gender API to evaluate gender

Now we want to assess its accuracy using a test set.

## Load the data

Three datasets, each a csv file:
- `test_set`
- `auth`
- `gender`

In [1]:
import pandas as pd
import numpy as np

In [2]:
test_set = pd.read_csv('/home/rdora/femec/data/top_femec.csv')
auth = pd.read_csv('/home/rdora/femec/data/author.csv')
gender = pd.read_csv('/home/rdora/femec/data/gender.csv')

In [3]:
print(f"Shape of authors: {auth.shape[0]:,}\nShape of test_set {test_set.shape[0]:,}")

Shape of authors: 72,754
Shape of test_set 1,481


We have 72,754 total authors without gender information. We know the gender of 1,481 women, this is the test set.

## Data processing

In `gender` we have some name duplicates, we need to analyse further to know the reason.

In [4]:
repeated_name = gender[gender.duplicated(subset=['name'])].iloc[0, 0]
gender[gender.name==repeated_name].head()

Unnamed: 0,name,q,gender,total_names,probability,country
56,muhammad,muhammad irwan padli,male,63762,99,ID
107,muhammad,muhammad,male,63762,99,ID
1758,muhammad,muhammad farooq,male,63762,99,ID
1789,muhammad,muhammad baqir,male,63762,99,ID
2331,muhammad,muhammad zammad,male,63762,99,ID


It is easy to see that repeated names come from different queries whose intersection is non-null.

In [5]:
# All names to lower capital letters
auth['first_name'] = auth['Name-First'].str.lower()
auth = auth.dropna(subset=['first_name'])

In [6]:
# We only care about names we can have gender of, and remove duplicates.
gender = gender.drop_duplicates(subset=['name'])

gender = gender[~gender.gender.isna()]

gender = gender[~gender.name.isna()].drop(['q', 'country'], axis=1)

gender.rename(columns={'name': 'first_name'}, inplace=True)

In [7]:
print(f"Shape of gender: {gender.shape[0]:,}")

Shape of gender: 16,105


In [8]:
auth_api = pd.merge(auth, gender, how='left', on='first_name')

In [9]:
ngender = auth_api.gender.notna().sum()
print(f"Authors with gender: {ngender / auth_api.shape[0]:,.2%}")

Authors with gender: 90.32%


The total number of unique names is not that large (16,105). However, 90.2% of all ~72k authors have a name in this list.

## Testing the results

In [10]:
# Add gender data
test_set = pd.merge(test_set, auth_api[['Short-Id', 'gender', 'probability']], how='left', on='Short-Id')

Let's compute the precision and recall of the whole sample.

In [11]:
true_positives = (test_set.gender_y=='female').sum()
false_positives = (test_set.gender_y=='male').sum()
false_negatives = (test_set.gender_y!='female').sum()

print(f"Precision = {true_positives / (true_positives + false_positives):,.2}")
print(f"Recall = {true_positives / (true_positives + false_negatives):,.2}")

Precision = 0.96
Recall = 0.91


## Precision and recall

We have high precision and recall overall, however, if we could obtain a better precision value at low recall cost, it would be even better. It is very important that our algorithm does well in women across our dataset.

In [12]:
# Keep only gender with a probability of 90 or higher
test_set_90 = test_set[test_set.probability >= 90]

true_positives = (test_set_90.gender_y=='female').sum()
false_positives = (test_set_90.gender_y=='male').sum()
false_negatives = (test_set.gender_y!='female').sum()

print(f"Precision = {true_positives / (true_positives + false_positives):,.2}")
print(f"Recall = {true_positives / (true_positives + false_negatives):,.2}")

Precision = 0.99
Recall = 0.9


Using a gender probability of 90 or higher, we get almost the same recall (0.9 instead of 0.91) and get 0.99 precision!

## Let's try with another gender guesser

In [13]:
import gender_guesser.detector as gg

In [14]:
guessed_gender = []
names = auth['first_name'].unique()
d = gg.Detector()
for name in names:
    guessed_gender.append(d.get_gender(name.capitalize()))
df_guess = pd.DataFrame({'first_name': names, 'gender': guessed_gender})

In [15]:
df_guess.gender.value_counts(normalize=True)

unknown          0.592417
male             0.223902
female           0.147348
andy             0.015647
mostly_male      0.012479
mostly_female    0.008207
Name: gender, dtype: float64

Recall is already not great with this API

In [16]:
auth_api = pd.merge(auth, df_guess, how='left', on='first_name')
test_2 = pd.merge(test_set, auth_api[['Short-Id', 'gender']], how='left', on='Short-Id')

### Now let's evaluate that on the test set

In [17]:
true_positives = ((test_2.gender=='female') | (test_2.gender=='mostly_female')).sum()
false_positives = ((test_2.gender=='male') | (test_2.gender=='mostly_male') | (test_2.gender=='andy')).sum()
false_negatives = ((test_2.gender!='female') & (test_2.gender!='mostly_female')).sum()

print(f"Precision = {true_positives / (true_positives + false_positives):,.2}")
print(f"Recall = {true_positives / (true_positives + false_negatives):,.2}")

Precision = 0.96
Recall = 0.86


# Add gender to the author table

In this last step we'll add the gender to the author table to create the new people table.

In [19]:
auth['last_name'] = auth['Name-Last'].str.lower()

In [20]:
people = auth[['Short-Id',
                 'Workplace-Institution',
                 'first_name',
                 'last_name']]

In [21]:
top_femec = pd.read_csv('/home/rdora/femec/data/top_femec.csv')

In [22]:
top_femec['gender'] = 'female'

In [23]:
gender_90 = gender[gender.probability >= 90].drop(['total_names', 'probability'], axis=1)
people = pd.merge(people, gender_90, on='first_name', how='left')
people = people[people.gender.notna()]

In [24]:
# Correct top femec gender
people.loc[people['Short-Id'].isin(top_femec['Short-Id'].unique()), 'gender'] = 'female'

In [25]:
people.gender.value_counts(normalize=True)

male      0.732679
female    0.267321
Name: gender, dtype: float64

In [26]:
people.to_csv('/home/rdora/femec/data/people.csv', index=False)