# News stories in which gender plays key role

### Some stories are based on knowing gender distribution. Here are some examples:

* **Washington Post**: Here’s how Hillary Clinton knows that 61 percent of her donors were women [<a href="https://www.washingtonpost.com/news/the-fix/wp/2015/07/16/heres-how-hillary-clinton-knows-that-61-percent-of-her-donors-were-women">link</a>]

* **The Atlantic**: When Will the Gender Gap in Science Disappear? [<a href="https://www.theatlantic.com/science/archive/2018/04/when-will-the-gender-gap-in-science-disappear/558413/">Link</a>]

* **Bloomberg News**: Record Numbers of Women Running for Office [<a href="https://www.bloomberg.com/graphics/2018-women-candidates/">Link</a>]

* **The Guardian**: How we analysed 70m comments on the Guardian website [<a href="https://www.theguardian.com/technology/2016/apr/12/how-we-analysed-70m-comments-guardian-website">Link</a>]

For these pieces, gender was **estimated** based on person's first name.

### An acknowledgment to our Non-Binary community

The Python library, ```Genderize```, is based on the theory that analyzing a first name can help estimate someone’s gender. But that really applies only in a binary world in which a name is either male or female. We **don’t** live in a binary world and this approach risks erasing the identities of our non-binary community members. 

The reality is that one’s name does not determine one’s gender.

As journalists, we may find ourselves working on critical projects where we need to know gender identities. For example:

- How many refugees in a camp are male and how many are female? 
- What is the gender diversity of top executive leadership in an industry? 
- What form of gender equality is there in tenured science professorships or in major prizes?

We could certainly ask each individual, but ```Genderize``` is currently the most effective way of approaching massive datasets of names to estimate gender.

## ```pip install Genderize``` (a library available for many programming languages)

In [1]:
!pip install Genderize



## ```pip install icecream```

In [2]:
pip install icecream

Note: you may need to restart the kernel to use updated packages.


In [3]:
## import necessary libraries
import pandas as pd
from genderize import Genderize as gd
# from google.colab import files  ## to export our files to our computer drive
from icecream import ic 

# ```gd().get(list_name)```

In [4]:
### requires a list
gd().get(["Sandeep", "Francisco", "Sajina", "Andrew", "Eduardo", "Mrinalini"])

[{'count': 8968, 'name': 'Sandeep', 'gender': 'male', 'probability': 0.99},
 {'count': 612988, 'name': 'Francisco', 'gender': 'male', 'probability': 1.0},
 {'count': 18, 'name': 'Sajina', 'gender': 'female', 'probability': 0.94},
 {'count': 766466, 'name': 'Andrew', 'gender': 'male', 'probability': 1.0},
 {'count': 466869, 'name': 'Eduardo', 'gender': 'male', 'probability': 1.0},
 {'count': 1458, 'name': 'Mrinalini', 'gender': 'female', 'probability': 1.0}]

### From Genderize site:
The **probability** indicates the certainty of the assigned gender. Basically the ratio of male to females. 

The **count** represents the number of <a href="https://genderize.io/our-data">data rows examined</a> in order to calculate the response.



In [5]:
## does not work on an individual name that is not in a list
gd().get("Sandeep")

[{'count': 17353, 'name': 'S', 'gender': 'male', 'probability': 0.67},
 {'count': 29102, 'name': 'a', 'gender': 'male', 'probability': 0.68},
 {'count': 7792, 'name': 'n', 'gender': 'male', 'probability': 0.61},
 {'count': 13793, 'name': 'd', 'gender': 'male', 'probability': 0.76},
 {'count': 9038, 'name': 'e', 'gender': 'male', 'probability': 0.65},
 {'count': 9038, 'name': 'e', 'gender': 'male', 'probability': 0.65},
 {'count': 9281, 'name': 'p', 'gender': 'male', 'probability': 0.76}]

In [6]:
## But you can make a single name into a list
myname = "Sandeep"
myname_list = [myname]
type(myname_list)

list

In [7]:
myname_list

['Sandeep']

In [8]:
## you can now call the genderize method on list
gd().get(myname_list)

[{'count': 8968, 'name': 'Sandeep', 'gender': 'male', 'probability': 0.99}]

In [9]:
## Genderize works by analyzing first names and estimating their gender probability 
## run this cell
f_names = ['Rarin','Sandeep', 'Burak', 'Sahar', 'Yoshiko','Susan', 'Nabila','Pat', "Lupita", "Fahriye", "Joseph"]
f_names

['Rarin',
 'Sandeep',
 'Burak',
 'Sahar',
 'Yoshiko',
 'Susan',
 'Nabila',
 'Pat',
 'Lupita',
 'Fahriye',
 'Joseph']

In [10]:
## run it on f_names
gd().get(f_names)

[{'count': 14, 'name': 'Rarin', 'gender': 'female', 'probability': 1.0},
 {'count': 8968, 'name': 'Sandeep', 'gender': 'male', 'probability': 0.99},
 {'count': 104562, 'name': 'Burak', 'gender': 'male', 'probability': 0.98},
 {'count': 5532, 'name': 'Sahar', 'gender': 'female', 'probability': 0.95},
 {'count': 2993, 'name': 'Yoshiko', 'gender': 'female', 'probability': 1.0},
 {'count': 572708, 'name': 'Susan', 'gender': 'female', 'probability': 1.0},
 {'count': 19880, 'name': 'Nabila', 'gender': 'female', 'probability': 1.0},
 {'count': 42822, 'name': 'Pat', 'gender': 'male', 'probability': 0.58},
 {'count': 21048, 'name': 'Lupita', 'gender': 'female', 'probability': 1.0},
 {'count': 623, 'name': 'Fahriye', 'gender': 'female', 'probability': 0.97},
 {'count': 635168, 'name': 'Joseph', 'gender': 'male', 'probability': 1.0}]

In [11]:
## We can pull out specific data by specifying the keys using a for loop
for person in gd().get(f_names):
    ic(person.get("probability"))
#     ic(person)


ic| person.get("probability"): 1.0
ic| person.get("probability"): 0.99
ic| person.get("probability"): 0.98
ic| person.get("probability"): 0.95
ic| person.get("probability"): 1.0
ic| person.get("probability"): 1.0
ic| person.get("probability"): 1.0
ic| person.get("probability"): 0.58
ic| person.get("probability"): 1.0
ic| person.get("probability"): 0.97
ic| person.get("probability"): 1.0


In [12]:
## FUNCTION to get gender data from genderize
def gender_data(name):
    '''
    takes a string name and returns gender data
    '''
    converted_to_list = [name]
    gender_data = gd().get(converted_to_list)
    return gender_data

In [13]:
## test it on "Sandeep"
example_name = gender_data("Sajina")
example_name

[{'count': 18, 'name': 'Sajina', 'gender': 'female', 'probability': 0.94}]

In [14]:
example_name[0].get("gender")

'female'

In [15]:
## now use the function in our for loop


## Apply to a Pandas dataframe

<a href="https://drive.google.com/file/d/1zjr63n9lfcnJlkCUjX7wQvrCBn-qmfE7/view?usp=sharing">Download this sample</a> of "random" names.

In [16]:
## read csv file into pandas dataframe
## see the head
df = pd.read_csv("gender-names.csv")
df

Unnamed: 0,Name
0,Lupita Nyong’o
1,Rarin Thongma
2,Chang-jae Shin
3,Sandeep Junnarkar
4,Kalsoom Lakhani
5,Hyang-ja Yang
6,John Smock
7,Xiaoming Huang
8,Sahar Hafeez
9,Yoshiko Shinohara


In [17]:
## see the tail


## What the problem here?

In [18]:
## split
df [["first", "last"]] = df["Name"].str.split(expand = True)
df

Unnamed: 0,Name,first,last
0,Lupita Nyong’o,Lupita,Nyong’o
1,Rarin Thongma,Rarin,Thongma
2,Chang-jae Shin,Chang-jae,Shin
3,Sandeep Junnarkar,Sandeep,Junnarkar
4,Kalsoom Lakhani,Kalsoom,Lakhani
5,Hyang-ja Yang,Hyang-ja,Yang
6,John Smock,John,Smock
7,Xiaoming Huang,Xiaoming,Huang
8,Sahar Hafeez,Sahar,Hafeez
9,Yoshiko Shinohara,Yoshiko,Shinohara


In [19]:
## reorder the columns
df = df[["Name", "first"]].copy()
df

Unnamed: 0,Name,first
0,Lupita Nyong’o,Lupita
1,Rarin Thongma,Rarin
2,Chang-jae Shin,Chang-jae
3,Sandeep Junnarkar,Sandeep
4,Kalsoom Lakhani,Kalsoom
5,Hyang-ja Yang,Hyang-ja
6,John Smock,John
7,Xiaoming Huang,Xiaoming
8,Sahar Hafeez,Sahar
9,Yoshiko Shinohara,Yoshiko


In [20]:
## function to take a string name, convert to list and return gender
## NOTICE it taps our earlier gender_data() function
def gender_estimate(name):
    '''
    takes a string name, returns gender based on using gender_data function
    '''
    return gender_data(name)[0].get("gender").upper()

In [21]:
## Test on "Sandeep"
gender_estimate("Sandeep")

'MALE'

In [22]:
## apply as a lambda expression on our dataframe
df["gender"] = df["first"].apply(lambda x: gender_estimate(x))
df

Unnamed: 0,Name,first,gender
0,Lupita Nyong’o,Lupita,FEMALE
1,Rarin Thongma,Rarin,FEMALE
2,Chang-jae Shin,Chang-jae,MALE
3,Sandeep Junnarkar,Sandeep,MALE
4,Kalsoom Lakhani,Kalsoom,FEMALE
5,Hyang-ja Yang,Hyang-ja,FEMALE
6,John Smock,John,MALE
7,Xiaoming Huang,Xiaoming,MALE
8,Sahar Hafeez,Sahar,FEMALE
9,Yoshiko Shinohara,Yoshiko,FEMALE


## But we need a sense of the probability

In [23]:
## function to return probability
## NOTICE it taps our earlier gender_data() function
def gender_prob(name):
    '''
    takes a string name, returns gender probability based on using gender_data function
    '''
    return gender_data(name)[0].get("probability")

In [24]:
## test probability on "Sandeep"
gender_prob("Sandeep")

0.99

In [25]:
## create new column called "Probability" in our df
df["probability"] = df["first"].apply(lambda x: gender_prob(x))
df

Unnamed: 0,Name,first,gender,probability
0,Lupita Nyong’o,Lupita,FEMALE,1.0
1,Rarin Thongma,Rarin,FEMALE,1.0
2,Chang-jae Shin,Chang-jae,MALE,1.0
3,Sandeep Junnarkar,Sandeep,MALE,0.99
4,Kalsoom Lakhani,Kalsoom,FEMALE,0.97
5,Hyang-ja Yang,Hyang-ja,FEMALE,1.0
6,John Smock,John,MALE,1.0
7,Xiaoming Huang,Xiaoming,MALE,0.84
8,Sahar Hafeez,Sahar,FEMALE,0.95
9,Yoshiko Shinohara,Yoshiko,FEMALE,1.0


In [26]:
### FUNCTION to return certainty
## NOTICE it taps our earlier gender_data() AND gender_probability() functions

def gender_certainty(name, min_prob):
    '''
    takes a string name and the min_prob, returns gender probability based on using gender_data function
    '''
    prob = gender_data(name)[0].get("probability")
    if prob >= min_prob:
        return "pass"
    else:
        return "flag"

In [27]:
## get gender on name "Pat"
gender_data("Pat")

[{'count': 42822, 'name': 'Pat', 'gender': 'male', 'probability': 0.58}]

In [28]:
## get probability on name "Pat"
gender_prob("Pat")

0.58

In [29]:
## get gender certainty on name "Pat"
gender_certainty("Pat", 0.57)

'pass'

In [30]:
## create a column called "Certainty" in our df
df["Certainty_95"] = df["first"].apply(lambda x: gender_certainty(x, 0.95))
df

Unnamed: 0,Name,first,gender,probability,Certainty_95
0,Lupita Nyong’o,Lupita,FEMALE,1.0,pass
1,Rarin Thongma,Rarin,FEMALE,1.0,pass
2,Chang-jae Shin,Chang-jae,MALE,1.0,pass
3,Sandeep Junnarkar,Sandeep,MALE,0.99,pass
4,Kalsoom Lakhani,Kalsoom,FEMALE,0.97,pass
5,Hyang-ja Yang,Hyang-ja,FEMALE,1.0,pass
6,John Smock,John,MALE,1.0,pass
7,Xiaoming Huang,Xiaoming,MALE,0.84,flag
8,Sahar Hafeez,Sahar,FEMALE,0.95,pass
9,Yoshiko Shinohara,Yoshiko,FEMALE,1.0,pass


In [32]:
df["Certainty_85"] = df["first"].apply(lambda x: gender_certainty(x, 0.85))
df

Unnamed: 0,Name,first,gender,probability,Certainty_95,Certainty_85
0,Lupita Nyong’o,Lupita,FEMALE,1.0,pass,pass
1,Rarin Thongma,Rarin,FEMALE,1.0,pass,pass
2,Chang-jae Shin,Chang-jae,MALE,1.0,pass,pass
3,Sandeep Junnarkar,Sandeep,MALE,0.99,pass,pass
4,Kalsoom Lakhani,Kalsoom,FEMALE,0.97,pass,pass
5,Hyang-ja Yang,Hyang-ja,FEMALE,1.0,pass,pass
6,John Smock,John,MALE,1.0,pass,pass
7,Xiaoming Huang,Xiaoming,MALE,0.84,flag,flag
8,Sahar Hafeez,Sahar,FEMALE,0.95,pass,pass
9,Yoshiko Shinohara,Yoshiko,FEMALE,1.0,pass,pass


## Slice flagged items for a manual check

In [33]:
## write pandas to create a slice
df.query("Certainty_95 == 'flag'")

Unnamed: 0,Name,first,gender,probability,Certainty_95,Certainty_85
7,Xiaoming Huang,Xiaoming,MALE,0.84,flag,flag
13,Pat Smith,Pat,MALE,0.58,flag,flag
14,Thuy Vo,Thuy,FEMALE,0.92,flag,pass
15,Dana Collins,Dana,FEMALE,0.94,flag,pass
17,Mingzhu Dong,Mingzhu,FEMALE,0.53,flag,flag
19,Trang Phuong,Trang,FEMALE,0.91,flag,pass


In [35]:
df.at[13, "gender"] = "FEMALE"
df.at[13, "Certainty_95"] = "pass"
df

Unnamed: 0,Name,first,gender,probability,Certainty_95,Certainty_85
0,Lupita Nyong’o,Lupita,FEMALE,1.0,pass,pass
1,Rarin Thongma,Rarin,FEMALE,1.0,pass,pass
2,Chang-jae Shin,Chang-jae,MALE,1.0,pass,pass
3,Sandeep Junnarkar,Sandeep,MALE,0.99,pass,pass
4,Kalsoom Lakhani,Kalsoom,FEMALE,0.97,pass,pass
5,Hyang-ja Yang,Hyang-ja,FEMALE,1.0,pass,pass
6,John Smock,John,MALE,1.0,pass,pass
7,Xiaoming Huang,Xiaoming,MALE,0.84,flag,flag
8,Sahar Hafeez,Sahar,FEMALE,0.95,pass,pass
9,Yoshiko Shinohara,Yoshiko,FEMALE,1.0,pass,pass
