### An acknowledgment to our Non-Binary community

```Genderize``` is based on the theory that analyzing a first name can help estimate someone’s gender. But that really applies only in a binary world in which a name is either male or female. We **don’t** live in a binary world and this approach risks erasing the identities of our non-binary community members. 

The reality is that one’s name does not determine one’s gender.

As journalists, we may find ourselves working on critical projects where we need to know gender identities. For example:

- How many refugees in a camp are male and how many are female? 
- What is the gender diversity of top executive leadership in an industry? 
- What form of gender equality is there in tenured science professorships or in major prizes?

We could certainly ask each individual, but ```Genderize``` is currently the most effective way of approaching massive datasets of names to estimate gender.

# Gender Estimator

### Some stories are based on knowing gender distribution. Here are some examples:

* **Washington Post**: Here’s how Hillary Clinton knows that 61 percent of her donors were women [<a href="https://www.washingtonpost.com/news/the-fix/wp/2015/07/16/heres-how-hillary-clinton-knows-that-61-percent-of-her-donors-were-women">link</a>]

* **The Atlantic**: When Will the Gender Gap in Science Disappear? [<a href="https://www.theatlantic.com/science/archive/2018/04/when-will-the-gender-gap-in-science-disappear/558413/">Link</a>]

* **Bloomberg News**: Record Numbers of Women Running for Office [<a href="https://www.bloomberg.com/graphics/2018-women-candidates/">Link</a>]

* **The Guardian**: How we analysed 70m comments on the Guardian website [<a href="https://www.theguardian.com/technology/2016/apr/12/how-we-analysed-70m-comments-guardian-website">Link</a>]

### Gender is **estimated** based on person's first name.

## ```pip install Genderize``` (a library available for many programming languages)

In [1]:
!pip install Genderize



In [2]:
## import necessary libraries
import pandas as pd
from genderize import Genderize as gd
# from google.colab import files  ## to export our files to our computer drive

# ```gd().get(list_name)```

In [3]:
### requires a list
gd().get(["Sahar"])

[{'name': 'Sahar', 'gender': 'female', 'probability': 0.95, 'count': 5188}]

### From Genderize site:
The **probability** indicates the certainty of the assigned gender. Basically the ratio of male to females. 

The **count** represents the number of <a href="https://genderize.io/our-data">data rows examined</a> in order to calculate the response.



In [4]:
## does not work on an individual name that is not in a list
gd().get("Sahar")

[{'name': 'S', 'gender': 'male', 'probability': 0.68, 'count': 10151},
 {'name': 'a', 'gender': 'male', 'probability': 0.68, 'count': 19908},
 {'name': 'h', 'gender': 'male', 'probability': 0.76, 'count': 5606},
 {'name': 'a', 'gender': 'male', 'probability': 0.68, 'count': 19908},
 {'name': 'r', 'gender': 'male', 'probability': 0.65, 'count': 7424}]

In [6]:
## But you can make a single name into a list
myname= "Sahar"
myname_list = [myname]
print(type(myname_list))
myname_list

<class 'list'>


['Sahar']

In [7]:
## you can now call the genderize method on list
gd().get(myname_list)

[{'name': 'Sahar', 'gender': 'female', 'probability': 0.95, 'count': 5188}]

In [8]:
## Genderize works by analyzing first names and estimating their gender probability 
## run this cell
f_names = ['Rarin','Sandeep', 'Sahar', 'Yoshiko','Susan', 'Nabila','Pat', "Lupita"]
f_names

['Rarin', 'Sandeep', 'Sahar', 'Yoshiko', 'Susan', 'Nabila', 'Pat', 'Lupita']

In [9]:
## run it on f_names
gd().get(f_names)

[{'name': 'Rarin', 'gender': 'female', 'probability': 1.0, 'count': 13},
 {'name': 'Sandeep', 'gender': 'male', 'probability': 0.98, 'count': 4494},
 {'name': 'Sahar', 'gender': 'female', 'probability': 0.95, 'count': 5188},
 {'name': 'Yoshiko', 'gender': 'female', 'probability': 0.98, 'count': 463},
 {'name': 'Susan', 'gender': 'female', 'probability': 0.98, 'count': 32184},
 {'name': 'Nabila', 'gender': 'female', 'probability': 0.98, 'count': 4793},
 {'name': 'Pat', 'gender': 'male', 'probability': 0.67, 'count': 26734},
 {'name': 'Lupita', 'gender': 'female', 'probability': 0.98, 'count': 975}]

In [11]:
## We can pull out specific data by specifying the keys using a for loop
for person in gd().get(f_names):
    print(f"{person.get('name')}: {person.get('gender')}")

Rarin: female
Sandeep: male
Sahar: female
Yoshiko: female
Susan: female
Nabila: female
Pat: male
Lupita: female


In [12]:
## FUNCTION to get gender data from genderize
def gender_data(a_item):
    list_it = [a_item]
    gender_data = gd().get(list_it)
    return gender_data

In [13]:
## test it on "Sandeep"
gender_data("Sandeep")

[{'name': 'Sandeep', 'gender': 'male', 'probability': 0.98, 'count': 4494}]

In [16]:
## now use the function in our for loop
for person in gender_data(f_names):
    print(f"{person.get('name')}: {person.get('gender')}")

Rarin: female
Sandeep: male
Sahar: female
Yoshiko: female
Susan: female
Nabila: female
Pat: male
Lupita: female


## Apply to a Pandas dataframe

In [None]:
## COLAB ONLY
## upload Excel file names.xlsx
files.upload()

In [26]:
## read csv file into pandas dataframe
## see the head
df = pd.read_csv("week-12_names.csv")
df.head(13)

Unnamed: 0,Name
0,Lupita Nyong’o
1,Rarin Thongma
2,Chang-jae Shin
3,Sandeep Junnarkar
4,Kalsoom Lakhani
5,Hyang-ja Yang
6,John Smock
7,Xiaoming Huang
8,Sahar Hafeez
9,Yoshiko Shinohara


In [20]:
## see the tail
df.tail()

Unnamed: 0,Name
13,Pat Smith
14,Dana Collins
15,Toshiro Mifune
16,Mingzhu Dong
17,Adewale Akinnuoye-Agbaje


## What the problem here?

In [28]:
## Split the first and last name into separate columns
df[["First", "Last"]] = df["Name"].str.split(expand=True)
df

Unnamed: 0,Name,First,Last
0,Lupita Nyong’o,Lupita,Nyong’o
1,Rarin Thongma,Rarin,Thongma
2,Chang-jae Shin,Chang-jae,Shin
3,Sandeep Junnarkar,Sandeep,Junnarkar
4,Kalsoom Lakhani,Kalsoom,Lakhani
5,Hyang-ja Yang,Hyang-ja,Yang
6,John Smock,John,Smock
7,Xiaoming Huang,Xiaoming,Huang
8,Sahar Hafeez,Sahar,Hafeez
9,Yoshiko Shinohara,Yoshiko,Shinohara


In [29]:
## reorder the columns
df = df[["Last", "First"]].copy()
df

Unnamed: 0,Last,First
0,Nyong’o,Lupita
1,Thongma,Rarin
2,Shin,Chang-jae
3,Junnarkar,Sandeep
4,Lakhani,Kalsoom
5,Yang,Hyang-ja
6,Smock,John
7,Huang,Xiaoming
8,Hafeez,Sahar
9,Shinohara,Yoshiko


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Last    18 non-null     object
 1   First   18 non-null     object
dtypes: object(2)
memory usage: 416.0+ bytes


In [None]:
## FUNCTION to get gender data from genderize
def gender_data(a_item):
    list_it = [a_item]
    gender_data = gd().get(list_it)
    return gender_data

In [31]:
## function to take a string name, convert to list and return gender
## NOTICE it taps our earlier gender_data() function
def gender_estimate(a_item):
    '''
    function to take a string name, convert to list and return gender
    uses our gender_data() function
    '''
    return (gender_data(a_item))[0].get("gender").upper()

```[{'name': 'Sandeep', 'gender': 'male', 'probability': 0.98, 'count': 4494}]```

In [32]:
## Test on "Sandeep"
gender_estimate("Sandeep")

'MALE'

In [33]:
## apply as a lambda expression on our dataframe
df["Probability"] = df["First"].apply(lambda x: gender_estimate(x))
df

Unnamed: 0,Last,First,Probability
0,Nyong’o,Lupita,FEMALE
1,Thongma,Rarin,FEMALE
2,Shin,Chang-jae,MALE
3,Junnarkar,Sandeep,MALE
4,Lakhani,Kalsoom,FEMALE
5,Yang,Hyang-ja,FEMALE
6,Smock,John,MALE
7,Huang,Xiaoming,MALE
8,Hafeez,Sahar,FEMALE
9,Shinohara,Yoshiko,FEMALE


In [35]:
def add(num):
    return(num+num)

In [36]:
add(2)

4

## But we need a sense of the probability

In [None]:
## function to return probability
## NOTICE it taps our earlier gender_data() function


In [None]:
## test probability on "Sandeep"


In [None]:
## create new column called "Probability" in our df


In [None]:
### FUNCTION to return certainty
## NOTICE it taps our earlier gender_data() AND gender_probability() functions



In [None]:
## get gender on name "Pat"


In [None]:
## get probability on name "Pat"


In [None]:
## get gender certainty on name "Pat"


In [None]:
## create a column called "Certainty" in our df


## Slice flagged items for a manual check

In [None]:
## write pandas to create a slice
