# Guessing a gender of a user based on username

Some prerequisites: 
* pip install gender-guesser
* download https://github.com/tue-mdse/genderComputer and unpack in the same folder as this notebook
* knowing locations improves accuracy (Andrea from Italy vs Andrea from Serbia)
* If you want to use locations, they need to be in standard, "full name of country" format. You need to use "united kingdom", not "uk. There is a piece of code here which does go from 2 letter country code to proper location for the gender computer. File to help you with that is available here: https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv

In [None]:
import pandas as pd
import numpy as np
import csv
from genderComputer import GenderComputer
import gender_guesser.detector as gender

In [4]:
active_users = pd.read_csv('../data/processed/active_users.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


## Gender Guesser

In [53]:
genderguesser_lookup = []
d = gender.Detector()

# go through all unique locations in the dataset
for index, row in active_users.iterrows():
    try:
        gender_guess = d.get_gender(row.DisplayName)
        
        # Gender guesser can't handle multiple words. Assume people use first name first (Andrea Sipka)
        # This solution can be greatly improved upon but I am choosing quick and dirty
        if gender_guess == 'unknown':
            
            # split string by space
            for item in row.DisplayName.split(' '):
                
                gender_guess = d.get_gender(item)
                # if we managed to gender the user, break, otherwise continue to the next word
                if gender_guess != 'unknown':
                    break
            
        genderguesser_lookup.append([row.DisplayName, gender_guess])
        
    except:
        genderguesser_lookup.append([row.DisplayName, ''])
        
# Save as dataframe
genderguesser_df = pd.DataFrame(genderguesser_lookup, columns=['DisplayName', 'Gender'])

## Get Country data

In [None]:
country_data = pd.read_csv('data/external/country_codes.csv')

In [148]:
country_data = country_data[['name', 'alpha-2', 'alpha-3', 'region', 'sub-region']]

In [150]:
country_data['code'] = country_data['alpha-2'].str.lower()
country_data['code3'] = country_data['alpha-3'].str.lower()

In [152]:
country_data.head()

Unnamed: 0,name,alpha-2,alpha-3,region,sub-region,code,code3
0,Afghanistan,AF,AFG,Asia,Southern Asia,af,afg
1,Åland Islands,AX,ALA,Europe,Northern Europe,ax,ala
2,Albania,AL,ALB,Europe,Southern Europe,al,alb
3,Algeria,DZ,DZA,Africa,Northern Africa,dz,dza
4,American Samoa,AS,ASM,Oceania,Polynesia,as,asm


In [153]:
country_data = country_data[['name', 'region', 'sub-region', 'code', 'code3']]

In [155]:
active_users = pd.merge(active_users, country_data, on='code', how='left')

In [158]:
active_users = active_users.rename(columns={"name": "country_name"})

## Gender computer

In [174]:
gender_lookup = []

gc = GenderComputer()

# go through all unique locations in the dataset
for index, row in active_users.iterrows():
    try:
        gender_lookup.append([row.DisplayName, gc.resolveGender(row.DisplayName, row.country_name)])
    except:
        gender_lookup.append([row.DisplayName, ''])
        
# Save as dataframe
gender_df = pd.DataFrame(gender_lookup, columns=['DisplayName', 'Gender'])

In [176]:
gender_df.shape

(984275, 2)

In [180]:
gender_df = pd.merge(gender_df, genderguesser_df, on='DisplayName', how='left')

In [182]:
gender_df.shape

(24639517, 3)

In [183]:
gender_df = gender_df.drop_duplicates(subset=['DisplayName'])

In [184]:
gender_df.shape

(792040, 3)

## Conservative merge of the two methods

In [186]:
# male or female only if both methods agree
conditions = [
    (gender_df['Gender_x'] == 'male') & (gender_df['Gender_y'] == 'male'),
    (gender_df['Gender_x'] == 'male') & (gender_df['Gender_y'] == 'mostly_male'),
    (gender_df['Gender_x'] == 'female') & (gender_df['Gender_y'] == 'female'),
    (gender_df['Gender_x'] == 'female') & (gender_df['Gender_y'] == 'mostly_female')]

# create a list of the values we want to assign for each condition
values = ['male', 'probably_male', 'female', 'probably_female']

# create a new column and use np.select to assign values to it using our lists as arguments
gender_df['Gender'] = np.select(conditions, values)

In [187]:
gender_df.sample(30)

Unnamed: 0,DisplayName,Gender_x,Gender_y,Gender
17942806,user3734728,,unknown,0
24384455,giangian,,unknown,0
17194004,tgrrr,,unknown,0
11586379,Patrickdev,,unknown,0
9146703,user949738,,unknown,0
17055591,Keegan Lillo,male,male,male
18999921,gold,male,unknown,0
21004454,Ehsan Ahmadi,male,male,male
22485911,Terixer,,unknown,0
21956917,sp_omer,,unknown,0


In [None]:
active_users = pd.merge(active_users, gender_df[['DisplayName', 'Gender']], on='DisplayName', how='left')

In [207]:
active_users.head()

Unnamed: 0,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,...,country,code,cluster_label_10,cluster_label_40,cluster_label_60,country_name,region,sub-region,code3,Gender
0,59111,2008-07-31T14:22:31Z,Jeff Atwood,2020-05-02T18:23:48Z,http://www.codinghorror.com/blog/,"El Cerrito, CA","<p><a href=""http://www.codinghorror.com/blog/a...",548898,3378,1311,...,United States of America,us,3.0,5.0,1.0,United States of America,Americas,Northern America,usa,male
1,5632,2008-07-31T14:22:31Z,Geoff Dalgas,2020-05-30T06:34:16Z,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the Stack Overflow team. Find...,26613,664,88,...,United States of America,us,3.0,26.0,15.0,United States of America,Americas,Northern America,usa,male
2,15196,2008-07-31T14:22:31Z,Jarrod Dixon,2020-05-29T15:37:16Z,http://jarroddixon.com,"Raleigh, NC, United States","<p><a href=""http://blog.stackoverflow.com/2009...",26423,7756,100,...,United States of America,us,1.0,2.0,30.0,United States of America,Americas,Northern America,usa,male
3,31887,2008-07-31T14:22:31Z,Joel Spolsky,2020-05-30T17:25:45Z,https://joelonsoftware.com/,"New York, NY","<p>In 2000 I co-founded Fog Creek Software, wh...",78047,825,97,...,United States of America,us,1.0,16.0,12.0,United States of America,Americas,Northern America,usa,male
4,48438,2008-07-31T14:22:31Z,Jon Galloway,2020-05-29T23:45:55Z,http://weblogs.asp.net/jgalloway/,"San Diego, CA","<p>Technical Evangelist at Microsoft, speciali...",13046,786,34,...,United States of America,us,3.0,5.0,1.0,United States of America,Americas,Northern America,usa,male


In [208]:
active_users.Gender.value_counts()

0                  656126
male               287140
probably_male       21322
female              17783
probably_female      1904
Name: Gender, dtype: int64

In [211]:
active_users.to_csv('../data/processed/active_users.csv')