In [None]:
import numpy as np
import pandas as pd

import re

import plotly.express as px
import plotly.graph_objects as go

import statsmodels.api as sm

pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 100)

import plotly.io as pio

pio.templates.default = "plotly"

# 0. What I want to do in this notebook

Gender bias is one of the most frequently discussed topics that attracts much attention. As is well known, the number of female Kagglers is far smaller than that of males, and the average wage of women is fewer than that of men.

**What is NOT answered yet by other notebooks** is whether the gender bias is systematic or not.Men dominating this field is not necessarily a problem if this is the result of freedom to choose the profession. No one can force women to be data scientists. Also, you can name other fields that are dominated by women. Moreover, the wage gap is reasonable if the gap can be explained by other factors such as education or working experiences. 

However, if women are paid less than men with the same level of education, working experience, and position, all of us can agree that there is a systematic bias against women. This is my primary interest . 

In this notebook, I will explore and try to answer this question: **'Is there a systematic gender bias regarding wages? If so, in which country is that bias more severe?'**  

**Most other notebooks are preoccipied with visualization with simple aggregation but you cannot conclude that there is a systematic bias by just looking at the mean/median wage difference between women and men. You need to control for other variables that might affect wages.**

My simple but rigorous approach is as follows. Section 1 describes the data cleaning process to extract reliable subsamples from the dataset. Section 2 visualizes the gender difference from many aspects including the median wage. Section 3 explains the result of gender gap estimation. Section 4 is the conclusion including suggestions.

Hope this is insightful and thought-provoking!

# 1. Data Cleaning

In [None]:
#read data
df = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
df = df.drop([0])

Data cleaning is a significant process because the quality of the analysis depends on the quality of data. 

Data cleaning includes the following six steps.

**1. Focusing on the currently employed**

As my interest is a wage gap, students and those who are not employed are excluded.

**2. Focusing on Women/Men**

To gauge the wage gap between women and men, I selected the respondents whose gender identity was either woman or man.

**3. Selecting the countries whose number of women's responses is 20 or more.**

Comparing responses from different countries (i.e. comparing Chinese women with American men) is misleading. Wages are denominated by USD but the level of general price differ across countries to a great extent. Therefore, I decided to compare within the respective countries by using median wages of that country. Making a reliable statistical comparison needs at least 20 women's responses in each country.

**4. Excluding implausible wages**

Some wage answers do not seem to be true. Some responses were below the minimum wage of that country, which is highly unlikely. Perhaps, people misunderstood the question or simply did not want to disclose their wages. Other responses were unrealistically high. These might actually be outliers but some of them should be just liars'. I decided to exclude responses indicating that wages were either below the minimum wage of the country or above five times the median wage of the country (calculated after excluding the responses lower than the minimum wage).

**5. Re-counting women's responses and re-selecting countries**

**6. Excluding countries whose median wage is less than USD 2,000.**

The smallest bin width of wages is USD 500 so you cannot assess wages in the low range. (i.e. all USD 700, USD 800, and USD 900 go into USD500 - USD999 bucket even though they are fairly different.)  

In [None]:
#create dict
len_df = {}
len_df['Initial'] = len(df)

#Exclude students/not employed
ex_role = ['Student','Currently not employed']
df = df[~df.Q5.isin(ex_role)]

len_df['STEP1:Employed'] = len(df)

#clean up gender
gender = ['Man','Woman']
df = df[df.Q2.isin(gender)]

len_df['STEP2:Woman/Man'] = len(df)

country = pd.crosstab(df['Q3'],df['Q2'])
country = country[country['Woman'] >= 20].index
country = country.drop('Other')

df = df[df.Q3.isin(country)]
len_df['STEP3: Select countries'] = len(df)

#clean up country name
df.loc[df['Q3'] == 'United Kingdom of Great Britain and Northern Ireland', 'Q3'] = 'UK'
df.loc[df['Q3'] == 'United States of America', 'Q3'] = 'USA'
df.loc[df['Q3'] == 'Iran, Islamic Republic of...', 'Q3'] = 'Iran'

In [None]:
#minimum wage by wikipedia 
#https://en.wikipedia.org/wiki/List_of_countries_by_minimum_wage
#as of 2021 nov 2

dict_min_wage = {
    'Australia':28734,
    'Brazil':3625,
    'Canada':21477, 
    'China':1945, 
    'Egypt':68*12,#for public sector 
    'France':21786,
    'Germany':22214, 
    'India':709, 
    'Indonesia':2477, 
    'Iran':432,
    'Italy':9999, # none assume similar to other EU contries 
    'Japan':15646,
    'Kenya':755, 
    'Malaysia':2663, 
    'Mexico':2678, 
    'Netherlands':23315, 
    'Nigeria':1108,  
    'Pakistan':1539, 
    'Poland':9408, 
    'Portugal':9594, 
    'Russia':2371,
    'South Korea':18695, 
    'Spain':15187, 
    'Taiwan':9702,
    'Tunisia':1902, 
    'Turkey':6224, 
    'UK':23656, 
    'USA':15080, 
    'Viet Nam':1437, 
    }

df['min_wage'] = df.Q3.map(dict_min_wage)

gd = pd.DataFrame({
    'country':dict_min_wage.keys(),
    'minimum wage':dict_min_wage.values()
})
fig = px.bar(gd , y='minimum wage', x='country')
fig.update_layout(
    title="Minimum Wage by Country",
    xaxis_title="Country",
    yaxis_title="Minimum Wage(USD)",
)
fig.show()

The chart above shows the minimum wages of selected countries. The data is extracted from the following wikipedia page: https://en.wikipedia.org/wiki/List_of_countries_by_minimum_wage. 

The minimum wages vary significantly across countries, and I excluded all responses below the minimum wage.

(Since Italy has no legal minimum wage, its defacto minimum wage is assumed to be closer to other EU countries. For Egypt, I used the minimum wage for the public sector.)

In [None]:
def get_max(x):
    try:
        x = x.split('-')
        return int(max([int(re.sub(r'\D','', str(i))) for i in x]))
    except:
        return np.nan

#Cleanup compensation
df['wage'] = df.Q25.apply(get_max)

df = df[df['wage'] >= df['min_wage']]

#too high wage
OUTLIER_MULTIPLE = 5
df['median_wage'] = df['Q3'].map(df.groupby('Q3')['wage'].median())
df['adj_wage'] = df['wage'] / df['median_wage']

df = df[df['adj_wage'] <= OUTLIER_MULTIPLE]

len_df['STEP4: Wage cleaned'] = len(df)

country = pd.crosstab(df['Q3'],df['Q2'])
country = country[country['Woman'] >= 20].index

df = df[df.Q3.isin(country)]
len_df['STEP5: Reselect countries'] = len(df)

In [None]:
fig = px.box(df, y="wage", x="Q3", color="Q2",notched=True,
             labels={"Q2": "Gender"})
fig.update_layout(
                  yaxis=dict(range=(0,300000))
)

fig.update_layout(
    title="Distribution of Wages by Country",
    xaxis_title="Country",
    yaxis_title="Wages(USD)",
)

fig.show()

The chart above shows the distribution of the cleaned wage by gender and country. From those countries, Kenya, Egypt, and Iran are excluded since their median wages are too small to evaluate. 

In [None]:
#median wage
gd_median = df.groupby(['Q2','Q3'])['wage'].median().reset_index().pivot_table(index='Q3',columns='Q2',values='wage')

drop_countries = gd_median.index[gd_median.max(axis=1) <= 2000]
gd_median = gd_median[~gd_median.index.isin(drop_countries)]
df = df[~df.Q3.isin(drop_countries)]

len_df['STEP6: Drop lower Wage Countries'] = len(df)

In [None]:
data = dict(
    number=len_df.values(),
    stage=len_df.keys())
fig = px.funnel(data, x='number', y='stage')
fig.update_layout(
    title="Num of Responses by Step",
    yaxis_title="Steps",
)
fig.show()

The funnel chart above shows the number of valid responses after each step of data cleaning. The number starts at 25,973 responses and finally decreases to around 7,000 responses.

In [None]:
a = pd.crosstab(df.Q3,df.Q2)
a['total'] = a.sum(axis=1)
a['ratio'] = a['Woman'] / a['total']
a

In [None]:
gd = pd.crosstab(df.Q3,df.Q2)

px.bar(gd, x=gd.index, y=['Man','Woman'], barmode="group",
        labels={
        "value": "Num of responses",
        "Q3": "Country",
        "variable": "Gender"
                 },
        title = 'Num of Responses by Gender and Country')

The chart above illustrates the number of valid responses by women and men after the data preprocessing. It confirms that the machine learning community is dominated by men. The largest subgroups are **India and USA**. Regarding the proportion gap, **Japan** has the widest gap between women(only 5%) and men(95%).  **USA, UK, and India** have the smallest gaps (women:20%, men 80%), which are still quite wide.

# 2. Median wage gap and factors that can explain gap

In [None]:
gd_median['Woman'] /= gd_median['Man']
gd_median['Man'] = 1

gd_median = gd_median.unstack().reset_index().rename(columns = {0:'wage'})

fig = px.line_polar(gd_median, r="wage", theta="Q3", color="Q2",  line_close=True,
            color_discrete_sequence=['blue','red'], 
            labels={
                "Q2": "Gender",
                 })

fig.update_layout(
  polar=dict(
    radialaxis=dict(
      visible=True,
      range=[0, 1.5]
    )),
  showlegend=True
)
fig.update_layout(
    title="Median Wages Comparison (Man = 1.0)",
)


fig.show()

The spider chart above shows the ratio of the median wage of women to that of men. Men's median wage shows at 1 and women's median wage is the ratio to the men's. In most countries except for **China and France**, women's median wages are lower than mens.

This indicates the gender gap but you **CANNOT** conclude that there is a gender bias by just looking at this chart.

Let us consider the following example. A man has 10 years of experience in the machine learning field and a woman just graduated from college. The man earns three times more than the woman. Can you say this is unfair wage practice? Of course not.

To detect systematic bias, it is necessary to control other factors that can affect wages. In other words, finding the wage gap that is not explained by other factors is the key.

From this survey, I found seven possibile factors: **Age, Education Level, Job Title, Coding Experience, Machine Learning Experience, Industry, and Employer Size**. Let's briefly look at the possible factors one by one.

## Age

In [None]:
gd = pd.crosstab(df.Q1,df.Q2)
gd = gd / gd.sum() *100

px.bar(gd, x=gd.index, y=['Man','Woman'], barmode="group",
        labels={
        "value": "(%)",
        "Q1": "Age",
        "variable": "Gender"
                 },
        title = 'Age Distirbutions by Gender')

Higher age normally leads to higher wages. Female Kagglers are younger than their male counterparts.

## Education Level

In [None]:
gd = pd.crosstab(df.Q4,df.Q2)
gd = gd / gd.sum() *100

ACADEMIC_ORDER = ['No formal education past high school',
                  'Some college/university study without earning a bachelor’s degree',
                  'Bachelor’s degree', 
                  'Master’s degree', 
                  'Doctoral degree',
                  'Professional doctorate',
                   'I prefer not to answer']

gd = gd.reindex(index=ACADEMIC_ORDER)

px.bar(gd, x=gd.index, y=['Man','Woman'], barmode="group",
       labels={
        "value": "(%)",
        "Q4": "Education Level",
        "variable": "Gender"
                 },
        title = 'Education Level Distirbutions by Gender')

Higher education level leads to higher wages. Female Kagglers receive higher education than their male counterparts.

## Job Title

In [None]:
gd = pd.crosstab(df.Q5,df.Q2)
gd = gd / gd.sum() *100

gd = gd.sort_values('Woman',ascending=False)

px.bar(gd, x=gd.index, y=['Man','Woman'], barmode="group",
       labels={
        "value": "(%)",
        "Q5": "Job title",
        "variable": "Gender"
                 },
        title = 'Job Title Distributions by Gender')

Job title relates to wages. Women are more likely to be 'Data Analyst' and 'Other', and less likely to be 'Data Scientist' and 'Machine Learning Engineer'.

## Coding Experience

In [None]:
gd = pd.crosstab(df.Q6,df.Q2)
gd = gd / gd.sum() *100

CODE_ORDER = ['I have never written code','< 1 years','1-3 years',
              '3-5 years', '5-10 years','10-20 years', '20+ years']

gd = gd.reindex(index=CODE_ORDER)

px.bar(gd, x=gd.index, y=['Man','Woman'], barmode="group",
       labels={
        "value": "(%)",
        "Q6": "Coding Experience",
        "variable": "Gender"
                 },
        title = 'Coding Experience Distirbutions by Gender')

Longer coding experience generally leads to higher wages. Female Kagglers have shorter coding experience on average.

## Machine learning experience

In [None]:
gd = pd.crosstab(df.Q15,df.Q2)
gd = gd / gd.sum() *100

ML_ORDER = ['I do not use machine learning methods','Under 1 year',
            '1-2 years', '2-3 years', '3-4 years', '4-5 years', 
            '5-10 years','10-20 years', '20 or more years',
        ]

gd = gd.reindex(index=ML_ORDER)

px.bar(gd, x=gd.index, y=['Man','Woman'], barmode="group",
       labels={
        "value": "(%)",
        "Q15": "Machine Learning Experience",
        "variable": "Gender"
                 },
        title = 'ML Experience Distirbutions by Gender')

Likewise, longer machine learning experience generally leads to higher wages. Female Kagglers have shorter experience on average.

## Industry

In [None]:
gd = pd.crosstab(df.Q20,df.Q2)
gd = gd / gd.sum() *100

gd = gd.sort_values('Woman',ascending = False)

px.bar(gd, x=gd.index, y=['Man','Woman'], barmode="group",
       labels={
        "value": "(%)",
        "Q20": "Industries",
        "variable": "Gender"
                 },
        title = 'Industry Distirbutions by Gender')

Industry relates to wages. Women are more likely from 'Academic/Education' field and less likely from 'Manufacturing' field.

## Employer Size

In [None]:
gd = pd.crosstab(df.Q21,df.Q2)
gd = gd / gd.sum() *100

gd.index

SIZE_ORDER = ['0-49 employees', '50-249 employees','250-999 employees',
              '1000-9,999 employees','10,000 or more employees'
        ]

gd = gd.reindex(index=SIZE_ORDER)

px.bar(gd, x=gd.index, y=['Man','Woman'], barmode="group",
       labels={
        "value": "(%)",
        "Q21": "Employer Size",
        "variable": "Gender"
                 },
        title = 'Employer Size Distirbutions by Gender')

Larger employers generally can offer higher wages. Women's employers are slightly smaller in size than men's.

# 3. Statistical Analysis

Let's see whether the gender wage gap can be explained by the other factors that we went throught. I used Linear Regression to control other factores and estimated the gender gap by country.

In [None]:
TARGET = ['adj_wage']
EXPLANATORY = ['Q2'] #gender
CONTROL = ['Q1','Q4','Q5','Q6','Q15','Q20','Q21']

df['gender_country'] = df['Q2'] + df['Q3']
by_country = pd.get_dummies(df['gender_country'])
by_country = by_country[[col for col in by_country.columns if 'Woman' in col]]

y = df[TARGET]

#woman not by country
X_total = pd.concat([pd.get_dummies(df[EXPLANATORY + CONTROL],drop_first=True)],axis=1)

model = sm.OLS(y,sm.add_constant(X_total))
results = model.fit()

result_total = pd.concat([results.params,results.bse],axis=1).loc['Q2_Woman']

result_total = pd.DataFrame(result_total).T

result_total.columns = ['coef','se']
result_total.index = ['Overall']

In [None]:
# X_plain = pd.concat([by_country],axis=1)
# model = sm.OLS(y,sm.add_constant(X_plain))
# results = model.fit()


# WOMAN_INDEX = [i for i in results.params.index if 'Woman' in i]
# result_plain = pd.concat([results.params,results.bse],axis=1).loc[WOMAN_INDEX]

# result_plain.columns = ['coef_plain','se_plain']
# result_plain.index = [i.replace('Woman','') for i in result_plain.index]

X = pd.concat([by_country,pd.get_dummies(df[CONTROL],drop_first=True)],axis=1)
model = sm.OLS(y,sm.add_constant(X))
results = model.fit()
#results.summary()

WOMAN_INDEX = [i for i in results.params.index if 'Woman' in i]
result = pd.concat([results.params,results.bse],axis=1).loc[WOMAN_INDEX]

result.columns = ['coef','se']
result.index = [i.replace('Woman','') for i in result.index]

result = pd.concat([result_total,result])
result = result.sort_values('coef')

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(
    name='Estimated Wage Gap by Gender',
    x=result.index, y=result.coef,
    error_y=dict(type='data', array=result.se*2)
))

fig.update_layout(barmode='group')

fig.update_layout(
    title="Kaggle Gender-Wage Gap Index (0.0 means equal)",
    xaxis_title="Country",
    yaxis_title="Gap",
)
fig.show()

The bar and error bar chart above shows the estimated 'gender-wage gap'. I call it **'Kaggle Gender-Wage Gap Index'**.The bar chart indicates the gender gap. A negative value means that women earn lower wages than men. Error bars show the estimation error (two times standard deviations). If error bars cross zero, that means it is not statistically significant. 

**Overall** is estimated 0.21-ish and statistically significant. This indicates the existence of global level gender wage gap. Based on the result, **women's wages are lower by around 20% of median wage than men** even AFTER controlling for other factors such as education, experience, job title, and industry.

**Canada, Brazil, USA, and Germany** have the largest wage gaps and these gaps are statistically significant. Women's wages are lower than men's by 40% of the median wage. 

**Spain, Russia, France, and Japan** have wage gaps as well but the gaps are NOT statistically significant.  So, it cannot be said that the gender gaps are systematic in these countries.

**India** has a negative gap and it is statistically significant. But the extent is around 15%, which is much smaller than the USA.

**China and Nigeria** do not have negative gaps and they are not statistically significant. This indicates there is no bias in these two countries.


In [None]:
comp = pd.DataFrame(result.coef.drop('Overall'))
comp['median diff'] = comp.index.map((gd_median[gd_median['Q2'] == 'Woman'].set_index('Q3')['wage'] - 1))
comp.columns = ['Kaggle Index','simple median comparison']

comp = pd.DataFrame(comp.unstack()).reset_index()

comp.columns = ['Method','Country','Gap']

fig = px.scatter(comp, y='Gap', x="Country", color="Method", symbol="Method")
fig.update_traces(marker_size=10)
fig.update_layout(
    title="Difference between Kaggle Gender-Wage Gap Index v.s. Simple Median Comparison",
    xaxis_title="Country",
    yaxis_title="Gap",
)
fig.show()

This scatter plot compares the Kaggle Gender-Wage Gap Index with simple comparison of median wages between women and men (shown in the first chart in section 2) . You can see an interesting difference.

For **Russia, India, and Nigeria**, simple comparison looks much worse than the Kaggle Index, meaning that the wage gap on the surface can be explained by other factors than gender.

On the contrary, for **Canada, USA, and Germany**, the gender gap becomes wider after controlling for other factors. This indicates the true gap is larger than median wage comparison.

In [None]:
#https://en.wikipedia.org/wiki/Global_Gender_Gap_Report
GGI = {
    'Canada':0.772,
    'Brazil':0.695,
    'USA':0.763, 
    'Germany':0.796, 
    'Spain':0.788, 
    'Russia':0.708, 
    'France':0.784,
    'Japan':0.656, 
    'UK':0.775, 
    'India':0.625, 
    'China':0.682,  
    'Nigeria':0.627, 
}

result['GGI'] = result.index.map(GGI)

fig = px.scatter(x=result['coef'], y=result['GGI'],text = result.index,trendline="ols")

fig.update_traces(textposition="top right")

fig.update_layout(
    title="Kaggle Gender-Wage Gap Index vs WEF's Gender Gap Index",
    xaxis_title="Kaggle Gender-Wage Gap Index",
    yaxis_title="WEF's Gender Gap Index",
)

fig.show()

Finally, I compare the Kaggle Gender-Wage Gap Index with the World Economic Forum's Global Gender Gap Index. Data are extracted from this Wikipedia: https://en.wikipedia.org/wiki/Global_Gender_Gap_Report

WEF's Index  "ranks countries according to calculated gender gap between women and men in four key areas: health, education, economy, and politics to gauge the state of gender equality in a country." Therefore, this index is a comprehensive gender gap index while the Kaggle Index is specialized in the data science field.

What is very intriguing is that those two indexes are **NEGATIVELY** correlated. This means the country which achieved higher gender equality overall has a larger wage gap in the machine learning community. 

# 4. Conclusion

In this notebook, I estimated **the Kaggle Gender-Wage Gap Index** and examined the gender bias in the data science community. The main findings are as follows.

* Women's wages are lower than men by 20% of median wage. This gap cannot be attributed to other factors such as education, job title, and working experience. Gender bias against women regarding wages is **REALLY EXISTS**.

* By country, Canada, Brazil, USA, and Germany have the largest gaps while China and Nigeria do not have any gaps. 

* The countries which achieve higher overall gender equality have a larger wage gap in the machine learning sector.

Needless to say, this result is not final and further research should be conducted to confirm the result. Methodology-wise, the Kaggle survey is not solely designed to research this topic. 

Moreover, the **core question** remains to be answered. The result is somewhat counterintuitive. Especially, the Kaggle Gender-Wage Gap is negatively correlated with WEI's gender gap index. Why is the gender gap in the machine learning field different from the overall gender gap? Before I started this experiment, I supposed that Japan (my country, which historically has poor performance in terms of gender equality measured by international institutions) should have larger bias than the US but the result indicates the opposite. Digging into this point might help help us to find the ways to tackle gender wage gap in the data science.

My **suggestion for the next survey** is to ask about wages in a more tactical manner. Wage responses are rather noisy. For example, multiple choices instead of raw numbers might be helpful to get more reliable responses. In addition, using the minimum wage of a country as a threshold can help people avoid answering unrealistically low wages.

**Thank you for reading my notebook! I hope that my findings are interesting and thought-provoking. I would love to hear/learn from your opinions/observations at your country about this issue.**