
<img src="https:\/\/flolytic.com\/wp-content\/uploads\/2018\/12\/mind-the-gap-1876790_640.jpg" width="700px">

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from keras.models import Sequential
from keras.layers import Dense

import matplotlib
import os
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
import seaborn as sns
plt.style.use('seaborn-white')

In [None]:
multiple_choice_responses = pd.read_csv('/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv')

In [None]:
!pip install pywaffle
from pywaffle import Waffle

<font size="6">Income Inequality Between Men and Women</font>

**The presence of income inequality between men and women has been a topic of discussion in modern society for quite some time. In this notebook, we will explore whether men and women are rewarded differently for their efforts in the datascience community and society as a whole.**

There have been discussions in the previous years about the gender divide in general (as far as I have seen) but in this notebook we will zoom in on the gender pay gap within the dataset.

In most occupations there is a significant  gender pay gap . Men tend to earn more than women in most occupations. That is, there is a ‘gender pay gap’.These inequalities have been narrowing across the world. In particular, over the last couple of decades most high-income countries have seen sizeable reductions in the gender pay gap.

Using the data provided by the 2019 Kaggle ML & DS Survey we will look into the male and female demographics and try to determine whether a gender a pay gap exists and if so, then why? 

Also, we will check if a neural network can 'see' the gender discrimination.

As a side note, we will be using mostly bar graphs as I have allergy to fancy diagrams.

This discussion is structured as follows (in the order of appearance) :

* [Overview of Men in the dataset](#section-one)

* [Overview of Women in the dataset](#section-two)

* [Analysis of whether a gender pay gap exists](#section-three)

* [Attempt at drawing conclusions](#section-four)

* [A look into the earnings of male and female Data Scientists](#section-five)

* [Can a neural network 'see' the gender divide ?](#section-six)

* [Afterthoughts](#section-seven)


**Before we dig in, keep in mind that the majority of the respondents are males (83.4%). This can and will skew the illustrations so please bear with me.**

In [None]:
male_columns = multiple_choice_responses[multiple_choice_responses['Q2']=='Male'] # DataFrame containing male data

female_columns = multiple_choice_responses[multiple_choice_responses['Q2']=='Female'] # DataFrame containing female data

male_count = male_columns['Q2'].value_counts() # Counting the number of males

female_count = female_columns['Q2'].value_counts() # Counting the number of females

male_count = male_count[0]  # converts to number instead of Series

female_count = female_count[0]  # converts to number instead of Series

male_percent = int((male_count/(male_count+female_count))*100)      # Percentage of males
female_percent = int((female_count/(male_count+female_count))*100)  # Percentage of females


data = { 'Male'  : male_percent, 'Female' : female_percent}

fig = plt.figure(
    FigureClass=Waffle, 
    rows=8,
    values=data, 
    colors=("#232066", "#983D3D"),
    legend={'loc': 'upper left', 'bbox_to_anchor': (1, 1)},
    icons=['male','female'],
    icon_size=18,
    icon_legend=True
)

fig.set_tight_layout(False)

***Males and Females in the dataset***

<a id="section-one"></a>
<font size="5">How are the men doing?</font>

**A man is an adult male human.**

Like most other male mammals, a man's genome typically inherits an X chromosome from his mother and a Y chromosome from his father. The male fetus produces larger amounts of androgens and smaller amounts of estrogens than a female fetus. This difference in the relative amounts of these sex steroids is largely responsible for the physiological differences that distinguish men from women.


<img src="https://townsquare.media/site/442/files/2019/10/best-of-schwarzenegger1.jpg" width="500px">

First let us see where most of the men in this dataset are from.

In [None]:
maleCountryCount = male_columns['Q3'].value_counts()[:59].reset_index() # variable for storing number of male respondents from each country


percentageCountry = [] # percentage of males from each country

for i in maleCountryCount.Q3:
  percentageCountry.append( (i / sum(maleCountryCount.Q3.values)) * 100 )

x_p = np.arange(len(maleCountryCount['index'].values)) # coordinates for labels and bars

plt.figure(figsize=(15, 7))
plt.bar(x_p, percentageCountry, align='center',color = 'blue')
plt.xticks(x_p, maleCountryCount['index'].values,rotation = 90)
plt.ylabel('Percentage of Males')
plt.title('Country of residence')

plt.show()

There are men from around to 60 countries who responded to the survey. For our purposes, let us take into account the two major countries. One glace at the diagram above will make it apparent that **most respondents are from India and the United States of America.**

<font size="4.5">Overview of the gender pay gap in India and the United States of America:</font>

* For the year 2013, the gender pay gap in India was estimated to be 24.81%.Further, while analyzing the level of female participation in the economy, a report slots India as one of the bottom 10 countries on its list. Thus, in addition to unequal pay, there is also unequal representation, because while women constitute almost half the Indian population (about 48% of the total), their representation in the work force amounts to only about one-fourth of the total.

* In the US, women's average annual salary has been estimated as 78% to 82% of that of men's average salary. Beyond overt discrimination, multiple studies explain the gender pay gap in terms of women's higher participation in part-time work and long-term absences from the labour market due to care responsibilities, among other factors.The extent to which discrimination plays a role in explaining gender wage disparities is somewhat difficult to quantify. A 2010 research review by the majority staff of the United States Congress Joint Economic Committee reported that studies have consistently found unexplained pay differences even after controlling for measurable factors that are assumed to influence earnings – suggestive of unknown/unmeasurable contributing factors of which gender discrimination may be one. Other studies have found direct evidence of discrimination – for example, more jobs went to women when the applicant's sex was unknown during the hiring process.

Differences in pay capture differences along many possible dimensions, including worker education, experience and occupation. Let us see how educated the men are in this dataset.

In [None]:
maleQualificationCount = male_columns['Q4'].value_counts().reset_index() # variable for storing qualification of  male respondents 

percentageQualification = [] # percentage of males from each country

for i in maleQualificationCount.Q4:
  percentageQualification.append( (i / sum(maleQualificationCount.Q4.values)) * 100 )

x_p = np.arange(len(maleQualificationCount['index'].values))

fig, ax = plt.subplots()

ax.barh(x_p, percentageQualification, align='center')  #used barh for making it look pulchritudinous
ax.set_yticks(x_p) 
ax.set_yticklabels(maleQualificationCount['index'].values)
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('Percentage Of Males')
ax.set_title('Qualifications of Males ')

plt.show()



Most men have a **Master's degree** which is common for India and the United States of America. There is also a large number of men who have a **Bachelor's degree** and a **Doctoral degree**. These high qualifications should result in receiving a high salary. 

* According to the Bureau of Labor Statistics (BLS), the median wage for workers in the United States in the first quarter of 2019 was 905 dollars per week or 47,060 dollars per year for a 40-hour workweek.

* The average salary in India is around 295 dollars to 300 dollars.


The difference in yearly income of both the major countries have a huge difference between them. Salaries can vary significantly based on both occupation and location. What is considered a good salary in one location may not be somewhere else.

Let us look into the earnings of the men in the dataset.

In [None]:
maleEarningsCount = male_columns['Q10'].value_counts()[:59].reset_index() # variable for storing qualification of  male respondents 
maleEarningsCount = maleEarningsCount.reindex([0,7,18,19,20,10,15,1,9,8,12,3,4,5,6,11,16,17,2,13,14,21,24,23,22])

percentageEarnings_m = [] # percentage of males from each country

for i in maleEarningsCount.Q10:
  percentageEarnings_m.append( (i / sum(maleEarningsCount.Q10.values)) * 100 )

x_p = np.arange(len(maleEarningsCount['index'].values))


plt.figure(figsize=(15, 5))

plt.bar(x_p, percentageEarnings_m,color = 'blue',width = 0.8)
plt.title('Men In Each Income Bracket')
plt.xlabel('Incomes')  
plt.ylabel('Percentage Of Males')
plt.xticks(x_p,maleEarningsCount['index'].values,rotation=85)
plt.legend(['Male'])
plt.show()  


A large number of men earn less than  or equal to a 1000 dollars. Only a few (75) earn above 500,000 dollars. In the diagram above, it is evident that the majority tend to earn in the 30,000 to 70,000 dollar range.

**The BLS (of the USA) reports that men earned a median salary of 52,208 dollars annually**

**The annual median per capita income in India stood at 616 dollars**

In [None]:
#### Drawing a simple line plot ####
#### Writing a function for storing the cumulative freqency values for each income bracket was too daunting for me so ###
#### I used my Casio fx-991 ES PLUS calculator instead ###
#### I double checked the values so rest assured ###
plt.figure(figsize=(15, 7))
plt.plot([1000, 2000, 3000, 4000,5000,7500,10000,15000,20000,25000,30000,40000,50000,60000,70000,80000,90000,100000,125000,150000,200000,250000,300000,500000,1000000],
         [1170,1640,1956,2206,2435,2880,3234,3936,4382,4839,5250,5885,6494,7100,7598,8038,8377,8700,9350,9756,10124,10271,10334,10398,10473])

The cumulative frequency diagram above can be used to get the median salary of the men in our dataset. It is around 30,000 dollars which is more than 20,000 dollars lower than the figure reported by the BLS. However, it is a lot higher than the Indian median salary.

This ends our analysis of the male demographic in the dataset. Now  let us see how the female counterparts are doing.

<a id="section-two"></a>
<font size="5">How are the women doing?</font>

**A woman is a female human being.** 

The word woman is usually reserved for an adult; girl is the usual term for a female child or adolescent. The plural women is also sometimes used for female humans, regardless of age, as in phrases such as "women's rights".

Typically, a woman has two X chromosomes and is capable of pregnancy and giving birth from puberty until menopause. Female anatomy, as distinguished from male anatomy, includes the fallopian tubes, ovaries, uterus, vulva, breasts, Skene's glands, and Bartholin's glands. The female pelvis is wider than the male, the hips are generally broader, and women have significantly less facial and other body hair. On average, women are shorter and less muscular than men.

<img src="https://www.biography.com/.image/ar_1:1%2Cc_fill%2Ccs_srgb%2Cg_face%2Cq_auto:good%2Cw_300/MTY2NzA3MDU5MTEzMzM4MTQ4/marilyn_monroe_photo_alfred_eisenstaedt_pix_inc_the_life_picture_collection_getty_images_53376357_cropped.jpg" width="500px">

In [None]:
femaleCountryCount = female_columns['Q3'].value_counts()[:59].reset_index() # variable for storing number of female respondents from each country


percentageCountry = [] # percentage of females from each country

for i in femaleCountryCount.Q3:
  percentageCountry.append( (i / sum(femaleCountryCount.Q3.values)) * 100 )

x_p = np.arange(len(femaleCountryCount['index'].values)) # coordinates for labels and bars

plt.figure(figsize=(15, 7))
plt.bar(x_p, percentageCountry, align='center',color = '#FF5106')
plt.xticks(x_p, femaleCountryCount['index'].values,rotation = 90)
plt.ylabel('Percentage of Females')
plt.title('Country of residence')

plt.show()



The women in the survey are almost from the same countries the men are from, with the majority being in India and the United States of America. The same as the men. This is helpful for our analysis because geography will not effect our comparsion.

In [None]:
femaleEarningsCount = female_columns['Q10'].value_counts()[:59].reset_index() # variable for storing qualification of  male respondents 
femaleEarningsCount = femaleEarningsCount.reindex([0,2,11,20,17,7,19,1,8,14,13,4,3,5,10,9,15,18,6,12,16,21,23,22,24 ])
percentageEarnings_f = [] # percentage of females from each country

for i in femaleEarningsCount.Q10:
  percentageEarnings_f.append( (i / sum(femaleEarningsCount.Q10.values)) * 100 )

x_p = np.arange(len(femaleEarningsCount['index'].values))

plt.figure(figsize=(15, 5))

plt.bar(x_p, percentageEarnings_f,color = '#FF5106',width = 0.8)
plt.title('Women In Each Income Bracket')
plt.xlabel('Incomes')  
plt.ylabel('Percentage Of Females')
plt.xticks(x_p,maleEarningsCount['index'].values,rotation=85)
plt.legend(['Female'])
plt.show()  


    

Here, it is evident that more females are earn between 0 to 999 dollars, similar to the males. Also, very few earn above or near 500,000 dollars.

An interesting fact that I would like to note is that 

**only 2 women in the entire dataset earn above 500,000 dollars compared to the 75 men who earn the same**

Let us check the mean income of females in the dataset.


In [None]:
### Again some casio magic. ###

plt.plot([1000, 2000, 3000, 4000,5000,7500,10000,15000,20000,25000,30000,40000,50000,60000,70000,80000,90000,100000,125000,150000,200000,250000,300000,500000,1000000],
         [318,436,505,555,611,695,745,863,939,1002,1066,1154,1253,1340,1411,1483,1541,1593,1680,1747,1803,1815,1817,1825,1827],color = '#FF5106')

The median income appears to be between 15,000 to 20,000 dollars. This is significantly lower than the median income of men we have seen above. Further analysis on this  is below but first let us see whether the women are underqualified.

In [None]:
femaleQualificationCount = female_columns['Q4'].value_counts().reset_index() # variable for storing qualification of  female respondents 

percentageQualification = [] # percentage of females from each country

for i in femaleQualificationCount.Q4:
  percentageQualification.append( (i / sum(femaleQualificationCount.Q4.values)) * 100 )

x_p = np.arange(len(femaleQualificationCount['index'].values))

fig, ax = plt.subplots()

ax.barh(x_p, percentageQualification, align='center',color = '#FF5106')  #used barh for making it look pulchritudinous
ax.set_yticks(x_p) 
ax.set_yticklabels(femaleQualificationCount['index'].values)
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('Percentage Of Females')
ax.set_title('Qualifications of Females ')

plt.show()


Women have almost the exactly same qualifications as the men. Most women have a Master's Degree, followed by others who have Bachelor's and Doctorate degrees.

Now let's compare our findings and see if there is infact a gender gap within our dataset.

<a id="section-three"></a>
<font size="5">What is happening?</font>

In [None]:
df = multiple_choice_responses[['Q2','Q10']]

df = df.iloc[1:,]  # dropping first row as it contains column labels

#df['Q10'].value_counts()

df['Q10'] = df.Q10.map({'$0-999' : 500,   # Converting the range to numbers for easier  calculations and graphs
'1,000-1,999': 1500,
'2,000-2,999' : 2500,
'3,000-3,999' : 3500,
'4,000-4,999': 4500,
'5,000-7,499' : 6250,
'7,500-9,999' : 8750,
'10,000-14,999' : 12500,
'15,000-19,999' : 17500,
'20,000-24,999' : 22500,
'25,000-29,999':  27500,
'30,000-39,999': 35000,
'40,000-49,999': 45000,
'50,000-59,999': 55000,
'60,000-69,999': 65000,
'70,000-79,999': 75000,
'80,000-89,999': 85000,
'90,000-99,999': 95000,
'100,000-124,999': 112500,
'125,000-149,999': 137500,
'150,000-199,999': 175000,
'200,000-249,999': 225000,
'250,000-299,999': 275000,
'300,000-500,000': 400000,
'> $500,000': 750000})

df.columns  = ['Gender','Income']  # naming columns

sns.catplot(data = (df.loc[((df.Gender == 'Female') | (df.Gender == 'Male')) & (df.Income.notnull())]), x='Gender', y='Income', kind='violin', split=True)
plt.title('Approximated Pay Distribution By Gender')

**The graph drawn above look very similar to each other at first glance. Most people of both genders earn less than or equal to a 1000 dollars and only a few earn close to or above 500,000 dollars. The Male distribution falls gradually and slowly evens out at higher incomes, but the female distribution falls sharply and is invisible after the 200,000 dollar range.**

**There is a higher number of women in the lowest income bracket than men and a alarmingly low number of women in the highest income brackets.**

* **17.4% of women are in the lowest income bracket compared to 9.48% of males.**
* **0.1% of women are in the highest income bracket compared to 0.68% males.**

To put things into perspective, the income brackets were divided into High, Low and Medium income groups.

In [None]:
df = multiple_choice_responses
salary_mapping = {'$0-999':'low', '1,000-1,999':'low', 
                  '10,000-14,999':'low', '100,000-124,999':'high',
                  '125,000-149,999':'high', '15,000-19,999':'low', 
                  '150,000-199,999':'high', '2,000-2,999':'low',
                  '20,000-24,999':'low', '200,000-249,999':'high', 
                  '25,000-29,999':'medium', '250,000-299,999':'high',    # mapping the ranges to high,low and medium
                  '3,000-3,999':'low','30,000-39,999':'medium',
                  '300,000-500,000':'high', '4,000-4,999':'low',
                  '40,000-49,999':'medium', '5,000-7,499':'low', 
                  '50,000-59,999':'medium', '60,000-69,999':'medium',
                  '7,500-9,999':'low', '70,000-79,999':'medium', 
                  '80,000-89,999':'medium', '90,000-99,999':'medium',
                  '> $500,000':'high'}

# create new column for the income group and convert the old salary
df['income_group'] = df['Q10'].map(salary_mapping)

df = df[df.income_group.notnull()]
df = df[df.Q2 != 'Prefer not to say']
df = df[df.Q2 != 'Prefer to self-describe']

df = df[['Q2','income_group']]

males = df[df.Q2 == 'Male']
females = df[df.Q2 == 'Female']

malesHighRollers = males.groupby('income_group')['Q2'].count()[0]

femalesHighRollers = females.groupby('income_group')['Q2'].count()[0]

malesLowRollers = males.groupby('income_group')['Q2'].count()[1]

femalesLowRollers = females.groupby('income_group')['Q2'].count()[1]

malesMediumRollers = males.groupby('income_group')['Q2'].count()[2]

femalesMediumRollers = females.groupby('income_group')['Q2'].count()[2]

fig, ax = plt.subplots(figsize=(6, 3), subplot_kw=dict(aspect="equal"))

recipe = ['Male','Female']

data =  [malesHighRollers , femalesHighRollers]
ingredients = [x.split()[-1] for x in recipe]


def func(pct, allvals):
    absolute = int(pct/100.*np.sum(allvals))
    return "{:.1f}%".format(pct, absolute)


wedges, texts, autotexts = ax.pie(data, autopct=lambda pct: func(pct, data),
                                  textprops=dict(color="w"))

ax.legend(wedges, ingredients,
          title="Genders",
          loc="center left",
          bbox_to_anchor=(1, 0, 0.5, 1))

plt.setp(autotexts, size=8, weight="bold")

ax.set_title("Males and Females in High Income Bracket")

plt.show()

fig, ax = plt.subplots(figsize=(6, 3), subplot_kw=dict(aspect="equal"))

recipe = ['Male','Female']

data =  [malesLowRollers , femalesLowRollers]
ingredients = [x.split()[-1] for x in recipe]


def func(pct, allvals):
    absolute = int(pct/100.*np.sum(allvals))
    return "{:.1f}%".format(pct, absolute)


wedges, texts, autotexts = ax.pie(data, autopct=lambda pct: func(pct, data),
                                  textprops=dict(color="w"))

ax.legend(wedges, ingredients,
          title="Genders",
          loc="center left",
          bbox_to_anchor=(1, 0, 0.5, 1))

plt.setp(autotexts, size=8, weight="bold")

ax.set_title("Males and Females in Low Income Bracket")

plt.show()

fig, ax = plt.subplots(figsize=(6, 3), subplot_kw=dict(aspect="equal"))

recipe = ['Male','Female']

data =  [malesMediumRollers , femalesMediumRollers]
ingredients = [x.split()[-1] for x in recipe]


def func(pct, allvals):
    absolute = int(pct/100.*np.sum(allvals))
    return "{:.1f}%".format(pct, absolute)


wedges, texts, autotexts = ax.pie(data, autopct=lambda pct: func(pct, data),
                                  textprops=dict(color="w"))

ax.legend(wedges, ingredients,
          title="Genders",
          loc="center left",
          bbox_to_anchor=(1, 0, 0.5, 1))

plt.setp(autotexts, size=8, weight="bold")

ax.set_title("Males and Females in Medium Income Bracket")

plt.show()



**You can see that there is a higher percentage of women in the lower income group (17.20 %) than the higher income group (11.70 %)**

From analysis of the cumulative frequency graphs we found the medians of both income distributions.

**Around 30,000 dollars for males** and **Between 15,000 to 20,000 dollars for females**

It is evident that females, on average, earn less than males in the demographic represented by the dataset.

We have seen that women and men have identical qualifications in terms of education. Hence differences in education cannot explain the gender pay gap. Then what is causing this? Some plausible explanations are as follows:

**Job flexibility**

All over the world women tend to do more unpaid care work at home than men – and women tend to be overrepresented in low paying jobs where they have the flexibility required to attend to these additional responsibilities.

**The motherhood penalty**

Closely related to job flexibility and occupational choice, is the issue of work interruptions due to motherhood. On this front there is again a great deal of evidence in support of the so-called ‘motherhood penalty’.

**Discrimination and bias**

Independently of the exact origin of the unequal distribution of gender roles, it is clear that our recent and even current practices show that these roles persist with the help of institutional enforcement. Goldin (1988) for instance, examines past prohibitions against the training and employment of married women in the US. She touches on some well-known restrictions, such as those against the training and employment of women as doctors and lawyers, before focusing on the lesser known but even more impactful ‘marriage bars’ which arose in the late 1800s and early 1900s. These work prohibitions are important because they applied to teaching and clerical jobs – occupations that would become the most commonly held among married women after 1950. Around the time the US entered World War II, it is estimated that 87% of all school boards would not hire a married woman and 70% would not retain an unmarried woman who married.



Research into the disparity between the incomes of men and women has revealed that,

> **Women all over the world are underrepresented in high-profile jobs, which tend to be better paid. As it turns out, in many countries women are at the same time overrepresented in low-paying jobs.**

**Let us see whether this is the case in our dataset.**



In [None]:
maleJobCount = male_columns['Q5'].value_counts().reset_index() # variable for storing number of respondents in each job


percentageJob_m = [] # percentage of males in each job

for i in maleJobCount.Q5:
  percentageJob_m.append( (i / sum(maleJobCount.Q5.values)) * 100 )


femaleJobCount = female_columns['Q5'].value_counts().reset_index() # variable for storing number of respondents in each job


percentageJob_f = [] # percentage of females in each job

for i in femaleJobCount.Q5:
  percentageJob_f.append( (i / sum(femaleJobCount.Q5.values)) * 100 )

x_p = np.arange(len(maleJobCount['index'].values))

plt.figure(figsize=(15, 7))
plt.bar(x_p-0.2,percentageJob_m, width=0.4, label='Males', color='blue',alpha = 0.8)
plt.bar(x_p+0.2, percentageJob_f, width=0.4, label='Females', color = '#FF5106')
#give title
plt.title(' Percentage of Men and Women In Each Job')
plt.xticks(x_p,femaleJobCount['index'].values,rotation=90)
plt.xlabel('Job Title')
plt.ylabel('Percentage of Gender')
#plt.plot(ypos, Year) will remove the numbers along the y-axis
#shows legend
plt.legend()
#show to show the graph
plt.show()


There are some interesting facts to consider from the graph above,

* The most prevalent job titles in both genders are **Students** (although technically not a job title) and **Data Scientists** 

* Within the males, there is and extremely high number of **Students**, even outnumbering the **Data Scientists**. There is also a comparatively high number of **Data Analysts**, who outnumber the **Software Engineers**.
 
* Within the females, the **Students** outnumber the **Data Scientists** by a comparatively wider margin. In addition, the number of **Data Analysts** and **Software Engineers** are almost the same.

So, the key observation is :
* Most people in the dataset are **Students**
* Most people are employed as either **Data Scientists, Software Engineers or Data Analysts **(in the order of prevalence)

Men outnumber women in all categories, irrespective of whether it is a high-profile or a low paid job. **A greater percentage of women are Students, unemployed and Data Analysts than males. Moreover, a greater percentage of males are Data Scientists than females. Considering that being a Data Scientist and Software Engineer are the highest paid jobs in our dataset, women are indeed being underrepresented as shown by the graph above.**

In addition, considering the median starting salaries of the perevalent jobs within females:

* Data Scientists receive 95,000 dollars
* Software Engineers receive 84,336 dollars
* Data Analysts receive 34,500 dollars

It is unsual to see that the median salary of the females in our dataset is only around 15000 to 20000 dollars despite them being employed in high paying positions.

This infographics can provide further context for you: 

<img src="http://businessoverbroadway.com/wp-content/uploads/2018/04/median_salary_world_DS_MLE.png" width="1000px">


*Although the job titles are not entirely the same as the ones in our dataset, we can get a general idea of how the* *salaries looked like.*

*Despite all being very lucrative professions, the women in our dataset receive significantly less remuneration* *than their male counterparts. This we have observed from our preceeding analysis.*

<a id="section-four"></a>
<font size="5">In Conclusion,</font>

From our analysis of the data from the survey and additional sources (cited below), there is no doubt that a gender pay gap exists within the demographic represented by the data and in society as a whole. We have looked at whether women are underqualified, but they are not. There are plasuible explanations to the existence of the pay gap which we have looked at but cannot be proved using the data provided. Although our analysis is simple, it does not fail to prove that the gender pay gap exists within the data science community.

To sum it  all up,

**All over the world men tend to earn more than women.**

**There are significantly less women working in data science and machine learning than men**

**Having the same job title as men does not mean women receive the same amount of money**


<img src="https://infographic.statista.com/normal/chartoftheday_4467_female_employees_at_tech_companies_n.jpg" width="1000px">

<a id="section-five"></a>
<font size="5">Gender Pay Gap within Data Science ?</font>

Our general discussion of the gender pay gap issue ended in the previous section. 

I think it would be interesting to look at the men and women within the dataset who identified themeselves to be **Data Scientists**. Specifically, let us see how their earnings differ.

In [None]:
labels = ('Males','Females')
sizes = [maleJobCount['Q5'][0],femaleJobCount['Q5'][1]]
explode = (0, 0.1)  # only "explode" the 2nd slice (i.e. 'Females')

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.title("Percentages of Data Scientists Who Are Males And Females", bbox={'facecolor':'0.8', 'pad':5})
plt.show() 

As you can see, the majority of Data Scientists are males (85%) with only 15% being women. **It is safe to say women are underrepresented in the Data Scientist community as a whole.** *(although our dataset does not reflect the entire community)*

According to a research conducted by **Harnham** *( A Data & Analytics recruitment firm )*,

* The gender pay gap in data science shrunk from 9.4% to 8.4% over the past year.

* Within the US, efforts have been made to get more women into science, technology, engineering, and mathematics (STEM) roles.

* While more women than men are receiving graduate degrees in STEM subjects, the ratio of female to male professionals across data and analytics professions has dropped slightly in the last year.

* Concerns at the executive level remain. The report found that the pay gap widens to 11% at senior levels of data and analytics jobs. 

* While 29% of women occupy entry level positions in data and analytics, only 19% occupy technical leads, and 13% occupy director positions, the research found. 

In addition to the aforementioned points, one point in particular is the same as our findings in the pie chart above :

> Data science and data and technology are the two largest areas in the data and analytics field, and they are also the most disproportionate, with **84% and 83% occupied by men**, respectively. In total, **only 23% of data and analytics roles are held by women.**

**These findings raises concerns over why women are underrepresented within the Data Science community.**

In [None]:
maleDataScientistColumns = male_columns[male_columns['Q5']=='Data Scientist'] # DataFrame containing male data

femaleDataScientistColumns = female_columns[female_columns['Q5']=='Data Scientist'] # DataFrame containing female data

maleEarningsCount = maleDataScientistColumns['Q10'].value_counts()[:59].reset_index() 
maleEarningsCount = maleEarningsCount.reindex([0, 11, 18 , 19, 21, 17 , 16, 3 , 14, 10, 12,  4, 2, 5, 8, 9, 15, 13, 1, 6, 7, 20 , 22, 24, 23])

percentageEarnings_m = [] # percentage of males 

for i in maleEarningsCount.Q10:
  percentageEarnings_m.append( (i / sum(maleEarningsCount.Q10.values)) * 100 )



femaleEarningsCount = femaleDataScientistColumns['Q10'].value_counts().reset_index()  
femaleEarningsCount = femaleEarningsCount.reindex([0, 8, 14 , 18 , 20 , 19 , 17, 7, 13, 16, 10, 4 ,3 ,9 ,6 , 12, 11, 15, 1,5 , 2 , 21, 24,22 , 23])
femaleEarningsCount.at[24,'index']= '250,000-299,999'
femaleEarningsCount.at[24,'Q10']= 0

femaleEarningsCount.at[23,'index']= '> 500,000'
femaleEarningsCount.at[23,'Q10']= 0

percentageEarnings_f = [] # percentage of females 

for i in femaleEarningsCount['Q10']:
  percentageEarnings_f.append( (i / sum(femaleEarningsCount['Q10'].values)) * 100 )

x_p = np.arange(len(femaleEarningsCount['index'].values))

x = x_p
new_x = x_p
plt.figure(figsize=(15, 5))
# the first call is as usual
plt.bar(new_x, percentageEarnings_m,color = 'blue')

# the second one is special to create stacked bar plots
plt.bar(new_x, percentageEarnings_f, bottom=percentageEarnings_m, color= '#FF5106')
plt.title('Percentage of Male and Female Data Scientists In Each Income Bracket')
plt.xlabel('Incomes')  
plt.ylabel('Number Of People')
plt.xticks(new_x,maleEarningsCount['index'].values,rotation=85)
plt.legend(['Male','Female'])
plt.show()  


At first glance, the income distributions look very similar in terms of trend. But the graph below tells a different story.

In [None]:

# Here we calculate the average and plot it
incomes = [499.5, 1499.5 ,2499.5 ,3499.5, 4499.5, 6249.5, 8749.5, 12499.5, 17499.5, 22499.5, 27499.5, 34999.5, 44999.5, 54999.5, 64999.5, 74999.5, 84999.5, 94999.5, 112499.5, 137499.5, 174999.5,224999.5, 274999.5,400000,750000]


maleDataScientist_average = sum(np.multiply(maleEarningsCount['Q10'],incomes)) / sum(maleEarningsCount['Q10'])


femaleDataScientist_average = sum(np.multiply(femaleEarningsCount['Q10'],incomes)) / sum(femaleEarningsCount['Q10'])

objects = ('Male','Female')

plt.figure(figsize=(15, 10))
plt.bar(0,maleDataScientist_average , width=0.4, label='Males', color='blue')
plt.bar(1, femaleDataScientist_average, width=0.4, label='Females', color = '#FF5106')
plt.xticks([0,1], objects)
plt.ylabel('Average Earnings')
plt.title('Average Earnings Of Male and Female Data Scientists')

plt.show()


##############
# Et Voila ! #
##############

# Disclaimer : Please take everything with a grain of salt, it is very hard to convey sarcasm in a kaggle kernel.

As you can see, male Data Scientists within the dataset earned on average **64000 dollars** and their female counterparts earned only **55000 dollars**. 

**This difference of 9000 dollars shows that a gender pay gap exists within the Data Scientists.** *(atleast the portion represented by the dataset)*

There is, however, a possibility that this difference can be explained through other variables which we do not have access to but others who have conducted extensive research into the gender pay gap issue have found that despite taking into account all possible variables, there still exists an 'unexplained' difference which verifies that gender plays a part in determining your salary.

<a id="section-six"></a>
<font size="5">Can a neural network 'see' the gender divide ?</font>

We humans know that the gender pay gap exists from the data and our own reasoning but can a neural network detect this?

I thought it would be interesting to train a extremely simple neural network for predicting a person's income based on their gender, country of residence, and job title and see if there is a difference between the incomes predicted for males and females.

**An overview of the model**

The model we will be using is a simple and deliberately naive neural network implemented with keras. It is a sequential model and uses quite a few layers. I did not aim for sophistication so feel free to tamper with it.

*Also, a low key disclaimer , the model will produce different results every time you will run it as the dataset is small so if my evaluation of the results do not match with the illustrations word for word please use your keen intelligence to bridge the gap between the discrepancies. Although these discrepancies will not be severe, I just thought you should know that they might exist. The results ,nevertheless, are coherent with the premise established.* 

In [None]:
# Here we prepare the data to feed the neural network with

df = pd.read_csv('/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv')
df = df[['Q2','Q3','Q4','Q10']]

df = df.iloc[1:,]

df['Q10'] = df.Q10.map({'$0-999' : 500,
'1,000-1,999': 1500,
'2,000-2,999' : 2500,
'3,000-3,999' : 3500,
'4,000-4,999': 4500,
'5,000-7,499' : 6250,
'7,500-9,999' : 8750,
'10,000-14,999' : 12500,
'15,000-19,999' : 17500,
'20,000-24,999' : 22500,
'25,000-29,999':  27500,
'30,000-39,999': 35000,
'40,000-49,999': 45000,
'50,000-59,999': 55000,
'60,000-69,999': 65000,
'70,000-79,999': 75000,
'80,000-89,999': 85000,
'90,000-99,999': 95000,
'100,000-124,999': 112500,
'125,000-149,999': 137500,
'150,000-199,999': 175000,
'200,000-249,999': 225000,
'250,000-299,999': 275000,
'300,000-500,000': 400000,
'> $500,000': 750000})

cleanup_nums = {"Q2": {"Male": 1, "Female": 2,"Prefer not to say" : 3,'Prefer to self-describe':4}} # label encoding
df.replace(cleanup_nums, inplace=True) # implementing the gender label encoding

# filling null values
df = df.fillna({"Q10": "0"})  
df = df.fillna({"Q2": "0"})
df = df.fillna({"Q3": "0"})
df = df.fillna({"Q4": "0"})

# label encoding other columns
df[["Q3"]] = df[["Q3"]].astype('category')  
df[["Q4"]] = df[["Q4"]].astype('category')

df["Q3"] = df["Q3"].cat.codes
df["Q4"] = df["Q4"].cat.codes


# main input and output x and y

x = df[['Q2','Q3','Q4']]
y = df['Q10']

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=40)  #splitting into test and train
#print(X_train.shape); print(X_test.shape)

# model formation
model = Sequential()
model.add(Dense(128, input_dim=3, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(1))
model.compile(optimizer= 'rmsprop',loss = "mse", metrics=['mae'])
model.fit(X_train,y_train, epochs=50, verbose=0)

test_males = X_test[X_test['Q2']==1]  # males in test set
test_females = X_test[X_test['Q2']==2]   #females in test set

#predictions
m = model.predict(test_males)  

f = model.predict(test_females)

In [None]:
average_males = sum(m) / len(m)

average_females = sum(f) / len(f)

objects = ('Male','Female')

plt.figure(figsize=(15, 10))
plt.bar(0,average_males , width=0.4, label='Males', color='blue')
plt.bar(1,average_females, width=0.4, label='Females', color = '#FF5106')
plt.xticks([0,1], objects)
plt.ylabel('Average Earnings')
plt.title('Average Earnings Predicted By The Model')

plt.show()


From the graph above, it can be seen that the average of incomes predicted for males is 38000 dollars and for females it is 26600 dollars. This chasm between the average incomes shows that the neural network has picked up the gender pay gap which exists in the dataset. I hope you can feel the gravity of the situation.

In [None]:
# this section is dedicated entirely to the scatter plot

men = X_test[X_test['Q2']==1]
women = X_test[X_test['Q2']==2]

men_pred = model.predict(men)
women_pred = model.predict(women)

men = men.drop(['Q3','Q4'],axis = 1)
women = women.drop(['Q3','Q4'],axis =1)

men.loc[men['Q2'] == 1, 'Q2'] = 'Male'
women.loc[women['Q2'] == 2, 'Q2'] = 'Female'

men['income'] = men_pred
women['income'] = women_pred

tips = pd.concat([men,women],axis = 0)

tips.columns = ['Gender','Incomes']

plt.style.use('seaborn-whitegrid')

sns.catplot(x='Gender', y='Incomes',data = tips)

The scatter plot above has the individual predictions for some men and women. The model's predicted income for most males is in the 20,000 to 40,000 dollar range. It stretches out beyond the 100,000 dollar range. However for the women, the predictions are concentrated slightly lower than the males. In addition, the range too is very limited. All women were predicted to earn less than 60,000 dollars.

In [None]:

# making box plot
sns.boxplot( x=tips["Gender"], y=tips["Incomes"] )

The median of the predicted incomes is higher for males. The lower (25%) and upper quartiles (75%) too, are  far apart. This also shows the spread of the predictions for females is lower. You can see this easily, just compare the width of the boxes above in the box and whisker plot. In addition, there is a significant difference between the male and female upper bounds.

Another point I would like to add is that there outliers in the predictions for males which stretch well beyond the upper bound but none for the females.

<a id="section-seven"></a>
<font size="4">Afterthoughts,</font>

**Sorry for a less than professional coding style and mistakes as I am self-taught in every aspect of data science and using kaggle.**
**I would very much appreciate your comments.** 
**Thank you for reading. **

*Data from the following sources have been sprinkled throughout the notebook.*


* Wikipedia  https://en.wikipedia.org/wiki/Man
* Wikipedia  https://en.wikipedia.org/wiki/Gender_pay_gap#India
* Wikipedia  https://en.wikipedia.org/wiki/Gender_pay_gap
* Wikipedia  https://en.wikipedia.org/wiki/Gender_pay_gap#United_States
* Wikipedia  https://en.wikipedia.org/wiki/Woman
* Economic Inequality by Gender   https://ourworldindata.org/economic-inequality-by-gender#differences-in-pay
* Why is there a gender pay gap?  https://ourworldindata.org/what-drives-the-gender-pay-gap
* Unadjusted Gender Pay Gap in average hourly wages https://ourworldindata.org/grapher/gender-gap-in-average-wages-ilo
* Median Starting Salary of Software Engineers https://www.payscale.com/research/US/Job=Software_Engineer/Salary
* Median Starting Salary of Data Scientists https://datasciencedegree.wisconsin.edu/data-science/data-scientist-salary/
* Harnham research article https://www.techrepublic.com/article/the-data-science-gender-pay-gap-is-shrinking-barely/

* Title  https://towardsdatascience.com/please-mind-the-gender-pay-gap-9162f13b4202
* Some inspiration https://www.kaggle.com/fchmiel/who-will-your-analysis-story-be-about

As a side note, for the sake of keeping everything simple, I did not take into account genders outside of males and females which maybe in the dataset.

Disclaimer : Please take some things with a grain of salt, it is very hard to convey sarcasm in a kaggle kernel.