# The Diversity Issue in Data Science, Fact, Fiction or Advocacy 2019 - 2021

Diversity in data science is an issue that has gained more and more attention in the last few years. This study attempts to evaluate the gender bias claims levelled against data science and data scientists: are these claims based on facts, are they fictions or just another advocacy to pump up emotions on a particular subject matter?  To provide a veritable background to this study, we look at some of these claims


>As data science professionals advance in their careers, the percentage of women decreases significantly. Among the most advanced individual contributors, 6% of data scientists are female; 10% of executive managers are female.” To make matters worse, they are paid 10,000 dollars less per year, on average, for the same work. .............fewer than 3 percent of data scientists are women of colour, fewer than 5 percent are Latino, fewer than 4 percent are African-American, and fewer than 0.5 percent are Native American. 
[Read more](
https://medium.com/stem-and-culture-chronicle/crunching-the-numbers-on-diversity-in-data-science-events-resources-to-foster-inclusion-5dc81d2ab52#:~:text=In%20fact%2C%20of%20all%20the,ranks%20the%20lowest%20in%20diversity.&text=Data%20suggest%20that%20fewer%20than,than%200.5%25%20are%20Native%20American.)


There has been suggestions from several quarters that there is a conscious effort to undermine women in data science:

>From middle school through graduate school, girls and women are said to “leak” out of computer science, math and other fields that typically lead to careers in data science. Women who have succeeded in these fields think otherwise. "The pipeline isn’t leaky so much as it’s toxic, they say, lined with practices that can devalue the presence and successes of women and people of colour." ....fifty percent of women in STEM reported that they have experienced discrimination on the job
[Read more](
https://msmagazine.com/2021/07/26/data-science-diversity-gender-women-stem/)

>The biggest concern is: how can a field so susceptible to bias such as data science and AI be driven by a workforce whose demographic is so skewed in favour of men?"...."In 2016, researchers at the North Carolina State University made international headlines following a study where they revealed that out of 1.4 million pull requests on GitHub, 78.6 percent of pull requests made by women were accepted compared with 74.6 percent of those by men. However, when their gender was identifiable, the acceptance rate dropped to 62.5 percent" 
[Read more](https://towardsdatascience.com/the-harsh-reality-about-being-a-woman-in-ai-and-data-science-cc6a61f9cddc)


The fact that Google and Facebook reported in 2018 that their AI workforce was made of 10 percent and 15 percent women respectively have fueled the claims of gender bias in favour of men against women in every facet of data science. 

>In 2020, on the global level(AI and other departments), "women made up 30% of the employees at Microsoft, 32% at Google, 45% at Amazon and 37% at Facebook. In fact, these numbers have only slightly improved, if at all, over the past 6 years.[Read more](http://businessoverbroadway.com/2021/03/08/gender-inequality-persists-in-data-science-and-ai/)." 

It is increasingly becoming difficult to discard these claims with wave of the hand and to assume that the situation will self-adjust itself. Therefore, we attempt to put these claims side by side with evidences presented from data using the results from Kaggle Annual Survey from 2019 to 2021. Also, we shall allude to the Stackoverflow's Developers Survey of the same period in the course of the study

---
### Objectives
- [x] To evaluate the plausibility of the diverstity issues in data science and determine the veracity of such claims
- [x] To do a comparative analysis using Kaggle Survey data from 2019 to 2021 to evaluate the claims that data science has marginalized the women folks or that the structure of data science makes it difficult for women to cope
- [x] To investigate wether the problem is a data science problem or social problem by comparing men and woman responses for both Kaggle and Stack Overflow's survey for the same period (2019 - 2021)
- [x] To visualize and analyse data for every country represented on Kaggle Survey from 2019 to 2021; to see the true position of things
- [x] To classify companies according to number of employess to have a picture of ratio of women to men
- [x] To see if there exist compensation disparities for data scientists when it comes to gender?
- [x] To evaluate if the expenditure on Machine Learning and Cloud Computing be used to explain the compensation disparity if it exists?
- [x] To make the analysis very simple for non-data scientist and the coding very intuitive for data science begginers.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#import chart_studio.plotly as py
import plotly    
import plotly.io as pio
import plotly.express as px
import matplotlib.pyplot as plt
import cufflinks as cf
import plotly.graph_objects as go
import warnings
warnings.filterwarnings('ignore')

In [None]:
class color:
    RED = '\033[91m'
    GREEN = '\033[92m'
    YELLOW = '\033[93m'
    BLUE = '\033[94m'
    PURPLE = '\033[95m'
    CYAN = '\033[96m'
    B = '\033[97m'
    BOLD = '\033[1m'
    DARKCYAN = '\033[36m'
    UNDERLINE = '\033[4m'
    END = '\033[0m'  

In [None]:
dataset19 = pd.read_csv('../input/kaggle-survey-2019/multiple_choice_responses.csv')
dataset20 = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')
dataset21 = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
data19 = dataset19.drop([0],axis=0)
data20 = dataset20.drop([0],axis=0)
data21 = dataset21.drop([0],axis=0)

# Section 1: Women Representation in Data Science 2019 - 2021

In [None]:
# Read the necessary data for section 1

bios19 = data19[['Q1','Q2','Q3','Q6','Q10','Q11','Q15']].rename(columns={'Q1':'Age_Group',
                                                       'Q2':'Gender',
                                                       'Q3':'Country',
                                                       'Q6':'Company_Size',
                                                       'Q10':'Annual_Income',
                                                       'Q11':'ML_Cloud_Exp',
                                                        'Q15':'Coding Years'})

bios20 = data20[['Q1','Q2','Q3','Q6','Q20','Q24','Q25']].rename(columns={'Q1':'Age_Group',
                                                        'Q2':'Gender',
                                                        'Q3':'Country',
                                                        'Q20': 'Company_Size',
                                                        'Q24':'Annual_Income',
                                                        'Q25':'ML_Cloud_Exp',
                                                        'Q6':'Coding Years'})

bios21 = data21[['Q1','Q2','Q3','Q6','Q21','Q25','Q26']].rename(columns={'Q1':'Age_Group',
                                                        'Q2':'Gender',
                                                        'Q3':'Country',
                                                        'Q21': 'Company_Size',
                                                        'Q25':'Annual_Income',
                                                        'Q26':'ML_Cloud_Exp',
                                                        'Q6':'Coding Years'})

# Insert years for each year
bios19.insert(0,'Year','2019')
bios20.insert(0,'Year','2020')
bios21.insert(0,'Year','2021')

# combine the data from 2019 to 2021
BG_Data = pd.concat([bios19, bios20, bios21])
BG_Data.insert(5,'Counter',0)

In [None]:
# Data housekeeping: replacements
BG_Data['Gender'] = BG_Data['Gender'].replace('Male','Man')
BG_Data['Gender'] = BG_Data['Gender'].replace('Female','Woman')

BG_Data['Country'] = BG_Data['Country'].replace('United Kingdom of Great Britain and Northern Ireland','United Kingdom')
BG_Data['Country'] = BG_Data['Country'].replace('Iran, Islamic Republic of...','Iran')
BG_Data['Country'] = BG_Data['Country'].replace('Other','Other Countries')
BG_Data['Country'] = BG_Data['Country'].replace('United States of America','USA')
BG_Data['Country'] = BG_Data['Country'].replace('United Arab Emirates','UAE')
BG_Data['Country'] = BG_Data['Country'].replace('I do not wish to disclose my location','Undisclosed')

In [None]:
# Slice for man and woman genders only
ManWoman = BG_Data[(BG_Data['Gender']=='Man') | (BG_Data['Gender']=='Woman')]

In [None]:
# Table showing percentage of women participation in data science 
# This also prepares the ground for report and visuals generation

a = ManWoman.groupby(['Country'])['Counter'].count().reset_index()
b = ManWoman.groupby(['Country','Gender'])['Counter'].count().reset_index()
c = pd.merge(a,b, on='Country').rename(columns={'Counter_x':'Respondents','Counter_y':'Counter'})

d = c[(c['Gender']=='Woman')]
d['Women_Participation_Percent'] = round((d['Counter']/d['Respondents']),2)*100
d = d.sort_values('Women_Participation_Percent', ascending=False)

d1 = d.sort_values('Women_Participation_Percent', ascending=False)[0:15]
d2 = d.sort_values('Women_Participation_Percent', ascending=True)[0:15]

In [None]:
# Generating reports and Visuals for each country 

print(color.BOLD+color.BLUE+color.UNDERLINE+
      'Countries Gender Analysis from the Country with the Best Women Representation to the Least 2019 - 2020','\n'+color.END)

#Countries are arranged in descending order from the most respondents to the least
numba = 0
country = d['Country'].unique()
for c in country:
    numba +=1
    tot = ManWoman[(ManWoman['Country']==c)]
    print(color.BLUE,color.BOLD,'\n',numba,'Country Name:',c,'|','Total =',len(tot),color.END)
    k = tot.groupby(['Year','Gender'])['Counter'].count().reset_index().sort_values('Counter',ascending=False)
    yea = tot['Year'].value_counts().reset_index(name='Count').rename(columns={'index':'Year'})
    
    year = yea['Year'].unique()
    year = ManWoman['Year'].unique()
    for y in year:
        tot2 = tot[(tot['Country']==c) & (tot['Year']==y)]
        print(color.BOLD+color.GREEN+'\n',
              'Year =',y,'|',
              'Total =',len(tot2),color.END)
    
        number1=0
        gender = ManWoman['Gender'].unique()
        for g in gender:
            number1+=1
            k2 = tot2[(tot2['Country']==c) & (tot2['Gender']==g) & (tot2['Year']==y)]
            per = round(len(k2)/len(tot2),2)*100 if len(tot2)!=0 else''
            print(number1,g,'=',len(k2),'|',per,'%')
     
    print(color.BOLD+color.GREEN+'\n Cummulative(2019 to 2021) =',len(tot),color.END)
    gender = ManWoman['Gender'].unique()
    for g in gender:
        tot3 = tot[(tot['Country']==c) & (tot['Gender']==g)]
        average = round(len(tot3)/len(tot),2)*100 if len(tot)!=0 else''
        print(g,'=',len(tot3),'|',average,'%')
    
    
# The Table that yielded the visuals 
    #print('\n',k.reset_index(drop=True),'\n')
# Simple Country Specific Visualization
    plt.title(c, fontsize=14)
    sns.barplot(x ='Gender',y ='Counter', hue='Year',data=k)
    plt.xlabel('Gender',fontsize=12)
    plt.ylabel('Respondents',fontsize=12)
    plt.show()

# Section One: Findings

In [None]:
number=0
for i in [d1,d2]:
    number+=1
    print(color.BOLD+color.GREEN,'Visuals:',number,color.END)
    if number == 1:
        print(color.BLUE,color.BOLD,'Top 15 Countries with the Best Women Parcticipation in Data Science 2019-2021')
    else:
        print(color.BLUE,color.BOLD,'Bottom 15 Countries with the Lowest Women Parcticipation in Data Science 2019-2021')
    fig = go.Figure()
    fig = px.bar(i, y='Women_Participation_Percent', x ='Country',text='Women_Participation_Percent')
    fig.update_traces(texttemplate='%{text:.3s}', textposition='outside')
    fig.update_layout(uniformtext_minsize=8)
    fig.update_layout(xaxis_tickangle=-45)
    fig.show('notebook')

### The Curious Case of Tunisia

Tunisia returned the highest percentage 41% of women respondents. This result is praiseworthy in a world where women representation in most countries are below 20%. We therefore must take a look at what is happening in Tunisia with respect to women in data science. Can the world learn anything from Tunisia?


>According to World Bank Blogs Antonius Verheijen(2020) Though the problems of gender inequality persist in Tunisia and the road to equality remain long. However, Tunisia women have a literacy rate of 72% represent 67% of higher education graduates and hold 36% of parliamentry positions [Read more here.](https://blogs.worldbank.org/arabvoices/status-women-tunisian-society-endangered) 

>It is worthy to note that 23% is the best in the United States history of women participation in congress achieved in 2019. [Read more on countries ratings of women in national parliaments here](http://archive.ipu.org/wmn-e/classif.htm)

>Apparently Tunisians women are not new to data science accolades. The first and 3rd place positions went to Tunisian women in the Womxn in Big Data South Africa: Female-Headed Households in South Africa Challenge. The competitoon attracted 452 data scientists world wide with 40% women participation. [Read more here](https://zindi.africa/learn/meet-the-winners-of-the-womxn-in-big-data-south-africa-female-headed-households-in-south-africa-challenge)

>UNESCO reports in 2019 that 65 percent of Tunisians with Bachelor degree and 69 percent of PhD holders were female....55.1 percent of Tunisian researchers are female, the largest proportion in Africa and in the Arab world. [Read more here](https://northafricapost.com/46498-tunisian-women-lead-african-arab-women-in-science.html). The numbers are higher than that of the USA where 53% of total PhD degrees awarded in 2019 were women, as a matter of fact women in US graduate schools outnumber men for master's and doctoral degrees (and enrollments in 7 out of 11) academic fields, they have earned more doctoral degrees than men for over a decade. The reason for the poor showing of US women in data science is traceable to the fact that men dominate the quantitative fields: engineering (74.9 percent), mathematics and computer sciences (73.2 percent), physical and earth sciences (64.9 percent). The domination of women is pronunced in fields like Pubic Administration, Health and Medical Sciences, Education, Social and Behavioural Sciences etc. [Read more here](https://www.aei.org/uncategorized/women-earned-majority-of-doctoral-degrees-in-2019-for-11th-straight-year-and-outnumber-men-in-grad-school-141-to-100/). [Check the table here](https://www.aei.org/wp-content/uploads/2020/10/GradDoc2020.png?x91208)

From the foregoing it shows that there is a conscious and sincere efforts to bridge the gender gap in Tunisia. The country may not be the best with respect to gender equality however their efforts are yielding fruits for the present and the future. To this end, the rest of the world may have something to learn from Tunisian women especially with respect how they have been able to place women issues on the front burner in a predominantly Islamic country

### Is the Problem Specific to Data Science?
**Kaggle vs. StackOverflow**

In [None]:
# stack19 = Stack Overflow's Developers' Survey Dataset for 2019
# stack20 = Stack Overflow's Developers' Survey Dataset for 2020
# stack21 = Stack Overflow's Developers' Survey Dataset for 2021

stack19 = pd.read_csv('../input/stack-overflow-developer-survey-results-2019/survey_results_public.csv')
stack20 = pd.read_csv('../input/stack-overflow-developer-survey-2020/survey_results_public.csv')
stack21 = pd.read_csv('../input/stack-overflow-developer-survey-results-2021/survey_results_public.csv')

In [None]:
# Slice out the Gender column - the only column important to us from the developers' survey
stack_Gender19 = stack19[['Gender']]
stack_Gender20 = stack20[['Gender']]
stack_Gender21 = stack21[['Gender']]

# Insert the different years for each year's dataset
stack_Gender19.insert(0,'Year','2019')
stack_Gender20.insert(0,'Year','2020')
stack_Gender21.insert(0,'Year','2021')

# Combine Stack Overflows' dataset for the 3 years together in a single dataset 
stack_Gender = pd.concat([stack_Gender19, stack_Gender20, stack_Gender21])
stack_Gender.insert(0,'Platform','StackOverflow')

In [None]:
# The Kaggle Gender Dataset

kaggle_Gender = BG_Data[['Year','Gender']]
kaggle_Gender.insert(0,'Platform','Kaggle')

In [None]:
# Combine Kaggle and Stack Overflow's datasest together
Kaggle_Stack = pd.concat([kaggle_Gender,stack_Gender])

#Create the Counter column useful for groupby
Kaggle_Stack.insert(3,'Counter',0)

In [None]:
# I am only interested in the Man and Woman gender because that is our scope and because they comprise over 97% of the total
# Slice out the man and woman datasets

ManWoman = Kaggle_Stack[(Kaggle_Stack['Gender']=='Man') | (Kaggle_Stack['Gender']=='Woman')]

In [None]:
# Visualize the dataset using catplot. However, it is not enough to visualize the data we need to see the report comparing 
# ratio of men to women on the different platforms with respect to different years
# Note that the analysis here is based on the sliced data of men and women and not on the overall dataset

# code alert!! this is list comprehension to generate what is already in the above table however in a report format

platform = ManWoman['Platform'].unique()
for p in platform:
    print(color.BOLD,color.GREEN,'\n',p,'Platform Gender Representation',color.END)
    year = ManWoman['Year'].unique()
    for y in year:
        tot = ManWoman[(ManWoman['Platform']==p) & (ManWoman['Year']==y)]
        print(color.BLUE,color.BOLD,'\n',y, '|','Total =',len(tot),color.END)
        gender = ManWoman['Gender'].unique()
        for g in gender:
            k = ManWoman[(ManWoman['Platform']==p) & (ManWoman['Year']==y) & (ManWoman['Gender']==g)]
            per = round(len(k)/len(tot),2)*100 if len(tot)!= 0 else''
            print(g,'=',len(k), '|',per,'%')
    
# visualization alert!! i love this seaborn catplot format it enables me combine 4 columns together at once
Man_Woman_Groupby = ManWoman.groupby(['Platform','Year','Gender']).count().reset_index().sort_values('Counter',ascending=False)
print(color.BOLD,color.GREEN,'\nMen vs. Women Representations on Kaggle and Stackoverflow Platforms 2019 - 2021',color.END)
sns.catplot(x='Gender',y='Counter',hue='Year',col='Platform',kind='bar',data=Man_Woman_Groupby)
plt.show()

###### Findings
Percentage of women respondents with respect to men on: 
1. Kaggle Platform for Data Scientists:
    1. 2019 - 17 percent
    2. 2020 - 20 percent
    3. 2021 - 19 percent
2. Stackoverflow Platform for Software Developers:
    1. 2019 - 8 percent
    2. 2020 - 8 percent
    3. 2021 - 5 percent  

Other fndings include:

3. The percentage of women in software development is falling; from 8% in 2019 and 2020 to 5% in 2021
4. For Data science, it increased from 17% in 2019 to 20% in 2020 despite the pandemeic but dropped to 19% in 2021
5. Both platforms took a hit from the pandemic, numbers in absolute terms dropped in 2020
6. When you look at the numbers, more women are being attracted to Data Science compared to Software Development. Data Science has more women than sofware development in 2020(3,878 to 3,844) and 2021(4,890 to 4,120). This is despite the fact that software development recording 6,344 women in 2019.

###### Is the Problem Data Science Specific?  
- [x] The report above shows that software development is worse-off compared to data science
- [x] Therefore, the diversity issue is not data science specific
- [x] Can we absolve data science considering its very recent history compared to other discplines? Find out in the next section

# Section 2: Evaluating Recruitments Gender Gap in Companies
### Do we have gender disparity in large, medium and small companies?

The Objective here is to classify companies with respect to number of employees and look at the percentage of women to men. Conventionally, the sizes of companies are classified with respect to the number of employees thus:

1. 250 employess and above = Large Company
2. 50-249 employees = Medium Company
3. 0-49 employees = Small Company

[Read more](
https://en.m.wikipedia.org/wiki/Small_and_medium-sized_enterprises#:~:text=Microentreprises%3A%201%20to%209%20employees,enterprises%3A%20250%20employees%20or%20more)

The literature on gender gap in data science is replete with evidences of lopsided recruitments patterns in favour of the male gender. 
>Data compiled by magazine Wired and software firm Element AI finds that, at major technology companies, no more than 15% of cited artificial intelligence researchers are women; Google AI’s records show that that only about 60 of 641 listed machine intelligence specialists are women. Across the three biggest conferences in the sector in 2017, only 12% of contributors were female.....
Participation in the main online communities for sector professionals reveals even further under-representation of women. For each of Data Science Central, Kaggle, and OpenML, the proportion of female users is 17-18%; for Slack Overflow, it is just 7.9%.....women across the sector tend to be better-qualified; 59% possess a graduate or post-graduate degree, compared with 55% of men, according to analysis of data drawn from almost 20,000 LinkedIn profiles.[Sam Trendall(2021)](https://www.publictechnology.net/articles/features/does-data-science-have-dangerous-gender-gap)


We also have reports of recruitment tools that discriminate against female applicants. A natural underrepresentation due to circumstances beyond the control of the employer may be permissible, but to be accused of gender recruitment simulation in favour of a particular gender is profoundly incredible and unacceptable if true. Sam Trendall(2021) citing the Alan Turing Institute report which claims that a number of examples of built-in bias against women in data science recruitments.
>...among them a machine learning-powered recruitment tool developed by Amazon that discriminated against female applicants on the basis of their gender, a social media chatbot that swiftly learned the language of racist and misogynistic hate speech, and an algorithm for generating pictures that tended to fill out cropped pictures of men by dressing them in suits, while women were kitted out in bikinis.
>“Marketing algorithms have disproportionally shown scientific job advertisements to men,” the report adds. “The introduction of automated hiring is particularly concerning, as the fewer the number of women employed within the AI sector, the higher the potential for future AI hiring systems to exhibit and reinforce gender bias.”[Sam Trendall(2021)](https://www.publictechnology.net/articles/features/does-data-science-have-dangerous-gender-gap)

In [None]:
# We are interested in the year gender, country and employees columns, so slice them out of the main dataset created above 

all_coys = BG_Data[['Year','Gender','Country','Company_Size']]
#all_coys

In [None]:
# Classify and replace the companies with respect to number of employees as specified above

all_coys['Company_Classification'] = all_coys['Company_Size'].replace({'1000-9,999 employees':'Large',
                                                                      '> 10,000 employees':'Large',
                                                                      '250-999 employees':'Large',
                                                                      '10,000 or more employees':'Large',
                                                                      '50-249 employees':'Medium',
                                                                      '0-49 employees':'Small',})

# Check unique values to see if classification is correctly done
#all_coys['Company_Classification'].unique()
all_coys.insert(5, 'Counter',0)

In [None]:
# Slice for Man and Woman from the final table
Man_Woman_Size = all_coys[(all_coys['Gender']=='Man') | (all_coys['Gender']=='Woman')]
#Man_Woman_Size

In [None]:
# Groupby Gender ,Year and  company classification to give you the final pictuer of men and women in different company sizes
Man_Woman_Groupby = Man_Woman_Size.groupby(['Year','Gender','Company_Classification'])['Counter'].count().reset_index().sort_values('Counter',ascending=False)
#Man_Woman_Groupby

In [None]:
# Visualize and generated a mini report showing the ratio of men to women in the different company's sizes 
# with respect to different years
# Note that the analysis here is based on the sliced data of men and women for the Company_Classification 
# and not on the overall dataset

# List Comprehension is used develop a component report and visualize the report in the same loop
# I need to look at the number of women or men in each company classification per year


print(color.BLUE,color.BOLD,'Employees by Gender,by Company Sizes and by Years',color.END)
# Separate for company sizes
size = Man_Woman_Groupby['Company_Classification'].unique()
for s in size:
    tot = Man_Woman_Size[(Man_Woman_Size['Company_Classification']==s)]
    print(color.BOLD,color.GREEN,'\n',s,'Companies',color.END)
    
# Build the years into it 
    yea = tot['Year'].value_counts().reset_index(name='Count').rename(columns={'index':'Year'})
    year = yea['Year'].unique()
    for y in year:
        tot2 = tot[(tot['Year']==y)]
        print(color.BLUE,color.BOLD,'\n',y, '|','Total =',len(tot2),color.END)

# Account for Gender
        for g in Man_Woman_Size['Gender'].unique():
            k = tot2[(tot2['Gender']==g)]
            per = round(len(k)/len(tot2),2)*100 if len(tot2)!=0 else''
            print(g,'=',len(k), '|',per,'%')
    
# vizualize the above table with catplot: disaggregated into'Company_Classification' by years
print(color.BOLD,'\n Employees by Gender,by Company Sizes and by Years ',color.END)
sns.catplot(x='Gender',y='Counter',hue='Year',col='Company_Classification',kind='bar',data=Man_Woman_Groupby)
plt.show()

In [None]:
# Gender Recruitment Gaps in Countries

# Objective: Identify gender recruitment gap figures in each country with respect to years and company sizes

# To this, end I use list comprehension again same as above

print(color.BLUE,color.BOLD,'Employees by Gender,by Countries, by Company Sizes and by Years',color.END)
number=0
coun = all_coys['Country'].value_counts().reset_index(name='Count').rename(columns={'index':'Country'})
country = coun['Country'].unique()
for c in country:
    number+=1
    tot = all_coys[(all_coys['Country']==c)]
    tot2 = tot[(tot['Gender']=='Man') | (tot['Gender']=='Woman')]
    #k = tot2.groupby(['Year','Company_Classification'])['Counter'].count().reset_index().sort_values('Counter',ascending=False)
    Groupby = tot2.groupby(['Year','Gender','Company_Classification'])['Counter'].count().reset_index().sort_values('Counter',ascending=False)
    print(color.BOLD+color.GREEN+'\n',
          number,'Country Name:',c,
          '|','Total Respondents(2019-2021) =',
          len(tot),color.END)
    
    
    for s in size:
        print(color.BOLD,color.GREEN,'\n',s,'Companies',color.END)
        yea = tot['Year'].value_counts().reset_index(name='Count').rename(columns={'index':'Year'})
        year = yea['Year'].unique()
        for y in year:
            tot3 = tot2[(tot2['Company_Classification']==s) & (tot2['Year']==y)]
            print(color.BLUE,color.BOLD,'\n',y, '|','Total =',len(tot3),color.END)
            for g in Man_Woman_Size['Gender'].unique():
                k1 = tot3[(tot3['Gender']==g)]
                per = round(len(k1)/len(tot3),2)*100 if len(tot3)!=0 else''
                print(g,'=',len(k1), '|',per,'%')
                
# Recruitment Disparity Calculations
# Slice for the man and woman separately
            man = Groupby[(Groupby['Gender']=='Man') & (Groupby['Year']==y) & (Groupby['Company_Classification']==s)]
            woman = Groupby[(Groupby['Gender']=='Woman') & (Groupby['Year']==y) & (Groupby['Company_Classification']==s)]
            disparity = (man['Counter'].unique() - woman['Counter'].unique())
            disPer = ((disparity)/(len(tot3))).round(2)*100
            print(*disparity,'or',*disPer,'percent more men than women')
            
    print(color.BOLD,'\nEmployees by Gender by Company Sizes and by Years in',c,color.END)            
    sns.catplot(x='Gender',y='Counter',hue='Year',
                col='Company_Classification',
                kind='bar',data=Groupby)
    plt.show()

# Section 2: Findings

In [None]:
# Create the recruitment disparity and disparity Percent table for each country for each company size and for each gender

slice1 = Man_Woman_Size.groupby(['Country','Company_Classification'])['Counter'].count().reset_index()
slice2 = Man_Woman_Size.groupby(['Gender','Country','Company_Classification'])['Counter'].count().reset_index()
slice12 = pd.merge(slice1,slice2, on=['Country','Company_Classification']).rename(columns={'Counter_x':'Counter','Counter_y':'Counter_Gender'})

# slice for man and woman
top10_Woman = slice12[(slice12['Gender']=='Woman')].rename(columns={'Gender': 'Gender_Woman','Counter_Gender':'Counter_Woman'})
top10_Man = slice12[(slice12['Gender']=='Man')].rename(columns={'Gender': 'Gender_Man','Counter_Gender':'Counter_Man'})
top10 = pd.merge(top10_Man,top10_Woman, on=['Country','Company_Classification']).rename(columns={'Counter_x':'Total'}).drop('Counter_y',axis=1)

# At each company level
Small_Coy = top10[(top10['Gender_Man'].notnull()) & (top10['Company_Classification']=='Small')]
Large_Coy = top10[(top10['Gender_Man'].notnull()) & (top10['Company_Classification']=='Large')]
Medium_Coy = top10[(top10['Gender_Man'].notnull()) & (top10['Company_Classification']=='Medium')]

# Disparity Per Company Size
Small_Coy['Recruitment_Disparity'] = (Small_Coy['Counter_Man'] - Small_Coy['Counter_Woman'])
Large_Coy['Recruitment_Disparity'] = (Large_Coy['Counter_Man'] - Large_Coy['Counter_Woman'])
Medium_Coy['Recruitment_Disparity'] = (Medium_Coy['Counter_Man'] - Medium_Coy['Counter_Woman'])

# Disparity Percent Per Company Size
Small_Coy['Recruitment_Disparity_Percent'] = round(Small_Coy['Recruitment_Disparity']/(Small_Coy['Total']),2)*100
Large_Coy['Recruitment_Disparity_Percent'] = round((Large_Coy['Recruitment_Disparity'])/(Large_Coy['Total']),2)*100
Medium_Coy['Recruitment_Disparity_Percent'] = round((Medium_Coy['Recruitment_Disparity'])/(Medium_Coy['Total']),2)*100

# Top 10 Lowest Percentage Disparity for each Company Size
top10_Lowest_Small = Small_Coy.sort_values('Recruitment_Disparity_Percent').reset_index(drop=True).head(10)
top10_Lowest_Large = Large_Coy.sort_values('Recruitment_Disparity_Percent').reset_index(drop=True).head(10)
top10_Lowest_Medium = Medium_Coy.sort_values('Recruitment_Disparity_Percent').reset_index(drop=True).head(10)

# Top 10 Highest Percentage Disparity for each Company Size
top10_Highest_Small = Small_Coy.sort_values('Recruitment_Disparity_Percent',ascending=False).reset_index(drop=True).head(10)
top10_Highest_Large = Large_Coy.sort_values('Recruitment_Disparity_Percent',ascending=False).reset_index(drop=True).head(10)
top10_Highest_Medium = Medium_Coy.sort_values('Recruitment_Disparity_Percent',ascending=False).reset_index(drop=True).head(10)


In [None]:
number=0
for i in [top10_Lowest_Small,top10_Lowest_Large,top10_Lowest_Medium]:
    number+=1
    print(color.BOLD+color.BLUE,'Visuals:',number,color.END)
    if number == 1:
        print('10 Countries with the Lowest Recruitment Gap Percent in Small-Sized Companies')
    elif number == 2:
        print('10 Countries with the Lowest Recruitment Gap Percent in Large Companies')
    else:
        print('10 Countries with the Lowest Recruitment Gap Percent in Medium-Sized Companies')
    fig = go.Figure()
    fig = px.bar(i, y='Recruitment_Disparity_Percent', x ='Country',text='Recruitment_Disparity_Percent')
    fig.update_traces(texttemplate='%{text:.3s}', textposition='outside')
    fig.update_layout(uniformtext_minsize=8)
    fig.update_layout(xaxis_tickangle=-45)
    fig.show('notebook')

In [None]:
number=0
for i in [top10_Highest_Small,top10_Highest_Large,top10_Highest_Medium]:
    number+=1
    print(color.BOLD+color.BLUE,'Visuals:',number,color.END)
    if number == 1:
        print(color.GREEN,color.BOLD,'10 Countries with the Highest Recruitment Gap Percent in Small-Sized Companies')
    elif number == 2:
        print(color.GREEN,color.BOLD,'10 Countries with the Highest Recruitment Gap Percent in Large Companies')
    else:
        print(color.GREEN,color.BOLD,'10 Countries with the Highest Recruitment Gap Percent in Medium-Sized Companies')
    fig = go.Figure()
    fig = px.bar(i, y='Recruitment_Disparity_Percent', x ='Country',text='Recruitment_Disparity_Percent')
    fig.update_traces(texttemplate='%{text:.3s}', textposition='outside')
    fig.update_layout(uniformtext_minsize=8)
    fig.update_layout(xaxis_tickangle=-45)
    fig.show('notebook')

# Section 3A: Gender Income Gap in Data Science
- [x] Is there an income or compensation gap in data science which is gender based?
- [x] Do we have a company-size dimension to gender income disparity?
- [x] In the same vein, do we have an Machine Learning/cloud computing expenditure gap in data science?
- [x] Is this gap in expenditure enough to explain the income gap?


This section become important when you consider screaming headlines like: "Data Science Hasnt Fixed Its Huge Gender Pay Gap" on venturebeat.com. In the article Kyle Wiggers(2021) citing the O’Reilly’s 2021 Data/AI Salary Survey which found out that:

>...Women’s salaries are significantly lower than men’s, equating to 84% of the average salary for men regardless of education or job title. For example, at the executive level, the average salary for women was 163,000 dollars versus 205,000 dollars for men, — a 20% difference. [Read more](https://www.google.com/amp/s/venturebeat.com/2021/09/14/data-science-hasnt-fixed-its-huge-gender-pay-gap/amp/)

In another article titled "The data science gender pay gap is shrinking—barely", Macy Bayern citing the Harnham research provided more evidences on the subject matter. The report was in 2019 which falls within the scope of our study
>The gender pay gap in data science in the US shrunk from 9.4% to 8.4% over the past year, according to research from Harnham. While some improvement is seen at the entry and mid-levels, the executive level has more room for improvement. However, concerns at the executive level remain, the pay gap widens to 11% at senior levels of data and analytics jobs.

>The gender pay gap in Europe is larger than in the US. The smallest is Germany, at 4%, while the largest is Spain, at 25%. Across countries, however, the pay gap follows a similar pattern to the US in higher level positions. For all regions, the pay gap widens when moving from a mid-level to a director level position.

>The UK has seen the most success in decreasing its data and analytics gender pay gap, dropping from 13.3% in 2018 to 7.2% in 2019. Similar to the other regions, the pay gap widens. [Read more](https://www.google.com/amp/s/www.techrepublic.com/google-amp/article/the-data-science-gender-pay-gap-is-shrinking-barely/)

*Long Report and Coding Alert!!! We have to look at the countries individually to verify the veracity of these claims. Kindly loosen your seat belts and walk with me. I hope to make this as interesting as possible* 

In [None]:
# I am interested in the year gender, country and employees columns, so slice them out of the main dataset created above 

All_Income = BG_Data[['Year','Gender','Country','Annual_Income','ML_Cloud_Exp']]
All_Income.insert(3,'Company_Classification',all_coys['Company_Classification'])
#All_Income

In [None]:
# I am not interested in the number of respondents here. Previous findings have shown beyond doubts that gender gap exists 
# I want to see if women are underpaid doing same job with men as claimed
# The data needed are given as text, some conversions are necessary
# I want to find average income therefore we split each data provided into two columns:
# 'Annual_IncomeA' of the lower limit
# 'Annual_IncomeB' for the upper limit
# Both columns are added and the result divided by 2 to give us the average
# To this end, we shave 3 limits: lower, middle, and upper limits which we shall use in the course of the study

All_Income[['Annual_IncomeA', 'Annual_IncomeB']] = All_Income['Annual_Income'].str.split('-',expand=True)

In [None]:
# Perform some cleaning on the dataset to make it workable for the ojectives we set out to achive

All_Income['Annual_IncomeA'] = All_Income['Annual_IncomeA'].replace(',','',regex=True)#.astype(np.number)
All_Income['Annual_IncomeA'] = All_Income['Annual_IncomeA'].replace('>','',regex=True)#.astype(np.number)
All_Income['Annual_IncomeA'] = All_Income['Annual_IncomeA'].replace('\$','',regex=True).astype(np.number)
All_Income['Annual_IncomeB'] = All_Income['Annual_IncomeB'].replace(',','',regex=True).astype(np.number)
All_Income['Average_Income'] = (All_Income['Annual_IncomeA'] + All_Income['Annual_IncomeB'])/2
All_Income = All_Income.sort_values('Average_Income',ascending=False)

In [None]:
# Expenditure on ML and Cloud Computing
# Replace  for the upper limit I put a 200,000 upper limit for ('$100,000 or more, and '> $100,000 ($USD))
All_Income['ML_Cloud_Exp'] = All_Income['ML_Cloud_Exp'].replace({'$0 ($USD)':0,
                                                 '$100,000 or more ($USD)':200000,
                                                 '$10,000-$99,999':99999,
                                                 '$1000-$9,999':9999,
                                                 '$1-$99':99,
                                                 '$0 (USD)':0,
                                                 '> $100,000 ($USD)':200000,
                                                 '$100-$999':999}).astype(np.number)

In [None]:
# Slice the final data for men and women only
Man_Woman = All_Income[(All_Income['Gender']=='Man') | (All_Income['Gender']=='Woman')].reset_index(drop=True)
Man_Woman.insert(9,'Count',0)
#Man_Woman

In [None]:
# Create Disparity Table

a = Man_Woman.groupby(['Gender','Country'])['Annual_IncomeA'].mean().reset_index()
b = a[(a['Gender']=='Woman')].rename(columns={'Gender': 'Gender_Woman','Annual_IncomeA':'Income_Woman'})
c = a[(a['Gender']=='Man')].rename(columns={'Gender': 'Gender_Man','Annual_IncomeA':'Income_Man'})

d = pd.merge(c,b, on='Country')
d['Income_Disparity'] = round((d['Income_Man'] - d['Income_Woman']),2)
d = d.sort_values('Income_Disparity').reset_index(drop=True)
#d = d[0:10]
#d

In [None]:
print(color.GREEN+color.BOLD,'Gender Income Disparity Reports 2019 - 2020',color.END)

# Take the 'Year' column first
yea = Man_Woman['Year'].value_counts().reset_index(name='Count').rename(columns={'index':'Year'})
year = yea['Year'].unique()
for y in year:
    print(color.BLUE+color.BOLD,'\nYear =',y,color.END,color.GREEN,color.BOLD,'\n\nMean and Median (Upper Limit) Annual Income',color.END)  
# Next 'Gender'    
    number=0
    for g in Man_Woman['Gender'].unique():
        number+=1
# Total for each year
        tot = Man_Woman[(Man_Woman['Year']== y)]
        k = tot[(tot['Gender']== g) & (tot['Year']== y)]
# mean and median compensations calculations
        Mean_Compensation = round(k['Annual_IncomeB'].mean(),2)
        Median_Compensation = k['Annual_IncomeB'].median()
# print mean and median compensations 
        print(g,':','Mean =',Mean_Compensation, '|','Median =', Median_Compensation)

# Income Disparity Calculations
# Slice for the man and woman separately
    man = tot[(tot['Gender']=='Man') & (tot['Year']==y)]
    woman = tot[(tot['Gender']=='Woman') & (tot['Year']==y)]
    
# disparityA uses Annual_IncomeA(lower limit) 
# disparityB uses Annual_IncomeB(upper limit)
# disparityAv uses Average_Income
    disparityA = round((man['Annual_IncomeA'].mean()) - (woman['Annual_IncomeA'].mean()),2)
    disparityB = round((man['Annual_IncomeB'].mean()) - (woman['Annual_IncomeB'].mean()),2)
    disparityAv = round((man['Average_Income'].mean()) - (woman['Average_Income'].mean()),2)

# Print the disparities
    print(color.GREEN,color.BOLD,'\nDisparities:',color.END)
    print('Middle Limit =',color.BOLD,disparityAv,color.END)
    print('Lower Limit =',color.BOLD,disparityA,color.END)
    print('Upper Limit =',color.BOLD,disparityB,color.END)

plt.title('Aggregate Average Data Science Income Disparity 2019 - 2021')
sns.barplot(x='Year',y='Average_Income',hue='Gender',data=Man_Woman)
plt.show()

In [None]:
# Aggregate Report and Visualization for mean and median compensation and ML/Cloud Expenditures with Respect to Gender in Different Years
# For mean and median income,the the upper limit(Annual_IncomeB) is used

print(color.GREEN+color.BOLD,'Income Disparities in Different Company Sizes 2019 to 2020',color.END)
for y in year:
    print('\n',color.UNDERLINE,'           ',color.END,color.BLUE+color.BOLD,'\n\nYear =',y,color.END)
    for s in size:
        print(color.BOLD,color.BLUE,'\nCompany Size:',s,color.END,color.GREEN,color.BOLD,
              '\n\nMean and Median (Upper Limit) Annual Income',color.END)
        for g in Man_Woman['Gender'].unique():
            tot = Man_Woman[(Man_Woman['Year']== y)]
            k = tot[(tot['Gender']== g) & (tot['Year']== y) & (tot['Company_Classification']== s)]
            Mean_Compensation = round(k['Annual_IncomeB'].mean(),2)
            Median_Compensation = k['Annual_IncomeB'].median()
            print(g,':', 'Mean =',Mean_Compensation, '|','Median =', Median_Compensation)

            
# Income Disparity Calculations
# Slice for the man and woman separately
        man = tot[(tot['Gender']=='Man') & (tot['Year']==y) & (tot['Company_Classification']== s)]
        woman = tot[(tot['Gender']=='Woman') & (tot['Year']==y) & (tot['Company_Classification']== s)]
    
# disparityA uses Annual_IncomeA(lower limit) 
# disparityB uses Annual_IncomeB(upper limit)
# disparityAv uses Average_Income
        disparityA = round((man['Annual_IncomeA'].mean()) - (woman['Annual_IncomeA'].mean()),2)
        disparityB = round((man['Annual_IncomeB'].mean()) - (woman['Annual_IncomeB'].mean()),2)
        disparityAv = round((man['Average_Income'].mean()) - (woman['Average_Income'].mean()),2)
# Print the report
        print(color.GREEN,color.BOLD,'\nIncome Disparities:',s,'Companies',color.END)
        print('Middle Limit =',color.BOLD,disparityAv,color.END)
        print('Lower Limit =',color.BOLD,disparityA,color.END)
        print('Upper Limit =',color.BOLD,disparityB,color.END)
#         print('The man earned more by',color.BOLD+color.PURPLE,disparityAv,'or',disparityA,
#               'or',disparityB,color.END,'in',s,'companies'
#               '(middle,lower and upper limits respectively) in',y)


print(color.BOLD,color.UNDERLINE,'\nAggregate Annual Compensation with Respect to Gender and Company Sizes (2019 - 2021)')
sns.catplot(x='Gender',y='Annual_IncomeB',hue='Year',
            col='Company_Classification',
            kind='bar',data = Man_Woman)
plt.show()

In [None]:
# Aggregate visualization for Average in come and ML and cloud expenditures with respect to gender in different years

print(color.GREEN+color.BOLD,'Aggregate Gender Income Disparity Reports by Countries 2019 - 2020',color.END)
number=0
for c in country:
    number+=1
    tot = Man_Woman[(Man_Woman['Country']==c)]
    yea = tot['Year'].value_counts().reset_index(name ='Count').rename(columns={'index':'Year'})
    year = yea['Year'].unique()
    print(color.BOLD+color.GREEN+'\n',number,'Country Name:',c,color.END)

    for y in year:
        print(color.BLUE+color.BOLD,'\nYear =',y,color.END,color.GREEN,color.BOLD,'\n\nMean and Median (Upper Limit) Annual Income',color.END)
        number1=0
        for g in Man_Woman['Gender'].unique():
            number1+=1
            tot2 = tot[(tot['Year']== y)]
            k = tot2[(tot2['Gender']== g)]
            Mean_Compensation = round(k['Annual_IncomeB'].mean(),2)
            Median_Compensation = k['Annual_IncomeB'].median()
            print(g,':','Mean =',Mean_Compensation, '|','Median =', Median_Compensation)

# Income Disparity Calculations
# Slice for the man and woman separately
        man = tot[(tot['Gender']=='Man') & (tot['Year']==y)]
        woman = tot[(tot['Gender']=='Woman') & (tot['Year']==y)]
    
# disparityA uses Annual_IncomeA(lower limit) 
# disparityB uses Annual_IncomeB(upper limit)
# disparityAv uses Average_Income(Middle Limit)
        disparityA = round((man['Annual_IncomeA'].mean()) - (woman['Annual_IncomeA'].mean()),2)
        disparityB = round((man['Annual_IncomeB'].mean()) - (woman['Annual_IncomeB'].mean()),2)
        disparityAv = round((man['Average_Income'].mean()) - (woman['Average_Income'].mean()),2)

# Print the report
        print(color.GREEN,color.BOLD,'\nDisparities:',color.END)
        print('Middle Limit =',color.BOLD,disparityAv,color.END)
        print('Lower Limit =',color.BOLD,disparityA,color.END)
        print('Upper Limit =',color.BOLD,disparityB,color.END)
#     print('The man earned',color.BOLD+color.PURPLE,disparityA,'or',
#           disparityB,'or',
#           disparityAv, color.END,
#           '(depending on which limit you use) more than the',g,'in',y)
    plt.title(c,fontsize=15)
    sns.barplot(x='Gender',y='Annual_IncomeB',hue='Year',data=tot)
    plt.show()

In [None]:
# Aggregate Report and Visualization for mean and median compensation and ML/Cloud Expenditures with Respect to Gender in Different Years
# For mean and median income,the the upper limit(Annual_IncomeB) is used
print(color.GREEN+color.BOLD,'Income Disparities by Countries, by Gender and by Different Company Sizes 2019 to 2020',color.END)

number=0
for c in country:
    number+=1
    tot = All_Income[(All_Income['Country']==c)]
    tot2 = tot[(tot['Gender']=='Man') | (tot['Gender']=='Woman')]
    yea = tot2['Year'].value_counts().reset_index(name ='Count').rename(columns={'index':'Year'})
    year = yea['Year'].unique()
    print(color.BOLD+color.GREEN+'\n',number,'Country Name:',c,color.END)
    
    for y in year:
        print('\n',color.UNDERLINE,'           ',color.END,color.BLUE+color.BOLD,'\n\nYear =',y,color.END)
        for s in size:
            print(color.BOLD,color.BLUE,'\nCompany Size:',s,color.END,color.GREEN,color.BOLD,
                  '\n\nMean and Median (Upper Limit) Annual Income',color.END)
            for g in Man_Woman['Gender'].unique():
                k = tot2[(tot2['Gender']== g) & (tot2['Year']== y) & (tot2['Company_Classification']== s)]
                Mean_Compensation = round(k['Annual_IncomeB'].mean(),2)
                Median_Compensation = k['Annual_IncomeB'].median()
                print(g,':', 'Mean =',Mean_Compensation, '|','Median =', Median_Compensation)

            
# Income Disparity Calculations
# Slice for the man and woman separately
            man = tot2[(tot2['Gender']=='Man') & (tot2['Year']==y) & (tot2['Company_Classification']== s)]
            woman = tot2[(tot2['Gender']=='Woman') & (tot2['Year']==y) & (tot2['Company_Classification']== s)]
    
# disparityA uses Annual_IncomeA(lower limit) 
# disparityB uses Annual_IncomeB(upper limit)
# disparityAv uses Average_Income
            disparityA = round((man['Annual_IncomeA'].mean()) - (woman['Annual_IncomeA'].mean()),2)
            disparityB = round((man['Annual_IncomeB'].mean()) - (woman['Annual_IncomeB'].mean()),2)
            disparityAv = round((man['Average_Income'].mean()) - (woman['Average_Income'].mean()),2)
# Print the report
            print(color.GREEN,color.BOLD,'\nIncome Disparities:',s,'Companies',color.END)
            print('Middle Limit =',color.BOLD,disparityAv,color.END)
            print('Lower Limit =',color.BOLD,disparityA,color.END)
            print('Upper Limit =',color.BOLD,disparityB,color.END)

    print(color.BOLD,'\nAggregate Income Disparity for',c,'(2019 - 2021)')
    sns.barplot(x='Gender',y='Average_Income',hue='Year',data=tot2)
    plt.show()
    
    print(color.BOLD,'\nGender Income Disparities by Company Sizes for',c,'(2019 - 2021)')
    sns.catplot(x='Gender',y='Annual_IncomeB',hue='Year',col='Company_Classification',kind='bar',data = tot2)
    plt.show()

# Section 3: Findings

In [None]:
a = Man_Woman.groupby(['Gender','Country','Company_Classification'])['Average_Income'].mean().reset_index()
top10_Woman = a[(a['Gender']=='Woman')].rename(columns={'Gender': 'Gender_Woman','Average_Income':'Income_Woman'})
top10_Man = a[(a['Gender']=='Man')].rename(columns={'Gender': 'Gender_Man','Average_Income':'Income_Man'})
top10 = pd.merge(top10_Man,top10_Woman, on=['Country','Company_Classification'])

Small_Coy = top10[(top10['Income_Man'].notnull()) & (top10['Company_Classification']=='Small')]
Large_Coy = top10[(top10['Income_Man'].notnull()) & (top10['Company_Classification']=='Large')]
Medium_Coy = top10[(top10['Income_Man'].notnull()) & (top10['Company_Classification']=='Medium')]

Small_Coy['Income_Disparity_Small'] = round((Small_Coy['Income_Man'] - Small_Coy['Income_Woman']),2)
Large_Coy['Income_Disparity_Large'] = round((Large_Coy['Income_Man'] - Large_Coy['Income_Woman']),2)
Medium_Coy['Income_Disparity_Medium'] = round((Medium_Coy['Income_Man'] - Medium_Coy['Income_Woman']),2)

top10_Lowest_Small = Small_Coy.sort_values('Income_Disparity_Small').reset_index(drop=True).head(13)
top10_Lowest_Large = Large_Coy.sort_values('Income_Disparity_Large').reset_index(drop=True).head(8)
top10_Lowest_Medium = Medium_Coy.sort_values('Income_Disparity_Medium').reset_index(drop=True).head(4)

top10_Highest_Small = Small_Coy.sort_values('Income_Disparity_Small',ascending=False).reset_index(drop=True).head(15)
top10_Highest_Large = Large_Coy.sort_values('Income_Disparity_Large',ascending=False).reset_index(drop=True).head(15)
top10_Highest_Medium = Medium_Coy.sort_values('Income_Disparity_Medium',ascending=False).reset_index(drop=True).head(15)

#top10_Lowest_Small

In [None]:
number=0
for i in [top10_Lowest_Small,top10_Lowest_Large,top10_Lowest_Medium]:
    number+=1
    a = i[['Gender_Man','Country','Income_Man']].rename(columns={'Gender_Man':'Gender','Income_Man':'Average_Annual_Income'})
    b = i[['Gender_Woman','Country','Income_Woman']].rename(columns={'Gender_Woman':'Gender','Income_Woman':'Average_Annual_Income'})
    ab = pd.concat([a,b])
    print(color.BOLD+color.BLUE,'Visuals:',number,color.END)
    if number == 1:
        print('Countries Where the Woman Earns more on the average in Small-Sized Companies')
    elif number == 2:
        print('Countries Where the Woman Earns more on the average in Large Companies')
    else:
        print('Countries Where the Woman Earns more on the average in Medium-Sized Companies')
    fig = go.Figure()
    fig = px.bar(ab, y='Average_Annual_Income', x ='Country',text='Average_Annual_Income',color='Gender')
    fig.update_traces(texttemplate='%{text:.3s}', textposition='outside')
    fig.update_layout(uniformtext_minsize=8)
    fig.update_layout(xaxis_tickangle=-45)
    fig.show('notebook')

In [None]:
number=0
for i in [top10_Highest_Small,top10_Highest_Large,top10_Highest_Medium]:
    number+=1
    a = i[['Gender_Man','Country','Income_Man']].rename(columns={'Gender_Man':'Gender','Income_Man':'Average_Annual_Income'})
    b = i[['Gender_Woman','Country','Income_Woman']].rename(columns={'Gender_Woman':'Gender','Income_Woman':'Average_Annual_Income'})
    ab = pd.concat([a,b])
    print(color.BOLD+color.GREEN,'Visuals:',number,color.END)
    if number == 1:
        print(color.BLUE,color.BOLD,'Top 15 Countries with the Highest Average Income Disparity in Small-Sized Companies 2019-2021')
    elif number == 2:
        print(color.BLUE,color.BOLD,'Top 15 Countries with the Highest Average Income Disparity in Large Companies 2019-2021')
    else:
        print(color.BLUE,color.BOLD,'Top 15 Countries with the Highest Average Income Disparity in Medium-Sized Companies 2019-2021')
    fig = go.Figure()
    fig = px.bar(ab, y= 'Average_Annual_Income', x ='Country',text= 'Average_Annual_Income',color='Gender')
    fig.update_traces(texttemplate='%{text:.3s}', textposition='outside')
    fig.update_layout(uniformtext_minsize=8)
    #fig.update_layout(autosize=False,width=800,height=800)
    fig.update_layout(xaxis_tickangle=-65)
    fig.show('notebook')

## Section 3B: Expenditure on ML and Cloud Computing Gap in Data Science

In [None]:
# Aggregate visualization for Average in come and ML and cloud expenditures with respect to gender in different years

print(color.GREEN+color.BOLD,'Gender Expenditure on ML and Cloud Computing Disparity Reports 2019 - 2020',color.END)
yea = Man_Woman['Year'].value_counts().reset_index(name='Count').rename(columns={'index':'Year'})
year = yea['Year'].unique()
for y in year:
    print(color.BLUE+color.BOLD,'\nYear =',y,color.END)
    print(color.GREEN,color.BOLD,'\nExpenditure on ML and Cloud',color.END)
    tot = Man_Woman[(Man_Woman['Year']== y)]
    for g in Man_Woman['Gender'].unique():
        k = tot[(tot['Gender']== g) & (tot['Year']== y)]
        Mean_Exp = round(k['ML_Cloud_Exp'].mean(),2)
        Median_Exp = k['ML_Cloud_Exp'].median()
        print(g,':', 'Mean =',Mean_Exp, '|','Median =',Median_Exp)
    
    man = tot[(tot['Gender']=='Man') & (tot['Year']==y)]
    woman = tot[(tot['Gender']=='Woman') & (tot['Year']==y)]
    
    disparityExp = round((man['ML_Cloud_Exp'].mean()) - (woman['ML_Cloud_Exp'].mean()),2)
    print(color.GREEN,color.BOLD,'\nExpenditure Disparity Report:',color.END)
    print('The man or his team has expended',
          color.BOLD+color.PURPLE,disparityExp,color.END,
          'on ML and Cloud computing more than the',g,
          'or her team in the last 5 years')
    
sns.barplot(x='Gender',y= 'ML_Cloud_Exp',hue='Year',data=Man_Woman)
plt.show()

In [None]:
print(color.GREEN+color.BOLD,'Gender Expenditure Disparities in Different Company Sizes 2019 to 2020',color.END)
print(color.GREEN,'How much did the man or his team spend on ML and Cloud computing more than the woman or her team in the last 5 years?',color.END)

number=0
for c in country:
    number+=1
    tot = All_Income[(All_Income['Country']==c)]
    tot2 = tot[(tot['Gender']=='Man') | (tot['Gender']=='Woman')]
    print(color.BOLD+color.GREEN+'\n',
          number,'Country Name:',c,
          '|','Total Respondents(2019-2021) =',
          len(tot),color.END)
    
    for y in year:
        print('\n',color.UNDERLINE,'           ',color.END,color.BLUE+color.BOLD,'\n\nYear =',y,color.END)
        tot = Man_Woman[(Man_Woman['Year']== y)]
        for s in size:
            print(color.BLUE,color.BOLD,'\nCompany Size:',s,color.END,color.GREEN,color.BOLD,'\nExpenditure on ML and Cloud',color.END)
            for g in Man_Woman['Gender'].unique():
                k = tot[(tot['Gender']== g) & (tot['Year']== y) & (tot['Company_Classification']== s)]
                Mean_Exp = round(k['ML_Cloud_Exp'].mean(),2)
                Median_Exp = k['ML_Cloud_Exp'].median()
                print(g,':', 'Mean =',Mean_Exp, '|','Median =',Median_Exp)
            
# Expenditure Disparity Calculations
# Slice for the man and woman separately
            man = tot[(tot['Gender']=='Man') & (tot['Year']==y) & (tot['Company_Classification']== s)]
            woman = tot[(tot['Gender']=='Woman') & (tot['Year']==y) & (tot['Company_Classification']== s)]

            disparityExp = round((man['ML_Cloud_Exp'].mean()) - (woman['ML_Cloud_Exp'].mean()),2)
            print(color.GREEN,color.BOLD,'\n5 Years Expenditure Disparity',color.END)
            print('Expenditure Disparity =',color.BOLD+color.PURPLE,disparityExp,color.END)

    
    sns.barplot(x='Gender',y='ML_Cloud_Exp',hue='Year',data=tot2)
    plt.show()
    sns.catplot(x='Gender',y='ML_Cloud_Exp',hue='Year',col='Company_Classification',kind='bar',data = tot2)
    plt.show()

# Section 3B: Findings

In [None]:
# Create Expenditure disparities tables according to companies

a = Man_Woman.groupby(['Gender','Country','Company_Classification'])['ML_Cloud_Exp'].mean().reset_index()
top10_Woman = a[(a['Gender']=='Woman')].rename(columns={'Gender': 'Gender_Woman','ML_Cloud_Exp':'Exp_Woman'})
top10_Man = a[(a['Gender']=='Man')].rename(columns={'Gender': 'Gender_Man','ML_Cloud_Exp':'Exp_Man'})
top10 = pd.merge(top10_Man,top10_Woman, on=['Country','Company_Classification'])

Small_Coy = top10[(top10['Exp_Man'].notnull()) & (top10['Company_Classification']=='Small')]
Large_Coy = top10[(top10['Exp_Man'].notnull()) & (top10['Company_Classification']=='Large')]
Medium_Coy = top10[(top10['Exp_Man'].notnull()) & (top10['Company_Classification']=='Medium')]

Small_Coy['Exp_Disparity_Small'] = round((Small_Coy['Exp_Man'] - Small_Coy['Exp_Woman']),2)
Large_Coy['Exp_Disparity_Large'] = round((Large_Coy['Exp_Man'] - Large_Coy['Exp_Woman']),2)
Medium_Coy['Exp_Disparity_Medium'] = round((Medium_Coy['Exp_Man'] - Medium_Coy['Exp_Woman']),2)

top10_Lowest_Small = Small_Coy.sort_values('Exp_Disparity_Small').reset_index(drop=True).head(15)
top10_Lowest_Large = Large_Coy.sort_values('Exp_Disparity_Large').reset_index(drop=True).head(20)
top10_Lowest_Medium = Medium_Coy.sort_values('Exp_Disparity_Medium').reset_index(drop=True).head(11)

top10_Highest_Small = Small_Coy.sort_values('Exp_Disparity_Small',ascending=False).reset_index(drop=True).head(15)
top10_Highest_Large = Large_Coy.sort_values('Exp_Disparity_Large',ascending=False).reset_index(drop=True).head(15)
top10_Highest_Medium = Medium_Coy.sort_values('Exp_Disparity_Medium',ascending=False).reset_index(drop=True).head(15)

#top10_Highest_Small

In [None]:
# Visualise disparity
number=0
for i in [top10_Highest_Small,top10_Highest_Large,top10_Highest_Medium]:
    number+=1
    a = i[['Gender_Man','Country','Exp_Man']].rename(columns={'Gender_Man':'Gender','Exp_Man':'Average_5years_Spend'})
    b = i[['Gender_Woman','Country','Exp_Woman']].rename(columns={'Gender_Woman':'Gender','Exp_Woman':'Average_5years_Spend'})
    ab = pd.concat([a,b])
    print(color.BOLD+color.GREEN,'Visuals:',number,color.END)
    if number == 1:
        print(color.BLUE,color.BOLD,'Top 15 Countries with the Highest Expenditure Disparity in Small-Sized Companies 2019-2021')
    elif number == 2:
        print(color.BLUE,color.BOLD,'Top 15 Countries with the Highest Average Expenditure Disparity in Large Companies 2019-2021')
    else:
        print(color.BLUE,color.BOLD,'Top 15 Countries with the Highest Average Expenditure Disparity in Medium-Sized Companies 2019-2021')
    fig = go.Figure()
    fig = px.bar(ab, y= 'Average_5years_Spend', x ='Country',text= 'Average_5years_Spend',color='Gender')
    fig.update_traces(texttemplate='%{text:.3s}', textposition='outside')
    fig.update_layout(uniformtext_minsize=8)
    fig.update_layout(xaxis_tickangle=-65)
    fig.show('notebook')

In [None]:
number=0
for i in [top10_Lowest_Small,top10_Lowest_Large,top10_Lowest_Medium]:
    number+=1
    a = i[['Gender_Man','Country','Exp_Man']].rename(columns={'Gender_Man':'Gender','Exp_Man':'Average_5years_Spend'})
    b = i[['Gender_Woman','Country','Exp_Woman']].rename(columns={'Gender_Woman':'Gender','Exp_Woman':'Average_5years_Spend'})
    ab = pd.concat([a,b])
    print(color.BOLD+color.BLUE,'Visuals:',number,color.END)
    if number == 1:
        print('Countries Where the Woman Spends more on the average on ML and Cloud Computing in Small-Sized Companies')
    elif number == 2:
        print('Countries Where the Woman Spends more on the average on ML and Cloud Computing in Large Companies')
    else:
        print('Countries Where the Woman Spends more on the average on ML and Cloud Computing in Medium-Sized Companies')
    fig = go.Figure()
    fig = px.bar(ab, y='Average_5years_Spend', x ='Country',text='Average_5years_Spend',color='Gender')
    fig.update_traces(texttemplate='%{text:.3s}', textposition='outside')
    fig.update_layout(uniformtext_minsize=8)
    fig.update_layout(xaxis_tickangle=-45)
    fig.show('notebook')

# Section 4: Are We Asking for too Much too Soon?
Data Science as it is today is relatively very young. Many of the main subject areas as we have it today are less than 15 years in existence. Though some of the threories date back but the advent of the modern day data sciernce is tied to the advent of programming languages and the development of higher computing powers.

[Read more on the history of Data Science](https://www.google.com/amp/s/www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/amp/)

1. Are we asking too much from data science too soon?
2. Are there any indication that Data Science will bridge the gender gap in future?

**Objectives**
- [x] Divided coding years into 2 broad groups - Below 5 years and above 5 years
- [x] Evaluate tge respondenst for each countries under the groups created
- [x] If most of the respondents per country have coding years below 5 years then we might agree with the claim and conlclude that perhaps there is ample time to solve the gender gap problems and that it is too soon for data science to self adjust

In [None]:
BG_Data['Coding Years'].unique()

In [None]:
BG_Data['Code_Years_Group'] = BG_Data['Coding Years'].replace({'1-2 years':'Below 5 Years',
                                                 '< 1 years':'Below 5 Years',
                                                 '20+ years':'Above 5 Years',
                                                 '3-5 years':'Below 5 Years',
                                                 '5-10 years':'Above 5 Years',
                                                 '10-20 years':'Above 5 Years',
                                                 '1-3 years':'Above 5 Years'
                                                 })

In [None]:
#code = BG_Data['Coding Years'].value_counts().reset_index(name='Respondents').rename(columns={'index':'Coding Years'})
print(color.BLUE,color.BOLD,'Coding Years Per Countries',color.END)
number=0
for c in country:
    number+=1
    tot = BG_Data[(BG_Data['Country']==c)]
    tot1 = tot.groupby(['Year','Code_Years_Group'])['Counter'].count().reset_index().sort_values('Counter',ascending=False)
    print(color.BOLD+color.GREEN+'\n',
          number,'Country Name:',c,
          '|','Total Respondents(2019-2021) =',
          len(tot),color.END)
    print('\n',color.UNDERLINE,'           ',color.END)
    
    
    yea = tot['Year'].value_counts().reset_index(name='Count').rename(columns={'index':'Year'})
    year = yea['Year'].unique()
    for y in year:
        tot2 = tot[(tot['Year']==y)]
        print(color.BLUE+color.BOLD,'\n\nYear =',y,'|','Total =',len(tot2),color.END)
        
        number2 = 0
        code = tot2['Code_Years_Group'].value_counts().reset_index(name='Respondents').rename(columns={'index':'Code_Years_Group'})
        for i in code['Code_Years_Group'].unique():
            number2+=1
            k = tot[(tot['Code_Years_Group']== i) & (tot['Year']== y)]
            per = round(len(k)/len(tot2),2)*100
            print(number2,i, '=', len(k),'|',per,'%')   
    
    plt.title('Aggregate Coding Years 2019 to 2021')
    sns.barplot(y='Code_Years_Group',x='Counter',hue ='Year',data=tot1) 
    plt.show()

# Section 4: Findings

In [None]:
a = BG_Data.groupby(['Country','Code_Years_Group'])['Counter'].count().reset_index().sort_values('Counter',ascending=False)
below_5 = a[(a['Code_Years_Group']=='Below 5 Years')].rename(columns={'Code_Years_Group': 'Code_Years_Below'})
above_5 = a[(a['Code_Years_Group']=='Above 5 Years')].rename(columns={'Code_Years_Group': 'Code_Years_Above'})
BA = pd.merge(below_5,above_5, on='Country').rename(columns={'Counter_x':'Counter_Below','Counter_y':'Counter_Above'})
BA['Difference'] = (BA['Counter_Below']) - (BA['Counter_Above'])

Total_Respondensts = BG_Data.groupby(['Country'])['Counter'].count().reset_index()

BA_Below = pd.merge(Total_Respondensts,BA, on='Country')
BA_Below['Percent'] = round(BA_Below['Counter_Below']/(BA_Below['Counter']),2)*100
BA_Below = BA_Below.sort_values('Percent',ascending=False)[0:27]
#BA_Below.head(27)

BA_Above = pd.merge(Total_Respondensts,BA, on='Country')
BA_Above['Percent'] = round(BA_Above['Counter_Above']/(BA_Above['Counter']),2)*100
BA_Above = BA_Above.sort_values('Percent',ascending=False)[0:14]
#BA_Above.head(14)

In [None]:
number=0
for i in [BA_Below,BA_Above]:
    number+=1
    print(color.BOLD+color.BLUE,'Visuals:',number,color.END)
    if number == 1:
        print(color.GREEN,color.BOLD,'Countries where Coding Years Below 5 years Greater or equal to 50 percent of Total Respondents')
    else:
        print(color.GREEN,color.BOLD,'Countries Where Coding Years Above 5 years Greater or equal to 50 percent of Total Respondents')
    fig = go.Figure()
    fig = px.bar(i, y='Percent', x ='Country',text='Percent')
    fig.update_traces(texttemplate='%{text:.3s}', textposition='outside')
    fig.update_layout(uniformtext_minsize=8)
    fig.update_layout(xaxis_tickangle=-45)
    fig.show('notebook')

In most countries coding years are less that 5 years, this shows that data science is still evolving and we might be asking too much if we compare it with other discplines that are relatively old.

# Conclusions - 10 Takeaways
1. Is gender diversity an issue in data science? ----- Yes
2. Is the the gender diversity issue specific to data science? ----- No
3. Are there wide recruitment gaps in countries for data science positions in various companies ----- Yes
4. The higest score for women participation in data Science (2019-2021) is 41% coiming from Tunisia
5. Do we have companies and countries where women earn more than men ----- Yes 
6. However, income disparity remains very high in many advance economies
7. Is there expenditure on ML and Cloud Computing disparity in data science ----- Yes
8. However, the expdenditure disparity is not adequate enough to explain the income disparity
9. Can the gender gap be bridged in future due to the fact that data science is still evolving? ----- Yes
10. Are women consciously being discriminated against in data science? ----- We dont have enough data to ascertain that