## Gender Disparity in Data Science

The field of computing increasingly has developed a gender gap. Let's explore the gender disparity and try to figure out the reasons.

In the early days of computers and computing, women were well-represented in the field. But lately, we see an increasing disparity.

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT26XQ7ED9XgqGLtrG_dyu8Km-MPfOtip_VfQ&usqp=CAU" width="300">


Despite on going efforts to bridge gender diversity in tech, women are still underrepresented, underpaid and often discriminated against in the tech industry. The gender gap is still wider in STEM occupations than in non-STEM   

<img src=https://www.statnews.com/wp-content/uploads/2019/11/AdobeStock_138936733-768x432.jpg alt="whatever" width="400">


Let's explore the 2020 Kaggle Survey to see if any progress is made by women in bridging the gap. Let's try to figure out reasons and find some possible solutions.

Starting with percentage of Women to Men who participated in the 2020 survey.


In [None]:
import numpy as np
import pandas as pd
import math

#visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud
%pylab inline
from matplotlib.colors import ListedColormap
from textwrap import wrap

# Disable warnings
import warnings
warnings.filterwarnings('ignore')
# plotly packages
import plotly
import plotly.graph_objects as go
import plotly.express as px
import cufflinks as cf
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
cf.go_offline()

In [None]:
df_2020 = pd.read_csv("/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv")
df20 = df_2020.iloc[1:,[1,2,3,4,5,6,118]]
df20.columns = ['Age','Gender','Country','Education','Title', 'NumYrs','Sal']
#df20.loc[df20['Country'] == 'United Kingdom of Great Britain and Northern Ireland','Country'] = 'UK'
#df20.loc[df20['Country'] == 'United States of America','Country'] = 'USA'
#df20.loc[df20['Country'] == 'Republic of Korea','Country'] = 'South Korea'
#df20.loc[df20['Country'] == 'Viet Nam','Country'] = 'Vietnam'
df20.loc[df20['Country'] == 'United States of America','Country'] = 'United States'
df20.loc[df20['Country'] == 'United Kingdom of Great Britain and Northern Ireland','Country'] = 'United Kingdom'

df20.loc[df20['Country'] == 'Iran, Islamic Republic of...','Country'] = 'Iran'
df20.loc[df20['Education'] == 'Some college/university study without earning a bachelorâ€™s degree','Education'] = 'High School/Some College'
df20.loc[df20['Education'] == 'No formal education past high school', 'Education'] = 'High School/Some College'
# df_2020 = pd.read_csv("/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv")
# df20 = df_2020.iloc[1:,[1,2,3,4,5,6,118]]
# df20.columns = ['Age','Gender','Country','Education','Title', 'NumYrs','Salary']
#Importing the 2019 Dataset
df_2019 = pd.read_csv('/kaggle/input/cleaned-mcr-kaggle-survey-2019/clean_multiple_choice_responses.csv')
df_2019 = df_2019.rename(columns={'Duration (in seconds)': 'Duration',
        'What is your age (# years)?': 'Age', 
        'What is your gender? - Selected Choice': 'Gender',
        'In which country do you currently reside?':'Country',
        'What is the highest level of formal education that you have attained or plan to attain within the next 2 years?': 'Education',
        'What is the size of the company where you are employed?':'CompanySize',
        'What is your current yearly compensation (approximate $USD)?':'Salary',
        'Approximately how much money have you spent on machine learning and/or cloud computing products at your work in the past 5 years?':'MoneyDS',
                            })
df19 = df_2019[['Duration','Age','Gender','Country','Education','Salary','CompanySize','MoneyDS']] #'CompanySize''MoneyDS'

# Replacing the ambigious countries name with Standard names
df19['Country'].replace({'United States of America':'United States',
                            'Viet Nam':'Vietnam',
                             "People 's Republic of China":'China',
                             "United Kingdom of Great Britain and Northern Ireland":'United Kingdom',
                             "Hong Kong (S.A.R.)":"Hong Kong"},inplace=True)
#Importing the 2018 Dataset
df_2018 = pd.read_csv('../input/kaggle-survey-2018/multipleChoiceResponses.csv')
df_2018.columns = df_2018.iloc[0]
df_2018=df_2018.drop([0])

df_2018 = df_2018.rename(columns={'Duration (in seconds)': 'Duration',
        'What is your age (# years)?': 'Age', 
        'What is your gender? - Selected Choice': 'Gender',
        'In which country do you currently reside?':'Country',
        'What is the highest level of formal education that you have attained or plan to attain within the next 2 years?': 'Education',
         #'What is the size of the company where you are employed?':'CompanySize',
         'What is your current yearly compensation (approximate $USD)?':'Salary',
         #'Approximately how much money have you spent on machine learning and/or cloud computing products at your work in the past 5 years?':'MoneyDS',
                            })
df18 = df_2018[['Duration','Age','Gender','Country','Education','Salary']]
# #Importing the 2017 Dataset
df17=pd.read_csv('../input/kaggle-survey-2017/multipleChoiceResponses.csv',encoding='ISO-8859-1')                         
gen_ratio = ((df20['Gender'].value_counts(normalize=True)[:2])*100).round(2)
df20['Gender'] == 'Woman'
sns.set(style="whitegrid")
clrs = ['#949994','#a0dea4']
ax = sns.barplot(x=gen_ratio.index, y=gen_ratio.values, palette = clrs)
ax.axhline(0, color="k", clip_on=False) 
ax.set_xlabel("Gender")
ax.set_ylabel("Percentage")
_= ax.set_title("Percentage of Women & Men")

About <span style="background-color: #FFFF00">20% of Woman</span> have responded to the kaggle survey. This is more or less reflective of Woman in Tech in general. Now let's see if their representaion has increased over the years (2017-2020).

In [None]:
# df_N = pd.DataFrame(data = [len(df17),len(df18),len(df19), len(df20)],
#                           columns = ['Numresponses'])
df_F = pd.DataFrame(data = [(df17['GenderSelect'] == 'Female').sum(), (df18['Gender'] == 'Female').sum(),
                        (df19['Gender'] == 'Female').sum(), (df20['Gender'] == 'Woman').sum()],columns = ['Females'])
df_M = pd.DataFrame(data = [(df17['GenderSelect'] == 'Male').sum(), (df18['Gender'] == 'Male').sum(),
                             (df19['Gender'] == 'Male').sum(), (df20['Gender'] == 'Man').sum()], columns = ['Males'])
#df_A = pd.concat([df_N, df_F,df_M] , axis=1)
df_A = pd.concat([df_F,df_M] , axis=1)                     
df_A.index = ['2017','2018','2019','2020']
dfAf = df_A.copy()
dfAf = dfAf.reset_index()
dfAf = dfAf.rename(columns={"index":"Year"})
dfAf["%W"] = ((dfAf["Females"]/dfAf["Males"])*100).round(2)
fig = go.Figure()
fig.add_trace(go.Bar(x=dfAf['Year'],
                y=dfAf['Males'],
                name='Men',
                marker_color='rgb(55, 83, 109)'
                   ))
fig.add_trace(go.Bar(x=dfAf['Year'],
                y=dfAf['Females'],
                name='Women',
                marker_color='indianred'
                ))
# Change the bar mode
fig.update_layout(
    title='Count of Men and Women for 2017, 2018, 2019, 2020',
    xaxis_title="Year",
    yaxis_title="Count",
    barmode='group', plot_bgcolor='rgb(250,250,250)')
fig.update_xaxes(showline=True, linewidth=2, linecolor='black',zeroline=False)
                
fig.update_yaxes(showline=True, linewidth=2, linecolor='black',
                showgrid=True, gridwidth=1, gridcolor='#d6d5d2')
fig.show()
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=dfAf['Year'], y=dfAf['%W'],
                    mode='lines+markers',
                         
                    name='lines+markers',
                    line_color = 'firebrick',
                    marker_color='#0834a3'))

# Edit the layout
fig.update_layout(title='Percentage of Women to Men(2017,2018,2019,2020)',
                   xaxis_title='Year',
                   yaxis_title='% of Women',
                   plot_bgcolor='white',
                   width=600,
                   height=300)
fig.update_xaxes(showline=True, linewidth=2, linecolor='black',zeroline=False)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black',
                showgrid=True, gridwidth=1, gridcolor='#d6d5d2')
fig.show()

From the above two graphs, total number of responders were the highest in 2018. But percentage wise we see the <span style="background-color: #aced98">highest increase from 2019 to 2020, nearly 4.6%</span> which is good. Looks like the particpation of women is increasing in the right direction.

In [None]:
df20 = df20.sort_values('Age')
df20["Male_Age_Group"]=df20[df20["Gender"]=="Man"]["Age"]
df20["Female_Age_Group"]=df20[df20["Gender"]=="Woman"]["Age"]
df20[["Male_Age_Group","Female_Age_Group"]].iplot(kind="histogram", bins=11, \
         theme="white", title="Men & Women Age Group Categories",
         xTitle='Age Groups', yTitle='Count')

Most men are in the 25-29(3128) to age group wheras most women are in the 22-24(886) age group. As the age increases(20-24 to 70+) there is <strong><span style="background-color: #fa4f46">continious decrease in women responders from 886 in 22-24 age group to 29 in 60-69 and 2 in 70+.</span></strong>

From an article- [https://www.cio.com/article/3516012/women-in-tech-statistics-the-hard-truths-of-an-uphill-battle.html] by  Sarah K. White ,Senior Writer, CIO 


Statistics from the following seven facets of IT work, ranging from higher education to workplace environment, paint a clear picture of the challenges women face in finding equal footing in a career in IT.
* The employment gap
* The degree gap
* The retention gap
* Workplace culture gap
* The founder gap
* The pay gap
* IT leadership

Since we have degree data of the correspndents lets check and see if there is Degree Gap between the male and female data scientists.

In [None]:
dfEW = df20[df20['Gender'] == 'Woman']['Education'].value_counts(sort = True)
dfEM = df20[df20['Gender'] == 'Man']['Education'].value_counts(sort = True)
labsW = dfEW.keys()
dataW = list(dfEW.values)
labsM = dfEM.keys()
dataM = list(dfEM.values)
# Creating explode data 
explode = (0.0, 0.0, 0.1, 0.1, 0.4, 0.1) 
  
# Creating color parameters 
colors = ( "#a8a7a5", "#a3cfc5", "#d9c0a0", 
          "#f7ec74", "#d4bdf0", "#cff2c7" ) 
# Wedge properties 
wp = { 'linewidth' : 1, 'edgecolor' : "black" }

# Creating autocpt arguments 
def func(pct, allvalues): 
    absolute = int(pct / 100.*np.sum(allvalues)) 
    return "{:.1f}%\n({:d})".format(pct, absolute)
# Creating plot 
#fig, ax = plt.subplots(figsize =(10, 7)) 
fig, (ax1, ax2) = plt.subplots(1, 2,figsize =(10, 8))
wedges, texts, autotexts = ax1.pie(dfEW,  
                                  autopct = lambda pct: func(pct, dataW), 
                                  explode = explode,  
                                  labels = labsW, 
                                  shadow = True, 
                                  colors = colors, 
                                  startangle = 90, 
                                  wedgeprops = wp, 
                                  textprops = dict(color ="black")) 

wedgesM, textsM, autotextsM = ax2.pie(dfEM,  
                                  autopct = lambda pct: func(pct, dataM), 
                                  explode = explode,  
                                  labels = labsM, 
                                  shadow = True, 
                                  colors = colors, 
                                  startangle = 90, 
                                  wedgeprops = wp, 
                                  textprops = dict(color ="black")) 

#Adding legend 
plt.legend(wedges, labsW, 
          title ="Education", 
          loc ="lower left",
          bbox_to_anchor =(1, 0, 0.5, 1)) 
  
# plt.setp(autotexts, size = 8, weight ="bold") 
fig.suptitle('Education',y=.9)
ax1.set_title("Women",pad=30, fontsize=18, color='red') 
#ax2.use_sticky_edges = False
ax2.set_title("Men", pad=30, fontsize=18, color='blue')
# show plot 
plt.show() 

Percentage of Woman with Masters degree, Doctoral degree compared to Woman with other degrees is more than Men's percentage with Master's Degree and Doctoral degrees. Comparatively more men with High school and some college to women in the same category. This means more men who do not have degrees are into Data Science compared to Women. Interesting!

### Analysis of Female respondents by Country

In [None]:
dfC = df20.copy()
dfC = dfC[dfC["Country"] != "Other"]
dfCF=dfC.groupby('Country')['Gender'].apply(lambda x: (x=='Woman').count()).reset_index(name='ctFemales')
dfCF = dfCF.sort_values(by='ctFemales', ascending=False)
dfCF = dfCF.iloc[:20,:]
x = dfCF['Country']
values = dfCF['ctFemales']
my_range=range(1,len(dfCF)+1)
# # Vertical version.
plt.figure(figsize=(12,6))
plt.hlines(y=my_range, xmin=0, xmax=dfCF['ctFemales'], color='red')
plt.plot(values, my_range, "D")
plt.yticks(my_range, x)
plt.title("Top 20 Countries with highest Female respondents", loc='center',fontsize=16)
plt.xlabel('Count, fig(1)', fontsize=14)
plt.ylabel('Countries', fontsize=14)
plt.show()

dfC1 = dfC.groupby(['Country','Gender']).size().unstack()[['Man','Woman']]
dfC1['%W'] = ((dfC1['Woman']/dfC1['Man'])*100).round(0)
dfC1 = dfC1.astype(int)
dfCW = dfC1.sort_values('Woman', ascending=False).head(20)
dfCWP = dfC1.sort_values('%W', ascending=False).head(20)
gF = pd.DataFrame({'group':dfCW.index.tolist(), 'value1':dfCW.Woman , 'value2':dfCW.Man })
my_range=range(1,len(gF.index)+1)
import seaborn as sns
sns.set(style="whitegrid")
plt.figure(figsize=(12,8))
plt.hlines(y=my_range, xmin=gF['value1'], xmax=gF['value2'], color='blue', alpha=1)
plt.scatter(gF['value1'], my_range, color='red', alpha=1, label='Females')
plt.scatter(gF['value2'], my_range, color='green', alpha=1 , label='Males', marker='D')
plt.legend()
# Add title and axis names
plt.yticks(my_range, gF['group'])
plt.title("Top 20 Countries with highest Female respondents to Males", loc='center', fontsize=16)
plt.xlabel('Num of Females & Males')
plt.ylabel('Countries')
plt.xlabel('Count, fig(2)', fontsize=14)
plt.ylabel('Countries', fontsize=14)
plt.show()
gF1 = pd.DataFrame({'group':dfCWP.index.tolist(), 'value1':dfCWP.Woman , 'value2':dfCWP.Man })
my_range=range(1,len(gF1.index)+1)
import seaborn as sns
sns.set(style="whitegrid")
plt.figure(figsize=(12,8))
plt.hlines(y=my_range, xmin=gF1['value1'], xmax=gF1['value2'], color='blue', alpha=1)
plt.scatter(gF1['value1'], my_range, color='red', alpha=1, label='Females')
plt.scatter(gF1['value2'], my_range, color='green', alpha=1 , label='Males', marker='D')
plt.legend()
# Add title and axis names
plt.yticks(my_range, gF1['group'])
plt.title("Top 20 Countries with %Woman to Man respondents", loc='center',fontsize=16)
plt.xlabel('Num of Females & Males')
plt.ylabel('Countries')
plt.xlabel('Count, fig(3)', fontsize=14)
plt.ylabel('Countries', fontsize=14)
plt.show()

From the above fig(1) if you just go by number of Female responders, <span style="background-color: #aced98">highest are - India, USA, Brazil, Japan, Russia. Though India, USA</span>, the numbers by themselves don't mean much because they are highly populated and so we have to consider population to really see where they stand.

From the above fig(2) comparing  Female responders to male, highest are - India, USA, Brazil, UK, Turkey. UK, Turkey proportionatly more female respondents.

Fig(3) tells us a different story. <span style="background-color: #aced98">In Malaysia, Tunisia, Iran, Ireland, Philippines though number of responders are small, percentage of Women compared to Men are more.</span>This is interesting, is it related to GDP?

### Analysis of Preferences based on Gender




In [None]:
dfVLM = df_2020.loc[df_2020["Q2"] == 'Man', [i for i in df_2020.columns if 'Q14' in i]]
q14M_count = pd.Series(dtype='int')
for i in dfVLM.columns:
    q14M_count[dfVLM[i].value_counts().index[0]] = dfVLM[i].count()
dM = q14M_count.to_dict()
dfVLW = df_2020.loc[df_2020["Q2"] == 'Woman', [i for i in df_2020.columns if 'Q14' in i]]
q14W_count = pd.Series(dtype='int')
for i in dfVLW.columns:
    q14W_count[dfVLW[i].value_counts().index[0]] = dfVLW[i].count()
dW = q14W_count.to_dict()
wordcloudw = WordCloud()
wordcloudm = WordCloud()                              
wordcloudw.generate_from_frequencies(frequencies=dW) 
wordcloudm.generate_from_frequencies(frequencies=dM)
fig, (ax1, ax2) = plt.subplots(1, 2,figsize =(14, 10))
ax1.imshow(wordcloudw)
ax1.axis('off')
ax1.set_title("Women")
ax2.imshow(wordcloudm)
ax2.axis('off')
ax2.set_title("Men")
fig.suptitle('Data Visualization Libraries',fontsize=18, C='red') 
fig.tight_layout()
fig.subplots_adjust(top=1.4)
fig.show()

* Scikit-learn is the most popular ML framework for both Men and Women.
* Next, TensorFlow, Keras, PyTorch, Xgboost are equally popular for both.
* <span style="background-color: #aced98">Looks like Caret is preffered compared to CatBoost by Women.</span>
* <span style="background-color: #adeddf">Fast.ai	and Prophet are not to their liking.</span>

* Matplotlib and Seaborn are equally popular with both Men and Women.
* <span style="background-color: #aced98">Ggplot/ggplot2  is preferred by more Women compared to Men.</span>
* <span style="background-color: #adeddf">Plotly/ plotly Express is Men's choice compared to Women.</span>


### Analysis of Women to Men by Title and Education by Countries

In [None]:
dfSal = df20.copy()
dfSal = dfSal.loc[~dfSal['Sal'].isna()]
dfSal['Sal'] = dfSal['Sal'].apply(lambda x: "500,000-700,000" if x == '> $500,000' else x)
dfSal['Sal'] = dfSal['Sal'].apply(lambda x: "0-999" if x == '$0-999' else x)
dfSal['Sal'] = dfSal['Sal'].str.replace(',','')
dfSal['S1'] = dfSal['Sal'].str.split('-').str[0].astype(int)
dfSal['S2'] = dfSal['Sal'].str.split('-').str[1].astype(int)
dfSal['Salary'] = (dfSal['S1']+dfSal['S2'])/2
dfSal['Salary'] = dfSal['Salary'].apply(math.ceil)
dfSal = dfSal.drop(['S1','S2'], axis=1)
#dfSal.loc[dfSal['NumYrs'] == 'I have never written code', "NumYrs"]= "No Code"
dfC = df20.copy()
dfC = dfC[dfC["Country"] != "Other"]
dfC1 = dfC.groupby(['Country','Gender']).size().unstack()[['Man','Woman']]
dfC1['%W'] = ((dfC1['Woman']/dfC1['Man'])*100).round(0)
dfC1 = dfC1.astype(int)
dfCW = dfC1.sort_values('Woman', ascending=False)#.head(20)
dfCWP = dfC1.sort_values('%W', ascending=False)#.head(20)
cwList = list(dfCW.index)
cwpList = list(dfCWP.index)
dfTest = dfSal.loc[dfSal['Country'].isin(cwpList[:17])]
dfT = dfTest.groupby(['Country','Title','Gender']).size().unstack().reset_index()
dfT.columns.name = None
dfT['%W'] = ((dfT['Woman']/dfT['Man'])*100).round()
dfT = dfT[['Country','Title','%W']].fillna(0)
dfT = dfT.set_index('Country').sort_index()
dfT = dfT.groupby([dfT.index, 'Title'])['%W'].first().unstack()
dfT = dfT.fillna(0)
dfA = dfT.T
palette_mw = sns.color_palette("Greys_r", 12) + sns.color_palette("Greens", 12)
plt.figure(figsize=(14,8))
ax = sns.heatmap(dfA, vmax=100, vmin=0, center = 50, cmap = palette_mw, square=True, 
           annot=True, linewidths=2, 
           cbar_kws = {"label":"% of women", "shrink":0.5})
ax.set_xlabel('Countries, fig(1)', fontsize=14)
ax.set_ylabel('Title', fontsize=14)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
_ = ax.set_title("Top 17 countries(%Women to Men) by Title",fontsize=16, C='blue')

In [None]:
dfAA = dfSal.loc[dfSal['Country'].isin(cwList[:12])]
dfAA = dfAA.groupby(['Country','Education','Gender']).size().unstack().reset_index()
dfAA.columns.name = None
dfAA['%W'] = ((dfAA['Woman']/dfAA['Man'])*100).round()
dfAA = dfAA[['Country','Education','%W']]
dfAA = dfAA.set_index('Country').sort_index()
dfAA = dfAA.groupby([dfAA.index, 'Education'])['%W'].first().unstack()
dfAA = dfAA.fillna(0)
dfAA = dfAA.T
def plot_waffle(proportion, axes = gca, numrows=4, numcols = 4, orient = "v", **kwargs):
    
    numfields = numrows*numcols 
    filled = int(round(proportion * numfields))

    filledvec = np.concatenate([np.ones(filled),np.zeros(numfields-filled)])
    if orient == "h":
        filledvec = filledvec.reshape(numrows, numcols)
        filledvec = filledvec
    else: 
        filledvec = filledvec.reshape(numcols, numrows)
        filledvec = filledvec.T 
        if orient != "v": 
            print("invalid orientation, defaulting to vertical")
        
    axes.pcolormesh(filledvec, edgecolors='w', **kwargs)
    axes.set_aspect('equal');
    axes.axis('off');
f, a = plt.subplots(nrows = dfAA.shape[0]+1, ncols = dfAA.shape[1]+1, figsize=(13,6))
plt.subplots_adjust(wspace = 0.2, hspace = 0.2)
waffle_params = {"numrows":4, "numcols":4, "orient":"v"}
params = {"linewidth":2, "cmap": ListedColormap(sns.color_palette("YlGnBu_r", 3))}
for row, ax_row in enumerate(a[1:,1:]): 
    for col, ax_col in enumerate(ax_row):
        plot_waffle(dfAA.iloc[row, col]/100, ax_col, **waffle_params, **params)
a[0][0].axis('off')

# print column names
for i, ax in enumerate(a[0,1:]):
    ax.axis('off')
    ax.text(x = 0.5, y = 0, s = dfAA.columns[i], rotation = 90, ha = "center", va="bottom")

# print row labels (df index)
for i, ax in enumerate(a[1:,0]):
    ax.axis('off')
    ax.text(x = -0.25, y = 0.5, s = '\n'.join(wrap(dfAA.index[i], 12)), rotation = 0, va = "center")
    
    # add legend
rect = lambda color: plt.Rectangle((0,0),1,1, color=color)
legend = a[0,0].legend([rect(params['cmap'](2)), rect(params['cmap'](0))], ["women", "men"])  
_= f.suptitle('Proportion between men and women in Education ', fontsize=16)

In [None]:
#Bar plot
data = dfAA.T.reset_index()
data.head()
data_mean = pd.DataFrame(data.mean(), columns=['women'])
data_mean['men'] = 100 - data_mean['women']
data_mean.head()
marker_dict = {
 'China':"o",
 'India':"p",
 'Germany':'v',
 'Indonesia':"D",
 'Nigeria':"^",
 'Iran':"<",
 'Russia':">",
 'Canada':"d",
 'United States':"s",
 'Brazil':"P",
 'Turkey':"X",
 'United Kingdom':"*",
}
#pastel2 = sns.color_palette('Pastel2', 2)
seta = sns.color_palette('husl', 2)
# background bar plot
ax = data_mean.plot.bar(stacked=True, cmap = ListedColormap(seta), 
                        width=1, edgecolor="k", 
                        alpha=1, figsize=(7,4))
# scatterplot for each country
for ind in arange(data.shape[0]):
    plt.scatter(y=list(data.iloc[ind][1:].values), 
                x=list(arange(data.shape[1]-1)+(ind+0.5-round((data.shape[0])/2))*0.06), 
                marker=marker_dict[data['Country'].iloc[ind]], s=60, 
                  edgecolor="k", linewidth=1, 
                 label=data['Country'].iloc[ind], 
                axes = ax, 
               zorder=10)

# make markers transparent
for paths in ax.collections:
    paths.set_facecolor('None')
  
# empty plot to get an empty legend entry
plt.plot(np.NaN, np.NaN, '-', color='none', label=' ', axes=ax)

# create legend
handles, labels = ax.get_legend_handles_labels()
plt.legend(handles[1:-2]+handles[0:1]+handles[-2::][::-1], 
           labels[1:-2]+labels[0:1]+labels[-2::][::-1], 
           loc='right', bbox_to_anchor=(1.4, 0.5), 
          edgecolor="w")

# adjust axes
plt.ylim(0,100)
plt.ylabel('% women')

plt.xlim(-0.5, 4.5)
_ = plt.title("% Women and mean across countries in Education")


Fig(1)
* Overall participation of Women is far less, see more black than green.
* Women DBA/Data Engineers are only in India and Taiwan in the top 17 countries.
* More Women working as Product/Project Manager, Research Scientist, Statistician
* More are from Ireland, Malaysia, Tunisia

Fig(2)
* In Indonesia more women with doctaral degree, equal to men with Bachelor's degree and none with Professional degree
* Overall very few women with professional Degree and very few from Turkey

Fig(3)
* As the bar graph shows % Women in each Education category is very less.
* % of Women with Doctoral Degree is slightly higher.
* Indonesia mean of % Women is way higher for Doctal Degree

### Analysis of Pay Divide


In [None]:
dfTree = dfSal.loc[dfSal['Country'].isin(cwList[:14])]
fig = px.treemap(dfTree, 
                 path=['Country','NumYrs','Gender'], 
                 values='Salary',
                 color='Country',
                 width=800, height=600,
                 title="Salary per Gender per NumYrs per Country")
                 
fig.show()

In [None]:
dfT = dfTest.groupby(['Country','Education','Gender'])["Salary"].mean().round(0).unstack().reset_index()
dfT.columns.name = None
dfT = dfT.iloc[:,[0,1,2,6]]
fig = px.scatter(dfT, x="Man", y="Woman", color="Country", hover_data=["Education"])                
fig.add_trace(go.Scatter(x=dfT['Man'], y=(dfT['Man']*.5),
                    mode='lines',
                    name='50% of Men Wages',
                        line=dict(color='black', width=2)))
fig.add_trace(go.Scatter(x=dfT['Man'], y=(dfT['Man']*.3),
                    mode='lines',
                    name='30% of Men Wages',
                        line=dict(color='#cfcbc2', width=2)))
fig.add_trace(go.Scatter(x=dfT['Man'], y=(dfT['Man']*.6),
                    mode='lines',
                    name='60% of Men Wages',
                        line=dict(color='gray', width=2)))
fig.add_trace(go.Scatter(x=dfT['Man'], y=(dfT['Man']*.7),
                    mode='lines',
                    name='70% of Men Wages',
                        line=dict(color='#7a563c', width=2)))
fig.add_trace(go.Scatter(x=dfT['Man'], y=(dfT['Man']),
                    mode='lines',
                    name='Equal Wages',
                        line=dict(color='#f7051d', width=2)))
fig.update_traces(marker=dict(size=7,
                              line=dict(width=2,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))
fig.update_layout(
    title='The Gender Wage Gap',
    xaxis_title="Men's Average Salary degree wise",
    yaxis_title="Women's Average Salary degree wise",
    width=742,
    height=600,
    hovermode="x",
    plot_bgcolor='rgb(250,250,250)')
fig.update_xaxes(showline=True, linewidth=2, linecolor='black',zeroline=False)
                
fig.update_yaxes(showline=True, linewidth=2, linecolor='black',
                showgrid=True, gridwidth=1, gridcolor='#d6d5d2', zeroline =False)
fig.show()


Tree map
* Clearly US takes the lion's share, way ahead than any other Country.
* In US Men with 20+ years, 10-20 years get much higher share of the Salary.
* Women are somewhat represented in the US in the 3-5 yrs and 5-10 yrs categories.

Gender Wage Gap Graph
* Averaging, only about 10% of Women earn more than Men with the same degree.
* Most Women earn less than 50% of men 
* In Singapore,Portugal,Taiwan,  Women with Doct0ral degree on an average earn more than Men

### Under representated Countries

In [None]:
df2 = df20.copy()
import plotly.graph_objs as gobj
gkk1 = df2.groupby(['Country','Gender']).size()
gkk1 = pd.DataFrame(gkk1).reset_index()
gkk1 = gkk1.rename(columns={0:'count'})
gkk1 = gkk1.pivot(index='Country', columns='Gender', values='count')
gkk1.columns.name = None
gkk1['Woman'] = gkk1['Woman'].astype(int)
gkk1['Man'] = gkk1['Man'].astype(int)
gkk1 =gkk1.sort_values(by = 'Woman', ascending = False)
gkn = gkk1.reset_index()
data1 = dict(type = 'choropleth',
            locations = gkn['Country'],
            locationmode = 'country names',
            colorscale= 'Reds',
            #text= ['IND','NEP','CHI','PAK','BAN','BHU', 'MYN','SLK'],
            z= gkn['Woman'],#[1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0],
            colorbar = {'title':'Country Colours', 'len':200,'lenmode':'pixels' })
#initializing the layout variable
layout = dict(title = 'The Nationality of Female Respondents in 2020',geo = {'scope':'world'})
# Initializing the Figure object by passing data and layout as arguments.
col_map = gobj.Figure(data = [data1],layout = layout)
#plotting the map
iplot(col_map)
# dfW['GDP per capita'] = dfW['GDP per capita'].str.replace('$','').str.replace(',','')
# dfW['GDP per capita'] = dfW['GDP per capita'].astype(int)
# dfW['population'] = dfW['population'].str.replace(',','')
# dfWGDP = dfW[dfW['Country'].isin(cwpList[:20])]
# dfWGDP = dfWGDP[['Country','population','GDP per capita']]

Many Countries from African Continent are not represented. Few in Europe-noticably Norway, Finland, few from South America, and Central America. 

Poverty, Lack of Education, Social and Political imbalances contribute to Women not participating propotinately.

<img src="https://stories.plancanada.ca/wp-content/uploads/2017/10/6-reasons-education.jpg" width="400">



### Conclusion

There are many hurdles for Girl child and Women.

We have to focus on bringing social and political stability especially to Countries in Africa. 

Educating girls from Primary to College.

Help in Reducing povert worldwide. distribution of wealth.

Improve Safety. 

Going forward hoping more and more girls enroll in STEM braches.