***An analysis of the 2019 Kaggle ML and DS Survey for Women's Representation in Machine Learning and Data Science ***
** Gonna work on it **

![](https://imgur.com/FZS0gSi.png)

In the last few years, a lot of effort is being made in society to increase women's representation in the tech world. The idea behind these efforts is that attracting and retaining more women in science, technology, engineering, and mathematics (STEM) will ultimately enhance innovation and creativity. Also, it is important to note here that when women are not involved in these fields, experiences, needs, and desires that are unique to women may be overlooked. Thus, our voices will only be heard when we have decent participation in society. 
The iconic photograph of **Katie Bowman in 2019**, watching the black hole's first photograph as it was being reconstructed, has inspired a lot of women.

![](https://inteng-storage.s3.amazonaws.com/images/sizes/katieboumanfacebookmain_resize_md.jpg)

Similarly, women like Grace Hopper, Marie Curie, Margaret Hamilton, and many other immensely talented females have been continuously regarded as idols and icons by many. They have time and again inspired us to follow our passions and do something incredible. And it's not like women haven't been making significant contributions to IT for decades. Still, women are substantially represented in STEM fields.
A lot of companies today are coming forward to change the world of work. This year, even the United Nations celebrated International Women's Day under the official theme **#BalanceforBetter**, advocating a more gender-balanced world.

### So,have we achieved parity. Are things improving?
The 2019 Kaggle ML & DS survey is an excellent place to start. Let's use our analysis skills to investigate whether things seem to be actually improving or is still much left to be done. Let the numbers speak for themselves.

# What is the annual Kaggle survey

Kaggle annual Machine Learning and Data Science Survey is in its third year now. This year, as in [2017](https://www.kaggle.com/kaggle/kaggle-survey-2017) and [2018](https://www.kaggle.com/kaggle/kaggle-survey-2018/), Kaggle has conducted an industry-wide survey that presents a comprehensive view of the state of data science and machine learning. The Survey Challenges from 2017 and 2018 were very informative, and so is expected out of the 2019 survey also.

# Objective
The objective of this notebook to analyze the survey data to answer an important question: is the women participation in STEM really improving or is it just a hype?

# Methodology
This notebook will only focus on the women in ML and DS, thereby analyzing the answers of the female respondents solely. The analysis will be done in two steps:

* Step 1 will comprise of the analysis of the 2019 Kaggle DS & ML survey for Female respondents.
* Step 2 shall consist of the comparisons of the results in the 2019 study with the previous two surveys. This will present a better picture and will show whether are things are improving or not.

In this notebook, we shall analyze the female participation in the survey under 5 different categories, namely:
![](https://cdn-images-1.medium.com/max/800/1*US4fqDrQqfEMfFm_MSxWkg.png)

* Demographics
* Education 
* Professional Experience 
* Job Role
* Salary

In [None]:
# Disable warnings in Anaconda
import warnings
warnings.filterwarnings('ignore')

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Graphics in retina format are more sharp and legible
%config InlineBackend.figure_format = 'retina' 

# Increase the default plot size and set the color scheme
plt.rcParams['figure.figsize'] = 8, 5
plt.rcParams['image.cmap'] = 'viridis'


import plotly.offline as py
import pycountry

py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

from plotly.offline import init_notebook_mode, iplot 
init_notebook_mode(connected=True)

import folium 
from folium import plugins


In [None]:
#Importing the 2019 Dataset
df_2019 = pd.read_csv('../input/kaggle-survey-2019/multiple_choice_responses.csv')
df_2019.columns = df_2019.iloc[0]
df_2019=df_2019.drop([0])

df_2018 = pd.read_csv('../input/kaggle-survey-2018/multipleChoiceResponses.csv')
df_2018.columns = df_2018.iloc[0]
df_2018=df_2018.drop([0])

df_2017=pd.read_csv('../input/kaggle-survey-2017/multipleChoiceResponses.csv',encoding='ISO-8859-1')

# The State of Affairs Today - 2019

In [None]:
print('Total respondents in 2019:',df_2019.shape[0])

male_2019 = df_2019[df_2019['What is your gender? - Selected Choice']=='Male']
female_2019 = df_2019[df_2019['What is your gender? - Selected Choice']=='Female']
female_2018 = df_2018[df_2018['What is your gender? - Selected Choice']=='Female']
female_2017 = df_2017[df_2017['GenderSelect']=='Female']



df_2019['In which country do you currently reside?'].replace({'United States of America':'United States','Viet Nam':'Vietnam','China':"People 's Republic of China","United Kingdom of Great Britain and Northern Ireland":'United Kingdom',"Hong Kong (S.A.R.)":"Hong Kong"},inplace=True)

topn = 10

count_male = male_2019['In which country do you currently reside?'].value_counts()[:topn].reset_index()
count_female = female_2019['In which country do you currently reside?'].value_counts()[:topn].reset_index()

pie_men = go.Pie(labels=count_male['index'],values=count_male['In which country do you currently reside?'],name="Men",hole=0.5,domain={'x': [0,0.46]})
pie_women = go.Pie(labels=count_female['index'],values=count_female['In which country do you currently reside?'],name="Women",hole=0.5,domain={'x': [0.52,1]})

layout = dict(title = 'Top-10 countries with respondents', font=dict(size=12), legend=dict(orientation="h"),
              annotations = [dict(x=0.2, y=0.5, text='Men', showarrow=False, font=dict(size=20)),
                             dict(x=0.8, y=0.5, text='Women', showarrow=False, font=dict(size=20)) ])

fig = dict(data=[pie_men, pie_women], layout=layout)
py.iplot(fig)

## Comparison of number of respondents w.r.t the previous surveys
![](https://cdn-images-1.medium.com/max/800/1*lQOxGmQA8cNq73oErNqB-Q.png)



# 1. Demographics
Barack Obama once quoted that where women are full participants in a country's politics or economy,  Let's see if the picture has changed from the past years.

## 1.1 The Great Gender Divide


In [None]:
colors = ['#1BA1E2', '#AA00FF', '#F0A30A','#8c564b'] #gold,bronze,silver,chestnut brown
counts = df_2019['What is your gender? - Selected Choice'].value_counts(sort=True)
labels = counts.index
values = counts.values

pie = go.Pie(labels=labels, values=values, marker=dict(colors=colors))
layout = go.Layout(title='Gender Distribution in 2019')
fig = go.Figure(data=[pie], layout=layout)
py.iplot(fig)

In [None]:
from plotly.subplots import make_subplots

counts1 = df_2019['What is your gender? - Selected Choice'].value_counts(sort=True)
counts2 = df_2018['What is your gender? - Selected Choice'].value_counts(sort=True)
counts3 = df_2017['GenderSelect'].value_counts(sort=True)


labels = ["Male ", "Female", "Prefer not to say ", "Prefer to self-describe"]
labels3 = ["Male ", "Female","A different identity", "Non-binary","genderqueer, or gender non-conforming"]
# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=1, cols=3, specs=[[{'type':'domain'}, {'type':'domain'},{'type':'domain'}]])
fig.add_trace(go.Pie(labels=labels3, values=counts3.values, name="2017"),
              1, 1)
fig.add_trace(go.Pie(labels=labels, values=counts2.values, name="2018"),
              1, 2)
fig.add_trace(go.Pie(labels=labels, values=counts1.values, name="2019"),
              1, 3)
# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.5, hoverinfo="label+percent+name")

fig.update_layout(
    title_text="Gender Distribution over the years",font=dict(size=12), legend=dict(orientation="h"),
    # Add annotations in the center of the donut pies.
    annotations=[dict(text='2017', x=0.11, y=0.5, font_size=20, showarrow=False),
                 dict(text='2018', x=0.5, y=0.5, font_size=20, showarrow=False),
                 dict(text='2019', x=0.88, y=0.5, font_size=20, showarrow=False)])
fig.show()


Well, majority of the respondents were male with a whopping 81% out of total. Let's see a visual distribution of the above result. This is a known issue and somewhat disturbing one too. With this low participation, how will our ideas and suggestions be incorporated in the industry. How will we be effectively represented when our participation is so low.
**Common girls!! we need more participation**

## 1.2 Daunting obstacles remain in Africa

Whereas it is heartening to see that majority of the females respondents are from India. The percentage of participation from Africa has been very low.Now participation in this survey may not really reflect the actual scenario, but it does reflect that more efforts are needed to be put in there.

In [None]:
def get_name(code):
    '''
    Translate code to name of the country
    '''
    try:
        name = pycountry.countries.get(alpha_3=code).name
    except:
        name=code
    return name

country_number = pd.DataFrame(female_2019['In which country do you currently reside?'].value_counts())
country_number['country'] = country_number.index
country_number.columns = ['number', 'country']
country_number.reset_index().drop(columns=['index'], inplace=True)
country_number['country'] = country_number['country'].apply(lambda c: get_name(c))
country_number.head(5)



worldmap = [dict(type = 'choropleth', locations = country_number['country'], locationmode = 'country names',
                 z = country_number['number'], autocolorscale = True, reversescale = False, 
                 marker = dict(line = dict(color = 'rgb(100,100,100)', width = 0.5)), 
                 colorbar = dict(autotick = False, title = 'Number of respondents'))]

layout = dict(title = 'The Nationality of Female Respondents', geo = dict(showframe = False, showcoastlines = True, 
                                                                projection = dict(type = 'Mercator')))

fig = dict(data=worldmap, layout=layout)
py.iplot(fig, validate=False)

##### 1.3 Indian & U.S women lead the way

**Top 10 Countries of Female Respondents**

Above we saw an overview of the countries from where the female respondents came. Let's now see the top 15 countries of female respondents.

In [None]:
def return_percentage(data,question_part,response_count):
    """Calculates percent of each value in a given column"""
    counts = data[question_part].value_counts()
    total = response_count
    percentage = (counts*100)/total
    value = [percentage]
    question = [data[question_part]][0]
    percentage_df = pd.DataFrame(data=value).T     
    return percentage_df


def plot_multiple_choice(data,question,title,x_axis_title):
    df = return_percentage(data,question,response_count)
    
    trace1 = go.Bar(
                    y = df.index,
                    x = df[question][0:20],
                    orientation='h',
                    name = "Kaggle Survey 2019",
                    marker = dict(color='#00C9E0',
                                 line=dict(color='black',width=1)),
                    text = df.index)
    data = [trace1]
    layout = go.Layout(barmode = "group",title=title,width=1000, height=500, 
                       xaxis= dict(title=x_axis_title),
                       yaxis=dict(autorange="reversed"),
                       showlegend=False)
    fig = go.Figure(data = data, layout = layout)
    iplot(fig)
    
response_count = len(female_2019['In which country do you currently reside?'])
plot_multiple_choice(female_2019,'In which country do you currently reside?','Top 20 countries of female respondents','Percentage of Respondents')
    

Well, as per the pattern that we have been seeing, Indian and US females are way ahead of their counterparts. A lot can be attributed to the booming tech industry in both these countries. However, as compared to males, the proportion is still less.It is also heartening to see Turkey in top 1O since I recently read an article on Turkey's tech lanscape titled [*Turkey's Booming Tech Industry: Where Are the Women?*](https://pulitzercenter.org/reporting/turkeys-booming-tech-industry-where-are-women)

In [None]:
female_2018['In which country do you currently reside?'].replace({'United States of America':'United States','Viet Nam':'Vietnam',
                                                                  'China':"People 's Republic of China",
                                                                  "United Kingdom of Great Britain and Northern Ireland":'United Kingdom',
                                                                  "Hong Kong (S.A.R.)":"Hong Kong"},inplace=True)
female_2019['In which country do you currently reside?'].replace({'United States of America':'United States','Viet Nam':'Vietnam',
                                                                  'China':"People 's Republic of China",
                                                                  "United Kingdom of Great Britain and Northern Ireland":'United Kingdom',
                                                                  "Hong Kong (S.A.R.)":"Hong Kong"},inplace=True)



female_country_2019 = female_2019['In which country do you currently reside?']
female_country_2018 = female_2018['In which country do you currently reside?']
female_country_2017 = female_2017['Country']
                                                                  
f_2019 = female_country_2019[(female_country_2019 == 'India') | (female_country_2019 == 'United States')].value_counts()
f_2018 = female_country_2018[(female_country_2018 == 'India') | (female_country_2018 == 'United States')].value_counts()
f_2017 = female_country_2017[(female_country_2017 == 'India') | (female_country_2017 == 'United States')].value_counts()                                                                  
                                         
female_country_count = pd.DataFrame(data = [f_2017,f_2018,f_2019],index = ['2017','2018','2019'])    

female_country_count['total'] = [len(female_2017),len(female_2018),len(female_2019)]
female_country_count['US%'] = female_country_count['United States']/female_country_count['total']*100
female_country_count['India%'] = female_country_count['India']/female_country_count['total']*100

female_country_count[['India%','US%']].plot(kind='bar',cmap='tab10')
plt.gcf().set_size_inches(10,8)
plt.title('Pattern of US and Indian Female respondents over the years', fontsize = 15)
plt.xticks(rotation=0,fontsize='10', horizontalalignment='right')
plt.xlabel('Years',fontsize=15)
plt.ylabel('Percentage of Respondents',fontsize=15)
plt.legend(fontsize=15,bbox_to_anchor=(1.04,0.5), loc="center left",labels=['India','US'])




## 1.4 Young Brigade dominate
#### Age Distribution of Female Respondents


In [None]:
plt.figure(figsize=(10,8))
sns.countplot(x="What is your age (# years)?", data=female_2019,palette ='Blues',order = df_2019['What is your age (# years)?'].value_counts().index.sort_values())
plt.title('Age wise Distribution of Female Respondents',fontsize=15)
plt.xticks( rotation=45, fontweight='bold', fontsize='10', horizontalalignment='right')
plt.yticks( fontsize=10)
plt.xlabel('Age in years',fontsize=15)
plt.ylabel('Count',fontsize=15)


The young brigade is setting a good example.It seems Girls in 20s and early 30s are really gearing up to better the gender ration in the industry. What is also heartening is that there are also females respondents who are greater than 60 and 70 years of age. These women are definitely an inspiration for others.



## 1.5 Country wise Age distribution


In [None]:
## Grouping the Ages
df_2019['Age Group']=np.where(df_2019['What is your age (# years)?'].isin(['18-21']),'18-21','')
df_2019['Age Group']=np.where(df_2019['What is your age (# years)?'].isin(['25-29','22-24']),'22-29',df_2019['Age Group'])
df_2019['Age Group']=np.where(df_2019['What is your age (# years)?'].isin(['30-34','35-39']),'30-39',df_2019['Age Group'])
df_2019['Age Group']=np.where(df_2019['What is your age (# years)?'].isin(['40-44','45-49']),'40-49',df_2019['Age Group'])
df_2019['Age Group']=np.where(df_2019['What is your age (# years)?'].isin(['50-54','55-59']),'50-59',df_2019['Age Group'])
df_2019['Age Group']=np.where(df_2019['What is your age (# years)?'].isin(['60-69']),'60-69',df_2019['Age Group'])
df_2019['Age Group']=np.where(df_2019['What is your age (# years)?'].isin(['70+']),'70s and above',df_2019['Age Group'])

count_age=df_2019.groupby(['In which country do you currently reside?','Age Group'])['What is your age (# years)?'].count().reset_index()
count_age.columns=['Country','Age Group','Count']
count_age=count_age[count_age['Country'].isin(df_2019['In which country do you currently reside?'].value_counts()[:10].index)]
count_age=count_age[count_age['Country']!='Other']
count_age.pivot('Country','Age Group','Count').plot.bar(stacked=True,width=0.8)
plt.gcf().set_size_inches(16,8)
plt.xticks( rotation=45,fontsize='10', horizontalalignment='right')
plt.yticks( fontsize=10)
plt.title('Country wise Age Distribution', fontsize = 15)
plt.xlabel('Countries',fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.legend(fontsize=15,bbox_to_anchor=(1.04,0.5), loc="centre left")
plt.show()

Well, this is an interesting outcome. The majority of female respondents in **India** comprises of women in their 20s , which is higher than any other country. This means Indian women in this age bracket are really serious about making a career in Data Science and working towards it.


# 2. Education

*When girls are educated, their countries become stronger and more prosperous: Michelle Obama*

In [None]:
from wordcloud import WordCloud
female_title = female_2019['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'].dropna()
plt.subplots(figsize=(8,8))
wordcloud = WordCloud(
                          background_color='black',
                          width=512,
                          height=384
                         ).generate(" ".join(female_title))
plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('graph.png')

plt.show()

In [None]:
fig = plt.figure()
plt.figure(figsize=(10,8))
sns.countplot(x="What is the highest level of formal education that you have attained or plan to attain within the next 2 years?", data=female_2019,palette="Set3",
             order = female_2019['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'].value_counts().index)
plt.xticks(rotation=90)
plt.title("Females'Formal Education level",fontsize=15)
plt.xticks( rotation=45,fontsize='10', horizontalalignment='right')
plt.yticks( fontsize=10)
plt.xlabel('Age in years',fontsize=15)
plt.ylabel('Count',fontsize=15)


## Female's Education qualification country wise

In [None]:
df_edu_temp = pd.crosstab(female_2019['In which country do you currently reside?'],
              female_2019['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'])


df_edu = df_edu_temp[(df_edu_temp.index == 'Brazil')| (df_edu_temp.index == 'India') | (df_edu_temp.index == 'Japan') | (df_edu_temp.index == 'Russia') | (df_edu_temp.index == 'United States')
                    |(df_edu_temp.index == 'Canada')| (df_edu_temp.index == 'Germany') | (df_edu_temp.index == "People 's Republic of China")
                    | (df_edu_temp.index == 'United Kingdom')].drop('I prefer not to answer',axis=1)

df_edu.plot(kind='bar',width=1)
plt.gcf().set_size_inches(16,8)
plt.title('Country wise Age Distribution', fontsize = 15)
plt.xlabel('Countries',fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.xticks( rotation=45,fontsize='15', horizontalalignment='right')
plt.yticks( fontsize=10)
plt.legend(fontsize=15,bbox_to_anchor=(1.04,0.5), loc="centre left")
plt.show()

Clearly, a lot of females are opting for Master's Degree after their bachelors. There are also PhDs in the group who answered the survey. However, there are also certain females who have had no formal education past high school. Inspite of this, they took the survey is a commnedable thing. They should be encouraged to complete their education either full time or through part time courses. Also, the number of Master's qualification in India and US is way high than other countries.

# 3. Professional Experience
*“We need to understand that if we all work on inclusion together, it’s going to be faster, broader, better, and more thorough than anything we can do on our own.”: Ellen Pao*

In [None]:
plt.figure(figsize=(8,8))
ax=female_2019['Select the title most similar to your current role (or most recent title if retired): - Selected Choice'].value_counts()[:10].plot.barh(width=0.9,color=sns.color_palette('Set3',25))
plt.gca().invert_yaxis()
plt.title('Current Roles2019')
plt.show()

Again, most of the respondents have the title of student, which could comprise both Undergraduate and Postgraduate level. This is closely followed by women in Data Scientist role. Interestingly, there are also women who are not employed but have responded to the survey. These women could not be working by choice or may be looking for jobs. We could connect to these women to understand if they are willing to work and could assist them in the same.

### Female's Current Role country wise

In [None]:

df_roles_temp = pd.crosstab(female_2019['In which country do you currently reside?'],
            female_2019['Select the title most similar to your current role (or most recent title if retired): - Selected Choice'])
df_roles_temp['Data Engineer'] = df_roles_temp['DBA/Database Engineer']+df_roles_temp['Data Engineer']
df_roles_temp['Data/Business Analyst'] = df_roles_temp['Data Analyst']+df_roles_temp['Business Analyst']
df_roles = df_roles_temp[(df_edu_temp.index == 'Brazil')| (df_edu_temp.index == 'India') | (df_edu_temp.index == 'Japan') | (df_edu_temp.index == 'Russia') | (df_edu_temp.index == 'United States')
                    |(df_edu_temp.index == 'Canada')| (df_edu_temp.index == 'Germany') | (df_edu_temp.index == "People 's Republic of China")
                    | (df_edu_temp.index == 'United Kingdom')].drop(['Other','DBA/Database Engineer','Data Engineer','Data Analyst','Business Analyst'],axis=1)


ax = df_roles.plot(kind='bar',width=1)
plt.gcf().set_size_inches(20,8)
plt.xticks( rotation=45,fontsize='10', horizontalalignment='right')
plt.yticks( fontsize=10)
plt.title('Country wise Age Distribution', fontsize = 15)
plt.xlabel('Countries',fontsize=15)
plt.ylabel('Count',fontsize=15)
plt.xticks( rotation=45,fontsize='15', horizontalalignment='right')
plt.yticks( fontsize=10)
plt.legend(fontsize=15,bbox_to_anchor=(1.04,0.5), loc="centre left")
plt.show()

Well, whereas there are more students who responded to the survey in India, there are more Data Scientists in US who have responded. 

# 4. How Big is your organisation 

In [None]:
plt.figure(figsize=(8,8))
female['What is the size of the company where you are employed?'].value_counts()[:10].plot.barh(width=0.9,color=sns.color_palette('Set3',25))
plt.gca().invert_yaxis()
plt.title('Size of Company')
plt.show()

Surprisingly, a lot of women work in companies having less than 50 employees. This may be because there are a lot if students in the population and many of them might be working for small startups. Well, this is something that can be further looked into. 

# TODO

This is a work in process and there are a lot of other interesting findings which can be elicited from the dataset. Hence this kernel will be continuously updated.