# 2020 Kaggle ML & DS Survey Results



This year, as in 2017, 2018, and 2019 Kaggle set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for 3.5 weeks in October, and after cleaning the data, it was finalized with 20,036 responses!

I'll be analyzing the dataset that's made of survey participants' answers. My aim is to provide an accurate and easy to understand visual overview of state of the data science community.

First, let's review the survey methodology notes from Kaggle.

* An invitation to participate in the survey was sent to the entire Kaggle community.
* The survey accommodated participants located in 171 different countries and territories.
* Respondents that were flagged by the survey system as “Spam” were excluded.
* Respondents with the most experience were asked the most questions.  For example, students and unemployed persons were not asked questions about their employer.  Likewise, respondents that do not write code were not asked questions about writing code.

This notebook focuses on the answers of participants with the following job titles: **Data Scientist, Research Scientist, Statistician, Machine Learning Engineer, Data Engineer, Data/Business Analyst**. Therefore, the answers of students, unemployed participans and of those with a different current job title than what's listed above are excluded from the analysis. <br>

The notebook covers 5 main topics:
* [Overall Age, Location, Gender Distribution](#overview)
* [Gender-diversity by Country](#gencountry)
* [Education Levels](#education)
* [Comparing various Job Titles within Data Science](#jobtitles)
* [Compensation Analysis - Median Salary by Country](#compensation)

The code driving visualizations have been hidden in order to clear up visual space; you have the option to unhide by clicking "show hidden code" throughout the notebook.

In [None]:
%%capture
# pywaffle for pictograms
!pip install pywaffle

In [None]:
%%capture
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib
import matplotlib.pyplot as plt

import seaborn as sns

from pywaffle import Waffle

import plotly.graph_objects as go
import plotly.graph_objs as gobj
import plotly
import plotly.express as px
import plotly.io as pio
from plotly.subplots import make_subplots

pio.templates.default = "ggplot2"

pd.options.display.max_columns=999
pd.options.display.max_rows=50
pd.options.display.max_colwidth = 500

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
maindf = pd.read_csv('/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', low_memory=False)

In [None]:
### DATA PREPROCESSING IN THIS CELL

# Group genders under one umbrella to lower cardinality.
maindf.loc[maindf['Q2'].isin(['Prefer to self-describe','Nonbinary']),'Q2'] = 'Other'

# Shorten answers where possible to prevent overlapping tick labels.
maindf.loc[maindf['Q3']=='United Kingdom of Great Britain and Northern Ireland','Q3'] = 'UK'
maindf.loc[maindf['Q3']=='United States of America','Q3'] = 'USA'
maindf.loc[maindf['Q3']=='Iran, Islamic Republic of...','Q3'] = 'Iran'
maindf.loc[maindf['Q3']=='United Arab Emirates','Q3'] = 'UAE'
maindf.loc[maindf['Q3']=='Republic of Korea','Q3'] = 'South Korea'

# Shorten answers where possible to prevent overlapping tick labels. Group certain education levels to lower cardinality.
maindf.loc[maindf['Q4']=='Some college/university study without earning a bachelor’s degree','Q4'] = 'Other'
maindf.loc[maindf['Q4']=='No formal education past high school','Q4'] = 'Other'
maindf.loc[maindf['Q4']=='I prefer not to answer','Q4'] = 'No Answer Provided'
maindf.loc[maindf['Q4']=='Professional degree','Q4'] = 'Other'
maindf.loc[maindf['Q4']=='Doctoral degree','Q4'] = 'Doctoral'
maindf.loc[maindf['Q4']=="Bachelor’s degree",'Q4'] = 'Bachelor’s'
maindf.loc[maindf['Q4']=="Master’s degree",'Q4'] = 'Master’s'

# Shorten answers where possible to prevent overlapping tick labels. Reformatting to wrap long names.
maindf.loc[maindf['Q5']=='Data Analyst','Q5'] = 'Data/Biz.<br>Analyst'
maindf.loc[maindf['Q5']=='Business Analyst','Q5'] = 'Data/Biz.<br>Analyst'
maindf.loc[maindf['Q5']=='Software Engineer','Q5'] = 'Software<br>Engineer'
maindf.loc[maindf['Q5']=='Data Engineer','Q5'] = 'Data<br>Engineer'
maindf.loc[maindf['Q5']=='DBA/Database Engineer','Q5'] = 'Data<br>Engineer'
maindf.loc[maindf['Q5']=='Machine Learning Engineer','Q5'] = 'ML<br>Engineer'
maindf.loc[maindf['Q5']=='Research Scientist','Q5'] = 'Research<br>Scientist'
maindf.loc[maindf['Q5']=='Data Scientist','Q5'] = 'Data<br>Scientist'
maindf.loc[maindf['Q5']=='Product/Project Manager','Q5'] = 'Product/Project<br>Manager'

# Establish a new column to identify employment status. 
maindf['Employment'] = maindf['Q5']
maindf.loc[~maindf['Q5'].isin(['Student','Currently not employed']),['Employment']] = 'Employed'
maindf.loc[maindf['Employment']=='Currently not employed','Employment'] = 'Not employed'

# Shorten answers where possible to prevent overlapping tick labels. Group certain answers to lower cardinality.
maindf.loc[maindf['Q15']=='I do not use machine learning methods','Q15'] = 'I don\'t use ML'
maindf.loc[maindf['Q15']=='Under 1 year','Q15'] = '< 1 years'
maindf.loc[maindf['Q15']=='5-10 years','Q15'] = '5+ years'
maindf.loc[maindf['Q15']=='10-20 years','Q15'] = '5+ years'
maindf.loc[maindf['Q15']=='20 or more years','Q15'] = '5+ years'

# Shorten answers where possible to prevent overlapping tick labels. Group certain answers to lower cardinality.
maindf.loc[maindf['Q6']=='I have never written code','Q6'] = 'No coding<br>experience'
maindf.loc[maindf['Q6']=='10-20 years','Q6'] = '10+ years'
maindf.loc[maindf['Q6']=='20+ years','Q6'] = '10+ years'


# only interested in these titles
select_titles = [
    'Data<br>Scientist',
    'Research<br>Scientist',
    'ML<br>Engineer', 
    'Data<br>Engineer', 
    'Data/Biz.<br>Analyst',
    'Statistician'
]


# Filter to keep only relevant titles' answers
maindf[1:] = maindf[1:][maindf['Q5'].isin(select_titles)]

In [None]:
colorlist = px.colors.qualitative.Vivid
titlecolor="#0D2A63"
mcolor1 = colorlist[6] # yellow
mcolor2 = colorlist[5] # green
mcolor3 = colorlist[7] # turq

tick_font = dict(dict(size=14,family='Soleil'))
title_font = dict(dict(size=20,color=titlecolor,family='Poynter Gothic Text'))
annot_font = dict(dict(size=14, family='Soleil',color="#0D2A63"))

# A General Overview

<div id="overview"></div>

Let's start with the basics and visualize the **overall age, gender and location distribution** of the survey respondents.

In [None]:
# Extract all the answers except missing records
vizdata=maindf.loc[1:, "Q1"].dropna()

# Obtain percentage of each answer
percentages = vizdata.value_counts(normalize=True).sort_index()
# Obtain raw counts of each answer
counts = vizdata.value_counts(normalize=False).sort_index()

# Combine all into one df
vizdata = pd.DataFrame(
    np.hstack([percentages.values.reshape(-1, 1), counts.values.reshape(-1, 1)]),
    index=percentages.index,
    columns=["Percent", "Count"],
).reset_index()

fig = px.bar(vizdata, x="index", y="Count", text="Percent",range_y=[0,2200])

# Create color list so that multiple colors are used for emphasis 
colors = [colorlist[10],]*11
colors[:4] = [colorlist[5]]*4

fig.update_traces(
    hovertemplate="Age: %{x}<br>Count: %{y:,.0f}",
    marker_color=colors,
    marker_opacity=0.8,
    texttemplate="%{text:.0%}",
    textposition='inside'
)

fig.update_layout(
    xaxis=dict(
        showgrid=False,
        title_text="Age Groups", 
        tickfont=tick_font
    ),
    yaxis=dict(
        showticklabels=True,
        title_text="Number of Respondents"),
    hoverlabel=dict(
        bgcolor="white",
        font_size=14,
    ),
    margin=dict(pad=5),
    height=500,
    width=800,
    title=dict(
        text="Age Distribution of Data Science Professionals",
        y=1,
        x=0,
        xref="paper",
        xanchor= 'left',
        yanchor= 'top',
        font=title_font
))
# line that separates first 4 bars and the rest
fig.add_shape(
    type="line",
    y0=0, x0=3.5, 
    y1=1, x1=3.5,
    xref='x',
    yref='paper',
    line=dict(
        color="black",
        width=1,
        dash="dashdot",
    ))

# vertical gray rectangle to highlight first 4 bars
fig.add_vrect(
    x0=-0.5, x1=3.5,
    y0=0,y1=1,
    fillcolor="darkgray", opacity=0.2,
    layer="below", line_width=0,
),

# Annotation to highlight percent of top 4
fig.add_annotation(
    y=2100,
    x=3,
    xref='x',yref='y',
    text='<i>≅63% are below 35 yrs old</i>',
    textangle=0,
    showarrow=True,
    arrowhead=7,
    arrowsize=1,
    arrowwidth=1.5,
    ax=150,
    ay=-10,
    font=annot_font,
)

fig.show()

Majority of the data science professionals who responded to the survey are **below 35 years old**. 1 in 4 data science professionals are in their late twenties. 

The upcoming graph will show the **top 10 countries** with **most DS professionals**.

In [None]:
vizdata = maindf.loc[1:, "Q3"].dropna()

percentages = vizdata.value_counts(normalize=True).sort_values()
counts = vizdata.value_counts(normalize=False).sort_values()

vizdata = pd.DataFrame(
    np.hstack([percentages.values.reshape(-1, 1), counts.values.reshape(-1, 1)]),
    index=percentages.index,
    columns=["Percent", "Count"],
).reset_index()

vizdata = vizdata[vizdata['index']!='Other'].tail(10)

# Obtain list of top 10 countries to be used in the following cell
top10list = vizdata['index']

fig = go.Figure()

fig.add_trace(
    go.Bar(
        x=vizdata["Count"], 
        y=vizdata["index"],
        text=vizdata["Percent"],
        orientation="h",name=''))

colors = [colorlist[10],]*10
colors[-2:] = [colorlist[5]]*2

fig.update_traces(
    hovertemplate="Country: %{y}<br>Count: %{x:,.0f}",
    marker_color=colors,
    marker_opacity=0.8,
    texttemplate="%{text:.0%}",
    textposition='inside'
)

fig.update_layout(
    xaxis=dict(
        showticklabels=True,
        showgrid=True,
        title="Number of Respondents"
    ),
    yaxis=dict(
        showgrid=False,
        tickfont=tick_font,
        title="Country of Residence"
        
    ),    
    hoverlabel=dict(
        bgcolor="white",
        font_size=14,
    ),
    margin=dict(pad=5),
    height=600,
    width=800,
    title=dict(
        text='10 Countries with the Most Data Science Professionals',
        y=0.94,
        x=0,
        xref="paper",
        xanchor= 'left',
        yanchor= 'top',
        font=title_font                  
    )
)

fig.add_shape(
    type="line",
    y0=0.8, x0=0, 
    y1=0.8, x1=1,
    xref='paper',
    yref='paper',
    line=dict(
        color="black",
        width=1,
        dash="dashdot",
    ))

fig.add_vrect(
    x0=0, x1=1800,
    y0=0.8,y1=1,
    xref='paper',
    yref='paper',
    fillcolor="darkgray", opacity=0.2,
    layer="below", line_width=0,
),


fig.add_annotation(
    y=0.85,
    x=0.8,
    xref='paper',yref='paper',
    text='<i>34% reside in India and the US</i>',
    textangle=0,
    showarrow=True,
    arrowhead=7,
    arrowsize=1,
    arrowwidth=1.5,
    ax=10,
    ay=50,
    font=annot_font,
)




fig.show()

Kagglers residing in India and in the U.S. alone make up to 34% of respondents; while the ones from the top 10 countries make up to 53%.

Let's look at the gender distributions.

In [None]:
vizdata = maindf.loc[1:, ["Q2"]].dropna()

vizdata = pd.DataFrame(vizdata.value_counts(normalize=True)).reset_index()

# Create a dict of labels and values(percents) as Pywaffle uses dict as input
vizdata_dict = dict(zip(vizdata['Q2'],vizdata[0]*100))

# Create value labels to rounded percentages that are consistent (unlike how Pywaffe rounds them up - sometimes sum of all exceed 100%)
labels = ['{} {:.1f}%'.format(k, v) for k, v in vizdata_dict.items()]

fig = plt.figure(
    FigureClass=Waffle,
    rows=6,
    title=dict(label='Overall Gender Distribution of DS Professionals',loc='left',size=4.5, color=titlecolor),
    figsize=(3,2),
    values=vizdata_dict,
    labels=labels,
    legend = {
        'loc': 'lower left',
        'bbox_to_anchor': (0, -0.4),
        'ncol': 4,
        'framealpha': 0,
        'fontsize': 3.5
    },
    icons='user', 
    icon_size=7,
    icon_legend=True,
    dpi=250,
    block_arranging_style='new-line',
    interval_ratio_y=0.1,
)

fig.show()

There's an overwhelming imbalance when we look at the distribution of genders. On average, 8 out of 10 respondents are men. I believe that diversity in any form (gender-wise, racial, cultural etc.) is highly beneficial as it tends to bring different, unique perspectives/ideas into the picture. I am hoping that in the upcoming years we will see a more balanced and rich distribution from gender-diversity perspective.

<div id="gencountry"></div>

We will now dive a bit deeper to look at the **gender distribution by country**. Since the field is dominated by men, measuring male to non-male ratio is a good way to understand level of gender-diversity for each country. 

The countries with the lower male/non-male rate **(more gender-diverse) are at the top**, and the countries with the higher male/non-male rate **(less gender-diverse) are at the bottom**.

In [None]:
vizdata = maindf.loc[1:,["Q2", "Q3"]].dropna()

country_sum = vizdata.groupby(["Q3"]).size()
vizdata = vizdata.groupby(["Q2", "Q3"], as_index=False).size()

vizdata["Country Sum"] = vizdata["Q3"].map(country_sum)
vizdata["Percent"] = vizdata["size"] / vizdata["Country Sum"]

vizdata = vizdata[vizdata['Q3']!='Other']

# Sort countries by rate of male respondents, ascending
sort_cats = (
    vizdata.loc[vizdata["Q2"] == "Man"]
    .sort_values(by=["Percent"])["Q3"]
    .values
)[::-1]

fig = go.Figure(
    px.bar(
        data_frame=vizdata,
        x="Percent",
        y="Q3",
        text='Percent',
        color="Q2",
        orientation="h",
        color_discrete_sequence=[colorlist[5],colorlist[0],colorlist[10],colorlist[9]]
    )
)

fig.update_traces(
    hovertemplate="%{y}<br>%{x:.0%}",
    texttemplate='%{text:.0%}',
    marker_opacity=0.7
)

fig.update_layout(
    margin=dict(pad=5, t=80),
    legend=dict(
        bgcolor="rgba(0,0,0,0)",
        orientation="h",
        title_text="",
        font=dict(size=12,color="#0D2A63"),
        yanchor="bottom",
        y=1,
        xanchor="left",
        x=0,
    ),
    hoverlabel=dict(
        bgcolor="white",
        font_size=14,
    ),
    height=1800,
    width=800,
    xaxis=dict(visible=False),
    yaxis=dict(
        categoryorder= "array", 
        categoryarray= sort_cats,
        tickfont=tick_font, 
        tickmode="linear", 
        title=""),
    title=dict(
        text='Gender-Diversity by Country',
        y=0.99,
        x=0,
        xref="paper",
        xanchor= 'left',
        yanchor= 'top',
        font=title_font                
    )
)

fig.show()

# Education Levels

<div id="education"></div>

Respondents are divided into four groups from Education levels perspective: **Doctoral** degree holders, **Master's** degree holders, **Bachelor's** degree holders, and those who pursued **other means of education**. 

Other means of education consists of professional degree holders, high school grads, and those who received some college/university study without earning a degree.

In [None]:
vizdata = maindf[maindf['Q4']!='No Answer Provided'].loc[1:, "Q4"].dropna()

order_ = ['Doctoral','Master’s','Bachelor’s','Other']
vizdata = vizdata.value_counts(normalize=True).reindex(order_).reset_index()

vizdata_dict = dict(zip(vizdata['index'],vizdata['Q4']*100))
labels = ['{} {:.1f}%'.format(k, v) for k, v in vizdata_dict.items()]

fig = plt.figure(
    FigureClass=Waffle,
    rows=6,
    title=dict(label='Educational Backgrounds of Data Scientists',loc='left',size=5, color=titlecolor),
    figsize=(3,2),
    values=vizdata_dict,
    labels=labels,
    legend = {
        'loc': 'lower left',
        'bbox_to_anchor': (0, -0.4),
        'ncol': 4,
        'framealpha': 0,
        'fontsize': 3.5
    },
    icons=['university','university','university','laptop-code'],
    icon_size=7, 
    icon_legend=True,
    dpi=250,
    block_arranging_style='new-line',
    interval_ratio_y=0.1,
)

fig.show()

The **largest cluster** of data science professionals are the ones with a **Master's** degree - **47%**. 
Respondents with a **Bachelor's** degree are **26%** of these professionals; and a **19%** hold **Doctoral** degree. 
**8%** of the respondents got into the field without earning a form of University degree.

The demand for skilled data science professionals is high, and the attention the field is getting has grown substantially over the recent years. It's an open discussion, whether a university/masters/doctoral degree is necessary to get into Data Science. The alternatives to the traditional education methods are growing rapidly, especially since the beginning of the covid pandemic. It will be interesting to see the next years survey results and see how these percentages change.

# Comparing various Job Titles within Data Science
<div id="jobtitles"></div>


We know that the Data Science attracts professionals with varying backgrounds. In the upcoming sections, we'll compare these varying professions/job titles to see how the **work activities**, **machine learning experiences**, **coding experiences** differ from one to another. We will also review which Machine Learning Algorithms are utilized (and at what rate) by these professions.

Let's start with work activities to see who does what. In this question the users were able to pick more than one activity provided, so the answers (activities) are not mutually exclusive.

In [None]:
colrange = list(range(110, 118))
vizdata = maindf.iloc[1:, colrange].dropna(how="all")

# trim out the unneccessary part of the answers 
vizdata.columns = maindf.iloc[0, colrange].apply(lambda x: x.split("-")[-1].strip()).values

# reduce the length of the answers to prevent overlapping labels
vizdata.rename(
    columns={
        "Analyze and understand data to influence product or business decisions": 
        "Analyze data for<br>business decisions",
        
        "Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data": 
        "Manage business<br>data infrastructure",
        
        "Build prototypes to explore applying machine learning to new areas": 
        "Prototype ML to<br>apply to new areas",
        
        "Build and/or run a machine learning service that operationally improves my product or workflows": 
        "Manage ML for<br>product/business",
        
        "Experimentation and iteration to improve existing ML models": 
        "Improve existing<br>ML models",
        
        "Do research that advances the state of the art of machine learning": 
        "Research to<br>advance state of<br>the art of ML",
        
        "None of these activities are an important part of my role at work": 
        "None of these",
    },
    inplace=True,
)

# omit "Other" and "None of these" answers as they don't provie any insight for this question
vizdata = vizdata.drop(columns=['Other','None of these']).notna()

# add title column
vizdata["Title"] = maindf.loc[vizdata.index, "Q5"]

titles_size = vizdata.groupby(["Title"]).size()
vizdata = vizdata.groupby("Title").sum()
vizdata = vizdata.stack().reset_index()
vizdata["Title Size"] = vizdata["Title"].map(titles_size)
vizdata["Percent"] = round(vizdata[0] / vizdata["Title Size"], 2)

# marker and text colors to highlight the largest percentage of each group
mcolors = [[colorlist[10]]*vizdata['level_1'].nunique() for _ in range(len(select_titles))]
tcolors = [["rgb(136,136,136)"]*vizdata['level_1'].nunique() for _ in range(len(select_titles))]

fig = make_subplots(cols=len(select_titles), rows=1, vertical_spacing=0.01,shared_yaxes=True)

for n, jobtitle in enumerate(select_titles):
    subset_data = vizdata[vizdata['Title']==jobtitle].reset_index(drop=True)
    indexmax = subset_data['Percent'].idxmax()
    mcolors[n][indexmax] = colorlist[5]
    tcolors[n][indexmax] = "black"
    fig.append_trace(trace=go.Bar(
        name=jobtitle,
        y=subset_data["level_1"],
        x=subset_data["Percent"],
        text=subset_data["Percent"], 
        orientation='h',
        hoverinfo="skip",
        textposition='outside',
        texttemplate="%{text:.0%}",
        textfont_color=tcolors[n],
        marker_color=mcolors[n],
        marker_opacity=0.8),
                     col=n+1,row=1)
    fig.add_annotation(dict(
        x=0.5,y=5.8,
        showarrow=False,
        font=tick_font,
        text=jobtitle,
        xref=f"x{n+1}",
        yref=f"y{n+1}"
    ))


fig.update_xaxes(
    range = [0,2.5],
    visible=False
)

fig.update_yaxes(
    tickfont=tick_font,
    ticklen=0,
    showgrid=False,
    title="",
)


fig.update_layout(
    paper_bgcolor='#EDEDED',
    plot_bgcolor='#EDEDED',
    margin=dict(pad=10),
    showlegend=False,
    height=700,
    width=700,
    title=dict(
        text="Regular Work Activities per Job Title",
        y=0.95,
        x=0,
        xref="paper",
        xanchor= 'left',
        yanchor= 'top', 
        font=title_font),  
    hoverlabel=dict(
        bgcolor="white",
        font_size=14,
    ),
)

for i in range(5):
    fig.add_shape(type="line",
        x0=-0.2, y0=0.5+i, x1=1, y1=0.5+i,
        xref = 'paper',
        line=dict(
            color="black",
            width=0.8,
            dash="dot",
        )
    )

fig.show()

Analyzing data is the most common activity across all job titles. On average, **62%** of the respondents reported that they analyze data on a regular basis. 

Nearly the **50% of Research Scientists** work towards advancing the state of the art of Machine Learning.

Data Engineers have two main responsibilities: management of business data infrastructure and analysis of the data. I assume that the type of analysis Data Engineers conduct is different than Business/Data Analysts analyses. Analysts are more interested in **the story and key insights** that's extracted from the data, while Data Engineers analyze the data from **format/frequency/quality** perspective. 

Back when I started learning about Data Science, one of the main things I was unclear about was the difference in roles and responsibilities of different job titles. A graph like the one above would have been helpful.

In the upcoming graphs, we'll compare these job titles from **Coding and Machine Learning Experience** perspectives.

In [None]:
vizdata = maindf.loc[1:, ["Q5","Q6"]].dropna()

titlesize = vizdata.groupby('Q5').size()
title_exp_size = vizdata.groupby(by=['Q5','Q6']).size()

vizdata = vizdata.drop_duplicates(inplace=False).copy()

vizdata['TitleSize'] = vizdata.set_index('Q5').index.map(titlesize)
vizdata['TitleExpSize'] = vizdata.set_index(['Q5','Q6']).index.map(title_exp_size)
vizdata['Percent'] = vizdata['TitleExpSize']/vizdata['TitleSize']

sort_exp=[
    "No coding<br>experience",
    '< 1 years',
    '1-2 years',
    '3-5 years',
    '5-10 years',
    '10+ years'
]

mcolors = [[colorlist[10]]*vizdata['Q6'].nunique() for _ in range(len(select_titles))]
tcolors = [["rgb(136,136,136)"]*vizdata['Q6'].nunique() for _ in range(len(select_titles))]

fig = make_subplots(cols=len(select_titles), rows=1, vertical_spacing=0.01,shared_yaxes=True)

for n, jobtitle in enumerate(select_titles):
    subset_data = vizdata[vizdata['Q5']==jobtitle].reset_index(drop=True)
    indexmax = subset_data['Percent'].idxmax()
    mcolors[n][indexmax] = colorlist[5]
    tcolors[n][indexmax] = "black"
    fig.append_trace(trace=go.Bar(
        name=jobtitle,
        y=subset_data["Q6"],
        x=subset_data["Percent"],
        text=subset_data["Percent"], 
        orientation='h',
        hoverinfo="skip",
        textposition='outside',
        texttemplate="%{text:.0%}",
        textfont_color=tcolors[n],
        marker_color=mcolors[n],
        marker_opacity=0.8),
                     col=n+1,row=1)
    fig.add_annotation(dict(
        x=0.5,y=6,
        showarrow=False,
        font=tick_font,
        text=jobtitle,
        xref=f"x{n+1}",
        yref=f"y{n+1}"
    ))


fig.update_xaxes(
    range = [0,1.5],
    visible=False
)

fig.update_yaxes(
    tickfont=tick_font,
    showgrid=False,
    title="",
    ticklen=0,
    categoryorder= "array",
    categoryarray= sort_exp,
)


fig.update_layout(
    paper_bgcolor='#EDEDED',
    plot_bgcolor='#EDEDED',
    margin=dict(pad=10),
    showlegend=False,
    height=700,
    width=700,
    title=dict(
        text="Coding Experience per Job Title", 
        y=0.95,
        x=0,
        xref="paper",
        xanchor= 'left',
        yanchor= 'top', 
        font=title_font),  
        hoverlabel=dict(
        bgcolor="white",
        font_size=14,
    ),
)


for i in range(5):
    fig.add_shape(type="line",
        x0=-0.15, y0=0.5+i, x1=1, y1=0.5+i,
        xref = 'paper',
        line=dict(
            color="black",
            width=0.8,
            dash="dot",
        )
    )

fig.show()

After a **very** rough calculation, on average: 
* Research Scientists have **6.1**,
* Data Engineers have **5.9**
* Data Scientists have **5.3**,
* Statisticians have **4.7**,
* Machine Learning Engineers have **4.6**,
* Business/Data Analysts have **3.3**
years of Coding/Programming experience...

In [None]:
vizdata = maindf.loc[1:, ["Q5","Q15"]].dropna()

titlesize = vizdata.groupby('Q5').size()
title_exp_size = vizdata.groupby(by=['Q5','Q15']).size()

vizdata = vizdata.drop_duplicates(inplace=False).copy()

vizdata['TitleSize'] = vizdata.set_index('Q5').index.map(titlesize)
vizdata['TitleExpSize'] = vizdata.set_index(['Q5','Q15']).index.map(title_exp_size)
vizdata['Percent'] = vizdata['TitleExpSize']/vizdata['TitleSize']

sort_exp=[
    "I don't use ML",
    '< 1 years',
    '1-2 years',
    '2-3 years',
    '3-4 years',
    '4-5 years',
    '5+ years'
]

mcolors = [[colorlist[10]]*vizdata['Q15'].nunique() for _ in range(len(select_titles))]
tcolors = [["rgb(136,136,136)"]*vizdata['Q15'].nunique() for _ in range(len(select_titles))]

fig = make_subplots(cols=len(select_titles), rows=1, vertical_spacing=0.01,shared_yaxes=True)

for n, jobtitle in enumerate(select_titles):
    subset_data = vizdata[vizdata['Q5']==jobtitle].reset_index(drop=True)
    indexmax = subset_data['Percent'].idxmax()
    mcolors[n][indexmax] = colorlist[5]
    tcolors[n][indexmax] = "black"
    fig.append_trace(trace=go.Bar(
        name=jobtitle,
        y=subset_data["Q15"],
        x=subset_data["Percent"],
        text=subset_data["Percent"], 
        orientation='h',
        hoverinfo="skip",
        textposition='outside',
        texttemplate="%{text:.0%}",
        textfont_color=tcolors[n],
        marker_color=mcolors[n],
        marker_opacity=0.8),
                     col=n+1,row=1)
    fig.add_annotation(dict(
        x=0.5,y=7,
        showarrow=False,
        font=tick_font,
        text=jobtitle,
        xref=f"x{n+1}",
        yref=f"y{n+1}"
    ))


fig.update_xaxes(
    range = [0,1.5],
    visible=False
)

fig.update_yaxes(
    tickfont=tick_font,
    showgrid=False,
    title="",
    ticklen=0,
    categoryorder= "array",
    categoryarray= sort_exp,
)


fig.update_layout(
    paper_bgcolor='#EDEDED',
    plot_bgcolor='#EDEDED',
    margin=dict(pad=10),
    showlegend=False,
    height=700,
    width=700,
    title=dict(
        text="Machine Learning Experience per Job Title", 
        y=0.95,
        x=0,
        xref="paper",
        xanchor= 'left',
        yanchor= 'top', 
        font=title_font),  
        hoverlabel=dict(
        bgcolor="white",
        font_size=14,
    ),
)


for i in range(6):
    fig.add_shape(type="line",
        x0=-0.15, y0=0.5+i, x1=1, y1=0.5+i,
        xref = 'paper',
        line=dict(
            color="black",
            width=0.8,
            dash="dot",
        )
    )

fig.show()

After a similar calculation, on average: 
* Data Scientists have **2.7**,
* Research Scientists have **2.5**,
* Machine Learning Engineers have **2.3**,
* Statisticians have **1.9**,
* Data Engineers have **1.4**
* Business/Data Analysts have **1.2**
years of Machine Learning experience...

Both the years of ML and Coding experience averages are very rough estimates and are prone to error, so take these averages with a **large grain of salt**. That being said, it makes intuitive sense to me that Scientists are the most experienced group and that Analysts tend to have the least experience. 

One thing that stood out to me was that on average Statisticians show slightly more coding experience compared to Machine Learning Engineers. This may be due to the margin of error of calculations, or the fact that the sample of survey respondents who are Statisticians happen to be a highly experienced bunch. A third probable reason is that Machine Learning Engineer position is relatively new, and Statisticians have been around for quite some time... 

Let's see the **top 5 Machine Learning algorithms** utilized regularly by these groups of professionals.

In [None]:
colrange = list(range(82, 95))
vizdata = maindf.iloc[1:, colrange].dropna(how="all")

vizdata.columns = maindf.iloc[0, colrange].apply(lambda x: x.split(" - ")[-1].strip()).values

vizdata.rename(
    columns={
        "Convolutional Neural Networks": 
        "CNNs",
        
        "Bayesian Approaches": 
        "Bayesian<br>Methods",
        
        "Gradient Boosting Machines (xgboost, lightgbm, etc)": 
        "GBMs",
        
        "Decision Trees or Random Forests": 
        "Dec Trees/<br>Rand Forests",
        
        "Linear or Logistic Regression": 
        "Lin/Log<br>Regression",
    },
    inplace=True,
)


vizdata = vizdata.notna()

vizdata["Title"] = maindf.loc[vizdata.index, "Q5"]

titles_size = vizdata.groupby(["Title"]).size()
vizdata = vizdata.groupby("Title").sum()
vizdata = vizdata.stack().reset_index()
vizdata["Title Size"] = vizdata["Title"].map(titles_size)
vizdata["Percent"] = round(vizdata[0] / vizdata["Title Size"], 2)

top5 = vizdata[vizdata['Title'].isin(select_titles)].groupby('level_1')['Percent'].mean().nlargest(5).index
vizdata = vizdata[vizdata['level_1'].isin(top5)]

mcolors = [[colorlist[10]]*vizdata['level_1'].nunique() for _ in range(len(select_titles))]
tcolors = [["rgb(136,136,136)"]*vizdata['level_1'].nunique() for _ in range(len(select_titles))]

fig = make_subplots(cols=len(select_titles), rows=1, vertical_spacing=0.01,shared_yaxes=True)

for n, jobtitle in enumerate(select_titles):
    subset_data = vizdata[vizdata['Title']==jobtitle].reset_index(drop=True)
    indexmax = subset_data['Percent'].idxmax()
    mcolors[n][indexmax] = colorlist[5]
    tcolors[n][indexmax] = "black"
    fig.append_trace(trace=go.Bar(
        name=jobtitle,
        y=subset_data["level_1"],
        x=subset_data["Percent"],
        text=subset_data["Percent"], 
        orientation='h',
        hoverinfo="skip",
        textposition='outside',
        texttemplate="%{text:.0%}",
        textfont_color = tcolors[n],
        marker_color=mcolors[n],
        marker_opacity=0.8),
                     col=n+1,row=1)
    fig.add_annotation(dict(
        x=0.5,y=4.8,
        showarrow=False,
        font=tick_font,
        text=jobtitle,
        xref=f"x{n+1}",
        yref=f"y{n+1}"
    ))


fig.update_xaxes(
    range = [0,2.5],
    visible=False
)

fig.update_yaxes(
    tickfont=tick_font,
    ticklen=0,
    showgrid=False,
    title="",
)


fig.update_layout(
    paper_bgcolor='#EDEDED',
    plot_bgcolor='#EDEDED',
    margin=dict(pad=10),
    showlegend=False,
    height=650,
    width=650,
    title=dict(
        text="5 Most Regularly Utilized ML Algorithms",
        y=0.95,
        x=0,
        xref="paper",
        xanchor= 'left',
        yanchor= 'top',  
        font=title_font),  
    hoverlabel=dict(
        bgcolor="white",
        font_size=14,
    ),
)

for i in range(4):
    fig.add_shape(type="line",
        x0=-0.2, y0=0.5+i, x1=1, y1=0.5+i,
        xref = 'paper',
        line=dict(
            color="black",
            width=0.8,
            dash="dot",
        )
    )

fig.show()

**Statisticians** mostly employ Regression Algorithms, Decision Trees/Random Forests and Bayesian Methods. <br>
More than half of **Research Scientists** and **Machine Learning Engineers** use Convolutional Neural Networks regularly. <br>
**Data Scientists** utilize GBM's more than any other group of professionals.


Overall, Linear and Logistic Regression methods are the clear winners from utilization rate perspective. On an average, **78%** of the above respondents reported that they use these two algorithms on a regular basis.

Gradient Boosting Methods, the winning algorithms for a lot of the Kaggle competitions, are utilized by only **39%** of participants regularly. This is most likely due to the explainability of these algorithms; explainability and simplicity are highly important factors in business. GBM's tend to be much more complex and difficult to explain compared to Regression algorithms.

# Compensation Analysis

<div id="compensation"></div>

In the survey users were asked to pick one of the provided salary ranges representing their annual salary(e.g. 5,000-10,000; 10,000-15,000...).<br>
To make an easier and simpler numerical analysis, (similar to the years of experience calculations above) I assumed a uniform distribution within each range, and replaced each salary range with the **average of** its **min** and **max values**.<br>
For example, if the salary range is **5,000 - 10,000**, I used **7,500** as a median salary. The only exception is **>500,000**, which I replaced with itself, **500,000**.
<br>
Keeping this in mind, let's look at the median salaries from top 10 countries with the highest salaries.

In [None]:
vizdata = maindf.loc[1:, ["Q3","Q24"]].dropna()
vizdata = vizdata[vizdata['Q3']!='Other']

vizdata["Q24"] = vizdata["Q24"].map(lambda x: x.replace(",", "")).map(lambda x: x.replace("$", ""))
vizdata["Q24"] = vizdata["Q24"].map(
    lambda x: 500000 if ">" in x
    else 0.5 * (int(x.split("-")[0]) + int(x.split("-")[1]))
)

vizdata = vizdata.groupby(by=["Q3"])["Q24"].agg(['median']).reset_index()

vizdata = vizdata.sort_values(by='median').tail(10)

top10mostpaying = vizdata['Q3'].values

vizdata['median'] = vizdata['median']/1000

fig = px.bar(vizdata, 
             x="median", 
             y="Q3", 
             text="median",
             orientation="h",
            )

colors = [colorlist[10],]*10
colors[-2:] = [colorlist[5]]*2

fig.update_traces(
    hoverinfo="skip",
    hovertemplate=None,
    marker_color=colors,
    marker_opacity=0.8,
    textangle=0,
    texttemplate="$ %{text:,.0f}K",
    textposition='inside'
)

fig.update_layout(
    xaxis=dict(
        visible=True,
        title_text="Annual Median Salary in USD",
        showticklabels=False,
        ticklen=0
    ),
    yaxis=dict(
        title_text="", 
        tickfont=tick_font,
        
    ),    
    hoverlabel=dict(
        bgcolor="white",
        font_size=14,
    ),
    margin=dict(pad=5),
    height=600,
    width=800,
    title=dict(
        text='10 Countries with the Highest Median Salaries',        
        y=1,
        x=0,
        xref="paper",
        xanchor= 'left',
        yanchor= 'top', 
        font=title_font,
))


fig.show()

**US** and **Switzerland** are the highest paying countries. 

**Israel**, a country known to invest heavily in R&D and technology, follows them closely.<br>

I see that majority of the top 10 countries are either mostly **English speaking**, or are very high in English proficiency [[1]](https://en.wikipedia.org/wiki/EF_English_Proficiency_Index)

Having lived in many countries myself, I know that it's crucial to take living expenses into account when doing salary comparison. Making 100k USD in  San Francisco, U.S. isn't the same thing as making 100k USD in Düsseldorf, Germany. For this reason, keeping average household expenditure[[2]](https://en.wikipedia.org/wiki/Household_final_consumption_expenditure) in mind may provide some guidance with salary analysis. 

In the next graph, we'll see the top 10 countries with the **highest median salary/expense rate**. The country with the best rate will be at the very top.

In [None]:
expensedf = pd.read_csv('../input/householdexpenditure/householdexpenditure.csv')
expensedf['Country'] = expensedf['Country'].apply(lambda x: x.strip())
expensedict = dict(zip(expensedf['Country'],expensedf['HouseholdExpenditurePerCapita']))


vizdata = maindf.loc[1:, ["Q3","Q24"]].dropna()
vizdata["Q24"] = vizdata["Q24"].map(lambda x: x.replace(",", "")).map(lambda x: x.replace("$", ""))
vizdata["Q24"] = vizdata["Q24"].map(
    lambda x: 500000 if ">" in x
    else 0.5 * (int(x.split("-")[0]) + int(x.split("-")[1]))
)
vizdata = vizdata[vizdata['Q3']!='Other']
vizdata = vizdata.groupby(by=["Q3"])["Q24"].agg(['median']).reset_index()

vizdata['HouseHoldExp'] = vizdata['Q3'].map(expensedict)
vizdata['Salary/Expenses'] = vizdata['median']/vizdata['HouseHoldExp']

vizdata['median'] = vizdata['median']/1000
vizdata['HouseHoldExp'] = vizdata['HouseHoldExp']/1000

vizdata = vizdata.sort_values(by='Salary/Expenses').tail(10)


fig = make_subplots(rows=1, 
                    cols=2, 
                    specs=[[{}, {}]], 
                    shared_xaxes=False,
                    shared_yaxes=False, 
                    horizontal_spacing=0.001
                   )

fig.add_trace(go.Scatter(
                    x=vizdata['Salary/Expenses'], 
                    y=vizdata['Q3'],
                    mode='lines+markers+text',
                    text=vizdata['Salary/Expenses'],
                    line_color=colorlist[7],
                    name='Salary / Expense Rate'), row=1, col=1)


fig.add_trace(go.Bar(x=vizdata['median'], 
                     y=vizdata['Q3'],
                     text=vizdata['median'], 
                     orientation='h',
                     marker_color=colorlist[5],
                     marker_opacity=0.8,
                     name='Annual Salary'), row=1, col=2)

fig.add_trace(go.Bar(x=vizdata['HouseHoldExp'], 
                     y=vizdata['Q3'],
                     text=vizdata['HouseHoldExp'], 
                     orientation='h',
                     marker_color="#800000",
                     marker_opacity=0.8,
                     name='Household Expenditure per Capita'), row=1, col=2)




fig.update_traces(
    hoverinfo="skip",
    hovertemplate=None,
    texttemplate="%{text:,.2f}",
    textposition="bottom right",
    row=1,col=1
)
fig.update_traces(
    hoverinfo="skip",
    hovertemplate=None,
    textangle=0,
    texttemplate="$%{text:,.0f}K",
    textposition='inside',
    row=1,col=2
)


fig.update_layout(
    legend=dict(
        bgcolor="rgba(0,0,0,0)",
        orientation="h",
        title_text="",
        yanchor="bottom",
        y=1.015,
        xanchor="left",
        x=0,
    ),
    barmode='overlay',
    xaxis1=dict(
        visible=True,
        showgrid=False,
        range=[2,6],
        ticklen=0,
        showticklabels=False,
        title='Salary/Expense Ratio'
    ),
    xaxis2=dict(
        visible=True,
        showgrid=False,
        ticklen=0,
        showticklabels=False,
        title='Annual Median Salary in USD'
    ),
    yaxis1=dict(
        showline=False,
        linewidth=1,
        showticklabels=True,
        ticklen=0
        
    ), 
    yaxis2=dict(
        title_text="", 
        tickfont=tick_font,
        showline=True,
        linecolor='rgba(102, 102, 102, 0.8)',
        showticklabels=False,
        linewidth=1,
        ticklen=0        
    ),

    margin=dict(pad=10,t=120),
    height=700,
    width=800,
    title=dict(
        text='10 Countries with the Highest Salary/Expense Ratio',
        y=0.95,
        x=0,
        xref="paper",
        xanchor= 'left',
        yanchor= 'top', 
        font=title_font
    )
)


fig.show()

Some of the top 10 Countries, and the ranking amongst them are different compared to the previous chart.<br> Israel is offering the best deal by providing a median salary of 95K USD and an average household expenditure per Capita is 18K USD. The Salary/Expense rate in Israel is **5.22**.

Switzerland follows Israel with a Salary/Expense rate of **3.97**, while offering a median salary of 112K USD.

USA, the country that pays the most, is 8th on the list when we consider living expenses alongside the salaries. The salary/expense rate in the U.S. is **2.97**.

There are obviously many other questions that could help us analyze survey results further. However, for the sake of keeping this notebook "digestible", I tried to keep this list relatively short and sweet. I hope that this notebook has been helpful to you, and feel free to let me know below if you have any questions/comments.

# References

EF English Proficiency Index 
https://en.wikipedia.org/wiki/EF_English_Proficiency_Index

List of countries by household final consumption expenditure per capita
https://en.wikipedia.org/wiki/List_of_countries_by_household_final_consumption_expenditure_per_capita