## Who 'Excels' on Kaggle??

### Contents
1. Introduction
    * Overview
    * Background
    * Main focus of this notebook's EDA

2. Exploratory Data Analysis
    * Overview
    * Demographics
    * Work life
    * Knowledge of programming languages and frequently used tools
    * Learning

3. Summary of findings

4. Conclusion

### 1. Introduction

#### Overview
According to Q38 of the 2020 DS & ML survey, a large number of Kagglers use Excel as their main tool for data analysis. In fact, Excel ranked second with 32% of responses - miles ahead of all other answer choices apart from Local development environments (IDEs), which came in first with 46% of responses. 
This notebook takes a closer look at these "Excelling" Kagglers by exploring and contrasting their profiles with those of IDE users.  

#### Background

I started my Python journey in November 2020, mostly because I wanted some intellectual stimulation while being on parental leave. Not that looking after a tiny human who spends their time looping through a continuous 'sleep, feed, poop' cycle isn't fun... I just felt like I needed another project to keep me sane. That's when I came across a podcast that talked about Python for data analysis and visualisation and I was intrigued.

I work in finance/commercial analysis so I've been working with Excel my entire working life (I would rather not say publically how many years that is, so let's just say I'm 'relatively experienced'). I also have a passion for data analysis and visualisation. Hence, one of the first questions I felt like digging into in the 2020 DS & ML survey was Q38 "What is the primary tool that you use at work or school to analyze data?"

When I looked at the responses I was quite surprised to see 'Basic statistical software (Microsoft Excel, Google Sheets, etc.)' ranking in second place, and by a long shot! 
To be honest I was pleased to see this because it made me feel like I'm not alone out there in terms of being an Excel user who also codes or at a minimum wants to code. And it gives me the courage to keep going with Python.

Anyhow, I decided to dig a little deeper to understand who those "Excelling" Kagglers are and how they differ from Local dev environment users. For the purposes of this exercise I will call these two groups "Excellers" and "IDElers".

To practice what I've learnt so far I've had a play with different chart types and visualisation libraries along the way.

Please keep scrolling if you'd like to see what I found out...

#### Main focus of this notebook's EDA 
Understand the profile of "Excellers" on Kaggle and how they differ (or not) from "IDElers". To investigate this question I have looked at the following points:

1. How do Excellers and IDElers differ demographically? (age, country of residence, gender, education)
2. How do they differ in terms of their work? (roles, tasks, salary)
3. How do they differ in terms of experience? (experience, programming languages and frequently used tools)
4. How do they differ in terms of learning new skills (resources used and recommendations)

##### Last but not least
Please keep in mind I have only been coding with Python for about 8 weeks so please be kind. If my notebook looks inconsistent or messy at times it is partly because I am trying different approaches for learning purposes and partly because I have no clue what I'm doing (lol). Any suggestions on how I can do things better/more efficiently are most welcome!


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
from matplotlib.ticker import FuncFormatter, MaxNLocator
import matplotlib.transforms as transforms
import seaborn as sns
import altair as alt
import plotly.express as px
import plotly.graph_objects as go
sns.set(style='white', context='notebook', palette='deep')
%matplotlib inline

results = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', low_memory=False, skiprows=[1])

In [None]:
# importing country codes from Gitbut resource
countries = pd.read_csv('../input/countries/all/all.csv')
countries.rename(columns={'name':'Country','alpha-3':'Country_Code','region':'Region','sub-region':'Sub-Region'}, inplace=True)
countries.drop(columns=['country-code','alpha-2','iso_3166-2','intermediate-region','region-code','sub-region-code','intermediate-region-code'], inplace=True)

In [None]:
# renaming some countries to look up country codes from Github resource
def cntry(results):
    if results['Q3'] == 'Russia':
        return 'Russian Federation'
    elif results['Q3'] == 'Iran, Islamic Republic of...':
        return "Iran (Islamic Republic of)"
    elif results['Q3'] == 'South Korea':
        return "Korea (Democratic People's Republic of)"
    elif results['Q3'] == 'Taiwan':
        return "Taiwan, Province of China"
    elif results['Q3'] == 'Republic of Korea':
        return "Korea, Republic of"
    else:
        return results['Q3']

results['Country'] = results.apply(cntry, axis=1)

# look up country codes, region and subregion
results = results.merge(countries, on='Country', how='left')

# rename selected questions
results['Age'] = results['Q1']
results['Gender'] = np.where((results['Q2'] == 'Man') | (results['Q2'] == 'Woman'), results['Q2'], 'Non-Binary')
results['Profession'] = results['Q5']
results['Education'] = results['Q4']
results['Background'] = results['Q6']

# rename responses in Q38
Q38_short ={'Basic statistical software (Microsoft Excel, Google Sheets, etc.)':'Excellers',
           'Advanced statistical software (SPSS, SAS, etc.)':'SSPS_SAS',
           'Business intelligence software (Salesforce, Tableau, Spotfire, etc.)':'BI_software',
           'Local development environments (RStudio, JupyterLab, etc.)':'IDElers',
           'Cloud-based data software & APIs (AWS, GCP, Azure, etc.)':'Cloud_based_tools',
           'Other':'Other'}

results['Tools'] = results['Q38'].map(Q38_short)

# rename columns for multi choice questions and change responses to binary
results = results.rename(columns={'Q7_Part_1':'Q7_Python',
                                  'Q7_Part_2':'Q7_R',
                                  'Q7_Part_3':'Q7_SQL',
                                  'Q7_Part_4':'Q7_C',
                                  'Q7_Part_5':'Q7_C++',
                                  'Q7_Part_6':'Q7_Java',
                                  'Q7_Part_7':'Q7_Javascript',
                                  'Q7_Part_8':'Q7_Julia',
                                  'Q7_Part_9':'Q7_Swift',
                                  'Q7_Part_10':'Q7_Bash',
                                  'Q7_Part_11':'Q7_MATLAB',
                                  'Q7_Part_12':'Q7_None',
                                  'Q7_OTHER':'Q7_Other'})

results['Q7_Python'] = np.where((results['Q7_Python'] == 'Python'),1,0)
results['Q7_R'] = np.where((results['Q7_R'] == 'R'),1,0)
results['Q7_SQL'] = np.where((results['Q7_SQL'] == 'SQL'),1,0)
results['Q7_C'] = np.where((results['Q7_C'] == 'C'),1,0)
results['Q7_C++'] = np.where((results['Q7_C++'] == 'C++'),1,0)
results['Q7_Java'] = np.where((results['Q7_Java'] == 'Java'),1,0)
results['Q7_Javascript'] = np.where((results['Q7_Javascript'] == 'Javascript'),1,0)
results['Q7_Julia'] = np.where((results['Q7_Julia'] == 'Julia'),1,0)
results['Q7_Swift'] = np.where((results['Q7_Swift'] == 'Swift'),1,0)
results['Q7_Bash'] = np.where((results['Q7_Bash'] == 'Bash'),1,0)
results['Q7_MATLAB'] = np.where((results['Q7_MATLAB'] == 'MATLAB'),1,0)
results['Q7_None'] = np.where((results['Q7_None'] == 'None'),1,0)
results['Q7_Other'] = np.where((results['Q7_Other'] == 'Other'),1,0)

results = results.rename(columns={'Q9_Part_1':'Q9_JupyterLab',
                                  'Q9_Part_2':'Q9_RStudio',
                                  'Q9_Part_3':'Q9_VisualStudio',
                                  'Q9_Part_4':'Q9_VSCode',
                                  'Q9_Part_5':'Q9_PyCharm',
                                  'Q9_Part_6':'Q9_Spyder',
                                  'Q9_Part_7':'Q9_Notepad++',
                                  'Q9_Part_8':'Q9_Sublime_Text',
                                  'Q9_Part_9':'Q9_Vim_Emacs',
                                  'Q9_Part_10':'Q9_MATLAB',
                                  'Q9_Part_11':'Q9_None',
                                  'Q9_OTHER':'Q9_Other'})

results['Q9_JupyterLab'] = np.where(pd.isnull(results['Q9_JupyterLab']),0,1)
results['Q9_RStudio'] = np.where(pd.isnull(results['Q9_RStudio']),0,1)
results['Q9_VisualStudio'] = np.where(pd.isnull(results['Q9_VisualStudio']),0,1)
results['Q9_VSCode'] = np.where(pd.isnull(results['Q9_VSCode']),0,1)
results['Q9_PyCharm'] = np.where(pd.isnull(results['Q9_PyCharm']),0,1)
results['Q9_Spyder'] = np.where(pd.isnull(results['Q9_Spyder']),0,1)
results['Q9_Notepad++'] = np.where(pd.isnull(results['Q9_Notepad++']),0,1)
results['Q9_Sublime_Text'] = np.where(pd.isnull(results['Q9_Sublime_Text']),0,1)
results['Q9_Vim_Emacs'] = np.where(pd.isnull(results['Q9_Vim_Emacs']),0,1)
results['Q9_MATLAB'] = np.where(pd.isnull(results['Q9_MATLAB']),0,1)
results['Q9_None'] = np.where(pd.isnull(results['Q9_None']),0,1)
results['Q9_Other'] = np.where(pd.isnull(results['Q9_Other']),0,1)

results = results.rename(columns={'Q14_Part_1':'Q14_Matplotlib',
                        'Q14_Part_2':'Q14_Seaborn',
                        'Q14_Part_3':'Q14_Plotly',
                        'Q14_Part_4':'Q14_Ggplot',
                        'Q14_Part_5':'Q14_Shiny',
                        'Q14_Part_6':'Q14_D3',
                        'Q14_Part_7':'Q14_Altair',
                        'Q14_Part_8':'Q14_Bokeh',
                        'Q14_Part_9':'Q14_Geoplotlib',
                        'Q14_Part_10':'Q14_Folium',
                        'Q14_Part_11':'Q14_None',
                        'Q14_OTHER':'Q14_Other'})

results['Q14_Matplotlib'] = results['Q14_Matplotlib'].apply(lambda x: 1 if not pd.isnull(x) else 0)
results['Q14_Seaborn'] = results['Q14_Seaborn'].apply(lambda x: 1 if not pd.isnull(x) else 0)
results['Q14_Plotly'] = results['Q14_Plotly'].apply(lambda x: 1 if not pd.isnull(x) else 0)
results['Q14_Ggplot'] = results['Q14_Ggplot'].apply(lambda x: 1 if not pd.isnull(x) else 0)
results['Q14_Shiny'] = results['Q14_Shiny'].apply(lambda x: 1 if not pd.isnull(x) else 0)
results['Q14_D3'] = results['Q14_D3'].apply(lambda x: 1 if not pd.isnull(x) else 0)
results['Q14_Altair'] = results['Q14_Altair'].apply(lambda x: 1 if not pd.isnull(x) else 0)
results['Q14_Bokeh'] = results['Q14_Bokeh'].apply(lambda x: 1 if not pd.isnull(x) else 0)
results['Q14_Geoplotlib'] = results['Q14_Geoplotlib'].apply(lambda x: 1 if not pd.isnull(x) else 0)
results['Q14_Folium'] = results['Q14_Folium'].apply(lambda x: 1 if not pd.isnull(x) else 0)
results['Q14_None'] = results['Q14_None'].apply(lambda x: 1 if not pd.isnull(x) else 0)
results['Q14_Other'] = results['Q14_Other'].apply(lambda x: 1 if not pd.isnull(x) else 0)

results = results.rename(columns={'Q23_Part_1':'Q23_Analyze_and_understand_data',
                                  'Q23_Part_2':'Q23_Build_run_data_infrastructure',
                                  'Q23_Part_3':'Q23_Build_prototypes_to_explore_ML',
                                  'Q23_Part_4':'Q23_Build_run_ML_service',
                                  'Q23_Part_5':'Q23_Experiment_to_improve_existing_ML',
                                  'Q23_Part_6':'Q23_Research_to_advance_ML',
                                  'Q23_Part_7':'Q23_None_of_the_above',
                                  'Q23_OTHER':'Q23_Other'})

results['Q23_Analyze_and_understand_data'] = np.where(pd.isnull(results['Q23_Analyze_and_understand_data']),0,1)
results['Q23_Build_run_data_infrastructure'] = np.where(pd.isnull(results['Q23_Build_run_data_infrastructure']),0,1)
results['Q23_Build_prototypes_to_explore_ML'] = np.where(pd.isnull(results['Q23_Build_prototypes_to_explore_ML']),0,1)
results['Q23_Build_run_ML_service'] = np.where(pd.isnull(results['Q23_Build_run_ML_service']),0,1)
results['Q23_Experiment_to_improve_existing_ML'] = np.where(pd.isnull(results['Q23_Experiment_to_improve_existing_ML']),0,1)
results['Q23_Research_to_advance_ML'] = np.where(pd.isnull(results['Q23_Research_to_advance_ML']),0,1)
results['Q23_None_of_the_above'] = np.where(pd.isnull(results['Q23_None_of_the_above']),0,1)
results['Q23_Other'] = np.where(pd.isnull(results['Q23_Other']),0,1)

results = results.rename(columns={'Q31_A_Part_1':'Q31_Amazon_QuickSight',
                                  'Q31_A_Part_2':'Q31_Microsoft_Power_BI',
                                  'Q31_A_Part_3':'Q31_Google_Data_Studio',
                                  'Q31_A_Part_4':'Q31_Looker',
                                  'Q31_A_Part_5':'Q31_Tableau',
                                  'Q31_A_Part_6':'Q31_Salesforce',
                                  'Q31_A_Part_7':'Q31_Einstein_Analytics',
                                  'Q31_A_Part_8':'Q31_Qlik',
                                  'Q31_A_Part_9':'Q31_Domo',
                                  'Q31_A_Part_10':'Q31_TIBCO_Spotfire',
                                  'Q31_A_Part_11':'Q31_Alteryx',
                                  'Q31_A_Part_12':'Q31_Sisense',
                                  'Q31_A_Part_13':'Q31_SAP_Analytics_Cloud',
                                  'Q31_A_Part_14':'Q31_None',
                                  'Q31_A_OTHER':'Q31_Other'})

results['Q31_Amazon_QuickSight'] = np.where(pd.isnull(results['Q31_Amazon_QuickSight']),0,1)
results['Q31_Microsoft_Power_BI'] = np.where(pd.isnull(results['Q31_Microsoft_Power_BI']),0,1)
results['Q31_Google_Data_Studio'] = np.where(pd.isnull(results['Q31_Google_Data_Studio']),0,1)
results['Q31_Looker'] = np.where(pd.isnull(results['Q31_Looker']),0,1)
results['Q31_Tableau'] = np.where(pd.isnull(results['Q31_Tableau']),0,1)
results['Q31_Salesforce'] = np.where(pd.isnull(results['Q31_Salesforce']),0,1)
results['Q31_Einstein_Analytics'] = np.where(pd.isnull(results['Q31_Einstein_Analytics']),0,1)
results['Q31_Qlik'] = np.where(pd.isnull(results['Q31_Qlik']),0,1)
results['Q31_Domo'] = np.where(pd.isnull(results['Q31_Domo']),0,1)
results['Q31_Alteryx'] = np.where(pd.isnull(results['Q31_Alteryx']),0,1)
results['Q31_Sisense'] = np.where(pd.isnull(results['Q31_Sisense']),0,1)
results['Q31_SAP_Analytics_Cloud'] = np.where(pd.isnull(results['Q31_SAP_Analytics_Cloud']),0,1)
results['Q31_None'] = np.where(pd.isnull(results['Q31_None']),0,1)
results['Q31_Other'] = np.where(pd.isnull(results['Q31_Other']),0,1)

results = results.rename(columns={'Q37_Part_1':'Q37_Coursera',
                                  'Q37_Part_2':'Q37_edX',
                                  'Q37_Part_3':'Q37_Kaggle_Learn_Courses',
                                  'Q37_Part_4':'Q37_DataCamp',
                                  'Q37_Part_5':'Q37_Fast.ai',
                                  'Q37_Part_6':'Q37_Udacity',
                                  'Q37_Part_7':'Q37_Udemy',
                                  'Q37_Part_8':'Q37_LinkedIn_Learning',
                                  'Q37_Part_9':'Q37_Cloud_cert_programs',
                                  'Q37_Part_10':'Q37_University',
                                  'Q37_Part_11':'Q37_None',
                                  'Q37_OTHER':'Q37_Other'})

results['Q37_Coursera'] = np.where(pd.isnull(results['Q37_Coursera']),0,1)
results['Q37_edX'] = np.where(pd.isnull(results['Q37_edX']),0,1)
results['Q37_Kaggle_Learn_Courses'] = np.where(pd.isnull(results['Q37_Kaggle_Learn_Courses']),0,1)
results['Q37_DataCamp'] = np.where(pd.isnull(results['Q37_DataCamp']),0,1)
results['Q37_Fast.ai'] = np.where(pd.isnull(results['Q37_Fast.ai']),0,1)
results['Q37_Udacity'] = np.where(pd.isnull(results['Q37_Udacity']),0,1)
results['Q37_Udemy'] = np.where(pd.isnull(results['Q37_Udemy']),0,1)
results['Q37_LinkedIn_Learning'] = np.where(pd.isnull(results['Q37_LinkedIn_Learning']),0,1)
results['Q37_Cloud_cert_programs'] = np.where(pd.isnull(results['Q37_Cloud_cert_programs']),0,1)
results['Q37_University'] = np.where(pd.isnull(results['Q37_University']),0,1)
results['Q37_None'] = np.where(pd.isnull(results['Q37_None']),0,1)
results['Q37_Other'] = np.where(pd.isnull(results['Q37_Other']),0,1)

# change salary ranges to numeric based on upper range
results[['Q24','Upper']] = results['Q24'].str.split('-',n=1,expand=True)
results['Upper'] = results['Upper'].str.replace(',', '').astype(float)

In [None]:
# preview dataframe
results.head()

In [None]:
# fill null values for Tools with 'No response'
results['Tools'].fillna('No response', inplace=True)

# filter dataframe on Excel and Jupyter users only
results_short = results[results.Tools.str.contains("Excellers|IDElers")]

xls_results = results[results['Tools'].str.contains("Excellers") == True]

ide_results = results[results['Tools'].str.contains("IDElers") == True]

In [None]:
# define colour schemes
deep_colors=['#ceecb3','#9cdba5','#6fc9a3','#4CA699','#44829b','#3e528f','#4C3C6C','#362b4d','#271a2c']
deep_colors2=['#ceecb3','#6fc9a3','#4CA699','#44829b','#4C3C6C','#271a2c']
deep_2 = ['#9cdba5','#44829b']
deep_2a = ['#4CA699','#4C3C6C']

### 2. Exploratory Data Analysis

In [None]:
# Show breakdown of responses for question 38 "What is the primary tool that you use at work or school to analyze data?"
results.groupby('Tools')['Tools'].count().to_frame().rename(columns={'Tools':'Responses'}).sort_values(by='Responses', ascending=False)

Out of the 20,036 respondents, 6,746 chose not to answer question 38 "What is the primary tool that you use at work or school to analyze data?" After accounting for missing values, Excellers rank second with 4,223 respondents selecting Excel and similar tools as their main tool for data analysis. This is over 5 times higher than BI tools or SSPS/SAS, which rank third and fourth with 798 and 781 responses, respectively. 

In [None]:
# Breakdown of Q38 responses (data analysis tools) excl. null values
chart = results.loc[(results['Tools'] != 'No response')].groupby('Tools')['Tools'].count()
pie, ax = plt.subplots(figsize=(10,6), subplot_kw=dict(aspect="equal"))
labels = chart.keys()

ax.set_title("Figure 1: Breakdown of most frequently used data analysis tools", size=14, weight="bold")
plt.pie(x=chart, autopct="%.1f%%", labels=labels, colors=deep_colors, pctdistance=0.8, labeldistance=1.1, textprops={'fontsize': 11, 'color':"black"})
pie.savefig("DAtoolsPieChart.png")

In percentage terms, 32% of survey respondents use Excel (or similar tools) as their main data analysis tool (fig 1). I did not even expect to see any mention of Excel in a data science and machine learning survey, let alone ranking second and miles ahead of tools like SSPS, SAS and BI software. 
So, let's take a closer look at the profile of these 32% of respondents.

### Demographics

In this section we will be looking at the overall demographics of our Excellers and IDElers, including their gender, age, country of residence and education. 

#### Gender
As a first step I looked at the differences in data analysis tools used by gender. I have broken the category gender into "Man", "Woman" and "Non-binary", which includes the answer choices "Prefer not to say" and "Prefer not to self-describe" due to the low number of responses. 
Overall, the patterns across the three groups are similar (fig 2), with Local IDEs used most frequently (40%+), followed by Excel (between 28% and 35%). Women appear to have a slightly smaller uptake of Local IDEs, with a higher share of Excel and BI software. The Non-binary group has the lowest uptake for Excel or BI tools and the highest share of 'Other' data analysis tools. 

In [None]:
# data analysis tools used by gender
df = pd.crosstab(results['Tools'].loc[(results['Tools'] !='No response')],results['Gender'])
df.plot(kind='pie',subplots=True,figsize=(16, 6),legend=False,wedgeprops=dict(width=0.5),autopct="%.1f%%",
        pctdistance=0.8, labeldistance=1.1, colors=deep_colors, radius=0.9,textprops={'fontsize': 9, 'color':"black"})
plt.title("Figure 2: Breakdown of most frequently used data analysis tools by gender", size=14, weight="bold", ha='right')
plt.savefig("Gender_DAtools_Pie.png")

### Age
Next, I looked at the age profile of Excellers vs IDElers. The age pyramid (fig 3) for both groups is pretty balanced and it's a little difficult to tell which group is older or younger by looking at the chart. Therefore, I looked for different ways to represent the data and found that a heatmap based on cumulative data (fig 3a) was the easiest to read. While still subtle in terms of difference, the below heatmap shows more clearly that Excellers are slightly older with only 76% under the age of 40 compared to 79% of IDElers. Note: the age pyramid (i.e. tornado chart) is based on this code found in the matplotlib discussion forum https://discourse.matplotlib.org/t/tornado-chart/17058/3

In [None]:
# Age pyramid Excellers vs IDElers
Age = pd.crosstab(results['Age'],results['Tools']).apply(lambda c: c/c.sum(), axis=0).round(3)*100
Age['Diff'] = Age['Excellers']-Age['IDElers']
Age.reset_index(inplace=True)

Age = Age.loc[(Age['Age'] != 'No response')]
Age[['Age','Excellers','IDElers']]

Ages = Age['Age']
num_Ages = len(Ages)

xls = Age['Excellers']
ide = Age['IDElers']

pos = np.arange(num_Ages)+0.5

fig = plt.figure(facecolor='white', edgecolor='none', figsize=(10,6))
ax_xls = fig.add_axes([0.05,0.1,0.42,0.8])
ax_ide = fig.add_axes([0.53,0.1,0.40,0.8])
        
ax_xls.xaxis.set_ticks_position('top')
ax_ide.xaxis.set_ticks_position('top')

ax_xls.barh(pos, xls, align='center', color='#4CA699')
ax_xls.set_yticks([])
ax_xls.set_xlim(0,25)
ax_xls.invert_xaxis()
ax_xls.grid(False)

ax_ide.barh(pos, ide, align='center', color='#4C3C6C')
ax_ide.set_yticks([])
ax_ide.set_xlim(0,25)
ax_ide.grid(False)

transform = transforms.blended_transform_factory(fig.transFigure, ax_ide.transData)
for i, label in enumerate(Ages):
    ax_ide.text(0.5, i+0.5, label, ha='center', va='center', fontsize=12, transform = transform)

ax_ide.set_title('IDElers', x=0.16, y=1.025, fontsize=13, weight='bold', pad=20)
ax_xls.set_title('Excellers', x=0.725, y=1.025, fontsize=13, weight='bold', pad=20)

plt.suptitle("Figure 3: Age profile of Excellers and IDElers", size=14, weight="bold", ha='right', y=1.1)

plt.show()
fig.savefig("Age_pyramid.png")

In [None]:
Age_Tools = pd.crosstab(results_short['Tools'], results_short['Q1']).apply(lambda r: r/r.sum(), axis=1).cumsum(axis=1)
plt.figure(figsize=(16,2))
sns.heatmap(Age_Tools, cmap='mako_r', annot=True, fmt='.1%', annot_kws={"size":11})
plt.title('Figure 3a: Cumulative age profile of Excellers and IDElers', weight='bold', pad=20, size=14)
plt.ylabel('')
plt.xlabel('')
plt.yticks(np.arange(2)+0.5,('Excellers','IDElers'), rotation=0, fontsize="10", va="center")
plt.show()

#### Country of residence
In regards to regional differences, I found that Middle Eastern countries had a higher Excel uptake (on average 26% of respondents, fig 4). However, there were considerable differences amongst countires within regions. The highest percentage of Excellers was found in the Philippines with 39% of respondents choosing Excel as most often used data analysis tool (figs 4a,4b). In second place was Saudia Arabia (36%), followed by Peru (35%), which had by far the highest percentage of Excellers within the Americas. The lowest uptake rates were seen in the Netherlands (9%) and Germany (11%), followed by Switzerland and Poland (14% each).
It's important to keep in mind that some countries had a relatively small number of respondents overall so these numbers are based on a small data set and not necessarily representative.
In addition to a grouped bar chart by region and a geoplot I added a bar chart with a threshold to easily identify which countries had more or less Excellers than average (credit to this post for the threshold chart https://stackoverflow.com/questions/28129606/how-to-create-a-matplotlib-bar-chart-with-a-threshold-line).

In [None]:
ME_list = ['Egypt','Iran (Islamic Republic of)', 'Iraq','Saudi Arabia','Yemen','Syria','Jordan','United Arab Emirates',
           'Israel','Lybia','Lebanon','Oman','Palestine (West Bank and Gaza Strip)','Kuwait','Qatar','Bahrain']

results.loc[results['Country'].isin(ME_list), 'Region'] = 'Middle East'
region = pd.crosstab(index=[results['Region'],results['Country']], columns=results['Tools'],
                     values=results['Tools'], aggfunc='count').fillna(0).apply(lambda r: r/r.sum(), axis=1).round(3)*100
chart = region['Excellers']

ax = chart.sort_index(level=0).unstack().transpose().plot(kind='bar', stacked=True, figsize=(15,6), legend=False, color=deep_colors2)

plt.legend(loc='upper left', bbox_to_anchor=(1.01,1.01))
plt.xlabel('')
threshold = region['Excellers'].mean()

chart2 = chart.reset_index()

thresh_AF = chart2['Excellers'].loc[(chart2['Region']=='Africa')].mean()
thresh_AM = chart2['Excellers'].loc[(chart2['Region']=='Americas')].mean()
thresh_AS = chart2['Excellers'].loc[(chart2['Region']=='Asia')].mean()
thresh_EU = chart2['Excellers'].loc[(chart2['Region']=='Europe')].mean()
thresh_ME = chart2['Excellers'].loc[(chart2['Region']=='Middle East')].mean()
thresh_OC = chart2['Excellers'].loc[(chart2['Region']=='Oceania')].mean()

n = -1
nAF = n+chart2['Excellers'].loc[(chart2['Region']=='Africa')].count()
nAM = nAF+chart2['Excellers'].loc[(chart2['Region']=='Americas')].count()
nAS = nAM+chart2['Excellers'].loc[(chart2['Region']=='Asia')].count()
nEU = nAS+chart2['Excellers'].loc[(chart2['Region']=='Europe')].count()
nME = nEU+chart2['Excellers'].loc[(chart2['Region']=='Middle East')].count()
nOC = nME+chart2['Excellers'].loc[(chart2['Region']=='Oceania')].count()

mAF = int(chart2['Excellers'].loc[(chart2['Region']=='Africa')].mean())
mAM = int(chart2['Excellers'].loc[(chart2['Region']=='Americas')].mean())
mAS = int(chart2['Excellers'].loc[(chart2['Region']=='Asia')].mean())
mEU = int(chart2['Excellers'].loc[(chart2['Region']=='Europe')].mean())
mME = int(chart2['Excellers'].loc[(chart2['Region']=='Middle East')].mean())
mOC = int(chart2['Excellers'].loc[(chart2['Region']=='Oceania')].mean())

ax.plot([n, nAF+0.5], [thresh_AF, thresh_AF], color='#ceecb3', linestyle='--', linewidth=1.5)
ax.plot([nAF+0.5, nAM+0.5], [thresh_AM, thresh_AM], color='#6fc9a3', linestyle='--', linewidth=1.5)
ax.plot([nAM+0.5, nAS+0.5], [thresh_AS, thresh_AS], color='#4CA699', linestyle='--', linewidth=1.5)
ax.plot([nAS+0.5, nEU+0.5], [thresh_EU, thresh_EU], color='#44829b', linestyle='--', linewidth=1.5)
ax.plot([nEU+0.5, nME+0.5], [thresh_ME, thresh_ME], color='#4C3C6C', linestyle='--', linewidth=1.5)
ax.plot([nME+0.5, nOC+0.5], [thresh_OC, thresh_OC], color='#271a2c', linestyle='--', linewidth=1.5)

plt.ylim(0,45)
plt.annotate(f"mean:{mAF}%", xy=((n+nAF)/2, mAF), xytext=((n+nAF)/2,mAF+3), ha='center', size=12,arrowprops=dict(facecolor='#ceecb3'))
plt.annotate(f"mean:{mAM}%", xy=((nAF+nAM)/2, mAM), xytext=((nAF+nAM)/2,mAM+3), ha='center', size=12, arrowprops=dict(facecolor='#6fc9a3'))
plt.annotate(f"mean:{mAS}%", xy=((nAM+nAS)/2, mAS), xytext=((nAM+nAS)/2,mAS+3), ha='center', size=12, arrowprops=dict(facecolor='#4CA699'))
plt.annotate(f"mean:{mEU}%", xy=((nAS+nEU)/2, mEU), xytext=((nAS+nEU)/2,mEU+3), ha='center', size=12, arrowprops=dict(facecolor='#44829b'))
plt.annotate(f"mean:{mME}%", xy=((nEU+nME)/2, mME), xytext=((nEU+nME)/2,mME+3), ha='center', size=12, arrowprops=dict(facecolor='#4C3C6C'))
plt.annotate(f"mean:{mOC}%", xy=((nME+nOC)/2+0.5, mOC), xytext=((nME+nOC)/2+0.5,mOC+3), size=12, arrowprops=dict(facecolor='#271a2c'))

plt.title('Figure 4: Breakdown of Excel uptake rates by Region and Country', size=14, weight='bold',pad=15)

plt.grid(False)

In [None]:
# Excel uptake by country
tools_cntry = pd.crosstab(results_short['Country'],results_short['Tools']).reset_index()
cntry = results.groupby('Country')['Tools'].count().to_frame()
Pct_tools_cntry = cntry.merge(tools_cntry, on='Country', how='left')
Pct_tools_cntry.rename(columns={'Tools':'Total'}, inplace=True)
Pct_tools_cntry['Pct_xls'] = (Pct_tools_cntry['Excellers']/Pct_tools_cntry['Total']).round(3)*100
Pct_tools_cntry['Pct_ide'] = (Pct_tools_cntry['IDElers']/Pct_tools_cntry['Total']).round(3)*100
Pct_tools_cntry['Pct_diff'] = Pct_tools_cntry['Pct_xls']-Pct_tools_cntry['Pct_ide']
Pct_tools_cntry = Pct_tools_cntry.merge(countries, on='Country', how='left')

fig = px.choropleth(data_frame=Pct_tools_cntry,
                    locations='Country_Code',
                    color='Pct_xls',
                    color_continuous_scale=deep_colors,
                    range_color=(0, 40),
                    hover_name='Country', 
                    title='Figure 4a: Percentage of Excellers amongst total respondents per country')
fig.show()

In [None]:
cntry_df = results.groupby('Country')['Age'].count().to_frame()
xls = xls_results.groupby('Country')['Age'].count().to_frame()
top_xls_cnty = cntry_df.merge(xls['Age'], on='Country', how='left')
top_xls_cnty.rename(columns={'Age_x':'Total', 'Age_y':'xls'}, inplace=True)
top_xls_cnty['Pct'] = (top_xls_cnty['xls']/top_xls_cnty['Total']).round(3)*100
top_xls_cnty.reset_index(inplace=True)

cntry_list = top_xls_cnty['Country'].unique()

# threshold code based on https://stackoverflow.com/questions/28129606/how-to-create-a-matplotlib-bar-chart-with-a-threshold-line
threshold = top_xls_cnty['Pct'].mean()
values = top_xls_cnty['Pct']
x = range(len(values))

above_threshold = np.maximum(values - threshold, 0)
below_threshold = np.minimum(values, threshold)

fig, ax = plt.subplots(figsize=(16,6))
p1 = ax.bar(x, below_threshold, 0.5, color='#6fc9a3')
p2 = ax.bar(x, above_threshold, 0.5, color="firebrick", bottom=below_threshold)

labels = list(cntry_list)

def format_fn(tick_val, tick_pos):
    if int(tick_val) in x:
        return labels[int(tick_val)]
    else:
        return ''
ax.xaxis.set_major_formatter(FuncFormatter(format_fn))
ax.xaxis.set_major_locator(MaxNLocator(55))

N = 55
ind = np.arange(N)
plt.xticks(ind, rotation=90)

plt.ylabel("Percentage of Excel users")
plt.ylim(0,45)
plt.grid(False)
plt.xlim(-1,55)
ax.plot([-1, 55], [threshold, threshold], "k--", linewidth=1)

def add_value_labels(ax, spacing=5):

    for rect1,rect2 in zip(p1,p2):
        h1 = rect1.get_height()
        h2 = rect2.get_height()
        plt.text(rect1.get_x()+rect1.get_width()/2., h1+h2, "%d%%" % (h1+h2), ha ='center', va='bottom',
                fontsize=8)
        
add_value_labels(ax)

plt.title('Figure 4b: Breakdown of Excel uptake rates by country of residence', size=14, weight='bold',pad=15)

fig.savefig("threshold-plot.png")

#### Education
In regards to education, Excellers appear to be equally likely to have a Bachelor's or Master's degree (39% each, fig 5), whereas IDElers had a higher share of Master's degrees (44%) compared to BAs (31%). Doctoral degrees were more common amongst IDElers (16%, compared to 9% of Excellers).

In [None]:
software_edu = pd.crosstab(results['Tools'],results['Education']).apply(lambda r: (r/r.sum())*100, axis=1).round(1).reset_index()
software_edu = software_edu[software_edu.Tools.str.contains("Excellers|IDElers")]

edu = ['Tools','I prefer not to answer','No formal education past high school', "Some college/university study without earning a bachelor’s degree", 
       'Professional degree', "Bachelor’s degree", "Master’s degree",'Doctoral degree']

software_edu = software_edu.reindex(edu, axis="columns")
software_edu.set_index('Tools', inplace=True)

sns.set_style('whitegrid')
plt.figure(figsize=(12,4))
ax = software_edu.plot(kind='bar', stacked=False, figsize=(16,6), color=deep_colors, legend=False)
legend = plt.legend(frameon=False, loc='upper left')
plt.xlabel('')
plt.ylabel("Percent of Respondents", size=12)
plt.ylim(0,55)
ax.grid(False)

for p in ax.patches[1:]:
    h = p.get_height()
    x = p.get_x()+p.get_width()/2.
    if h != 0:
        ax.annotate("%g" % p.get_height(), xy=(x,h), xytext=(0,4), rotation=90, size=11, weight='bold',
                   textcoords="offset points", ha="center", va="bottom")
plt.title('Figure 5: Education levels of Excellers and IDElers', size=14, weight='bold',pad=15)
plt.show()

### Work life
In this section we'll be looking at the differences in roles (i.e. profession), salary and work tasks between Excellers and IDElers.

#### Profession
Here, we are trying to understand which professions are most likely to use Excel or IDEs for data analysis. Not surprisingly, 52% of data scientists use local IDEs and only 11% use Excel (fig 6). As expected there is also a strong tendency to use local IDEs for statisticians, data engineers, ML engineers and research scientists. Excel, on the other hand, is more likely to be used for data analysis purposes by Product/and Project Managers, Business Analysts and 'Other' professions. I personally would be keen to see a further breakdown of these 'Other' professions. I wonder how many of them work in Finance...perhaps that's something to consider for future surveys :-)

In [None]:
prof_tools = pd.crosstab(results['Profession'],results['Tools'], margins=True, margins_name='Total').iloc[:-1].reset_index()
prof_tools['Pct_xls'] = (prof_tools['Excellers']/prof_tools['Total'])*100
prof_tools['Pct_ide'] = (prof_tools['IDElers']/prof_tools['Total'])*100

ax = prof_tools[['Profession','Pct_xls','Pct_ide']].loc[(prof_tools['Profession'] != 'No response')].set_index('Profession').sort_values(by='Pct_xls', ascending=False).plot(kind='bar', figsize=(16,6), color=deep_2a, legend=False)
plt.xlabel("Profession")
plt.ylabel("Percentage of users")
plt.grid(False)
plt.ylim(0,60)
plt.legend(loc='upper left')

def add_value_labels(ax, spacing=5):

    for rect in ax.patches:
        y_value = rect.get_height()
        x_value = rect.get_x() + rect.get_width() / 2
 
        space = spacing
        va = 'bottom'

        if y_value < 0:
            space *= -1
            va = 'top'

        label = "{:.1f}".format(y_value)

        ax.annotate(
            label,(x_value, y_value),xytext=(0, space),textcoords="offset points",ha='center',rotation=90,va=va)

add_value_labels(ax)

plt.title('Figure 6: Breakdown of professions by Excellers and IDElers', size=14, weight='bold',pad=15)
plt.savefig("Roles-plot.png")

#### Tasks
Question 23 of the survey asked respondents to "Select any activities that make up an important part of your role at work: (Select all that apply)". As expected, data analysis is a key work activity for both Excellers and IDElers and the response 'Analyze and understand data' was the most selected choice for both groups (fig 7). In saying that, there was a 6%pt difference between both groups with IDElers spending relatively more time on other tasks, namely 'Build prototypes to explore applying machine learning to new areas', which ranked second for this group at 19%, and 'Experimentation and iteration to improve existing ML models', which came in third at 15%. Excellers on the other hand scored higher on 'None of the above' with 14%, indicating that a larger component of their work has nothing to do with data science and machine learning related tasks.

In [None]:
Q23_list = ['Q23_Analyze_and_understand_data','Q23_Build_run_data_infrastructure','Q23_Build_prototypes_to_explore_ML',
           'Q23_Build_run_ML_service','Q23_Experiment_to_improve_existing_ML','Q23_Research_to_advance_ML',
           'Q23_None_of_the_above','Q23_Other']

task = results_short.groupby("Tools")[Q23_list].sum().apply(lambda r: r/r.sum(), axis=1).round(3)*100

ax = task.plot(kind='bar',stacked=False, color=deep_colors, figsize=(10,6))
plt.grid(False)
plt.legend(loc='upper left', bbox_to_anchor=(1.05, 1))
plt.ylabel('Percent of users')
plt.xlabel('')
plt.ylim(0,35)

def add_value_labels(ax, spacing=5):

    for rect in ax.patches:
        y_value = rect.get_height()
        x_value = rect.get_x() + rect.get_width() / 2
 
        space = spacing
        va = 'bottom'

        if y_value < 0:
            space *= -1
            va = 'top'

        label = "{:.1f}".format(y_value)

        ax.annotate(
            label,(x_value, y_value),xytext=(0, space),textcoords="offset points",ha='center',rotation=90,va=va)                      

add_value_labels(ax)

plt.title('Figure 7: Breakdown of main works tasks for Excellers and IDElers', size=14, weight='bold',pad=15)
plt.savefig("Tasks-plot.png")

#### Salaries

In terms of salaries it appears that IDElers are more likely to have a higher annual income than Excellers with 80% of the latter earning up to 60,000 USD p.a. while 80% of IDElers earn up to 80,000 USD p.a. (fig 8).

In [None]:
sal = pd.crosstab(results['Upper'],results['Tools']).apply(lambda c: c/c.sum(), axis=0).cumsum(axis=0).round(3)*100
sal.reset_index()
sal['Diff'] = sal['Excellers']-sal['IDElers']

ax = sal[['Excellers','IDElers']].plot(kind='line',figsize=(16,8), color=deep_2a, legend=False)
plt.xlabel("Upper Salary Range")
plt.title("Figure 8: Cumulative percentage point difference in salaries between Excellers and IDElers", size=14, weight='bold',pad=15)
plt.ylabel("Percentage Points")
ax.grid(False)
plt.legend()
ax.axhline(y=80, color='darkgray', linewidth=1, linestyle='--')
ax.set_xlim(0,500000)
plt.annotate("$59,999 US", xy=(60000, 80), xytext=(60000,90), size=12, arrowprops=dict(facecolor='#4CA699', lw=1))
plt.annotate("$79,999 US", xy=(80000, 80), xytext=(80000,70), size=12, arrowprops=dict(facecolor='#4C3C6C', lw=1))   
plt.show()

### Experience and knowledge of tools and programming languages

In this section we will look at overall coding experience, as well as knowledge of different programming languages and frequently used tools, including IDEs, visualisation libraries, and BI tools.

#### Coding experience overall

Not surprisingly, Excellers indicated to have less coding experience compared to IDElers, with 54% having less than 2 years experience compared to 34% of IDElers (fig 9). In fact, 31% of Excellers had less than 1 year of coding experience and 10% indicated they had never written any code. 
The most frequent response for IDElers was 3-5 years experience (28% of respondents), compared to 1-2 years for Excellers (23%). 

In [None]:
software_exp = pd.crosstab(results_short['Tools'],results_short['Background'].dropna()).apply(lambda r: r/r.sum(), axis=1).round(3)*100
exp = ['I have never written code', '< 1 years', '1-2 years', '3-5 years', '5-10 years', '10-20 years', '20+ years']

software_exp = software_exp.reindex(exp, axis="columns")

ax = software_exp.transpose().plot(kind='bar', color=deep_2, legend=False, figsize=(16,6), grid=False)
plt.xlabel("Percent of respondents")
plt.ylabel('Coding experience')
plt.ylim(0,35)
plt.legend(loc='upper right')

def add_value_labels(ax, spacing=5):

    for rect in ax.patches:
        y_value = rect.get_height()
        x_value = rect.get_x() + rect.get_width() / 2
 
        space = spacing
        va = 'bottom'

        if y_value < 0:
            space *= -1
            va = 'top'

        label = "{:.1f}".format(y_value)

        ax.annotate(
            label,(x_value, y_value),xytext=(0, space),textcoords="offset points",ha='center',rotation=90,va=va)                      

add_value_labels(ax)
plt.title('Figure 9: Coding experience of Excellers vs IDElers', size=14, weight='bold',pad=15)
plt.savefig("Experience_plot.png")

#### Frequently used Programming languages and data analysis & visualisation tools

##### Programming languages
Python was the most frequently used programming language for both groups (33% of respondents each), followed by SQL (15% and 16%, respectively, fig 10). R, on the other hand, was more prevalent amongst IDElers with 12% of respondents indicating to use it regularly compared to 5% of Excellers. Interestingly, C and C++ where slightly more prevalent amongst Excellers, as were Java and Javascript. 

##### IDEs
In terms of IDEs, JupyterLab ranked highest for both groups (25% and 29%, respectively, fig10) followed by VSCode. In line with the use of programming languages, RStudio was more frequently used by IDElers, while VisualStudio and Notepad++ were more commonly used by Excellers.

##### Visualisation libraries
Matplotlib was by far the most frequently used visualisation libaries for both groups (41% and 33%, respectively, fig 10) followed by Seaborn. While Plotly ranked third for Excellers, GGplot ranked slightly higher for IDElers, which is again not surprising considering their usage of R. 

##### BI tools
Microsoft Power Bi and Tableau ranked highest for both groups, however, Power Bi came in first for Excellers (20%, fig 10), while Tableau ranked highest for IDElers (21%). Google Data Studio ranked third for both groups with 6%. All other BI tools appear to be hardly used. It is worth noting that 39% of respondents in both groups chose 'None' as their answer to question 31A, which I found quite surprising. 

In [None]:
# create lists for multi choice questions
Q7_list = ['Q7_Python','Q7_R','Q7_SQL','Q7_C','Q7_C++','Q7_Java','Q7_Javascript','Q7_Julia','Q7_Swift','Q7_Bash','Q7_MATLAB',
          'Q7_None','Q7_Other']
Q9_list = ['Q9_JupyterLab','Q9_RStudio','Q9_VisualStudio','Q9_VSCode','Q9_PyCharm','Q9_Spyder','Q9_Notepad++','Q9_Sublime_Text',
          'Q9_Vim_Emacs','Q9_MATLAB','Q9_None','Q9_Other']
Q14_list = ['Q14_Matplotlib','Q14_Seaborn','Q14_Plotly','Q14_Ggplot','Q14_Shiny','Q14_D3','Q14_Altair','Q14_Bokeh',
            'Q14_Geoplotlib','Q14_Folium','Q14_Other']
Q31_list = ['Q31_Amazon_QuickSight','Q31_Microsoft_Power_BI','Q31_Google_Data_Studio','Q31_Looker','Q31_Tableau',
            'Q31_Salesforce','Q31_Einstein_Analytics','Q31_Qlik','Q31_Domo','Q31_TIBCO_Spotfire','Q31_Alteryx','Q31_Sisense',
            'Q31_SAP_Analytics_Cloud','Q31_None','Q31_Other']

In [None]:
Plang = results_short.groupby('Tools')[Q7_list].sum().apply(lambda r: r/r.sum(), axis=1).round(3)
IDE = results_short.groupby('Tools')[Q9_list].sum().apply(lambda r: r/r.sum(), axis=1).round(3)
vis_tool = results_short.groupby('Tools')[Q14_list].sum().apply(lambda r: r/r.sum(), axis=1).round(3)
BI_tool = results_short.groupby('Tools')[Q31_list].sum().apply(lambda r: r/r.sum(), axis=1).round(3)

fig, (axis1,axis2,axis3,axis4) = plt.subplots(1,4)
fig.suptitle('Figure 10: Programming and Tool usage of Excellers and IDElers', fontsize=14, weight='bold', y=1.8, x=0.8)

ax1 = sns.heatmap(Plang.transpose(), cmap='mako_r', annot=True, fmt='.1%', cbar=False, ax=axis1)
ax2 = sns.heatmap(IDE.transpose(), cmap='mako_r', annot=True, fmt='.1%', cbar=False, ax=axis2)
ax3 = sns.heatmap(vis_tool.transpose(), cmap='mako_r', annot=True, fmt='.1%', cbar=False, ax=axis3)
ax4 = sns.heatmap(BI_tool.transpose(), cmap='mako_r', annot=True, fmt='.1%', cbar=False, ax=axis4)

ax1.set_title('Programming languages')
ax2.set_title('IDE used')
ax3.set_title('Visualisation libraries used')
ax4.set_title('BI tools')

axis1.set_position([0.2,1.0, 0.5, 0.6])
axis2.set_position([1.2,1.0, 0.5, 0.6])
axis3.set_position([0.2,0.0, 0.5, 0.6])
axis4.set_position([1.2,0.0, 0.5, 0.6])

#### Learning
Finally we look at learning resources used by Excellers and IDElers as well as their recommendations in terms of what programming language to learn first for aspiring data scientists.

##### Learning resources used
Coursera is the most popular learning resource among both groups (21% and 22%, respectively, fig 11) followed by Kaggle and Udemy. A slightly higher share of IDElers made use of university courses compared to Excellers.

##### Programming language recommended
Python was by far the most recommended programming language to learn with over 81% of respondents from both groups selecting it as the top choice (fig 12). In second place was R, however, only ranking 0.1%pt ahead of SQL for Excellers.

In [None]:
Q37_list = ['Q37_Coursera','Q37_edX','Q37_Kaggle_Learn_Courses','Q37_DataCamp','Q37_Fast.ai','Q37_Udacity','Q37_Udemy',
            'Q37_LinkedIn_Learning','Q37_Cloud_cert_programs','Q37_University','Q37_None','Q37_Other']

learn = results_short.groupby('Tools')[Q37_list].sum().apply(lambda r: r/r.sum(), axis=1).round(3)
plt.figure(figsize=(16,2))
sns.heatmap(learn, cmap='mako_r', annot=True, fmt='.1%',annot_kws={"size":11})
plt.title('Figure 11: Learning resources used by Excellers and IDElers', weight='bold', pad=15, size=14)
plt.ylabel('')
plt.xlabel('')
plt.yticks(np.arange(2)+0.5,('Excellers','IDElers'), rotation=0, fontsize="10", va="center")
plt.show()

In [None]:
recomm = pd.crosstab(results_short['Tools'],results_short['Q8']).apply(lambda r: r/r.sum(), axis=1)
plt.figure(figsize=(16,2))
sns.heatmap(recomm, cmap='mako_r', annot=True, fmt='.1%',annot_kws={"size":11})
plt.title('Figure 12: Number 1 Programming Language recommended by Excellers and IDElers', weight='bold', pad=15, size=14)
plt.ylabel('')
plt.xlabel('')
plt.yticks(np.arange(2)+0.5,('Excellers','IDElers'), rotation=0, fontsize="10", va="center")
plt.show()

### 3. Summary of findings

Based on the 2020 DS & ML survey, 'Excelling' Kagglers are:
* marginally older than those using Local IDEs as their main tool for data analysis;
* they are pretty similar in terms of gender, with women having a marginally stronger tendency to use Excel and BI tools to analyse data;
* in terms of country of residence, Excellers are slightly more likely to reside in Middle Eastern countries, however, there was quite a bit of variance between individual countries within regions; the Philippines (39%), Saudi Arabia (35%) and Peru (34%) had the highest proportion of Excel users while the Netherlands (9%), Germany (11%), Switzerland and Poland (14% each) had the lowest;
* while equally likely to have a Bachelor's or Master's degree (39% each), Excellers had an overall smaller share of Master's and Doctoral degrees compared to IDElers;
* in terms of profession, Excellers were more likely to be Project/Product Managers, Business Analysts, DBA/Database engineers or have 'Other' roles; not surprisingly, a high share of IDElers were more likely to be Data Scientists, Statisticians, Machine Learning Engineers, Research Scientists and Data Engineers
* as expected data analysis was a key taks for both groups, however, IDElers had a higher share of machine learning related tasks, while'Excellers' had a higher share of tasks that are unrelated to DS/ML, with 14% selecting 'None of the above';
* in terms of salary, IDElers were more likely to have a higher annual income than Excellers with 80% of the latter earning up to 60,000 USD p.a. while 80% of IDElers earn up to 80,000 USD p.a.;
* overall, Excellers were less experienced coders, with 54% having less than 2 years of coding experience, 31% having less than 1 year experience and 10% having never written code;
* Python was the most frequently used programming language for both groups, followed by SQL. R, on the other hand, was more prevalent amongst IDElers while C and C++ where slightly more prevalent amongst Excellers, as were Java and Javascript;
* in terms of IDEs, JupyterLab ranked highest for both groups, followed by VSCode. RStudio was used more frequently by IDElers, while VisualStudio and Notepad++ were more commonly used by Excellers;
* Matplotlib was by far the most frequently used visualisation libary for both groups, followed by Seaborn. While Plotly ranked third for Excellers, GGplot ranked slightly higher for IDElers, which is again not surprising considering their usage of R;
* Microsoft Power Bi and Tableau ranked highest for both groups, however, Power Bi came in first for Excellers, while Tableau ranked highest for IDElers
* Coursera was the most popular learning resource for both groups, followed by Kaggle and Udemy. A slightly higher share of IDElers made use of university courses compared to Excellers; 
* Python was by far the most recommended programming language to learn with over 81% of respondents from both groups selecting it as the top choice. R and SQL came in second and third for both groups, however, there was only a marginal difference of 0.1%pt between R and SQL for Excellers compared to a 4.4%pt difference for IDElers. 




### 4. Conclusion
The 2020 Kaggle DS & ML survey shows that Excel remains a key tool for data analysis. At the same time, it appears that many Excel users are only just starting out on their programming journey so time will tell whether Excel will remain their tool of choice once they get more proficient in coding.
I will certainly try and use less Excel and more IDEs going forward! I just need to think about how I can get my team on board...
In any case, it seems to me as though the lines between roles, tasks and tools are getting blurrier with tools and programming languages being used by a growing variety of audiences. For now, I would expect Excellers on Kaggle to stick around and maybe even increase in numbers as more individuals and companies/industries turn to DS/ML or simply automation of processes. Perhaps future surveys could add the type of industry respondents work in. This would certainly make it easier for me to convince my team that Python has a place in finance. And as an added bonus it would allow those on the hunt for a job to see where the opportunities are :-)