In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import umap as umap

# hide setting with copy warning to keep things clean
import warnings
warnings.filterwarnings('ignore')

pd.set_option("max_columns", 102)


sns.set_style('whitegrid')

# <img src="https://upload.wikimedia.org/wikipedia/commons/b/b7/Flag_of_Europe.svg" width="200" height="200"> <img src="https://upload.wikimedia.org/wikipedia/en/a/ae/Flag_of_the_United_Kingdom.svg" width="200" height="200">

# EU vs UK 

As a British Data Scientist working in academia I am acutely aware of the potential impact Brexit is having (and expected to have) on the data science and academic landscape. For example, there are already indications of [reduced grant money for UK scientists](https://royalsociety.org/-/media/news/2019/brexit-uk-science-impact.pdf) and signs that Brexit will cause long term damage to the [wider British economy](https://www.piie.com/publications/working-papers/brexit-everyone-loses-britain-loses-most). From a non-economic perspective, UK universities [face the threat](https://academic.oup.com/oxrep/article/33/suppl_1/S155/3066079) of a restriction in hiring academic staff members from EU states and the UK as a whole is expected to see [reduced migration from EU member states](https://journals.sagepub.com/doi/abs/10.1177/002795011623800111?journalCode=nera).

Motivated by the imminent British exit from the European Union, I was interested to see  how members of the UK data science and machine learning community compares to those in the wider European community. The most recent Kaggle data science and machine learning survey (now in it's third iteration) gives us an opportunity to compare these communites; investigating their differing demographics, compensation, and  education levels.

In this notebook I will use the previous three surveys to compare the responses of persons located in the UK to those located in other EU member states. While I focus on the results of the 2019 survey, I do use results from the previous two years where possible. This temporal dependance gives us a chance to investigate the changing nature of post-referendum UK data science and machine learning community. The results should be taken with a large pinch of salt (please read the limitations below). Because this survey was not designed to investigate  EU/UK differences it is unlikely to hold much value, however, I hope it motivates some readers to think about the impact of Brexit (and other global) events on the wider data science community. Within Kaggle, (writing this before I have done any significant analysis!) I hope this work supports the inclusive and cross-border it culture maintains.


**Note when reading this notebook**

>- All figures use a consistent colour scheme: <span style="color:red">UK results are represented by red</span> and <span style="color:blue">EU results are represented by blue</span>.
- When I refer to the 'EU' I actually mean the European Union excluding the UK (that is 27 member states). 
- Each section (index on left pane of notebook viewer) starts with a textual summary of the findings. There is then a 'Key points' section with bullet points summarizing the key findings of the analysis presented in the given section. Following this, there are the individual figures associated with each section, each with their own caption. Captions are above the figure and you won't need to read them unless you can't understand the figure.

Enjoy!




In [None]:
UK_COLOUR = '#C8102E'
EU_COLOUR = '#1E448A'
color_dict = {'EU': EU_COLOUR, 'UK':UK_COLOUR}
eu_countries = ['Austria','Belgium','Bulgaria','Croatia','Cyprus','Czech Republic','Denmark','Estonia','Finland','France',
               'Germany','Greece','Hungary','Ireland','Italy','Latvia','Lithuania','Luxembourg','Malta','Netherlands',
               'Poland','Portugal','Romania','Slovakia','Slovenia','Spain','Sweden']

survey_years = np.r_[2019, 2018, 2017]

In [None]:
# some helper functions used throughout the notebook
def map_ages_to_intervals(x):
    """
    Bins a continuous range of ages into discrete intervals.

    Parameters:
    -----------
    x : pd.Series,
        Series of float values of ages
        
    Returns:
    --------
    x_binned, pd.Series,
        The binned ages.
    """

    age_bins = pd.IntervalIndex.from_tuples([(18, 21), (22, 24), (25, 29),
                                             (30, 34), (35, 39), (40, 44),
                                             (45, 49), (50, 54), (55, 59),
                                             (60,64), (65,69), (70, 100)])
    
    x = pd.cut(x, age_bins).astype(str)
    x = (x.str.replace(', ', '-')
          .str.strip('(]')
          .str.replace('.0-','-')
          .apply(lambda x: x[:-2]))
    return x

def turn_off_lines(ax):
    """
    Turns off lines on a matplotlib axes object.
    """
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)
    ax.yaxis.set_ticks_position('left')
    ax.xaxis.set_ticks_position('bottom')
    ax.grid(False)

def load_and_process(eu_countries, drop_non_eu=True):
    """
    Loads the kaggle ML & DS survey results from 2017, 2018 and 2019.
    It casts to the results to a DataFrame with a structure consistent 
    across the years. It only extracts a subset of the total questions
    and only takes countries in the EU (including the UK).
    
    Currently, in the interest of time, it is not robust to the addition 
    of future years (i.e., 2020 onwards).
    
    Parameters:
    -----------
    eu_countries: iterable, 
        List-like object containing the names of each country (excluding the UK) in the EU
    
    Yields:
    --------
    df: 
        Returns a list of the survey results where each entry s a pandas DataFrame with the
        results from a given year.
    """
    
    # load the data
    prefix = '/kaggle/input/kaggle-survey'
    df_2019 = pd.read_csv(prefix + '-2019/multiple_choice_responses.csv')
    df_2018 = pd.read_csv(prefix + '-2018/multipleChoiceResponses.csv')
    df_2017 = pd.read_csv(prefix + '-2017/multipleChoiceResponses.csv', encoding="ISO-8859-1")
    
    # align the columns we want to create in out dataframes
    cols_to_generate = ['country','gender','age','education',
                        'employment_status','salary']
    col_names_2019 = ['Q3','Q2','Q1','Q4','Q5','Q10']
    col_names_2018 = ['Q3','Q1','Q2','Q4','Q6','Q9']
    col_names_2017 = ['Country','GenderSelect','Age','FormalEducation',
                      'EmploymentStatus','CompensationAmount']

    for df, cols in zip([df_2019, df_2018, df_2017], 
                        [col_names_2019, col_names_2018, col_names_2017]):
        # instantiate our new df
        processed_df = pd.DataFrame({'body': df[cols[0]].values})
        
        # classify countries as EU member state or UK
        EU_mask = processed_df['body'].isin(eu_countries)
        UK_mask = processed_df['body'].str.contains('United Kingdom', case=False)==True
        processed_df.loc[UK_mask,'body'] = 'UK'
        processed_df.loc[EU_mask,'body'] = 'EU'

        # create the other columns
        for new_col, old_col in zip(cols_to_generate, cols):
            processed_df[new_col] = df[old_col].values
            
        # perform some manual corrections
        
        # map the age column for 2017 to intervals
        if 'Age' in cols:
            processed_df.loc[processed_df['age']<17, 'age'] = np.nan
            processed_df['age'] = map_ages_to_intervals(processed_df['age'])
        
        # group 'out' of bound ages and fill NaNs
        high_ages = ['80+','70-100','70-79']
        processed_df.loc[processed_df['age'].isin(high_ages), 'age'] = '70+'
        sixties = ['60-64','65-69']
        processed_df.loc[processed_df['age'].isin(sixties), 'age'] = '60-69'
        processed_df.loc[processed_df['age']=='n','age'] = np.nan
        
        # process the salary column for the 2019 survey
        if cols[5]=='Q10':

            for i, col in enumerate(['salary_lower', 'salary_higher']):
                # remove the row featuring the questions
                processed_df = processed_df[~(processed_df.salary.str.contains('What is your current')==True)]
                # format the salary columns (remove $,commas and > signs)
                processed_df[col] = (processed_df['salary']
                                    .str.split('-', expand=True)[i]
                                    .str.replace('$','')
                                    .str.replace(',','')
                                    .str.replace('>','')
                                    .astype(float)
                                    )
            
            # add media question responses
            media_questions = df.columns[df.columns.str.contains('Q12')]
            processed_df[media_questions] = df[media_questions]
            # add IDE questions
            IDE_questions = df.columns[df.columns.str.contains('Q16')]
            processed_df[IDE_questions] = df[IDE_questions]
            # add programming questions
            prog_questions = df.columns[df.columns.str.contains('Q18')]
            processed_df[prog_questions] = df[prog_questions]
            processed_df['job'] = df['Q5']
        
        # corrections for 2018 and 2019 survies
        if 'Q1' in cols:
             # add the survery response time
            processed_df['completion_time'] = df['Time from Start to Finish (seconds)']
        
            # Add coding experience
            experience_mapping = {'0-1':'0-1','< 1':'0-1','1-2':'1-2','2-3':'3-5',
                                  '3-4':'3-5','4-5':'3-5','3-5':'3-5','5-10':'5-10',
                                  '10-15':'10-20','10-20':'10-20','15-20':'10-20',
                                  '20+':'20-+','20-25':'20-+','25-30':'20-+','30 +':'20-+',
                                  'I have never written code': '0-1'}
            try:
                # 2019 survey, map it to single categories
                processed_df['coding_experience'] = df['Q15']
                processed_df['coding_experience'] = (processed_df['coding_experience']
                                                     .str.replace('years','')
                                                     .str.strip()
                                                     .map(experience_mapping))
            except:
                # 2018 survey, map it to single categories
                processed_df['coding_experience'] = df['Q8']
                processed_df['coding_experience'] = (processed_df['coding_experience']
                                                     .str.strip()
                                                     .map(experience_mapping))
        
        if drop_non_eu:
            # drop countries which are not EU or UK
            processed_df = processed_df[processed_df['body'].isin(['UK','EU'])]  
            
        yield processed_df
        
def autolabel(rects):
    """
    Attach a text label above a bar in a bar chart displaying its height.
    
    This is taken from the matplotlib example gallery:
    https://matplotlib.org/examples/api/barchart_demo.html
    """
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.01*height,
                '%d' % int(height),
                ha='center', va='bottom')
     

In [None]:
# a list of dfs for each survey [2019, 2018, 2017]
yearly_dfs = list(load_and_process(eu_countries))

# 0. Robustness of analysis
With any analysis, it is important to outline the limitations of the work and any assumptions we use.

**Limitations:**

> - This survey was not commissioned to look at EU/UK relationship. It was not even commissioned with a UK/EU focus. We must question at each step whether the secondary data usage is relevant.
- We are dealing with a relatively small number of yearly responses (~500 from UK, and ~2500 from EU member states). We must therefore be careful to interrogate whether any findings are statistically significant for this number of responses.
- The survey targets a primarily a specific demographic (primarily, users heavily engaged with the Kaggle platform). Therefore, we must be careful when extrapolating our results beyond this demographic.
- There is an inherent flaw of comparing a number of countries (the EU excluding the UK) with a large economic and social diversity to a single country (the UK) with a more restricted social and economic landscape. In reality it would be more meaningful to compare the UK only to an EU member state reflective of its politics and economy (e.g., France with a similar GDP per capita). However, the UK vs EU is a bit more theatrical  so I have chosen to do this. This is likely the largest flaw in this, so please hold it in your minds when reading through this notebook. See the book 'Factfulness' by H. Rosling for more insights about such questionable comparisons.

**Assumptions:**

>- It is likely some EU countries (only 17 of the 28 member states are represented in the 2019 results) are listed in the 'Other' category and these are excluded from my analysis.

# 1. Kaggle survey participation

In this section we look at how the participation in the Kaggle survey has changed over the last three years. In the left panel of Figure of 1.1 we display the net number of responses to the survey for persons located either in the UK or other EU member states. In the right panel of Figure 1.1 we display the reponse rate normalised by the respective population (as reported by [Wikipedia](https://en.wikipedia.org/wiki/Demographics_of_the_European_Union#Population_shifts)) of the two bodies. Between 2017 and 2018 a large (~50%) increase in the survey participation for both the EU and UK was observed. In 2019 participation levels return to the levels observed in 2017; potentially showing either a reduced engagement with the Kaggle platform in 2019 compared to 2018 for persons in the EU and UK, reduced advertisement of the survey and/or greater entry restrictions. There were more responses per captia for the UK compared to the EU as a whole (Figure 1.1 right), but the difference is reducing each year with more EU participation relative to UK participation. This may be the result of reduced UK participation as a result of  ['survey/referendum/polling' fatigue](https://www.theguardian.com/uk-news/2019/oct/26/voters-disillusionment-renders-expected-election-tough-to-call) - perhaps the UK public are getting tired of being asked things?

As well as looking at the participation rates, we also looked at the time taken by respondents to complete the survey (recorded in 2018 and 2019 only). Histograms of the response times is displayed in Figure 1.2. On average, persons spent longer completing the 2018 survey than 2019 survey (with an average difference of ~400 seconds) and the 2018 survey completion times had a greater variation (i.e., a broader distribution). In 2019 there was no significance difference in completion time between EU and UK respondents. However, UK residents did appear to take slightly longer on average in 2018 (compare the two maxima in Figure 1.2). This is likely a result of UK respondents answering more questions in total (which also explains the two disinct maxima in survey completion time).

**Key points:**
>- UK survey participation is less in 2019 compared to 2017, where as EU participation is slightly higher. (See left panel of Figure 1.1).
- Participation per capita is higher for the UK compared to the (mean) participation per capita for EU member states, but the EU is catching up.
- People spent less time completing the survey in 2019  (approximately 400 seconds less on average) compared to the 2018 survey.
- There where no significant differences in completion time for UK / EU respondents in 2019.

In [None]:
# prepare data for this analysis
eu_respondents = []
uk_respondents = []
for df in yearly_dfs:
    counts = df.groupby('body')['country'].count()
    eu_respondents.append(counts['EU'])
    uk_respondents.append(counts['UK'])
    
eu_respondents = np.array(eu_respondents)
uk_respondents = np.array(uk_respondents)

### Figure 1.1 Survey participation

The below graphic displays the number of responses to the 2017, 2018 and 2019 Kaggle data science and machine learning surveys. The left panel displays the net number of responses, with responses from persons located in the UK and EU represented by solid red bars and solid blue bars respectively. The right hand panel displays the number of respondents normalised by the population of the two bodies (as reported by [Wikipedia](https://en.wikipedia.org/wiki/Demographics_of_the_European_Union#Population_shifts)). We have excluded the UK population from the population of the EU.

In [None]:
fig, (ax, ax2) = plt.subplots(1, 2)
# plot bar chart and offset them along x axis to make them side by side
eu_rects = ax.bar(survey_years-0.2, eu_respondents, 0.4, color=EU_COLOUR, label='EU', alpha=0.8)
uk_rects = ax.bar(survey_years+0.2, uk_respondents, 0.4, color=UK_COLOUR, label='UK', alpha=0.8)


# add labels and sort out tick labels
ax.legend()
ax.set_ylabel('Number of responses')
ax.set_xlabel('Year')
ax.set_xticklabels(['','','2017','','2018','','2019'])
ax.set_ylim(0,4500)
ax.set_title('Net Participation')

# add counts to the top of the bars
autolabel(eu_rects)
autolabel(uk_rects)


# plot the normalised data on the second axis

# population figures from wikipedia
eu_pop = np.r_[513481691,(513481691+511643456)/2,511643456]
uk_pop = np.r_[66647112,(66647112+65808573)/2,65808573]
eu_pop = eu_pop - uk_pop

eu_rects = ax2.bar(survey_years-0.2, eu_respondents/eu_pop * 100, 0.4, color=EU_COLOUR, label='EU', alpha=0.8)
uk_rects = ax2.bar(survey_years+0.2, uk_respondents/uk_pop * 100, 0.4, color=UK_COLOUR, label='UK', alpha=0.8)


# add labels and sort out tick labels
ax2.legend()
ax2.set_ylabel('Fraction of population responding (%)')
ax2.set_xlabel('Year')
_ = ax2.set_xticklabels(['','','2017','','2018','','2019'])
ax2.set_title('Participation normalised by population')

for ax in [ax,ax2]:
    turn_off_lines(ax)

fig.set_size_inches(13,5)

### Figure 1.2 : Survey completion time

This figure displays a kernel density estimate of the probability distribution of the survey completion time  for the 2019 (left figure) and the 2018 (right figure) Kaggle surveys  for respondents located in the UK (red shaded region) and EU (blue shaded region). Persons with completion times exceeding 7500 seconds were considered as outliers and ignored in this analysis.

In [None]:
fig, axes = plt.subplots(ncols=2, sharey=True)
max_comp_time = 7500 # cut-off for max response time to plot
bins = np.linspace(0,max_comp_time,50)
for i, ax in enumerate(axes):
    for body in ['EU','UK']:
        body_mask = yearly_dfs[i]['body']==body
        time_2019 = yearly_dfs[i].loc[body_mask,'completion_time'].astype(float, errors='ignore')
        sns.kdeplot(time_2019[time_2019<max_comp_time], color=color_dict[body],alpha=0.3, label=body, ax=ax, shade=True)
        #ax.hist(time_2019[time_2019<max_comp_time], bins=bins, color=color_dict[body],
                #density=True, alpha=0.4, label=body)
        ax.set_title(survey_years[i])
        ax.set_xlabel('Survey completion time (seconds)')
        turn_off_lines(ax)
# finishing touches
axes[0].set_ylabel('Probabilty density')
axes[1].legend()
fig.set_size_inches(10,5)

# 2. Age

In the following section we investigate the age of the respondents to the survey. We display histograms (Figure 2.1) of the respondents reported ages for each year of the Kaggle survey. The survey clearly attacts a relatively young UK (and EU) audience compared to the [nominal distribution](https://www.statista.com/statistics/281174/uk-population-by-age/) of the UK. We also found that a slightly higher proportion of EU respondents report a younger age bracket compared to the UK (note the differences in the bottom panels for persons between 22-39, EU citizens are more likely to respond in all of these categories compared to an UK respondent). UK respondents are more likely to be between the ages of 40 and 49 compared to EU respondents.

There is a distinct worry that, as well as reducing migration from the UK, Brexit could also cause a large number of young UK professionals to migrate from the UK. There is already evidence of this from a a surge in [citizenship applications](https://www.theguardian.com/politics/2017/oct/13/brexit-vote-creates-surge-in-eu-citizenship-applications) to other EU countries from British persons following the Brexit referendum in 2016. While in the Kaggle survey we observed a drop in the number of UK respondents reporting in the 25-29 age bracket from 2017 to 2019 (compare right and left panels of Figure 2.1) and this could reflect post-brexit referendum migration to the EU of early career professionals, we cannot robustly conclude this because of the different methods of collecting age between 2017 and 2018 onwards.

**Key points:**
>- The survey attracts a relatively young audience.
- EU respondents have a slightly higher fraction of reported ages under 30 compared to the UK.
- UK respondents have a higher fraction of respondents aged 40-49.


### Figure 2.1 Age distribution
The panels (left to right) display the age distribution for respondents located in UK (red bars) and EU (blue bars) in 2019, 2018 and 2017.The bottom panel of each years graphic displays the difference between the fraction of EU and UK respondents reporting in a given age category, the bars in the difference plots are coloured corresponding to which body (UK or EU) have the higher fraction of respondents in the given age bracket. The 2017 survey had a significantly different question set and age grouping, which may have effected the observed results.

In [None]:
fig, axes = plt.subplots(3,3, gridspec_kw= {'wspace':0.5, 'hspace':0,'height_ratios': [4, 4,2]})

residuals = []
for axs, body in zip(axes, ('EU','UK')):
    for i, ax in enumerate(axs):
        # make the plot
        if body=='UK':
            ax.invert_yaxis()
            #ax.axhline(0.12,color='grey', alpha=0.5)
        grouped_by_age = yearly_dfs[i][yearly_dfs[i]['body']==body].groupby('age')['body'].count() 
        grouped_by_age = grouped_by_age / grouped_by_age.sum() # normalise
        ax.bar(grouped_by_age.index, grouped_by_age.values, color=color_dict[body], label=body, alpha=0.5,lw=2)
        turn_off_lines(ax)
        ax.grid(axis='y')
        if i==2:
            ax.legend()
        # store values for residual plot 
        residuals.append(grouped_by_age.values)
        if body=='EU':
            ax.set_title(survey_years[i])

# plot the difference graph
residuals = np.array(residuals).reshape(33,2)      
differences = (residuals[:,0]-residuals[:,1])#/(residuals[:,0]+residuals[:,1])
differences = differences.reshape(3,11)

for ax, resid in zip(axes[2], differences):
    # make the colours for the differance graph
    colors = pd.Series(resid.copy()) # be flamboyent with types
    mask = colors>0
    colors[mask] = color_dict['EU']
    colors[~mask] = color_dict['UK']
    ax.bar(grouped_by_age.index, resid, color=colors, alpha=0.3)
    turn_off_lines(ax)
    plt.setp(ax.get_xticklabels(), rotation=90)
    ax.set_ylim(-0.2,0.2)
    ax.set_xlabel('Age group (years)')
    ax.spines['left'].set_visible(False)
    ax.yaxis.set_visible(False)
    ax.axhline(0, color='grey', alpha=0.3)
axes[0][0].set_ylabel('Fraction of respondents')
axes[0][0].yaxis.set_label_coords(-0.175, 0)

fig.set_size_inches(15,4)

# 3. Gender

This section investigates the self-reported gender of EU/UK respondents to the survey, the results are summarised in Figure 3.1. Overall, we can see there is a large gender imbalance in respondents (approximately 80% of respondents identify as Male for all years); the computer science and technology sectors are known to have [one of the largest gender imbalances](https://www.stemwomen.co.uk/blog/2019/09/women-in-stem-percentages-of-women-in-stem-statistics) and the results of this survey reflect this. However, in 2018 and 2019 there was a slight increase in Female participants compared to 2017, this may be a result of advertising differences between the years attracting more female participants in the survey as opposed to an underlying change in the gender distribution of the sector however.

Notably, the UK has a higher fraction of Female participation when compared to the EU as a whole (18% to 14%), while small, this is likely significant (the UK attracted over sFemale respondents in 2017 and 2018). I'm aware the UK, for a long time,  has been trying to increase Female participation in STEM subjects at school and [success is being seen already](https://www.thetimes.co.uk/article/a-level-results-2019-more-girls-than-boys-study-science-for-the-first-time-f3bgjgrsd). The lower gender imbalance reported here compared to the EU as a whole may reflect the impact of such policies within the UK.

**Key points:**
>- There is a large gender imbalance (~80% identifying as Male) reported in the Kaggle survey.
- The UK has a slightly smaller gender imbalance than the EU as a whole.
- The gender imbalance decreases from 2017 to 2018, but there has been no sigificant change from 2018 to 2019.

### Figure 3.1 Gender distribution

This figure displays the gender distribution for respondents to the Kaggle DS/ML survey undertaken in 2019, 2018 and 2017 (left to right panel). We have grouped several responses (Did not respond, self-identify and non-binary) under one heading 'Other': this is to keep the graphic simple and to minimize the number of labels. UK responses are represented by the outer circle (shades of red) and EU responses are represented by the inner circle (shades of blue). 

In [None]:
UK_MULTI_COLOURS = ['#810a1e','#ef3654','#b00e29','#e01233']
EU_MULTI_COLOURS = ['#10254b','#2c63c9','#193a75','#234e9f']
pi_labels = ['UK: Female','UK: Male','UK: Other','EU: Female','EU: Male','EU: Other']
size = 0.3 # used to determine size of the doughnuts
# kwargs shared across the charts
shared_pi_formats = {'wedgeprops':dict(width=size, edgecolor='w'),
                     'textprops':dict(color='white', fontsize=8),
                     'autopct':'%.1f %%'}

fig, axes = plt.subplots(ncols=3)
for i, ax in enumerate(axes):
    # group all the other genders to 'Other' for simplicity see disclaimer above
    yearly_dfs[i].loc[~yearly_dfs[i]['gender'].isin(['Male','Female']), 'gender'] = 'Other'
    UK_mask = yearly_dfs[i]['body']=='UK'
    EU_mask = yearly_dfs[i]['body']=='EU'
    ax.pie(yearly_dfs[i][UK_mask].groupby('gender')['body'].count(), radius=1,
           colors=UK_MULTI_COLOURS, pctdistance=0.85, **shared_pi_formats)
    ax.pie(yearly_dfs[i][EU_mask].groupby('gender')['body'].count(), radius=1-size,
           colors=EU_MULTI_COLOURS, pctdistance=0.78, **shared_pi_formats)
    ax.set(aspect="equal", title=survey_years[i])

    
axes[2].legend(loc=(0.8,0.8), labels=pi_labels)
fig.set_size_inches(16,6)

# 4. Education

In this section we look at the education levels of respondents, the number of students participating and the coding experience of respondents. We begin with Figure 4.1 which displays the fraction of survey respondents with either a Bachelor's, Master's or Doctoral degree. We see that a large fraction of all respondents have obtained a minimum of a Bachelor's degree (~90%) and, unsurprisingly, both the EU and UK have a high fraction of respondents who have obtained a higher degree (Master's or PhD). The EU, however, does have a noticeably higher fraction of respondents with a higher degree (~75%) compared to the UK (~70%). Despite the greater attainment levels of higher degrees for the EU, UK respondents were more likely to have obtained a Doctoral degree (compare size darkest red bars to darkest blue bars in Figure 4.1). We cannot be sure why more EU respondents have a higher degree but it maybe a result of cheaper fees for a Master's programme in the EU compared to the UK, a different cultural attitude of the need to attain these degrees and/or a greater number of univerities. 

There are no temporal dependances of note in the data, although, there is a slight drop in degree attainment levels for UK respondents. This could indicate that the Kaggle survey is reaching a wider audience in the UK since 2017 (i.e., less specialized people who have not obtained a university degree).

Looking at the engagement of current students (Figure 4.2), we see that Kaggle surveys have a modest response (~15%) from students in the UK and EU. Between 2018 and 2019 there were fractionally less student responses from persons located in the EU, but UK student engagement remained (proportionally) the same.

Finally, we discuss the coding experience of respondents (Figure 4.3). Firstly, we remark that the 2018 respondents had less coding experience than the 2019 respondents. The 2018 survey saw higher participation levels (Figure 1.1) and likely reached a broader audience with lower net experience in the data science / machine learning fields, resulting in lower reported coding experiences. Secondly, comparing the UK respondents to the EU respondents, the UK respondents had more coding experience on average that the EU respondents. This may reflect the slightly older demographic of UK respondents (Figure 2.1) and the higher fraction with Doctoral degrees (likely resulting in more opportunity to code). 

**Key points:**
>- Both the UK and EU respondents have similar overall degree attainment levels.
- Overall,  EU respondents are more likely to have obtained either a Master's or Doctoral degree compared to the UK.
- However, UK respondents are more likely to hold a PhD than EU respondents.
- Total UK degree attainment of respondents has dropped slightly over three years.
- Approximately 15% of all responses from the EU/UK were from students.
- On average, the UK respondents had slightly more coding experience than there EU counterparts.
- Overall, the 2018 survey (UK/EU) respondents had less coding exprience than the 2019 respondents. Likely reflecting a broader audience scope in the 2018 survey.

### Figure 4.1 Reported highest education levels

This figure displays the reported highest education levels of the respondents. For each year, the highest degree attainment is stacked such that the bar is cumulative (e.g, for the 2017 bar you can read off that ~90% of respondents had a Bachelors Degree). Responses from persons located in the EU are shaded in blue and those from the UK in red. All bars are stacked in the same order (i.e, from bottom to top Doctoral degrees, Master's degrees and Bachelor's degrees). 

In [None]:
# create a dictionary with eductation data in
EU_dict = {'Bachelor':[],'Master':[],'Doctoral':[]}
UK_dict = {'Bachelor':[],'Master':[],'Doctoral':[]}
for i in range(0,3):
    uk_mask = yearly_dfs[i]['body']=='UK'
    eu_mask = yearly_dfs[i]['body']=='EU'
    eu_degree_fractions = (yearly_dfs[i][eu_mask].groupby('education')['body'].count() / 
                           yearly_dfs[i][eu_mask].groupby('education')['body'].count().sum())
    uk_degree_fractions = (yearly_dfs[i][uk_mask].groupby('education')['body'].count() / 
                           yearly_dfs[i][uk_mask].groupby('education')['body'].count().sum())
    # extract values for each degree and append to global dicts
    for degree in ['Bachelor','Master','Doctoral']:
        eu_degree_mask = eu_degree_fractions.index.str.contains(degree)
        uk_degree_mask = uk_degree_fractions.index.str.contains(degree)
        EU_dict[degree].append(eu_degree_fractions[eu_degree_mask].values)
        UK_dict[degree].append(uk_degree_fractions[uk_degree_mask].values)
        
# plot the data we created

fig, ax = plt.subplots(1, 1)
ax.bar(survey_years-0.2,  np.ravel(EU_dict['Doctoral']), 0.4, color=EU_MULTI_COLOURS[0],
       alpha=0.8)
ax.bar(survey_years+0.2, np.ravel(UK_dict['Doctoral']), 0.4, color=UK_MULTI_COLOURS[0],
       alpha=0.8)
ax.bar(survey_years-0.2, np.ravel(EU_dict['Master']), 0.4, color=EU_MULTI_COLOURS[1],
       label='EU', alpha=0.8, bottom = np.ravel(EU_dict['Doctoral']))
ax.bar(survey_years+0.2, np.ravel(UK_dict['Master']), 0.4, color=UK_MULTI_COLOURS[1],
       label='UK', alpha=0.8, bottom = np.ravel(UK_dict['Doctoral']))
ax.bar(survey_years-0.2, np.ravel(EU_dict['Bachelor']), 0.4, color=EU_MULTI_COLOURS[3],
       alpha=0.5,  bottom = np.ravel(EU_dict['Doctoral']) +  np.ravel(EU_dict['Master']))
ax.bar(survey_years+0.2, np.ravel(UK_dict['Bachelor']), 0.4, color=UK_MULTI_COLOURS[3], 
       alpha=0.5,  bottom = np.ravel(UK_dict['Doctoral']) + np.ravel(UK_dict['Master']))


# add labels and sort out tick labels
ax.legend(loc=(1.01,0.878))
ax.set_ylabel('Fraction of responses')
ax.set_xlabel('Year')
ax.text(0.09,0.1,'PhD', transform=ax.transAxes, color='white', rotation=0, fontsize=10)
ax.text(0.07,0.47,'Master', transform=ax.transAxes, color='white', rotation=0, fontsize=10)
ax.text(0.057,0.82,'Bachelor', transform=ax.transAxes, color='white', rotation=0, fontsize=10)
_ = ax.set_xticklabels(['','','2017','','2018','','2019'])
ax.set_ylim(0,1)
# turn off all the lines to remove clutter
turn_off_lines(ax)
fig.set_size_inches(7,5)

### Figure 4.2 Student engagment

This figure displays the fraction of respondents who are a Student in 2018 (left two bars) and 2019 (right two bars) for respondents located in the EU (solid blue bars) and UK (solid red bars).

In [None]:
# only looking at students for now.
jobs_to_use = ['Student','Other']

# map all the other categories to 'other'
yearly_dfs[0].loc[~yearly_dfs[0]['employment_status'].isin(jobs_to_use), 'employment_status'] = 'Other'
yearly_dfs[1].loc[~yearly_dfs[1]['employment_status'].isin(jobs_to_use), 'employment_status'] = 'Other'

EU_students = []
UK_students = []
students = {'UK':UK_students, 'EU':EU_students}
for body in ['UK','EU']:
    student_2019  = yearly_dfs[0][yearly_dfs[0]['body']==body].groupby('employment_status')['body'].count()
    student_2019 = student_2019 / student_2019.sum()
    student_2018 = yearly_dfs[1][yearly_dfs[1]['body']==body].groupby('employment_status')['body'].count()
    student_2018 = student_2018 / student_2018.sum()
    students[body].append([student_2019.values[1], student_2018.values[1]])
    
fig, ax = plt.subplots(1, 1)
ax.bar(survey_years[:-1]-0.2,  np.ravel(students['EU'])*100, 0.4, color=EU_COLOUR, alpha=0.8, label='EU')
ax.bar(survey_years[:-1]+0.2, np.ravel(students['UK'])*100, 0.4, color=UK_COLOUR, alpha=0.8, label='UK')

ax.legend()
ax.set_ylabel('Fraction of responses from students (%)')
ax.set_xlabel('Year')
_ = ax.set_xticklabels(['','','2018','','','','2019'])
ax.set_ylim(0,25)
turn_off_lines(ax)
fig.set_size_inches(7,5)

### Figure 4.3 Coding experience

In the below figure we display the reported coding experience for respondents in 2019 (left panel) and 2018 (right panel). Colour denotes the location of the respondents, either in the UK or in a EU member state (red and blue respectively). The bars are translucent and overlaid on top of each other to allow for easy comparison of bar heights.

In [None]:
fig, axes = plt.subplots(1,2, sharey=True)
offset = {'UK':0.2, 'EU':-0.2}
for i, ax in enumerate(axes):
    yearly_dfs[i]['coding_lower'] =  yearly_dfs[i]['coding_experience'].str.split('-', expand=True)[0].astype(float)
    yearly_dfs[i]['coding_higher'] =  yearly_dfs[i]['coding_experience'].str.split('-', expand=True)[1]
    for body in ['EU','UK']:
        coding_grouping = yearly_dfs[i][yearly_dfs[i]['body']==body].groupby(['coding_lower','coding_higher'])['body'].count()
        coding_grouping = coding_grouping / coding_grouping.sum()
        coding_grouping = coding_grouping.reset_index()
        coding_grouping['bins'] = coding_grouping['coding_lower'].astype(int).astype(str) + '-' + coding_grouping['coding_higher'].astype(str)
        labels = coding_grouping['bins'].values
        ax.bar(np.arange(0, len(labels))-offset[body], coding_grouping['body'], alpha=0.6, color=color_dict[body], label=body, width=0.4)
        ax.set_title(survey_years[i])
        ax.set_xlabel('Coding experience (years)')
        # turn off all the lines to remove clutter
        ax.set_xticklabels(['']+list(labels[:-1])+['20+'])
        turn_off_lines(ax)
        ax.grid(axis='y', alpha=0.4)
    
# tidy up the axes
axes[0].set_ylabel('Fraction of respondents')
axes[1].legend()

fig.set_size_inches(14,5.5)

# 5. Compensation

In this section we look at the the reported compensation of respondents in the 2019 Kaggle survey. Firstly, we display the reported compensation values for respondents located in the UK and EU in Figure 5.1. The biggest  difference in the distribution is the UK respondents, on average, report higher salaries than EU respondents (compare the normalized difference graphic in the right panel. The higher values are mainly shaded red and the lower value incomes are mainly shaded blue, which indicate that the UK respondents generally reported higher income). The earning disparity could be accounted for by some of the insights we previously extracted: UK respondents are more likely to have a PhD (Figure 4.1) and are slightly older (Figure 2.1), which both should account for a higher levels of compensation.  Finally, [compared to about half of the EU member states the UK has a higher cost of living](https://www.numbeo.com/cost-of-living/rankings_by_country.jsp?title=2018&region=150) and this could result in a higher realtive compensation amount of UK respondents. Its unlikely the number of student respondents affect the results (see Figure 4.2) as the EU saw fractionally less student responses in 2019 compared to the UK.

Follow this, we also grouped respondents into three income categories (low, medium and high income) and then looked at the gender distribution across these income bands (Figure 5.2). Our graphic reveals a striking gender imbalance for both UK and EU respondents: higher earners are nearly exclusively Male and the Female to Male gender ratio drops as the income category reduces (i.e., lower earners are more likely to be Female than higher earner). This is the stark reminder to addresses the gender biases and barriers in the data science and machine learning community. 

There are concerns that a No deal Brexit will cause a dramatic [hit to the UK economy](https://news.sky.com/story/live-chancellor-philip-hammond-to-reveal-brexit-economic-impact-analysis-11565724), and so a particularily intresting comparison could be made after Brexit event itself (i.e., after the 2010 Kaggle survey). 

**Key points:**
>- A higher fraction of UK respondents fell into higher wage brackets than EU respondents.
- A significantly higher fraction of EU respondents earned less then \$15,000 a year compared to UK respondents.
- There is a large gender imbalance for recievers of high income: they are nearly exclusively male for both the EU and UK.

### Figure 5.1 Reported compensation in 2019
This figure displays the fraction of respondents reporting a salary in a given compensation window (in dollars) for the both UK (red bars) and EU (blue bars). The figure in the far right panel (titled 'Normalized differences') displays the normalized difference between the number of respondents per income category for respondents in the  UK and EU. When the fractional respondents is higher for the EU the bars are coloured blue, when the UK respondents report more often in a given income category the bars are coloured red.

In [None]:
fig, axes = plt.subplots(1,3, sharey=True, gridspec_kw = {'wspace':0, 'hspace':0,
                                                         'width_ratios': [3, 3,1]})
groupings = []
for ax, body in zip(axes,['UK','EU']):
    salary_grouping = yearly_dfs[0][yearly_dfs[0]['body']==body].groupby(['salary_lower','salary_higher'])['body'].count()
    salary_grouping = salary_grouping / salary_grouping.sum()
    salary_grouping = salary_grouping.reset_index()
    salary_grouping['salary_lower'] = salary_grouping['salary_lower'] 
    salary_grouping['bins'] = salary_grouping['salary_lower'].astype(str) + '-' + salary_grouping['salary_higher'].astype(str)
    groupings.append((salary_grouping['bins'], salary_grouping['body']))
    ax.barh(salary_grouping['bins'], salary_grouping['body'], alpha=0.5, color=color_dict[body], label=body)
    ax.set_title(body, fontsize=14)
    plt.setp(ax.spines.values(), visible=False)
    ax.patch.set_visible(False)
    ax.grid(False)
    ax.set_xlabel('Fraction of respondents')

axes[0].invert_xaxis()
axes[0].set_ylabel('Compensation amount ($)')

#ax.set_title('Compensation amount in 2019')

# calulate and plot the residuals on the third axis
# calculate normalised difference
asymmetry = (groupings[1][1][:-1]-groupings[0][1])/(groupings[1][1][:-1]+groupings[0][1])
# make the colours of the bars
colors = asymmetry.copy()
mask = asymmetry>0
colors[mask] = color_dict['EU']
colors[~mask] = color_dict['UK']
# plot
axes[2].barh(groupings[0][0], asymmetry, alpha=0.3,color=colors)
axes[2].axvline(0, alpha=0.2, color='gray')
axes[2].grid(False)
plt.setp(axes[2].spines.values(), visible=False)
plt.setp(axes[2].get_xticklabels(), visible=False)
axes[2].patch.set_visible(False)
axes[2].set_title('Normalized differences', fontsize=14)
plt.suptitle('Reported income by body', fontsize=16)
fig.set_size_inches(15,12)

### Figure 5.2 Compensation by gender in 2019

The below graphic outlines the gender distribution for persons with a low income (\$0-\$29999, approximately less than the [UK average income](https://www.ons.gov.uk/peoplepopulationandcommunity/personalandhouseholdfinances/incomeandwealth/bulletins/householddisposableincomeandinequality/yearending2018)), a medium income (\$30,000-\$99999) and a high income (\$100000+) in both the EU (blue segments) and UK (red segments). This is adapted from [my original post](https://www.kaggle.com/fchmiel/who-will-your-analysis-story-be-about#Method-3:-Using-UMAP) outlining how to create this graphic.

Only 26 respondents reported a low income in the UK.

In [None]:

# group the compensation level in low, medium or high income.
salary_mapping = {'$0-999':'low', '1,000-1,999':'low', 
                  '10,000-14,999':'low', '100,000-124,999':'high',
                  '125,000-149,999':'high', '15,000-19,999':'low', 
                  '150,000-199,999':'high', '2,000-2,999':'low',
                  '20,000-24,999':'low', '200,000-249,999':'high', 
                  '25,000-29,999':'low', '250,000-299,999':'high',
                  '3,000-3,999':'low','30,000-39,999':'medium',
                  '300,000-500,000':'high', '4,000-4,999':'low',
                  '40,000-49,999':'medium', '5,000-7,499':'low', 
                  '50,000-59,999':'medium', '60,000-69,999':'medium',
                  '7,500-9,999':'low', '70,000-79,999':'medium', 
                  '80,000-89,999':'medium', '90,000-99,999':'medium',
                  '> $500,000':'high'}
# make the salary grouping
yearly_dfs[0]['income_group'] = yearly_dfs[0]['salary'].map(salary_mapping)
    
fig, axes = plt.subplots(ncols=3)
for i, (ax, income) in enumerate(zip(axes, ['low', 'medium','high'])):
    # group all the other genders to 'Other' for simplicity see disclaimer above
    yearly_dfs[0].loc[~yearly_dfs[0]['gender'].isin(['Male','Female']), 'gender'] = 'Other'
    
    # respondent masks
    UK_mask = yearly_dfs[0]['body']=='UK'
    EU_mask = yearly_dfs[0]['body']=='EU'
    income_mask = yearly_dfs[0]['income_group']==income
    
    # Count respondents of each gender in subgroup
    EU_gender_count = yearly_dfs[0][EU_mask & income_mask].groupby('gender')['body'].count()
    UK_gender_count = yearly_dfs[0][UK_mask & income_mask].groupby('gender')['body'].count()

    # plot
    ax.pie(UK_gender_count, radius=1, colors=UK_MULTI_COLOURS, pctdistance=0.85, **shared_pi_formats)
    ax.pie(EU_gender_count, radius=1-size, colors=EU_MULTI_COLOURS, pctdistance=0.78, **shared_pi_formats)
    ax.set(aspect="equal", title=f'{income.capitalize()} income')


axes[1].legend(loc=(2.1,0.8), labels=pi_labels)
fig.set_size_inches(16,6)

# 6. Media Engagement

In the following, we look at the data science media sources UK and EU respondents said were there favourite for accessing data science media in 2019. Figure 6.1 displays the number of times respondent a indicated a given media type was one of their favourite media sources (normalized by the total number of indications). Overall, Blogs, Kaggle and Youtube were used most by respondents to engage with data science media. The media engangement of UK / EU respondents were similar, but UK respondents were more likely to be engaged with Twitter compared to EU respondents.

We also investigated what media sources repondents selected *together* as their favourite media sources (Figure 6.2), which could indicate complementary sources. We found that users rarely use a single source of media but tend to use a number of media sources, for example Journal Publications hold a media source complementary to Kaggle (e.g., novel methods for use in competitions).

**Key points:**
>- Types of platforms used to engage with data science media is similar in the UK and EU.
- UK respondents are more likely to have used Twitter to engage with data science media.
- Respondents tended to indicate several sources in the favourite media sources.

### Figure 6.1 Favourite data science media sources

The below figures displays how many respondents said a given media type  was one of their favouriet sources of data science media (normalized by the total number favourites selected). Red bars represent UK responses and blue bars the EU ones. The bars are translucent and overlaid to allow a direct comparison of their heights.

In [None]:
# extract the question on engagement with media sources
df_media = yearly_dfs[0].loc[:,(yearly_dfs[0].columns.str.contains('Q12')==True) | yearly_dfs[0].columns.str.contains('body')]

# extract name of websites for each response
col_names = df_media.mode().values[0]
engagement_types = [str(name).split('(')[0].strip() for name in col_names]

# make the column names the name of the media
df_media.columns = ['body'] + engagement_types[1:]

# drop the free field text column
df_media.drop('-1', axis=1, inplace=True)

# count the number of each media type
media_grouped = df_media.groupby('body').count()-1 # -1 for the question row

# make the plot
fig, ax = plt.subplots()
for body in ['EU','UK']:
    data = media_grouped.loc[body,:] / media_grouped.loc[body,:].sum()
    data.sort_values(inplace=True)
    ax.bar(data.index, data.values, alpha=0.35, color=color_dict[body], label=body)
    turn_off_lines(ax)
ax.legend()
plt.setp(ax.get_xticklabels(), rotation=90)    
ax.set_ylabel('Fraction of total engagements')
ax.set_title('2019 Media engagements')
fig.set_size_inches(8,5)

### Figure 6.2 Co-occurances of media use

This figure highlights the types of media sources respondents indicated were their favourites. The diagonal elements show the number of times respondents said the given media type was one of their favourites (darker colour is more respondents). The off-diagonal elements show media sources which users indicated *were both* in their favoure media sources for accessing data science media. For example, read the Kaggle row and the colour denotes the frequency that the respondent selected Kaggle and the other media (denoted by the column) as one of their favourites. This graph not only tells you that Kaggle, Blogs and Youtube are the favourite media sources (look at the diagonal elements) but the off-diagonal elements also tell you that respondents who selected Kaggle as one of their favourite media sources also liked Blogs and Youtube.

In [None]:
body_series = df_media['body'].copy()
df_media[~df_media.isna()] = 1
df_media[df_media.isna()] = 0

fig, axes = plt.subplots(1,2)

for ax, body, cmap in zip(axes, ['EU','UK'], ['Blues','Reds']):
    mask = body_series==body
    # create the cross occurances df
    cross_occ_df = pd.DataFrame(index=df_media.columns[1:], columns=df_media.columns[1:])
    cols = df_media.columns[1:]
    for col_1 in cols:
        for col_2 in cols:
            # calculate the co-occurances (i.e, when respondent said they used them both)
            cross_occ = pd.crosstab(df_media.loc[mask,col_1], df_media.loc[mask,col_2])
            try:
                cross_occ_df.loc[col_1,col_2] = cross_occ.loc[1,1]
            except:
                cross_occ_df.loc[col_1, col_2] = 0
    # plot and format the graph
    ax.imshow(cross_occ_df, cmap=cmap)
    ax.set_title(body)
    ax.set_xticks(np.arange(0,12))
    ax.set_yticks(np.arange(0,12))
    ax.set_xticklabels(cols, rotation=90)
    ax.set_yticklabels(cols)
    fig.set_size_inches(10,10)
    turn_off_lines(ax)
    if body=='UK':
        ax.set_yticklabels([])
        ax.spines['left'].set_visible(False)
        ax.set_yticks([])
        
plt.suptitle('Co-occurances of media use', fontsize=15)
fig.set_size_inches(10,5)

# 7. Software use

In this section we dicuss respondents use of different software and programming languages. In our first figure (Figure 7.1) we look at the programming language used by five of the top professions responding to the 2019 Kaggle survey. The first thing to note is that Python is clearly the language used most regularily by many of the respondents. Secondly, we can look as specific professions. For example, Data Analysts (2nd from top) responded they use Python, R and SQL regularily but few of them used any other languages. Comparitively, Software Engineers (2nd from bottom) used a more diverse range of languages. Data Scientists (bottom) used similar languages to Data Analysts but more Analysts used SQL. Matlab was not used extensively, with the exception of (UK) Students and Research Scientists. Personally, the author knows from experience that Matlab is a favourite of UK universities for undergraduate teaching and this may explain this finding. Overall, there were only small differences between programming language use between EU and UK respondents.

In our next figure (Figure 7.2) we investigate respondents use of different IDE. Again we see only small differences between UK and EU respondents but we observe quite large differences between respondents with different Jobs. For example, Software Engineers (2nd from bottom) particuarily favour Visual Studio more than other respondents and Research Scientists (top row) tend to use the most diverse range of IDEs. Overall, Jupyter, PyCharm and RStudio are the IDEs used by the most respondents. 

**Key points:**
>- Python is the language most used by respondents. R and SQL are the 2nd most.
- Data Analysts have the most restricted language use.
- Jupyter, Rstudio and Pycharm and the most used IDES.
- Software Engineers use Visual Studio more than other professions do.

### Figure: 7.1 Programming language by profession in 2019

In this figure we look at the programming languages respondents use on a regular basis. The size (bigger is more) and colour (darker is more) indicate the number of repsondents of a given job (row) type who indicated they use the given programming language (column) on a regular basis. EU respondents (left panel) are represented by blue circles and UK respondents (right panel) are represented by red circles.

In [None]:
cols_for_analysis = ['job'] + list(yearly_dfs[0].columns[yearly_dfs[0].columns.str.contains('Q18')])
job_titles = ['Data Scientist', 'Software Engineer', 'Student', 'Data Analyst', 'Research Scientist']

fig, axes = plt.subplots(1,2, facecolor=(0.99, 0.99, 0.99), sharey=True)

for ax, body, cmap in zip(axes,['EU','UK'], ['Blues','Reds']):
    # extract the data
    df_IDE = yearly_dfs[0][cols_for_analysis]
    # make the column names those of the tick box
    ide_names = yearly_dfs[0][cols_for_analysis].mode().values[0]
    df_IDE.columns = ['job'] + list(ide_names[1:])
    # make a mask for the body
    body_mask = yearly_dfs[0].body==body
    
    # count number of ide used by each job type
    grouped_df = df_IDE.loc[body_mask,:].groupby('job').count().loc[job_titles, :'None']
    values = grouped_df.values / grouped_df.values.sum(axis=1)[:,np.newaxis]
    # make a grid to plot on
    X, Y = np.meshgrid(np.arange(1,12), np.arange(1,len(job_titles)+1))
    
    # get and clean up the tick labels
    x_tick_labels = [x.split('(')[0].split('/')[0].strip() for x in grouped_df.columns]
    y_tick_labels = list(grouped_df.index)

    ax.scatter(x=np.ravel(X), y=np.ravel(Y), s=np.ravel(values)*2500, c=np.ravel(values), cmap=cmap)

    # tidy up axis, make ticks and lines invsible.
    plt.setp(ax.spines.values(), visible=False)
    ax.patch.set_visible(False)
    ax.grid(False)
    ax.set_yticks(np.arange(1,len(job_titles)+1))
    ax.set_yticklabels(y_tick_labels)
    ax.set_xticks(np.arange(1,12))
    ax.set_xticklabels(x_tick_labels, rotation=90)
    ax.set_title(f'{body} respondents')
plt.suptitle('Programming by Profession', fontsize=14)
    
fig.set_size_inches(15,5)

### Figure 7.2: IDE use by profession in 2019

In this figure we look at the IDEs used by respondents in five of the top professions (rows of figure). The size (bigger is more) and colour (darker is more) indicate the number of repsondents of a given job type who indicated they use this IDE. 

The left panel (blue circles) corresponds to EU respondents and the right panel (red circles) corresponds to UK respondents.

In [None]:
cols_for_analysis = ['job'] + list(yearly_dfs[0].columns[yearly_dfs[0].columns.str.contains('Q16')])
job_titles = ['Data Scientist', 'Software Engineer', 'Student', 'Data Analyst', 'Research Scientist']

fig, axes = plt.subplots(1,2, facecolor=(0.99, 0.99, 0.99), sharey=True)

for ax, body, cmap in zip(axes,['EU','UK'], ['Blues','Reds']):
    # extract the data
    df_IDE = yearly_dfs[0][cols_for_analysis]
    # make the column names those of the tick box
    ide_names = yearly_dfs[0][cols_for_analysis].mode().values[0]
    df_IDE.columns = ['job'] + list(ide_names[1:])
    # make a mask for the body
    body_mask = yearly_dfs[0].body==body
    
    # count number of ide used by each job type
    grouped_df = df_IDE.loc[body_mask,:].groupby('job').count().loc[job_titles, :'None']
    values = grouped_df.values / grouped_df.values.sum(axis=1)[:,np.newaxis]
    # make a grid to plot on
    X, Y = np.meshgrid(np.arange(1,12), np.arange(1,len(job_titles)+1))
    
    # get and clean up the tick labels
    x_tick_labels = [x.split('(')[0].split('/')[0].strip() for x in grouped_df.columns]
    y_tick_labels = list(grouped_df.index)

    ax.scatter(x=np.ravel(X), y=np.ravel(Y), s=np.ravel(values)*2500, c=np.ravel(values), cmap=cmap)

    # tidy up axis, make ticks and lines invsible.
    plt.setp(ax.spines.values(), visible=False)
    ax.patch.set_visible(False)
    ax.grid(False)
    ax.set_yticks(np.arange(1,len(job_titles)+1))
    ax.set_yticklabels(y_tick_labels)
    ax.set_xticks(np.arange(1,12))
    ax.set_xticklabels(x_tick_labels, rotation=90)
    ax.set_title(f'{body} respondents')
    
plt.suptitle('IDE use by Profession', fontsize=14)    
fig.set_size_inches(15,5)

# 8. Conclusions

We performed a high level analysis of the previous three Kaggle ML/DS surveys, focussing on comparing the responses of persons located in the UK to persons located in the other EU member states. We found that the UK and EU largely share similar problems: primarily, a large gender imbalance (Figure 3.1) in the data science and machine learning communities. We found noticeable differences between the reported compensation amounts of EU and UK respondents (Figure 5.1), and found a large gender imbalance between low and high earners both in the UK and EU (Figure 5.2). We discovered subtle differences in the degree attainment levels of respondents (Figure 4.1), with EU respondents more likely to have a higher degree but UK respondents more likely to have a PhD. We investigated the media UK and EU respondents used (Figure 6.1 and 6.2), media platform engagement was similar for both communities but the UK respondents were more likely to use Twitter to engage with data science media.

Overall we found that the UK data science community is very similar to the EU community and there are no obvious signs in the survey of negative effects arising as a result of the referendum on the UK data science community. We found that, due to the different format of the three surveys (2017, 2018, 2019), it was difficult to draw out reliable temporal trends. Going forward, it will be important to keep a similar format to the 2019 survey such that in future years (i.e., the 2020 survey and beyond) robust temporal analysis can be performed which, for example, could look at the changes in the UK data science community post-Brexit (particularily in the event of a No deal brexit, which would result in the largest changes to the UK and EU economies and culture).

**Five key findings:**

>1. The EU and UK data science communities have a large gender imbalance.
2. UK respondents report a higher average salary than their EU counterparts; there is a significant gender imbalance between low and high earners.
3. A larger fraction of EU respondents obtained higher degrees (Masters or PhD) but more UK respondents have PhDs.
4. Python and Jupyter are the most used programming language and IDE respectively, for both UK and EU respondents.
5. Most survery respondents are under 30: a young demographic responded to the survey.

## 8.1 How could this analysis be improved?
>- Additional years of questioning and increase sample size
- Direct comparison of countries (e.g., UK vs France)
- Comparison of clusters in the data (e.g., cluster on IDE and investigate education levels)

**Please feel free to provide constructive feedback and let me know the figures you find easiest to interpret!**