In [None]:
# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Trends in the roles of Kagglers (2017 - 2020)

In this notebook I aim to tell a story about how the roles of Kagglers have developed over the last four years based on the results of the annual Kaggle surveys. I haven't analysed all questions across the surveys. Instead I've focused on a few areas I'm particularly interested in understanding more about.

I've been away for a while but it's good to be back amongst the Kaggle community. When I saw this competition I had grand plans to explore many aspects of these surveys and put together the mother of all notebooks. However work, spending time with my wife and dog, running, Christmas and COD Warzone got in the way, so the scope of this notebook has shortened somewhat. In true PhD thesis style though I've included a further work section at the end for stuff I might get to at some point.

I think exploring the reasons behind analysis outputs is one of the most interesting aspects of working in the field of Data Science. So since we're in this no-pressure environment (i.e. no one is going to lose out financially or otherwise if I give a wildly incorrect hypothesis) I aim to provide commentary around the outputs where I can. We're all domain experts to some extent in this area so if anyone has any further insights or alternate suggestions for some of the results then I'd love to hear them.

Incase you're already getting tired of my rambling, here are some headline insights to take away from this work:
* **Students** have surged from 5.9% of respondents in 2017 to 25.8% of respondents in 2020
* In 2020, 43.4% of students are based in **India** and the number of Indian students has increased x10 since 2017
* **Data Scientist** is consistently the most popular occupation amongst Kagglers but has reduced in popularity by 7.3% in the last year
* Salaries vary by role. They remain largely stable in the US over the past four years, but are more changeable in other countries

## Some motivation for this work - What's my role again?

Data Science is a very broad topic, and a Data Scientist can cover a lot of roles in a company. I work for a small analytics company with a team of 3 - 4 data scientists and we cover many aspects of work in our roles, including data acquisition, product development, project management, and customer support (as well as some actual data science). When I filled in the 2020 survey and got to the current role question, it was hard to answer.

I have a Bachelors degree in Maths, and eventually managed to complete a part time PhD whilst working full time. In terms of roles, I started as a student, my first professional role was as a research associate, then I tried 'Mathematician' for a while, and also mulled over 'Quantitative Analyst' (I found it hard to say, so dropped it). A few years into my professional career, I went to the predictive analytics world conference in San Francisco where Anthony Goldbloom gave the keynote speech about the Heritage Health Prize competition. In another talk I attended, a speaker (unfortunately I can't remember his name) summarised at the end of his presentation by saying - "to do well, learn Python, learn R, change your job title to Data Scientist, and ask for a pay rise". It got a big cheer. I promptly went and found my boss, who had been watching a different talk, and told him I was now a data scientist and I needed a pay rise. He laughed, and agreed. 

Data Scientist has stuck since then. On my business card and email signature I'm a data scientist. But maybe I'm actually more of a product/project manager, or maybe I'm a data analyst? Maybe I'm something else? Maybe this is a common thing for Data Scientists. I'm not sure. It would be great if I could use these surveys to help me make things clearer.

## Some notes about compiling the annual survey results

I initially focused on the following areas across the surveys -
* **Role** - I tidied up the various role options across the years, hopefully without changing anyone's actual role too much. I had to look up what some actually were. There were separate questions for unemployement and student status in the 2017 survey so these have been added to the roles to be consistent with following years.
* **Country** - some country names were captured differently across the surveys but on the whole this was pretty clean to compile.
* **Salary** - these were a lot of fun to compile. 2019 and 2020 were the same format, however 2018 was captured in different bands, and 2017 captured individual amounts in the repondants own currency. Thank goodness Kaggle provided a conversion table. I had to create a new set of salary bands that all the results would fit into.

In [None]:
survey17 = pd.read_csv('../input/kaggle-survey-2017/multipleChoiceResponses.csv', encoding='latin-1')
conversionrates17 = pd.read_csv('../input/kaggle-survey-2017/conversionRates.csv')
survey18 = pd.read_csv('../input/kaggle-survey-2018/multipleChoiceResponses.csv')
survey19 = pd.read_csv('../input/kaggle-survey-2019/multiple_choice_responses.csv')
survey20 = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')

In [None]:
# Extract info we initially want to use from the survey data
# 2017 survey
responses17 = survey17[['GenderSelect', 'Country', 'EmploymentStatus', 'StudentStatus', 'CurrentJobTitleSelect', 'CompensationAmount', 'CompensationCurrency']] # columns
responses17.insert(0, 'ResponseID', range(1, 1 + len(responses17))) # add a response ID
responses17.loc[(responses17['EmploymentStatus'] == 'Not employed, but looking for work') | (responses17['EmploymentStatus'] == 'Not employed, and not looking for work'), 'CurrentJobTitleSelect'] = 'Not Employed' # add not employed to jobs list
responses17.loc[(responses17['StudentStatus'] == 'Yes'), 'CurrentJobTitleSelect'] = 'Student' # add student to jobs list
responses17.loc[(responses17.CurrentJobTitleSelect.isna()), 'CurrentJobTitleSelect'] = 'Other' # fill in job blanks with other
# these next few lines about currency took a while to make work. I'm sure there's a more efficient way to do it.
responses17sal = pd.merge(responses17, conversionrates17, how = 'inner', left_on = 'CompensationCurrency', right_on = 'originCountry')
responses17sal['CompensationAmount'] = responses17sal['CompensationAmount'].str.replace(',', '') # remove non-numeric characters
responses17sal['CompensationAmount'] = responses17sal['CompensationAmount'].str.replace('-', '')
responses17sal.loc[(responses17sal['CompensationAmount'] == ''), 'CompensationAmount'] = 0 # replace empty values with 0
#responses17['CompensationAmount'] = responses17['CompensationAmount'].fillna(0) # replace nan with 0
responses17sal['CompensationAmount'] = pd.to_numeric(responses17sal['CompensationAmount'])
responses17sal['USD_converted'] = responses17sal['CompensationAmount']*responses17sal['exchangeRate']
responses17 = pd.merge(responses17, responses17sal[['ResponseID', 'USD_converted']], how = 'left', on = 'ResponseID')
responses17 = responses17[['GenderSelect', 'Country', 'CurrentJobTitleSelect', 'USD_converted']]
responses17.columns = ['Gender', 'Country', 'JobTitle', 'Salary']
responses17['SurveyYear'] = 2017

# 2018 survey
responses18 = survey18.loc[1:,['Q1', 'Q3', 'Q6', 'Q9']]
responses18.columns = ['Gender', 'Country', 'JobTitle', 'Salary']
responses18['SurveyYear'] = 2018

# 2019 survey
responses19 = survey19.loc[1:,['Q2', 'Q3', 'Q5', 'Q10']]
responses19.columns = ['Gender', 'Country', 'JobTitle', 'Salary']
responses19['SurveyYear'] = 2019

# 2020 survey
responses20 = survey20.loc[1:,['Q2', 'Q3', 'Q5', 'Q24']]
responses20.columns = ['Gender', 'Country', 'JobTitle', 'Salary']
responses20['SurveyYear'] = 2020

In [None]:
# the salary bands need a fair amount of work to make them all compareable. Hurrah.
# 2017 are individual amounts so will need to be grouped
# 2018 has 19 salary bands
# 2019 and 2020 have 25 salary bands
# no year provides groupings that satisfy the other years so we'll have to create a hybrid salary grouping based on groups across 2018 - 2020

# 2017
responses17['SalaryGrp'] = 'Unknown'
responses17.loc[(responses17['SurveyYear'] == 2017) & (responses17['Salary'] <= 10000), 'SalaryGrp'] = '$0-10k'
responses17.loc[(responses17['SurveyYear'] == 2017) & (responses17['Salary'] > 10000) & (responses17['Salary'] <= 20000), 'SalaryGrp'] = '$10-20k'
responses17.loc[(responses17['SurveyYear'] == 2017) & (responses17['Salary'] > 20000) & (responses17['Salary'] <= 30000), 'SalaryGrp'] = '$20-30k'
responses17.loc[(responses17['SurveyYear'] == 2017) & (responses17['Salary'] > 30000) & (responses17['Salary'] <= 40000), 'SalaryGrp'] = '$30-40k'
responses17.loc[(responses17['SurveyYear'] == 2017) & (responses17['Salary'] > 40000) & (responses17['Salary'] <= 50000), 'SalaryGrp'] = '$40-50k'
responses17.loc[(responses17['SurveyYear'] == 2017) & (responses17['Salary'] > 50000) & (responses17['Salary'] <= 60000), 'SalaryGrp'] = '$50-60k'
responses17.loc[(responses17['SurveyYear'] == 2017) & (responses17['Salary'] > 60000) & (responses17['Salary'] <= 70000), 'SalaryGrp'] = '$60-70k'
responses17.loc[(responses17['SurveyYear'] == 2017) & (responses17['Salary'] > 70000) & (responses17['Salary'] <= 80000), 'SalaryGrp'] = '$70-80k'
responses17.loc[(responses17['SurveyYear'] == 2017) & (responses17['Salary'] > 80000) & (responses17['Salary'] <= 90000), 'SalaryGrp'] = '$80-90k'
responses17.loc[(responses17['SurveyYear'] == 2017) & (responses17['Salary'] > 90000) & (responses17['Salary'] <= 100000), 'SalaryGrp'] = '$90-100k'
responses17.loc[(responses17['SurveyYear'] == 2017) & (responses17['Salary'] > 100000) & (responses17['Salary'] <= 125000), 'SalaryGrp'] = '$100-125k'
responses17.loc[(responses17['SurveyYear'] == 2017) & (responses17['Salary'] > 125000) & (responses17['Salary'] <= 150000), 'SalaryGrp'] = '$125-150k'
responses17.loc[(responses17['SurveyYear'] == 2017) & (responses17['Salary'] > 150000) & (responses17['Salary'] <= 200000), 'SalaryGrp'] = '$150-200k'
responses17.loc[(responses17['SurveyYear'] == 2017) & (responses17['Salary'] > 200000) & (responses17['Salary'] <= 250000), 'SalaryGrp'] = '$200-250k'
responses17.loc[(responses17['SurveyYear'] == 2017) & (responses17['Salary'] > 250000) & (responses17['Salary'] <= 300000), 'SalaryGrp'] = '$250-300k'
responses17.loc[(responses17['SurveyYear'] == 2017) & (responses17['Salary'] > 300000) & (responses17['Salary'] <= 500000), 'SalaryGrp'] = '$300-500k'
responses17.loc[(responses17['SurveyYear'] == 2017) & (responses17['Salary'] > 500000), 'SalaryGrp'] = '$500k+'

# 2018
responses18['SalaryGrp'] = 'Unknown'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == 'I do not wish to disclose my approximate yearly compensation'), 'SalaryGrp'] = 'Unknown'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == '0-10,000'), 'SalaryGrp'] = '$0-10k'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == '10-20,000'), 'SalaryGrp'] = '$10-20k'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == '20-30,000'), 'SalaryGrp'] = '$20-30k'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == '30-40,000'), 'SalaryGrp'] = '$30-40k'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == '40-50,000'), 'SalaryGrp'] = '$40-50k'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == '50-60,000'), 'SalaryGrp'] = '$50-60k'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == '60-70,000'), 'SalaryGrp'] = '$60-70k'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == '70-80,000'), 'SalaryGrp'] = '$70-80k'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == '80-90,000'), 'SalaryGrp'] = '$80-90k'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == '90-100,000'), 'SalaryGrp'] = '$90-100k'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == '100-125,000'), 'SalaryGrp'] = '$100-125k'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == '125-150,000'), 'SalaryGrp'] = '$125-150k'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == '150-200,000'), 'SalaryGrp'] = '$150-200k'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == '200-250,000'), 'SalaryGrp'] = '$200-250k'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == '250-300,000'), 'SalaryGrp'] = '$250-300k'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == '300-400,000'), 'SalaryGrp'] = '$300-500k'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == '400-500,000'), 'SalaryGrp'] = '$300-500k'
responses18.loc[(responses18['SurveyYear'] == 2018) & (responses18['Salary'] == '500,000+'), 'SalaryGrp'] = '$500k+'

# 2019
responses19['SalaryGrp'] = 'Unknown'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '$0-999'), 'SalaryGrp'] = '$0-10k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '1,000-1,999'), 'SalaryGrp'] = '$0-10k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '2,000-2,999'), 'SalaryGrp'] = '$0-10k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '3,000-3,999'), 'SalaryGrp'] = '$0-10k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '4,000-4,999'), 'SalaryGrp'] = '$0-10k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '5,000-7,499'), 'SalaryGrp'] = '$0-10k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '7,500-9,999'), 'SalaryGrp'] = '$0-10k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '10,000-14,999'), 'SalaryGrp'] = '$10-20k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '15,000-19,999'), 'SalaryGrp'] = '$10-20k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '20,000-24,999'), 'SalaryGrp'] = '$20-30k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '25,000-29,999'), 'SalaryGrp'] = '$20-30k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '30,000-39,999'), 'SalaryGrp'] = '$30-40k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '40,000-49,999'), 'SalaryGrp'] = '$40-50k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '50,000-59,999'), 'SalaryGrp'] = '$50-60k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '60,000-69,999'), 'SalaryGrp'] = '$60-70k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '70,000-79,999'), 'SalaryGrp'] = '$70-80k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '80,000-89,999'), 'SalaryGrp'] = '$80-90k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '90,000-99,999'), 'SalaryGrp'] = '$90-100k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '100,000-124,999'), 'SalaryGrp'] = '$100-125k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '125,000-149,999'), 'SalaryGrp'] = '$125-150k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '150,000-199,999'), 'SalaryGrp'] = '$150-200k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '200,000-249,999'), 'SalaryGrp'] = '$200-250k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '250,000-299,999'), 'SalaryGrp'] = '$250-300k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '300,000-500,000'), 'SalaryGrp'] = '$300-500k'
responses19.loc[(responses19['SurveyYear'] == 2019) & (responses19['Salary'] == '> $500,000'), 'SalaryGrp'] = '$500k+'

# 2020
responses20['SalaryGrp'] = 'Unknown'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '$0-999'), 'SalaryGrp'] = '$0-10k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '1,000-1,999'), 'SalaryGrp'] = '$0-10k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '2,000-2,999'), 'SalaryGrp'] = '$0-10k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '3,000-3,999'), 'SalaryGrp'] = '$0-10k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '4,000-4,999'), 'SalaryGrp'] = '$0-10k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '5,000-7,499'), 'SalaryGrp'] = '$0-10k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '7,500-9,999'), 'SalaryGrp'] = '$0-10k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '10,000-14,999'), 'SalaryGrp'] = '$10-20k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '15,000-19,999'), 'SalaryGrp'] = '$10-20k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '20,000-24,999'), 'SalaryGrp'] = '$20-30k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '25,000-29,999'), 'SalaryGrp'] = '$20-30k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '30,000-39,999'), 'SalaryGrp'] = '$30-40k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '40,000-49,999'), 'SalaryGrp'] = '$40-50k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '50,000-59,999'), 'SalaryGrp'] = '$50-60k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '60,000-69,999'), 'SalaryGrp'] = '$60-70k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '70,000-79,999'), 'SalaryGrp'] = '$70-80k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '80,000-89,999'), 'SalaryGrp'] = '$80-90k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '90,000-99,999'), 'SalaryGrp'] = '$90-100k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '100,000-124,999'), 'SalaryGrp'] = '$100-125k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '125,000-149,999'), 'SalaryGrp'] = '$125-150k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '150,000-199,999'), 'SalaryGrp'] = '$150-200k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '200,000-249,999'), 'SalaryGrp'] = '$200-250k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '250,000-299,999'), 'SalaryGrp'] = '$250-300k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '300,000-500,000'), 'SalaryGrp'] = '$300-500k'
responses20.loc[(responses20['SurveyYear'] == 2020) & (responses20['Salary'] == '> $500,000'), 'SalaryGrp'] = '$500k+'

# append the four sets of responses together
responses = responses17.append([responses18, responses19, responses20])
responses[['SurveyYear']] = responses[['SurveyYear']].astype(str)

In [None]:
# Tidy up Job Title column
responses['JobTitle'] = responses.JobTitle.replace(['Software Developer/Software Engineer', 'Software Engineer'], 'Software Developer')
responses['JobTitle'] = responses.JobTitle.replace(['Researcher', 'Research Assistant', 'Scientist/Researcher'], 'Research Scientist')
responses['JobTitle'] = responses.JobTitle.replace(['Manager'], 'Product/Project Manager')
responses['JobTitle'] = responses.JobTitle.replace(['Not employed', 'Currently not employed'], 'Not Employed')
responses['JobTitle'] = responses.JobTitle.replace(['Engineer'], 'Data Engineer')
responses['JobTitle'] = responses['JobTitle'].fillna('Other')

In [None]:
# Tidy up Country column
responses['Country'] = responses.Country.replace(["Hong Kong (S.A.R.)"], 'Hong Kong')
responses['Country'] = responses.Country.replace(["Iran, Islamic Republic of..."], 'Iran')
responses['Country'] = responses.Country.replace(["People 's Republic of China", "Republic of China"], 'China')
responses['Country'] = responses.Country.replace(["Republic of Korea"], 'South Korea')
responses['Country'] = responses.Country.replace(["United Kingdom of Great Britain and Northern Ireland", "United Kingdom"], 'UK')
responses['Country'] = responses.Country.replace(["United States", "United States of America"], 'USA')
responses['Country'] = responses.Country.replace(["Viet Nam"], 'Vietnam')
responses['Country'] = responses.Country.replace(["United Arab Emirates"], 'UAE')
responses['Country'] = responses.Country.replace(["Other", "I do not wish to disclose my location"], 'Unknown')
responses['Country'] = responses.Country.fillna('Unknown')

In [None]:
# create a function to speed up repetitive steps
def surveysum(table, colnames):
    """function to summarise a given set of columns from a table"""
    survsum = table[colnames].value_counts().reset_index()
    colnames.append('Responses')
    survsum.columns = colnames
    return survsum

# Survey Summary
Very quickly, let's check how much data we're working with.

In [None]:
# summarise responses by survey year
survyearsum = surveysum(responses, ['SurveyYear'])

# plot survey year responses
colors = ['#009db6', '#45025c', '#005db6', '#11d16e'] # based on index

fig = go.Figure(go.Bar(x = survyearsum['SurveyYear'], y = survyearsum['Responses'], marker_color = colors))
fig.update_xaxes(type='category')
fig.update_layout(height = 500,
                  width = 600,
                  title = 'Survey Responses by Year',
                  xaxis_title = "Survey Year",
                  yaxis_title = "Responses",
                  xaxis = {'categoryorder':'array', 'categoryarray':['2017', '2018', '2019', '2020']})

fig.show()

2018 was quite a year for the Kaggle survey. 2019 and 2020 responses have been a bit lower but stable since then. In short, there's a good amount of responses.

## Kaggler's Role Titles

Now let's move onto something more specific and interesting. What's everyone's roles?

In [None]:
# summarise job titles by survey year
jobsummary = surveysum(responses, ['JobTitle', 'SurveyYear'])

# join summaries and calculate %
jobsummary = pd.merge(jobsummary, survyearsum, on='SurveyYear')
jobsummary['ResponsePrcnt'] = jobsummary['Responses_x']/jobsummary['Responses_y']

In [None]:
jobsummary17 = jobsummary[jobsummary['SurveyYear'] == '2017']
jobsummary18 = jobsummary[jobsummary['SurveyYear'] == '2018']
jobsummary19 = jobsummary[jobsummary['SurveyYear'] == '2019']
jobsummary20 = jobsummary[jobsummary['SurveyYear'] == '2020']

fig = go.Figure(go.Bar(x=jobsummary17['JobTitle'], y=jobsummary17['ResponsePrcnt'], name='2017', marker_color='#11d16e'))
fig.add_trace(go.Bar(x=jobsummary18['JobTitle'], y=jobsummary18['ResponsePrcnt'], name='2018', marker_color='#009db6'))
fig.add_trace(go.Bar(x=jobsummary19['JobTitle'], y=jobsummary19['ResponsePrcnt'], name='2019', marker_color='#005db6'))
fig.add_trace(go.Bar(x=jobsummary20['JobTitle'], y=jobsummary20['ResponsePrcnt'], name='2020', marker_color='#45025c'))

fig.update_layout(barmode='group', 
                  #height = 700,
                  #width = 1100,
                  xaxis_tickangle = -45,
                  yaxis = dict(tickformat = ".1%"),
                  legend=dict(yanchor = "top", y = 0.98, xanchor = "right", x = 0.99),
                  title = "Survey responses for Role (2017 - 2020)",
                  xaxis_title = "Selected Role",
                  yaxis_title = "Response Percentage",
                  legend_title = "Survey Year",
                  #xaxis = {'categoryorder':'total descending'},)
                  xaxis = {'categoryorder':'array', 'categoryarray':['Student', 'Data Scientist', 'Other', 'Software Developer',
                                                                    'Not Employed', 'Data Analyst', 'Research Scientist',
                                                                    'Machine Learning Engineer', 'Business Analyst', 'Product/Project Manager',
                                                                    'Data Engineer', 'Statistician', 'DBA/Database Engineer',
                                                                    'Consultant', 'Chief Officer', 'Developer Advocate',
                                                                    'Marketing Analyst', 'Salesperson', 'Principal Investigator',
                                                                    'Data Journalist', 'Programmer', 'Computer Scientist',
                                                                    'Predictive Modeler', 'Data Miner', 'Operations Research Practitioner']})
# roles ordered by responses in 2020, then responses in 2019 for any not in 2020, and so on.

fig.show()

There's a lot to digest in this plot. Let me put my discussion cap on.
* Students rising! Student Kagglers dominate the most recent survey by some way, increasing from 20.4% to 25.8% of responses from the previous year. Kaggle is clearly an increasingly valuable resource for students. Remarkably, students made up only 5.9% of responses back in 2017. There are probably a number of factors contributing to the huge increase since then. Speculating on the spot - reasons could include increasing demand for data specialists in industry, general increased availablity of data science degrees and courses, and ofcourse, Kaggle continuing to roll out their own free learning courses (which are great, by the way).
* Survey design could have impacted some of the ability to compare the results fairly across years. In particular, the 2017 survey was rather different to the following surveys in terms of design and the focus of questions. There were far more questions wanting the opinion of the person - topics like job title accuracy, job satisfaction, most important job factors, etc. Potentially Kaggle were getting to know who their users were and what they value. Whereas the more recent surveys are more black and white - what do you do, what do you use, etc. I imagine Kaggle will use this information to make sure their products are calibrated to the skills and tools of most of their users, which makes sense. Ofcourse, Kaggle may just have had some feedback after 2017 saying the survey was a bit long and hard to answer, and so they simplified it a bit. Specifically though on the subject of job titles, employement status and student status were individual questions in 2017. 13.7% of people responded as 'Not working' when asked about their employment status, compared to 3.5% in 2018. Perhaps this is where some unnamed students could have been hiding? *Note: after 2017, both these questions were removed and 'Student' and 'Not Employed' became job title options.*
* Data Scientist is consistently the most common professional role, although considerably less people selected it as their job title this year. Is being a Data Scientist no longer cool? Or is it increasingly ok and appreciated to be something else instead?
* A lot of people have job titles outside the selection options (or would prefer not to say what their job title is). Kaggle haven't provided any output from the free-form text responses, so we won't go any further into what these other roles are.
* Responders indicating they are Not Employed is on the rise - up from 4.8% to 8.2% in 2020. Again, there could be several factors affecting this. Increases in student Kagglers mean there will presumably be increases in Kagglers transitioning from education to professional careers, with potentially some amount of time inbetween where they are not employed. Kaggle's increased offering of learning courses and dataset access is probably also increasingly attractive to those currently not working who are looking to boost their skills and job prospects. Ofcourse another more sobering potential reason for increased unemployment is the year that is 2020 and the devasting global impacts of covid. All the best to anyone impacted by covid in any way, health, career or other, I hope you get back up to speed soon.
* There is a fair amount of changing job selection options in the survey. Some of the more periferal job selections aren't given every year, with many only given once. The Kaggle guys are potentially tweaking the survey options based on their own internal priorities, or someone different is writing the survey each year ;). The job title options certainly aren't determined purely on previous popularity though, with roles such as Consultant, Programmer and Computer Scientist having far more responses in the years they were options than roles such as Statistician or the very rare Database Engineer. I imagine since roles like Statasticians and DB Engineers are fundamental to the core topic of data science, they remain in the survey regardless of the number of responses, where as some of the others may be a bit vague. Who knows. Machine Learning Engineers are celebrating though! Dropped after 2017 before being reinstated in 2020 and picking up a very respectable 5.4% of responses. Maybe some digging in the Other category might highlight more about what is going on here.

My brain's hurting now from thinking, so lets produce some more plots.

First up, where on earth (literally) are those students coming from?

In [None]:
# create table just for student responses
students = responses[responses['JobTitle'] == 'Student']

# summarise student responses by survey year
studentsummary = surveysum(students, ['SurveyYear'])

# summarise students by country and survey year
studentcountrysummary = surveysum(students, ['SurveyYear', 'Country'])

# join summaries and calculate %
studentcountrysummary = pd.merge(studentcountrysummary, studentsummary, on = 'SurveyYear')
studentcountrysummary['StudentResponsePrcnt'] = studentcountrysummary['Responses_x']/studentcountrysummary['Responses_y']

In [None]:
# plot country percentage for student responses
studentcountrysummary17 = studentcountrysummary[studentcountrysummary['SurveyYear'] == '2017']
studentcountrysummary18 = studentcountrysummary[studentcountrysummary['SurveyYear'] == '2018']
studentcountrysummary19 = studentcountrysummary[studentcountrysummary['SurveyYear'] == '2019']
studentcountrysummary20 = studentcountrysummary[studentcountrysummary['SurveyYear'] == '2020']

fig = go.Figure(go.Bar(x = studentcountrysummary17['Country'], y = studentcountrysummary17['StudentResponsePrcnt'], name='2017', marker_color='#11d16e'))
fig.add_trace(go.Bar(x = studentcountrysummary18['Country'], y = studentcountrysummary18['StudentResponsePrcnt'], name='2018', marker_color='#009db6'))
fig.add_trace(go.Bar(x = studentcountrysummary19['Country'], y = studentcountrysummary19['StudentResponsePrcnt'], name='2019', marker_color='#005db6'))
fig.add_trace(go.Bar(x = studentcountrysummary20['Country'], y = studentcountrysummary20['StudentResponsePrcnt'], name='2020', marker_color='#45025c'))

#fig = go.Figure(go.Bar(x = studentcountrysummary17['Country'], y = studentcountrysummary17['Responses_x'], name='2017', marker_color='#11d16e'))
#fig.add_trace(go.Bar(x = studentcountrysummary18['Country'], y = studentcountrysummary18['Responses_x'], name='2018', marker_color='#009db6'))
#fig.add_trace(go.Bar(x = studentcountrysummary19['Country'], y = studentcountrysummary19['Responses_x'], name='2019', marker_color='#005db6'))
#fig.add_trace(go.Bar(x = studentcountrysummary20['Country'], y = studentcountrysummary20['Responses_x'], name='2020', marker_color='#45025c'))

fig.update_layout(barmode='group', 
                  #height = 700,
                  xaxis_tickangle = -45,
                  yaxis = dict(tickformat=".1%"),
                  title = "Student responses by Country and Survey Year (Top 10 Countries by response volume)",
                  xaxis_title = "Country",
                  yaxis_title = "Percentage of students in survey year",
                  legend_title = "Survey Year",
                  xaxis = {'categoryorder':'total descending'},)
fig.update_xaxes(range=(-.5, 9.5))

fig.show()

It's pretty clear what's going on - almost half of all Kaggle students are based in India, up from 24.5% of students in 2017 to 43.4% of students in 2020. In terms of raw counts though, the number of students in India completing the Kaggle survey has gone up ten fold in four years - 240 responses in 2017 to 2242 responses in 2020. If these figures are representative of the wider Kaggle population then this is remarkable, and great news for both India and Kaggle. Again, the roll out of free learning resources could be a contributing factor to this.

Student responses are much lower everywhere else. Interestingly, responses from the US and China have dropped considerably (the US had 856 student responses in 2018 - down to 353 in 2020, China 715 down to 209 for the same years), but have generally remained stable elsewhere. The responses from India certainly skew the results - meaning that stable student proportions across years correspond to rising student responses in the latest year.

There is probably a whole notebook to write on the trends of Indian (and American) students, but I'll leave this for now.

# Salaries

Role salaries are another fascinating aspect of the survey. I think the most insight can be gained from reviewing the salaries of roles for individual countries, so I'll pick a few to look at.

In [None]:
salaries = responses[(responses['SalaryGrp'] != 'Unknown') & (responses['JobTitle'] == 'Data Scientist') & (responses['Country'] == 'USA')]

salarysummary = surveysum(salaries, ['SurveyYear', 'SalaryGrp'])

survyearsalsum = surveysum(salaries, ['SurveyYear'])

# join summaries and calculate %
salarysummary = pd.merge(salarysummary, survyearsalsum, on = 'SurveyYear')
salarysummary['ResponsePrcnt'] = salarysummary['Responses_x']/salarysummary['Responses_y']

salarysummary17 = salarysummary[salarysummary['SurveyYear'] == '2017']
salarysummary18 = salarysummary[salarysummary['SurveyYear'] == '2018']
salarysummary19 = salarysummary[salarysummary['SurveyYear'] == '2019']
salarysummary20 = salarysummary[salarysummary['SurveyYear'] == '2020']

#fig = go.Figure(go.Bar(x=salarysummary17['SalaryGrp'], y=salarysummary17['Responses_x'], name='2017', marker_color='#11d16e'))
#fig.add_trace(go.Bar(x=salarysummary18['SalaryGrp'], y=salarysummary18['Responses_x'], name='2018', marker_color='#009db6'))
#fig.add_trace(go.Bar(x=salarysummary19['SalaryGrp'], y=salarysummary19['Responses_x'], name='2019', marker_color='#005db6'))
#fig.add_trace(go.Bar(x=salarysummary20['SalaryGrp'], y=salarysummary20['Responses_x'], name='2020', marker_color='#45025c'))

fig = go.Figure(go.Bar(x=salarysummary17['SalaryGrp'], y=salarysummary17['ResponsePrcnt'], name='2017', marker_color='#11d16e'))
fig.add_trace(go.Bar(x=salarysummary18['SalaryGrp'], y=salarysummary18['ResponsePrcnt'], name='2018', marker_color='#009db6'))
fig.add_trace(go.Bar(x=salarysummary19['SalaryGrp'], y=salarysummary19['ResponsePrcnt'], name='2019', marker_color='#005db6'))
fig.add_trace(go.Bar(x=salarysummary20['SalaryGrp'], y=salarysummary20['ResponsePrcnt'], name='2020', marker_color='#45025c'))

fig.update_layout(barmode='group', 
                  #height = 700,
                  #width = 1100,
                  xaxis_tickangle = -45,
                  yaxis = dict(tickformat = ".1%"),
                  legend=dict(yanchor = "top", y = 0.98, xanchor = "right", x = 0.99),
                  title = "Data Scientist Salaries in The United States",
                  xaxis_title = "Salary",
                  yaxis_title = "Responses",
                  legend_title = "Survey Year",
                  #xaxis = {'categoryorder':'total descending'},)
                  xaxis = {'categoryorder':'array', 'categoryarray':['$0-10k', '$10-20k', '$20-30k', '$30-40k',
                                                                    '$40-50k', '$50-60k', '$60-70k', '$70-80k',
                                                                    '$80-90k', '$90-100k', '$100-125k', '$125-150k',
                                                                    '$150-200k', '$200-250k', '$250-300k', '$300-500k',
                                                                    '$500k+',]})

fig.show()

There are clear salary bands for Data Scientists in the US - generally between 100k - 200k. There are some smaller trends which show that salaries may be increasing. This clear distribution of salaries could be due to the opportunities and competition of numerous tech companies in places like Silicon Valley. I'm sure US based Kagglers would know more on this!

In [None]:
salaries = responses[(responses['SalaryGrp'] != 'Unknown') & (responses['JobTitle'] == 'Data Analyst') & (responses['Country'] == 'USA')]

salarysummary = surveysum(salaries, ['SurveyYear', 'SalaryGrp'])

survyearsalsum = surveysum(salaries, ['SurveyYear'])

# join summaries and calculate %
salarysummary = pd.merge(salarysummary, survyearsalsum, on = 'SurveyYear')
salarysummary['ResponsePrcnt'] = salarysummary['Responses_x']/salarysummary['Responses_y']

salarysummary17 = salarysummary[salarysummary['SurveyYear'] == '2017']
salarysummary18 = salarysummary[salarysummary['SurveyYear'] == '2018']
salarysummary19 = salarysummary[salarysummary['SurveyYear'] == '2019']
salarysummary20 = salarysummary[salarysummary['SurveyYear'] == '2020']

#fig = go.Figure(go.Bar(x=salarysummary17['SalaryGrp'], y=salarysummary17['Responses_x'], name='2017', marker_color='#11d16e'))
#fig.add_trace(go.Bar(x=salarysummary18['SalaryGrp'], y=salarysummary18['Responses_x'], name='2018', marker_color='#009db6'))
#fig.add_trace(go.Bar(x=salarysummary19['SalaryGrp'], y=salarysummary19['Responses_x'], name='2019', marker_color='#005db6'))
#fig.add_trace(go.Bar(x=salarysummary20['SalaryGrp'], y=salarysummary20['Responses_x'], name='2020', marker_color='#45025c'))

fig = go.Figure(go.Bar(x=salarysummary17['SalaryGrp'], y=salarysummary17['ResponsePrcnt'], name='2017', marker_color='#11d16e'))
fig.add_trace(go.Bar(x=salarysummary18['SalaryGrp'], y=salarysummary18['ResponsePrcnt'], name='2018', marker_color='#009db6'))
fig.add_trace(go.Bar(x=salarysummary19['SalaryGrp'], y=salarysummary19['ResponsePrcnt'], name='2019', marker_color='#005db6'))
fig.add_trace(go.Bar(x=salarysummary20['SalaryGrp'], y=salarysummary20['ResponsePrcnt'], name='2020', marker_color='#45025c'))

fig.update_layout(barmode='group', 
                  #height = 700,
                  #width = 1100,
                  xaxis_tickangle = -45,
                  yaxis = dict(tickformat = ".1%"),
                  legend=dict(yanchor = "top", y = 0.98, xanchor = "right", x = 0.99),
                  title = "Data Analyst Salaries in The United States",
                  xaxis_title = "Salary",
                  yaxis_title = "Responses",
                  legend_title = "Survey Year",
                  #xaxis = {'categoryorder':'total descending'},)
                  xaxis = {'categoryorder':'array', 'categoryarray':['$0-10k', '$10-20k', '$20-30k', '$30-40k',
                                                                    '$40-50k', '$50-60k', '$60-70k', '$70-80k',
                                                                    '$80-90k', '$90-100k', '$100-125k', '$125-150k',
                                                                    '$150-200k', '$200-250k', '$250-300k', '$300-500k',
                                                                    '$500k+',]})

fig.show()

Data Analysts are generally paid less than Data Scientists in the US, however there are clear trends that Analysts are being paid more in recent years.

In [None]:
salaries = responses[(responses['SalaryGrp'] != 'Unknown') & (responses['JobTitle'] == 'Software Developer') & (responses['Country'] == 'USA')]

salarysummary = surveysum(salaries, ['SurveyYear', 'SalaryGrp'])

survyearsalsum = surveysum(salaries, ['SurveyYear'])

# join summaries and calculate %
salarysummary = pd.merge(salarysummary, survyearsalsum, on = 'SurveyYear')
salarysummary['ResponsePrcnt'] = salarysummary['Responses_x']/salarysummary['Responses_y']

salarysummary17 = salarysummary[salarysummary['SurveyYear'] == '2017']
salarysummary18 = salarysummary[salarysummary['SurveyYear'] == '2018']
salarysummary19 = salarysummary[salarysummary['SurveyYear'] == '2019']
salarysummary20 = salarysummary[salarysummary['SurveyYear'] == '2020']

#fig = go.Figure(go.Bar(x=salarysummary17['SalaryGrp'], y=salarysummary17['Responses_x'], name='2017', marker_color='#11d16e'))
#fig.add_trace(go.Bar(x=salarysummary18['SalaryGrp'], y=salarysummary18['Responses_x'], name='2018', marker_color='#009db6'))
#fig.add_trace(go.Bar(x=salarysummary19['SalaryGrp'], y=salarysummary19['Responses_x'], name='2019', marker_color='#005db6'))
#fig.add_trace(go.Bar(x=salarysummary20['SalaryGrp'], y=salarysummary20['Responses_x'], name='2020', marker_color='#45025c'))

fig = go.Figure(go.Bar(x=salarysummary17['SalaryGrp'], y=salarysummary17['ResponsePrcnt'], name='2017', marker_color='#11d16e'))
fig.add_trace(go.Bar(x=salarysummary18['SalaryGrp'], y=salarysummary18['ResponsePrcnt'], name='2018', marker_color='#009db6'))
fig.add_trace(go.Bar(x=salarysummary19['SalaryGrp'], y=salarysummary19['ResponsePrcnt'], name='2019', marker_color='#005db6'))
fig.add_trace(go.Bar(x=salarysummary20['SalaryGrp'], y=salarysummary20['ResponsePrcnt'], name='2020', marker_color='#45025c'))

fig.update_layout(barmode='group', 
                  #height = 700,
                  #width = 1100,
                  xaxis_tickangle = -45,
                  yaxis = dict(tickformat = ".1%"),
                  legend=dict(yanchor = "top", y = 0.98, xanchor = "right", x = 0.99),
                  title = "Software Developer Salaries in The United States",
                  xaxis_title = "Salary",
                  yaxis_title = "Responses",
                  legend_title = "Survey Year",
                  #xaxis = {'categoryorder':'total descending'},)
                  xaxis = {'categoryorder':'array', 'categoryarray':['$0-10k', '$10-20k', '$20-30k', '$30-40k',
                                                                    '$40-50k', '$50-60k', '$60-70k', '$70-80k',
                                                                    '$80-90k', '$90-100k', '$100-125k', '$125-150k',
                                                                    '$150-200k', '$200-250k', '$250-300k', '$300-500k',
                                                                    '$500k+',]})

fig.show()

Software Developers get paid similar amounts to Data Scientists in the US. Whilst there are small changes, these salaries look largely stable.

In [None]:
salaries = responses[(responses['SalaryGrp'] != 'Unknown') & (responses['JobTitle'] == 'Data Scientist') & (responses['Country'] == 'UK')]

salarysummary = surveysum(salaries, ['SurveyYear', 'SalaryGrp'])

survyearsalsum = surveysum(salaries, ['SurveyYear'])

# join summaries and calculate %
salarysummary = pd.merge(salarysummary, survyearsalsum, on = 'SurveyYear')
salarysummary['ResponsePrcnt'] = salarysummary['Responses_x']/salarysummary['Responses_y']

salarysummary17 = salarysummary[salarysummary['SurveyYear'] == '2017']
salarysummary18 = salarysummary[salarysummary['SurveyYear'] == '2018']
salarysummary19 = salarysummary[salarysummary['SurveyYear'] == '2019']
salarysummary20 = salarysummary[salarysummary['SurveyYear'] == '2020']

#fig = go.Figure(go.Bar(x=salarysummary17['SalaryGrp'], y=salarysummary17['Responses_x'], name='2017', marker_color='#11d16e'))
#fig.add_trace(go.Bar(x=salarysummary18['SalaryGrp'], y=salarysummary18['Responses_x'], name='2018', marker_color='#009db6'))
#fig.add_trace(go.Bar(x=salarysummary19['SalaryGrp'], y=salarysummary19['Responses_x'], name='2019', marker_color='#005db6'))
#fig.add_trace(go.Bar(x=salarysummary20['SalaryGrp'], y=salarysummary20['Responses_x'], name='2020', marker_color='#45025c'))

fig = go.Figure(go.Bar(x=salarysummary17['SalaryGrp'], y=salarysummary17['ResponsePrcnt'], name='2017', marker_color='#11d16e'))
fig.add_trace(go.Bar(x=salarysummary18['SalaryGrp'], y=salarysummary18['ResponsePrcnt'], name='2018', marker_color='#009db6'))
fig.add_trace(go.Bar(x=salarysummary19['SalaryGrp'], y=salarysummary19['ResponsePrcnt'], name='2019', marker_color='#005db6'))
fig.add_trace(go.Bar(x=salarysummary20['SalaryGrp'], y=salarysummary20['ResponsePrcnt'], name='2020', marker_color='#45025c'))

fig.update_layout(barmode='group', 
                  #height = 700,
                  #width = 1100,
                  xaxis_tickangle = -45,
                  yaxis = dict(tickformat = ".1%"),
                  legend=dict(yanchor = "top", y = 0.98, xanchor = "right", x = 0.99),
                  title = "Data Scientist Salaries in the United Kingdom",
                  xaxis_title = "Salary",
                  yaxis_title = "Responses",
                  legend_title = "Survey Year",
                  #xaxis = {'categoryorder':'total descending'},)
                  xaxis = {'categoryorder':'array', 'categoryarray':['$0-10k', '$10-20k', '$20-30k', '$30-40k',
                                                                    '$40-50k', '$50-60k', '$60-70k', '$70-80k',
                                                                    '$80-90k', '$90-100k', '$100-125k', '$125-150k',
                                                                    '$150-200k', '$200-250k', '$250-300k', '$300-500k',
                                                                    '$500k+',]})

fig.show()

And finally, what do Data Scientist salaries look like in my country? They're certainly different to those in the US. It looks like we generally get paid less over here. I feel like we're a few years behind the US in terms of the establishment of Data Science in business, but tech companies are growing all over the country and data science will only continue to grow, as can be seen by the increased proportion of respondants with higher salaries. It also looks like there are two salary distributions in here - a lower salary group around 50k, and a higher salary group around 100k. This might be due to different salary grades in the public and private sectors. Again, this is pure speculation at this point - I'm sure I could put more time into investigating the underlying reasons.

## Further work when I get time
* Further exploration into the underlying reasons for role and salary trends
* Add some more interesting plots (bar charts are ok, but it would be nice to have more variety)
* Build a classification model to predict what my job title should be. No promises I'll change it though. Data Science salaries are pretty good ;)

If you've got this far then thanks for reading. As I type, Trump supporters have taken over the Capitol in DC, so I hope that gets sorted soon. Stay safe everyone.