# Analyzing Kaggle Survey Data Through the Years
Each year, Kaggle conducts an Machine Learning and Data Science Survey to its users. Data available on Kaggle spans from 2017 to 2021. This notebook intends to:
- Read data from each year's survey
- Combine data from each year into a dataset
- Analyzing the trends of survey results

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import os
import warnings
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

warnings.filterwarnings("ignore")

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Load data
df2021 = pd.read_csv('/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
df2020 = pd.read_csv('/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')
df2019 = pd.read_csv('/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv')
df2018 = pd.read_csv('/kaggle/input/kaggle-survey-2018/multipleChoiceResponses.csv')
df2017 = pd.read_csv('/kaggle/input/kaggle-survey-2017/multipleChoiceResponses.csv', encoding='ISO-8859-1')

In [None]:
# Print dimension of each year
print('2021: ' + str(df2021.shape))
print('2020: ' + str(df2020.shape))
print('2019: ' + str(df2019.shape))
print('2018: ' + str(df2018.shape))
print('2017: ' + str(df2017.shape))

# Basic info of respondents

In this section, we combine basic info of respondents of each year into one dataset. Basic info includes:
- Age
- Country of origin
- Gender
- Occupation
- Education qualification
- Coding Experience

A note on 'Coding Experience': The question wordings asked each year is different. In the following analysis, we extract answers to the following question as 'Coding Experience' value for each year:
- 2017: How long have you been learning data science?
- 2018: How long have you been writing code to analyze data?
- 2019: How long have you been writing code to analyze data (at work or at school)?
- 2020 and 2021: For how many years have you been writing code and/or programming?

In [None]:
basic = pd.DataFrame(columns = ['year','age','country','gender','occupation', 'education', 'experience'])

In [None]:
# Get 2021 data
basic2021 = df2021.loc[1:,['Q1','Q3','Q2','Q5','Q4','Q6']]
basic2021.columns = ['age','country','gender','occupation', 'education', 'experience']
basic2021['year']=2021
basic2021.head()

In [None]:
# Get 2020 data
basic2020 = df2020.loc[1:,['Q1','Q3','Q2','Q5','Q4','Q6']]
basic2020.columns = ['age','country','gender','occupation', 'education', 'experience']
basic2020['year']=2020
basic2020.head()

In [None]:
# Get 2019 data
basic2019 = df2019.loc[1:,['Q1','Q3','Q2','Q5','Q4','Q15']]
basic2019.columns = ['age','country','gender','occupation', 'education', 'experience']
basic2019['year']=2019
basic2019.head()

In [None]:
# Get 2018 data
basic2018 = df2018.loc[1:,['Q2','Q3','Q1','Q6','Q4','Q24']]
basic2018.columns = ['age','country','gender','occupation', 'education', 'experience']
basic2018['year']=2018
basic2018.head()

In [None]:
# Get 2017 data
basic2017 = df2017.loc[:,['Age','Country','GenderSelect','CurrentJobTitleSelect','FormalEducation','LearningDataScienceTime']]
basic2017.columns = ['age','country','gender','occupation', 'education', 'experience']
basic2017['year']=2017
basic2017.head()

In [None]:
# Extra for 2017: Grouping age variable
basic2017.age = pd.cut(basic2017.age, bins=[-0.1,17,21,24,29,34,39,44,49,54,59,69,100], 
       labels=['18-','18-21','22-24','25-29','30-34','35-39','40-44','45-49','50-54','55-59','60-69','70+'])

In [None]:
basic = pd.concat([basic, basic2021, basic2020, basic2019, basic2018, basic2017])
basic.shape

Let's quickly investigate distribution for each year:

In [None]:
pd.pivot_table(basic, index='age',columns='year', values='country', aggfunc='count')

In [None]:
pd.pivot_table(basic, index='gender',columns='year', values='country', aggfunc='count')

In [None]:
pd.pivot_table(basic, index='country',columns='year', values='gender', aggfunc='count')

In [None]:
pd.pivot_table(basic, index='occupation',columns='year', values='country', aggfunc='count')

In [None]:
pd.pivot_table(basic, index='education',columns='year', values='country', aggfunc='count')

In [None]:
pd.pivot_table(basic, index='experience',columns='year', values='country', aggfunc='count')

# Cleaning data and basic visualization

As seen above, there is much need to clean up the data for more sensible analysis.

In [None]:
# Clean up gender
basic['gender'] = basic['gender'].replace({'Man':'Male','Woman':'Female','Prefer to self-describe':'Undisclosed'})
basic['gender'][~basic['gender'].isin(['Male','Female','Undisclosed'])] = 'Others'

gender_table = pd.pivot_table(basic, index='gender',columns='year', values='country', aggfunc='count').fillna(0)
gender_table = gender_table / gender_table.sum(axis=0)
gender_table.style.format("{:.2%}")

Gender proportion is quite stable across the years, with male made roughly 80% of respondents.

In [None]:
# Clean up country
country_dict = {'Hong Kong (S.A.R.)':'Hong Kong', 'Iran, Islamic Republic of...':'Iran','People \'s Republic of China':'China',
               'United Kingdom of Great Britain and Northern Ireland':'United Kingdom','United States of America':'United States',
               'Viet Nam':'Vietnam','Republic of China':'Taiwan','Republic of Korea':'South Korea',
               'I do not wish to disclose my location':'Other'}

basic['country'] = basic['country'].replace(country_dict)

In [None]:
# Plot 5 bar charts
fig, ax = plt.subplots(nrows=1, ncols=5, figsize=(17,5))

for year in range(2017, 2022):
    temp = basic[(basic.year == year) & (basic.country != 'Other')]
    temp_top10 = temp['country'].value_counts()[:10]
    sns.countplot(data = temp, y='country', order=temp['country'].value_counts().iloc[:10].index, ax=ax[year-2017], color='royalblue')
    ax[year-2017].set_title(str(year))
    ax[year-2017].set_xlabel('')
    ax[year-2017].set_ylabel('')
    plt.tight_layout()

Top two countries are United States and India all along. I intend to assign color to each country but choose to leave it to next time.

In [None]:
# clean up age
basic['age'] = basic['age'].replace({'70-79':'70+','80+':'70+'})

age_table = pd.pivot_table(basic, index='age',columns='year', values='country', aggfunc='count').fillna(0)
age_table = age_table / age_table.sum(axis=0)
age_table.style.format("{:.2%}").background_gradient(cmap='Blues')

As expected, respondents concentrated at young ages.

In [None]:
# clean up occupation
job_dict = {'Program/Project Manager':'Product/Project Manager', 'Research Scientist': 'Scientist/Researcher', 'Currently not employed':'Not employed',
           'Product Manager':'Product/Project Manager', 'Research Assistant':'Scientist/Researcher', 'Software Engineer': 'Software Developer/Software Engineer',
            'Programmer':'Software Developer/Software Engineer', 'Machine Learning Engineer':'Data Scientist', 'Statistician':'Scientist/Researcher',
           'Researcher':'Scientist/Researcher','Computer Scientist':'Scientist/Researcher','Predictive Modeler':'Data Scientist',
           'Data Journalist':'Data Analyst','DBA/Database Engineer':'Data Engineer'}

job_list = ['Student','Data Scientist','Software Developer/Software Engineer','Scientist/Researcher','Data Analyst','Not employed',
           'Business Analyst','Data Engineer','Product/Project Manager']
basic['occupation'] = basic['occupation'].replace(job_dict)
basic['occupation'] = basic['occupation'].fillna('Other')
basic.loc[~basic.occupation.isin(job_list),'occupation'] = 'Other'

In [None]:
job_table = pd.pivot_table(basic, index='occupation',columns='year', values='country', aggfunc='count').fillna(0)
job_table = job_table / job_table.sum(axis=0)
job_table.style.format("{:.2%}").background_gradient(cmap='Blues')

Findings:
- 2017 has no option of 'Student' so others made more than 40%. We can expect 20-25% were students in that year
- Data Scientist made roughly 20% of respondents
- There was a jump in % of unemployed respondents in 2020 and the level remained in 2021 - effect of pandemic?

In [None]:
# clean up education

edu_dict = {'Bachelor\'s degree':'Bachelor', 'Bachelor’s degree':'Bachelor', 'Doctoral degree':'Doctor', 
            'I did not complete any formal education past high school': 'High school or below',
           'Master\'s degree':'Master', 'Master’s degree':'Master', 'No formal education past high school':'High school or below',
           'Professional doctorate':'Doctor',
           'Some college/university study without earning a bachelor\'s degree':'College dropout', 
           'Some college/university study without earning a bachelor’s degree':'College dropout'}

basic['education'] = basic['education'].replace(edu_dict)
edu_table = pd.pivot_table(basic, index='education',columns='year', values='country', aggfunc='count').fillna(0)
edu_table = edu_table / edu_table.sum(axis=0)
edu_table.style.format("{:.2%}").background_gradient(cmap='Blues')

In the most recent survey, it is about 40% bachelor vs 40% master.

In [None]:
# clean up experience

basic['experience'] = basic.experience.replace({'< 1 years': '< 1 year', '1-2 years': '1-3 years'})
basic.loc[basic.experience.str.startswith('I have never written code', na=False),'experience'] = 'Nil'
basic.loc[(~basic.experience.isin(['Nil','< 1 year','1-3 years','3-5 years','5-10 years'])) & (basic.experience.notna()), 'experience'] = '10+ years'

exp_table = pd.pivot_table(basic, index='experience',columns='year', values='country', aggfunc='count').fillna(0)
exp_table = exp_table / exp_table.sum(axis=0)
exp_table.style.format("{:.2%}").background_gradient(cmap='Blues')

There is no clear trend in respondent's experience. Note that 2017 data look inconsistent with other years. It may be because of (1) different wordings of the question and (2) many respondents did not answer the question.

That's it for now!