# Intro
In this notebook we want to compare the results of the survey of the years [2017](https://www.kaggle.com/kaggle/kaggle-survey-2017), [2018](https://www.kaggle.com/kaggle/kaggle-survey-2018), [2019](https://www.kaggle.com/c/kaggle-survey-2019) and [2020](https://www.kaggle.com/c/kaggle-survey-2020).

![](https://storage.googleapis.com/kaggle-competitions/kaggle/23724/logos/header.png) 

We focus on the multiple choice responses and do not consider 
* the freeformResponses in 2017 and 2018,
* the other text responses in 2019.

Additonally we include the [Kaggle meta dataset](https://www.kaggle.com/kaggle/meta-kaggle). 

The challenge of this notebook is to make them comparable over time. 

For an analysis focused on 2020 we consider this [notebook](https://www.kaggle.com/drcapa/2020-kaggle-ml-ds-survey-eda).

<span style="color: royalblue;">Please vote the notebook up if it helps you. Thank you. </span>

# Libraries

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt

In [None]:
import warnings
warnings.filterwarnings("ignore")

# Path

In [None]:
path = '/kaggle/input/'
os.listdir(path)

# Load And Prepare Data
For the data of the year 2017 we have to add the questions separately.

In [None]:
data_2020 = pd.read_csv(path+'kaggle-survey-2020/'+'kaggle_survey_2020_responses.csv',
                        low_memory=False)
data_2019 = pd.read_csv(path+'kaggle-survey-2019/'+'multiple_choice_responses.csv',
                        low_memory=False)
data_2018 = pd.read_csv(path+'kaggle-survey-2018/'+'multipleChoiceResponses.csv',
                        low_memory=False)

questions = pd.read_csv(path+'kaggle-survey-2017/'+'schema.csv',
                        low_memory=False, encoding = "ISO-8859-1")
data_2017 = pd.read_csv(path+'kaggle-survey-2017/'+'multipleChoiceResponses.csv',
                        low_memory=False, encoding = "ISO-8859-1")
questions.index=questions['Column']
new_row = questions[questions.index.isin(data_2017.columns)]['Question'].to_dict()
new_row = pd.DataFrame(new_row, index=[0])
data_2017 = pd.concat([new_row, data_2017]).reset_index(drop = True) 

Load user data of the Kaggle meta dataset:

In [None]:
data_user = pd.read_csv('/kaggle/input/meta-kaggle/'+'Users.csv') 

# Functions
We define some helper functions for visualization.

In [None]:
def plot_bar(data, text='', rotation=False):
    fig = plt.figure(figsize=(10, 5))
    x = data.keys()
    y = data.values
    plt.bar(x, y)
    plt.title(text, loc='center')
    plt.xlabel('Year')
    if rotation:
        plt.xticks(rotation='vertical')
    plt.grid()
    plt.show()
    
def plot_compare_distribution(data, text=''):
    fig = plt.figure(figsize=(10, 5))
    x = data.columns
    for row in range(len(data.index)):
        y = data.iloc[row].values
        plt.plot(x, y, marker='o', fillstyle='none', ls='-',label=data.index[row])
    plt.title(text, loc='center')
    plt.xlabel('Year')
    plt.ylabel('Distribution')
    plt.grid()
    plt.legend(bbox_to_anchor=(1.05, 1))
    plt.show()

# Overview

## Number Of Yearly New Users
The number of new users increases rapidly.

In [None]:
data_user.index=pd.to_datetime(data_user['RegisterDate'])
users = data_user.resample('A').count()['UserName']
users.index = users.index.year
plot_bar(users, text='Number of new users', rotation=False)

## Number Of Survey Participations

In [None]:
s = pd.Series(dtype='float64')
s['2017'] = len(data_2017.index)-1
s['2018'] = len(data_2018.index)-1
s['2019'] = len(data_2019.index)-1
s['2020'] = len(data_2020.index)-1
plot_bar(s, text='Number of participations', rotation=False)

## Number Of Features

In [None]:
s = pd.Series(dtype='float64')
s['2017'] = len(data_2017.columns)
s['2018'] = len(data_2018.columns)
s['2019'] = len(data_2019.columns)
s['2020'] = len(data_2020.columns)
plot_bar(s, text='Number of columns', rotation=False)

# EDA

## Age
We want to compare the distrubution of the age groups. Before we have to align the age groups for the years 2017 and 2018.

In [None]:
def group_age_2017(s):
    if s == 'Unknown':
        return 'Unknown'
    elif (s >= 18) & (s <= 21):
        return '18-21'
    elif (s >= 22) & (s <= 24):
        return '22-24'
    elif (s >= 25) & (s <= 29):
        return '25-29'
    elif (s >= 30) & (s <= 34):
        return '30-34'
    elif (s >= 35) & (s <= 39):
        return '35-39'
    elif (s >= 40) & (s <= 44):
        return '40-44'
    elif (s >= 45) & (s <= 49):
        return '45-49'
    elif (s >= 50) & (s <= 54):
        return '50-54'
    elif (s >= 55) & (s <= 59):
        return '55-59'
    elif (s >= 60) & (s <= 69):
        return '60-69'
    elif (s >= 70):
        return '70+'

def group_age_2018(s):
    if s == 'Unknown':
        return 'Unknown'
    elif (s=='70-79') | (s=='80+'):
        return '70+'
    else:
        return s
    
data_2017['Age'].fillna('Unknown', inplace=True)
data_2017[1:]['Age'] = data_2017[1:]['Age'].apply(group_age_2017)
data_2018[1:]['Q2'] = data_2018[1:]['Q2'].apply(group_age_2018)

df_age = pd.DataFrame()
df_age['2017'] = 100*data_2017[1:]['Age'].value_counts().sort_index()/len(data_2017[1:])
df_age['2018'] = 100*data_2018[1:]['Q2'].value_counts().sort_index()/len(data_2018[1:])
df_age['2019'] = 100*data_2019[1:]['Q1'].value_counts().sort_index()/len(data_2019[1:])
df_age['2020'] = 100*data_2020[1:]['Q1'].value_counts().sort_index()/len(data_2020[1:])
df_age.drop(['Unknown'], inplace=True)

As we can see there is a significant increase of members in the group 18-21 from 7% to 17%.

In [None]:
plot_compare_distribution(df_age[0:5])

## Gender

In [None]:
def rename_gender_2020(s):
    if s == 'Man':
        return 'Male'
    elif s == 'Woman':
        return 'Female'
    else:
        return s

data_2017[1:]['GenderSelect'].fillna('NoSelection', inplace=True)
data_2020[1:]['Q2'] = data_2020[1:]['Q2'].apply(rename_gender_2020)

s_2017 = 100*data_2017[1:]['GenderSelect'].value_counts()/len(data_2017[1:])
s_2018 = 100*data_2018[1:]['Q1'].value_counts()/len(data_2018[1:])
s_2019 = 100*data_2019[1:]['Q2'].value_counts()/len(data_2019[1:])
s_2020 = 100*data_2020[1:]['Q2'].value_counts()/len(data_2020[1:])
df_gender = pd.concat([s_2017, s_2018, s_2019, s_2020], axis=1)
df_gender.columns=['2017', '2018', '2019', '2020']

As we can see there is a slight decrease of male users and a slight increase of female users. And there is a significant imbalance between male and female users.

In [None]:
plot_compare_distribution(df_gender[0:2])

## Country

In [None]:
data_2017[1:]['Country'].fillna('Unknown', inplace=True)
data_2018[1:]['Q3'].fillna('Unknown', inplace=True)
data_2019[1:]['Q3'].fillna('Unknown', inplace=True)
data_2020[1:]['Q3'].fillna('Unknown', inplace=True)

dict_country = {'United States': 'United States of America'}

data_2017[1:]['Country'].replace(dict_country, inplace=True)

s_2017 = 100*data_2017[1:]['Country'].value_counts()/len(data_2017[1:])
s_2018 = 100*data_2018[1:]['Q3'].value_counts()/len(data_2018[1:])
s_2019 = 100*data_2019[1:]['Q3'].value_counts()/len(data_2019[1:])
s_2020 = 100*data_2020[1:]['Q3'].value_counts()/len(data_2020[1:])
df_country = pd.concat([s_2017, s_2018, s_2019, s_2020], axis=1)
df_country.columns=['2017', '2018', '2019', '2020']

df_country.sort_values(by=['2020'], ascending=False, inplace=True)

In [None]:
plot_compare_distribution(df_country[0:6])

# No Perfect Match Questions
The questions and answers above are easy to compare. But there are also questions which do not or do not perfectly match.  

## Code Experience
Since 2020 the question for code experience is more general than the years before.

In [None]:
print('2017:', data_2017.loc[0, 'Tenure'])
print('2018:', data_2018.loc[0, 'Q24'])
print('2019:', data_2019.loc[0, 'Q15'])
print('2020:', data_2020.loc[0, 'Q6'])

In [None]:
data_2017[1:]['Tenure'].fillna('NoSelection', inplace=True)
data_2018[1:]['Q24'].fillna('NoSelection', inplace=True)
data_2019[1:]['Q15'].fillna('NoSelection', inplace=True)
data_2020[1:]['Q6'].fillna('NoSelection', inplace=True)

def adjust_answers_2017(s):
    if s=='Less than a year':
        return '< 1 years'
    elif s=='1 to 2 years':
        return '1-2 years'
    elif s=='3 to 5 years':
        return '3-5 years'
    elif s=='6 to 10 years':
        return '5-10 years'
    elif s=='More than 10 years':
        return '10+'
    elif s=="I don't write code to analyze data":
        return "I don't write code"
    else:
        return s

def adjust_answers_2018_2020(s):
    if s=='NoSelection':
        return 'NoSelection'
    elif s=='< 1 year':
        return '< 1 years'
    elif (s=='10-20 years')or(s=='20-30 years') or (s=='30-40 years') or (s=='40+ years') or (s=='20+ years'):
        return '10+'
    elif ((s=="I have never written code but I want to learn") or
          (s=="I have never written code and I do not want to learn") or
          (s=="I have never written code")):
        return "I don't write code"
    else:
        return s

data_2017[1:]['Tenure'] = data_2017[1:]['Tenure'].apply(adjust_answers_2017)
data_2018[1:]['Q24'] = data_2018[1:]['Q24'].apply(adjust_answers_2018_2020)
data_2019[1:]['Q15'] = data_2019[1:]['Q15'].apply(adjust_answers_2018_2020)
data_2020[1:]['Q6'] = data_2020[1:]['Q6'].apply(adjust_answers_2018_2020)

s_2017 = 100*data_2017[1:]['Tenure'].value_counts()/len(data_2017[1:])
s_2018 = 100*data_2018[1:]['Q24'].value_counts()/len(data_2018[1:])
s_2019 = 100*data_2019[1:]['Q15'].value_counts()/len(data_2019[1:])
s_2020 = 100*data_2020[1:]['Q6'].value_counts()/len(data_2020[1:])
df_code_exp = pd.concat([s_2017, s_2018, s_2019, s_2020], axis=1)
df_code_exp.columns=['2017', '2018', '2019', '2020']
index_series = ['< 1 years', '1-2 years', '3-5 years', '5-10 years', '10+', "I don't write code", 'NoSelection']

In [None]:
df_code_exp.reindex(index_series).round(2)