# Kaggle Survey 2021

My intent here is to look at things that were not covered in the report authored by Kaggle. For now this notebook tries to answer the questions:

* Does response time relate to respondent age?
* Does gender reflect differences in education?
* How do programming languages compared with each other?

This is ongoing work; I will investigate further questions and release updates in the next few days.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Import data and skip 2nd row
dat = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv',
                  header=0, low_memory=False, skiprows=[1])

## Does response time relate to respondent age?

Here we will check how response time distributes across the different age groups. For this I will log-transform response time, hoping to approximate a more normal distribution.

In [None]:
plt.figure(figsize=(18, 6))

# Log-transform time
dat['log_time'] = np.log(dat['Time from Start to Finish (seconds)'].astype(int) + 1)
#dat['Q1'] = pd.Categorical(dat.Q1) 

# Discard |x - mu|/sigma > 3
norm_log_time = (dat['log_time'] - dat['log_time'].mean()) / dat['log_time'].std()
dat.loc[norm_log_time.abs() > 3, 'Q1'] = np.nan

sns.violinplot(x=pd.Categorical(dat.Q1) , y='log_time', data=dat)

After dismissing points lying further than three standard deviations away from the mean - which for a standard normal distribution encompasses 99% of the observations - we see a clear tendency for older respondents to take longer in filling in the survey. To me this could be because i) young respondents are more agile, and / or ii) older respondents pay more attention to the different questions.

## Does gender reflect differences in education?

Next we can investigate differences in educational attainment among reported genders.

In [None]:
# Q2 and Q4
gb = dat.groupby('Q2')['Q4'].value_counts().unstack().apply(lambda x: x / x.sum(), axis=1)

gb.plot(kind='bar', stacked=True, figsize=(18, 6))

Here we do not see many differences; interestingly those that "prefer not to say" their gender also report "I prefer not to answer" to Q4 the most. What about salary?

## Does gender reflect differences in salary?

In [None]:
dat['Q25'] = pd.Categorical(dat['Q25'])
idx = np.argsort(dat['Q25'].cat.categories.str.replace('-[0-9,]+|\$|>|,', '').astype(int))
dat['Q25'] = dat['Q25'].cat.reorder_categories(dat['Q25'].cat.categories[idx])

# Q2 and Q4
gb = dat.groupby('Q25')['Q2'].value_counts().unstack().apply(lambda x: x / x.sum(), axis=1)

gb.plot(kind='bar', stacked=True, figsize=(18, 4), xlabel='Salary(USD)', ylabel='Frequency')

Here we observe the clear men-majority in the participants, but also an apparent gender pay gap. To settle this we would need to run additional analysis and test hypotheses. Also interestingly, a large proportion of non-binary respondants claim to earn over 1,000,000USD a year; to me this seems to be largely compounded by unfaithful responses.

## How do programming languages compare with each other?

Now we focus on which programming languages are used together the most, using a co-occurrence matrix.

In [None]:
plt.figure(figsize=(12, 8))

# Q8

#gb = dat.groupby('Q2')['Q4'].value_counts().unstack().apply(lambda x: x / x.sum(), axis=1)
lang = dat.loc[:, dat.columns.str.startswith('Q7_')]
lang_labels = lang.apply(lambda x: x[x.notnull()].unique()[0]).tolist()

lang_mat = lang.notnull().astype(int)
co_occ = lang_mat.T.dot(lang_mat)
sns.heatmap(co_occ, fmt='d', annot=True, xticklabels=lang_labels, yticklabels=lang_labels, cbar_kws={'label': 'No.Respondents'})

Here we see - unsurprisingly - Python taking the lead, followed by SQL.