In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import pearsonr
import seaborn as sns
%matplotlib inline

# The Stack Overflow Developer Survey from a Human Resource Managers Point of View

## Idea

The following notebook analyses the stack overflow developer survey dataset of 2017 from a human resource managers perspective. We will ask four different questions related to employee retention and job satisfaction of software developers and try to answer them with different data science methods based on this dataset.

The analysis follows broadly the CRISP-DM (cross-industry standard process for data mining) model. The six major phases of this model are:

* Business Understanding
* Data Understanding
* Data Preparation
* Modeling
* Evaluation
* Deployment

While the first five steps are roughly reflected in this notebook, the results of the analysis are deployed as a blog post on medium.com.

## Business Understanding

Employee retention and job satisfaction are important things for a human resource manager to consider. Since IT resources are rare to find these days, this is especially true today. It would be a great opportunity if data science techniques could help here. Therefore, we ask and try to answer four questions:

* Is it possible to predict whether a developer is looking for a new job or not?
* If so, what are the most important features for such a prediction?
* Is job satisfaction related to other features recorded in the survey – like salaries?
* Do these aspects change from country to country?

## Data understanding

Kaggle characterizes the Stack Overflow survey in the following way: "Every year, Stack Overflow conducts a massive survey of people on the site, covering all sorts of information like programming languages, salary, code style and various other information. This year, they amassed more than 64,000 responses fielded from 213 countries."

In [None]:
df_orig = pd.read_csv('/kaggle/input/so-survey-2017/survey_results_public.csv')
df_schema = pd.read_csv('/kaggle/input/so-survey-2017/survey_results_schema.csv')
df_orig.shape

In [None]:
df_orig.head()

The final dataset consists of about 50.000 entries with about 150 features. An additional CSV file shows the exact questions the developer were asked. Next to a few numerical data features like salary there are more than a hundred features consisting of categorical data.

### First question: Is it possible to predict whether a developer is looking for a new job or not?

To answer the questions formulated above we first create a labeled dataset for the machine learning model:

In [None]:
df = df_orig


# What was the question Stack Overflow asked?

print("Question: " + df_schema[df_schema['Column'] == "JobSeekingStatus"]['Question'].tolist()[0])

# What are the possible answers for JobSeekingStatus?

print("Answers:")
print(df['JobSeekingStatus'].unique())

In [None]:
# Reduce Dataframe to professional, full-time developers

df = df.loc[df['EmploymentStatus'] == 'Employed full-time']
df = df.loc[df['Professional'] == 'Professional developer']
df = df.drop('Respondent', axis=1)

In [None]:
df.shape

In [None]:
# Delete rows without a JobSeekingStatus
df = df.dropna(subset=['JobSeekingStatus'], axis=0)

# Delete columns with only NaNs
df = df.dropna(how='all', axis=1)

df.shape

In [None]:
# Create two categories of developers: those who are not interested in a new job (1) and those who are (0)
X = df.drop('JobSeekingStatus', axis=1)
y = pd.get_dummies(df['JobSeekingStatus'], prefix="JobSearch")
y = y['JobSearch_I am not interested in new job opportunities']

## Data Preparation

Since the first and most ambitious question should be answered by creating a machine learning predictor for the data, we need to prepare the data to be used by a classifier. Therefore, we convert numerical NaNs to the mean of the column and convert categorical data to dummy data first.

In [None]:
# Fill the NaNs in numerical columns with the mean

num_cols = X.select_dtypes(include=['float','int']).columns

for col in num_cols:
    X[col].fillna(X[col].mean(), inplace=True)

# Create dummy columns for categorical columns (takes a while...)

cat_cols = X.select_dtypes(include=['object']).columns

for col in cat_cols:
    X = pd.concat([X.drop(col, axis=1), pd.get_dummies(X[col], prefix=col, drop_first=True)], axis=1)

## Modeling

After several tests we decided to use a Random Forest classifier for the prediction. It is either fast and has the useful capability that it can show the most important features used for the prediction – which is helpful for answering the next question.

In [None]:
# Do the prediction
    
# Step 1: test train sample

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Step 2: create and train a classifier (may take a while, too...)

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)

# Step 3: How well does it predict?

y_pred = clf.predict(X_test)
print("Classification report:")
print(classification_report(y_test, y_pred))

With this pretty high F1 score we can say that it is generally possible to predict the two types of developer pretty well.

### Next question: what are the most important features for such a prediction?

In [None]:
# What are the 10 most important features?

importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_], axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(10):
    print("%d. %s (%f)" % (f + 1, X_train.columns[indices[f]-1], importances[indices[f]]))

# Plot the feature importances
plt.figure()
plt.title("Feature importances")
plt.bar(range(10), importances[indices[:10]], color="r", yerr=std[indices[:10]], align="center")
plt.xticks(range(10), X_train.columns[indices[:10]-1], rotation='vertical')
plt.show()

Hence, the clearly most important feature is JobSatisfaction.

### Next question: Is job satisfaction related to other features recorded in the survey – like salaries?

Therefore, we analyze the correlation between job satisfaction and salary.

In [None]:
df = df_orig

In [None]:
# Again, reduce Dataframe to professional, full-time developers

df = df.loc[df['EmploymentStatus'] == 'Employed full-time']
df = df.loc[df['Professional'] == 'Professional developer']
df = df.drop(['Respondent','JobSeekingStatus','ExpectedSalary'], axis=1)

In [None]:
# What are possible values for job satisfaction?

df['JobSatisfaction'].unique()

In [None]:
# Delete rows with no value for job satisfaction or salary

df = df.dropna(subset=['JobSatisfaction','Salary'], axis=0)

In [None]:
# Are both features linear correlated?

corr, _ = pearsonr(df['Salary'], df['JobSatisfaction'])
print('Pearsons correlation: %.3f' % corr)

In [None]:
## Plot a correlation matrix for a deeper look
## Source: https://seaborn.pydata.org/examples/many_pairwise_correlations.html

# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

As the correlation numbers are altogether pretty low, we cannot really find a correlation here.

### Last question: Do these aspects change from country to country?

To answer this question we analyze how job satisfaction (and salary) vary from country to country.

In [None]:
# Calculate average job satisfaction and salary for major countries

df = df.dropna(subset=['JobSatisfaction','Salary'], axis=0)
major_countries = df['Country'].value_counts()[:15].keys()

sal_mean = []
sat_mean = []

for i in range(len(major_countries)):
    sat_mean.append(df.loc[df['Country'] == major_countries[i]]['JobSatisfaction'].mean())
    sal_mean.append(df.loc[df['Country'] == major_countries[i]]['Salary'].mean())

In [None]:
# Compare these values by a scatter plot

plt.title("Job Satisfaction and Salary for 15 Countries")
plt.xlabel("Avg. Salary")
plt.ylabel("Avg. Job Satisfaction")
plt.scatter(sal_mean, sat_mean)
plt.show()

In [None]:
# Are they linear correlated?

corr, _ = pearsonr(sal_mean, sat_mean)
print('Pearsons correlation: %.3f' % corr)

As the scatter plot and the increased Pearsons number suggests we have at least a moderate correlation.

## Deployment

Please see the following blog post for a deeper discussion on the topics outlined above:

https://medium.com/@cornel_77788/how-data-science-could-help-a-human-resource-manager-5d6e95c87c95