# Estimating Salary from Data in the Stack Overflow Survey
### Using Support Vector Regression to calculate respondents' salaries based on Survey responses

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import csv
import datetime

%matplotlib inline

We load the data and take a look at it.

In [None]:
df = pd.read_csv('../input/survey_results_public.csv', low_memory=False)
df.info()

In [None]:
df.index = df['Respondent']
del(df['Respondent'])

In [None]:
df.info()

## Numerical v Categorical Data
Being able to quickly distinguish which columns are numerical and which are numerical is so useful it’s bound to be part of the `pandas` package soon. But it's straightforward to do by hand.

In [None]:
numerical = []
text = []
for c in df.columns:
    if df[c].dtype == 'float64':
        numerical.append(c)
    elif df[c].dtype == 'int64':
        numerical.append(c)
    else:
        text.append(c)

## Reducing the Number of Columns
We don’t need all 129 columns of data to figure out what’s going with pay scales. All the multiple choice columns can go, as can the personal ones. There’s no way to say these are the only possible columns to select, but these surviving columns are the results of a combination of best guess and trial-and-error.

In [None]:
shorter_columns = ['ConvertedSalary',
                    'Hobby',
                     'OpenSource',
                     'Country',
                     'Employment',
                     'FormalEducation',
                     'UndergradMajor',
                     'CompanySize',
                     'DevType',
                     'YearsCoding',
                     'YearsCodingProf',
                     'DatabaseWorkedWith',
                     'PlatformWorkedWith',
                     'FrameworkWorkedWith',
                     'OperatingSystem',
                     'Age']

df = df[shorter_columns]
df.info()

Data where the salary values are `np.nan` are of no use to us. Away they go.

In [None]:
df = df[df.ConvertedSalary > 0]
df.info()

## Countries

In [None]:
df.Country.value_counts()

## Reducing the Number of Countries
The countries data is, unsurprisingly, dominated by the USA. Therefore, we’re going to create an `Others` category to mop up the lower tail of the distribution and keep things tidy.

We’ll do this by writing a function, `shorten_categories()`, that takes the series generated by calling `.value_counts()` on a categorical series, and a cut-off point. If the values of a category are above or equal to the cut-off, that category maps to itself in the dictionary that the function will return. Otherwise, the category maps to `Other`.

We then use this function to create a shorter country column, and we'll then call `.value_counts()/df.shape[0]` on it to see how it works out proportionally.

In [None]:
def shorten_categories(categories, cutoff):
    categorical_map = {}
    for i in range(len(categories)):
        if categories.values[i] >= cutoff:
            categorical_map[categories.index[i]] = categories.index[i]
        else:
            categorical_map[categories.index[i]] = 'Other'
    return categorical_map

In [None]:
country_map = shorten_categories(df.Country.value_counts(), 400)
df['Country_Shorter'] = df.Country.map(country_map)
df.Country_Shorter.value_counts()/df.shape[0]

## Shortening the Formal Education Categories
Some of the Formal Education categories run a little long for graphing purposes. "Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)" will be bigger than the graph itself. Therefore, we'll shorten these categories by mapping a dictionary.

In [None]:
education_dict = {"Bachelor’s degree (BA, BS, B.Eng., etc.)": "Batchelor's",
                    "Some college/university study without earning a degree": "Some college",
                    "Master’s degree (MA, MS, M.Eng., MBA, etc.)": "Masters",
                    "Associate degree": "Associate Degree",
                    "Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)": "High School",
                    "Professional degree (JD, MD, etc.)": "Professional",
                    "Other doctoral degree (Ph.D, Ed.D., etc.)": "Doctoral",
                    "nan": "nan",
                    "Primary/elementary school": "Elementary",
                    "I never completed any formal education": "None"}
df['Education'] = df.FormalEducation.map(education_dict)

### Examining the Data
We can now compare individual categories against salary, again to get a better sense of them.

#### Converted Salary v Formal Education

In [None]:
fig, ax = plt.subplots(1,1, figsize=(12, 7))
df.boxplot('ConvertedSalary', 'Education', ax=ax)
plt.suptitle('Salary v Formal Education')
plt.title('')
plt.ylabel('Salary ($)')
plt.xticks(rotation=90);

#### Salary v Years of Professional Experience

In [None]:
fig, ax = plt.subplots(1,1, figsize=(12, 7))
df.boxplot('ConvertedSalary', 'YearsCodingProf', ax=ax)
plt.suptitle('Salary (US$) v Years Coding Professionally')
plt.title('')
plt.ylabel('Salary')
plt.xticks(rotation=90);

There are quite a few outliers that crush the main detail of our graph. We'll set a salary cutoff at $250,000 and see what things are like for regular folks, to use one of President Obama's favorite phrases. Regular folks don't generally pull down two mill pa on 0-2 years' experience.

In [None]:
fig, ax = plt.subplots(1,1, figsize=(12, 7))
df[df.ConvertedSalary <=250000].boxplot('ConvertedSalary', 'YearsCodingProf', ax=ax)
plt.suptitle('Salary (US$) v Years Coding Professionally, outliers removed')
plt.title('')
plt.ylabel('Salary')
plt.xticks(rotation=90);

#### Salary v Country

In [None]:
fig, ax = plt.subplots(1,1, figsize=(12, 7))
df.boxplot('ConvertedSalary', 'Country_Shorter', ax=ax)
plt.suptitle('Salary (US$) v Country')
plt.title('')
plt.ylabel('Salary')
plt.xticks(rotation=90);

Again, we'll impose a $250k ceiling to get a better idea of the general data.

In [None]:
fig, ax = plt.subplots(1,1, figsize=(12, 7))
df[df.ConvertedSalary <= 250000].boxplot('ConvertedSalary', 'Country_Shorter', ax=ax)
plt.suptitle('Salary (US$) v Country, Outliers Removed')
plt.title('')
plt.ylabel('Salary')
plt.xticks(rotation=90);

And that's instructive. Being a developer means you've a pretty good chance of good pay in the USA, UK, Israel and Australia. Things aren't so good in Ukraine, India and Russia. There are also some pretty low-paid developer jobs in Sweden.

## Can we Predict Salary?
It's unfortunate that we don't have better language data. The language category allows for a number of different languages and, while it's trivial to separate them, what we can't do is figure out a reliable way to weigh the languages against each other in terms of general work as it's reasonable to understand it.

Consider, for instance, CSS. With the greatest of respect, you could argue that CSS isn't a language at all, but if you're a front-end developer, it's pretty important to what you do. But even so, if you can only write CSS you won't get a job. So it would have a small weighing even for someone who uses it every day, and exponentially smaller again for someone who just threw in the list, like emergency snacks into a shopping basket. We have no way of figuring out the weight of the languages relative to each other for a particular respondent and we are far better to just leave them out than to guess and deliver a misleading result.

Those caveats noted, we'll proceed by loading some modules from `sklearn` and starting getting our data ready for analysis.

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error

#### Dropping Categories from the `Country` Field and Capping the Salary
Because this script could take well over an hour to run over the full data set, we have to cut the data. We can cut it at random, by sampling the data. Or we can cut it methodically, with a method to our madness.

I've chosen the methodical method. Over 40% of our data is respresented by two countries, USA and Other. Our data will be more interesting if we drop these two countries, and only look at countries other than the big beast and the collection of smaller beasties - again, with all due respect to big and little beasties.

Secondly, we'll also cap the salary at $250,000 to give ourselves some sort of fighting chance of getting this right. The spectacularly high salaries just aren't credible. We'll then we create `train` and `test` dataframes in the usual way, with `sklearn.model_selection.train_test_split`.

In [None]:
df2 = df.copy()
df2 = df2[(df2.ConvertedSalary <= 250000) & (df2.Country_Shorter != 'Other') & (df2.Country_Shorter != 'United States')]
del(df2['Country_Shorter'])
df2.info()

## Solely Categorical Data
We find an unusual case here in that our data is entirely categorial – there are no numerical data. This is a chance occurrence, but that’s OK. If anything, it makes our job easier.

Normally, we create another data frame that one-hot encodes all the categorical data and merge that back with the original. However, in this case, no merging is necessary as the dummy data covers the entirety of the data.

Even though the dummy data doesn’t have to be merged with anything, we’ll change its name to `features` nonetheless. We don’t need the original `features` data frame, and it’s easier to keep up with these operations when the names are consistent.

In [None]:
labels = df2['ConvertedSalary']
features = df2.drop('ConvertedSalary', axis=1)

In [None]:
dummies = pd.get_dummies(features)
features = dummies
features.shape

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.2, random_state=42)

A sanity check is helpful every now and again - let's make sure we're getting what we ought to be getting.

In [None]:
for i in [train_features, test_features, train_labels, test_labels]:
    print(len(i), type(i))

## Finding the Best Setting for our Support Vector Regressor
The support-vector regression model in `sklearn` has adjustable parameters. Different kernels can be used, there are different degrees against which we can balance bias against variance, and there are other parameters too. Rather than work these out piece-by-piece, we can use the `GridSearchCV` module in `sklearn` to automate the process for us. It creates a `GridSearchCV` object which has a `best_estimator_` attribute, and it is this that we shall use as our regressor.

As it happens, `sklearn` seems to be phasing out its `GridSearchCV` module in favor of CV (cross-validation) extensions of the different models, like Linear Regression, Random Trees, and the rest. However, the SVRCV module doesn’t exist yet, so we’re going with the tried and the tested.

In [None]:
param_grid = [{'kernel':('linear', 'rbf'), 'C':[1, 10]}]
regressor = SVR()
gridsearch = GridSearchCV(regressor, param_grid, scoring='neg_mean_squared_error')

### Sparse Matrix
Again, our good luck in having one-hot encoded data means we can convert them to a sparse matrix. Using a sparse matrix makes the data processing exponentially shorter in time.

In [None]:
from scipy.sparse import csr_matrix

In [None]:
train_regressor_ready_data = csr_matrix(train_features.values)

In [None]:
gridsearch.fit(train_regressor_ready_data, train_labels.values)

In [None]:
regressor = gridsearch.best_estimator_

## Checking for Accuracy
We’ll use RMSE, the root-mean-squared error, to see how accurate our regressor is. We’ll then create a database that contains the correct salary data, the predicted data, the country data and the formal education data. We’ll group the data by country and formal education, and see how we’re doing in each category with scatter plots, as we map observed salaries against predicted salaries.

In [None]:
train_predictions = regressor.predict(train_regressor_ready_data)

In [None]:
rootMeanSquaredError_train = np.sqrt(mean_squared_error(train_labels, train_predictions))
print("${:,.02f}".format(rootMeanSquaredError_train))

## Further Testing Our Model
We'll test our model further by running it against the test data. Our aim is for it to have about the same RMSE value, as per this comment on, of all places, Stack Exchange: https://stats.stackexchange.com/a/288809/190839. So we're going to prepare our `test_features` data as we did the `train_features`, predict some values, and measure the root-mean-squared error between the predicted data and the `test_labels` data, hoping to arrive in or around the RMSE value we got with the training data.

In [None]:
test_regressor_ready_data = csr_matrix(test_features.values)
test_predictions = regressor.predict(test_regressor_ready_data)

In [None]:
rootMeanSquaredError_test = np.sqrt(mean_squared_error(test_labels, test_predictions))
print("${:,.02f}".format(rootMeanSquaredError_test))

## Level of Fit

In [None]:
levelOfFit = abs(rootMeanSquaredError_train-rootMeanSquaredError_test)/rootMeanSquaredError_train*100.0
print("There is a {:.02f}% difference between the root mean squared errors of the train set and the test set.".format(levelOfFit))

In [None]:
df2.Country.value_counts()

### Creating a Data Frame from Which to Base the Plots

In [None]:
plotting_df = pd.DataFrame(train_labels)
plotting_df['PredictedSalary'] = train_predictions
plotting_df['Country'] = train_features.index.map(df.Country)
plotting_df['Education'] = train_features.index.map(df.Education)
plotting_df['Experience'] = train_features.index.map(df.YearsCodingProf)

In [None]:
plotting_df.head()

In [None]:
len(plotting_df.Country.unique())

In [None]:
byCountry = plotting_df.groupby('Country')

In [None]:
colors = [plt.get_cmap('inferno')(1. * i/255) for i in range(0, 255, 15)]
countries = plotting_df.Country.unique().tolist()

Firstly, we'll create a scatter plot on a single axis.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(14, 14))
i = 0
for a, b in byCountry:
    plt.scatter(b.ConvertedSalary, b.PredictedSalary, color=colors[i], label=a, alpha=0.5)
    i +=1
plt.xlabel('Actual Salary')
plt.ylabel('Predicted Salary')
plt.legend()
plt.title('Predicted v Actual Salary');

And now we'll break the countries out into subplots.

In [None]:
fig, ax = plt.subplots(4, 4, sharex=True, sharey=True, figsize=(12, 12))
counter = 0
for i in range(4):
    for j in range(4):
        temp = byCountry.get_group(countries[counter])
        ax[i][j].scatter(temp['ConvertedSalary'], temp['PredictedSalary'], color = colors[counter])
        ax[i][j].set_title(countries[counter])
        counter += 1
plt.tight_layout()

#### Breaking Down the Stats
We'll write a function, `create_correlations_table()`, that will return a data frame that shows us
1. the category,
2. the sample size,
3. the Pearson's R value,
4. the two-tailed p-value relative to that Pearson's R value (that is to say, the chances of getting so extreme a Pearson's R by dumb luck, rather than correlation), and
5. The root-mean-squared-error value for each category.

The function has a `sample_size_cutoff` parameter for when we'll get to particularly small values, for which Pearson's R values are meaningless, set by default at zero.

In [None]:
import scipy.stats as stats

In [None]:
def create_correlations_table(groupedDf, sample_size_cutoff=0):
    holder = []
    for a, b in groupedDf:
        if len(b) > sample_size_cutoff:
            if type(a) == str:
                category = a
            else:
                category = str(a)
            temp = stats.pearsonr(b['ConvertedSalary'], b['PredictedSalary'])
            RMSE = np.sqrt(mean_squared_error(b.ConvertedSalary, b.PredictedSalary))
            holder.append([a, len(b), temp[0], temp[1], RMSE])
        else:
            continue

    correlations = pd.DataFrame(holder, columns = ['Country', 'Sample Size', 'Pearson R', 'Probability', 'RMSE'])
    correlations.sort_values('RMSE', inplace=True)
    return correlations

In [None]:
country_correlations = create_correlations_table(byCountry)
country_correlations

The country with the best Pearson's R number for the correlation between predicted and actual salary is also the country with the highest RMSE score. This isn't novel, as accuracy of prediction isn't the same thing as accuracy of correlation.

### Plotting by Experience
We'll follow the same procedure in examining the effect of experience on salary.

In [None]:
byExperience = plotting_df.groupby('Experience')
years = plotting_df.Experience.dropna().unique().tolist()
colors = [plt.get_cmap('inferno')(1. * i/255) for i in range(0, 255, 21)]

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(14, 14))
i = 0
for a, b in byExperience:
    plt.scatter(b.ConvertedSalary, b.PredictedSalary, color=colors[i], label=a, alpha=0.5)
    i +=1
plt.xlabel('Actual Salary')
plt.ylabel('Predicted Salary')
plt.legend()
plt.title('Predicted v Actual Salary');

In [None]:
fig, ax = plt.subplots(4, 3, sharex=True, sharey=True, figsize=(12, 12))
counter = 0
for i in range(4):
    for j in range(3):
        if counter < len(years):
            temp = byExperience.get_group(years[counter])
            ax[i][j].scatter(temp['ConvertedSalary'], temp['PredictedSalary'], color = colors[counter])
            ax[i][j].set_title(years[counter])
            counter += 1
        else:
            pass
plt.tight_layout()

In [None]:
experience_correlations = create_correlations_table(byExperience)
experience_correlations

This is all very interesting. There's a better match here between correlation and root-mean-squared error. The hardest category to predict is thirty or more years, which is hardly surprising and there are many crossroads over thirty years. The easiest is at the opposite end of the scale, those developers who have just started out.

## By Country and Experience
Having gone this far, it seems worthwhile to combine the two to see which has the strongest impact on salary.

In [None]:
byCountryEx = plotting_df.groupby(['Country', 'Experience'])
country_experience_df = create_correlations_table(byCountryEx, 100)

In [None]:
country_experience_df

We get our best RMSE scores for developers with 0-2 years of experience, irrespecitive of where they're based. Strangely, Switzerland is very erractic for the next experience level up, with the highest RMSE value of the lot.

## Conclusions
All data analysis reports are only as good as the information on which their built. Surveys are not ideal tools for researching salaries. All surveys suffer from response bias by their nature, and this particular survey is unfortunate is its not identifying languages as primary, secondary and tertiary, say, or as combinations - frontend, backend, and so on.

For all that, our model is good with a 3% difference between the RMSE scores for our training and test sets and could be even better with a few more little tweaks.