# Stack Overflow Developer Survey 2022

A Stack Overflow Developer Survey é uma pesquisa anual realizada pela plataforma Stack Overflow, que coleta informações sobre a comunidade de desenvolvedores. A pesquisa abrange uma variedade de tópicos, como linguagens de programação, ferramentas, práticas de desenvolvimento e satisfação profissional. Os resultados são analisados e publicados em um relatório que fornece insights sobre tendências e percepções dos desenvolvedores. A pesquisa é uma fonte importante de informações para profissionais de tecnologia e empresas de desenvolvimento de software.

Com esse projeto, pretendemos analisar os dados da pesquisa de 2022 a fim de construir e comparar regressores para a predição de salários de desenvolvedores de software. Para isso, utilizaremos técnicas de aprendizado de máquina e estatística.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [5]:
# load raw data
# remember to download the data from
# https://www.kaggle.com/datasets/dheemanthbhat/stack-overflow-annual-developer-survey-2022
# https://insights.stackoverflow.com/survey

raw = pd.read_csv('raw/survey_results_public.csv')

# select columns of interest

target_col = 'ConvertedCompYearly'
features_cols = ['Employment', 'RemoteWork', 'EdLevel', 'YearsCode', 'YearsCodePro', 'Country', 'Age', 'Gender']

# drop rows with missing data

raw = raw[raw[target_col].notnull()]
raw = raw[raw[features_cols].notnull().all(axis=1)]

raw = raw[raw['Age'] != 'Prefer not to say']
#
# does not drop 'Prefer not to say' in 'Gender', 'Trans', 'Sexuality', 'Ethnicity', 'Accessibility'
# as their absence may not be a missing value
#

#
# pode ser que isso seja um erro e inclusive diminuir ou comprimir o número de features pode ajudar a melhorar a precisão do modelo (não que eu esteja disposto a testar isso agora)
#
##
#
# atualização: foi um erro, removendo features tumultuando o modelo; colunas descartadas ['Trans', 'Sexuality', 'Ethnicity', 'Accessibility']
#

# Select columns of interest

X_raw = raw.drop(raw.columns.difference(features_cols), axis=1)
y_raw = raw[target_col]

# fix heterogeneous columns

X_raw['YearsCode'] = X_raw['YearsCode'].replace('Less than 1 year', 0)
X_raw['YearsCode'] = X_raw['YearsCode'].replace('More than 50 years', 51)

X_raw['YearsCodePro'] = X_raw['YearsCodePro'].replace('Less than 1 year', 0)
X_raw['YearsCodePro'] = X_raw['YearsCodePro'].replace('More than 50 years', 51)

# cast column to numeric

X_raw['YearsCode'] = X_raw['YearsCode'].astype('int64')
X_raw['YearsCodePro'] = X_raw['YearsCodePro'].astype('int64')

# split multi label columns (MultiHotEncoding)

X_raw = X_raw.drop('Employment', axis=1).join(X_raw['Employment'].str.get_dummies(sep=';').add_prefix('Empl_'))
X_raw = X_raw.drop('Gender', axis=1).join(X_raw['Gender'].str.get_dummies(sep=';').add_prefix('Gender_'))
#X_raw = X_raw.drop('Sexuality', axis=1).join(X_raw['Sexuality'].str.get_dummies(sep=';').add_prefix('Sexuality_'))
#X_raw = X_raw.drop('Ethnicity', axis=1).join(X_raw['Ethnicity'].str.get_dummies(sep=';').add_prefix('Ethnicity_'))
#X_raw = X_raw.drop('Accessibility', axis=1).join(X_raw['Accessibility'].str.get_dummies(sep=';').add_prefix('Accessibility_'))

# merge useless categories

df['merged_column'] = df['A_cat'] | df['B_cat']
# Step 2: Drop 'A_cat' and 'B_cat' columns
df.drop(columns=['A_cat', 'B_cat'], inplace=True)

print('Total features in the dataset ', X_raw.shape)
X_raw.dtypes

Total features in the dataset  (37627, 15)


RemoteWork                                                   object
EdLevel                                                      object
YearsCode                                                     int64
YearsCodePro                                                  int64
Country                                                      object
Age                                                          object
Empl_Employed, full-time                                      int64
Empl_Employed, part-time                                      int64
Empl_Independent contractor, freelancer, or self-employed     int64
Empl_Retired                                                  int64
Gender_Man                                                    int64
Gender_Non-binary, genderqueer, or gender non-conforming      int64
Gender_Or, in your own words:                                 int64
Gender_Prefer not to say                                      int64
Gender_Woman                                    

In [19]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# preprocess categorical data

one_hot_transformer = Pipeline([
    ('onehot', OneHotEncoder())
])

ordinal_transformer = Pipeline([
    ('ordinal', OrdinalEncoder())
])

one_hot_cols = ['Country', 'RemoteWork', 'Trans']
ordinal_cols = ['EdLevel', 'Age']

column_trans_preprocessor = ColumnTransformer(
    [('one_hot', one_hot_transformer, one_hot_cols),
     ('ordinal', ordinal_transformer, ordinal_cols)],
    remainder='passthrough')

X_transformed = column_trans_preprocessor.fit_transform(X_raw)

display(X_transformed.shape)
display(column_trans_preprocessor.get_feature_names_out())

(35183, 216)

array(['one_hot__Country_Afghanistan', 'one_hot__Country_Albania',
       'one_hot__Country_Algeria', 'one_hot__Country_Andorra',
       'one_hot__Country_Angola', 'one_hot__Country_Argentina',
       'one_hot__Country_Armenia', 'one_hot__Country_Australia',
       'one_hot__Country_Austria', 'one_hot__Country_Azerbaijan',
       'one_hot__Country_Bahrain', 'one_hot__Country_Bangladesh',
       'one_hot__Country_Barbados', 'one_hot__Country_Belarus',
       'one_hot__Country_Belgium', 'one_hot__Country_Benin',
       'one_hot__Country_Bhutan', 'one_hot__Country_Bolivia',
       'one_hot__Country_Bosnia and Herzegovina',
       'one_hot__Country_Botswana', 'one_hot__Country_Brazil',
       'one_hot__Country_Bulgaria', 'one_hot__Country_Cambodia',
       'one_hot__Country_Cameroon', 'one_hot__Country_Canada',
       'one_hot__Country_Cape Verde', 'one_hot__Country_Chile',
       'one_hot__Country_China', 'one_hot__Country_Colombia',
       'one_hot__Country_Congo, Republic of the...',
  

In [4]:
# normalize the variables (????)

In [20]:
from sklearn.model_selection import train_test_split

# Split the data into train, validation, and test sets with proportions 70:10:20
X_train, X_val_test, y_train, y_val_test = train_test_split(
    X_transformed,
    y_raw,
    test_size=0.3,
    random_state=42
)

X_val, X_test, y_val, y_test = train_test_split(
    X_val_test,
    y_val_test,
    test_size=0.67,
    random_state=42
)

# check the shapes of the resulting train, validation, and test sets
# X and y have the same shape

print('Train set shape: ', X_train.shape)
print('Validation set shape: ', X_val.shape)
print('Test set shape: ', X_test.shape)

Train set shape:  (24628, 216)
Validation set shape:  (3483, 216)
Test set shape:  (7072, 216)


In [21]:
# stolen code didn't check

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

linear_regressor = LinearRegression()  # Create an instance of the regressor
linear_regressor.fit(X_train, y_train)  # Train the model

y_pred = linear_regressor.predict(X_test)  # Make predictions on the test set

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print('Mean Squared Error:', mse)
print('Mean Absolute Error:', mae)
print('R-squared:', r2)

Mean Squared Error: 766433681033.9508
Mean Absolute Error: 192829.44806407834
R-squared: 0.02617326966208
