# Stack Overflow Developer Survey 2022

A Stack Overflow Developer Survey é uma pesquisa anual realizada pela plataforma Stack Overflow, que coleta informações sobre a comunidade de desenvolvedores. A pesquisa abrange uma variedade de tópicos, como linguagens de programação, ferramentas, práticas de desenvolvimento e satisfação profissional. Os resultados são analisados e publicados em um relatório que fornece insights sobre tendências e percepções dos desenvolvedores. A pesquisa é uma fonte importante de informações para profissionais de tecnologia e empresas de desenvolvimento de software.

Com esse projeto, pretendemos analisar os dados da pesquisa de 2022 a fim de construir e comparar regressores para a predição de salários de desenvolvedores de software. Para isso, utilizaremos técnicas de aprendizado de máquina e estatística.

In [46]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [92]:
# load raw data
# remember to download the data from
# https://www.kaggle.com/datasets/dheemanthbhat/stack-overflow-annual-developer-survey-2022
# https://insights.stackoverflow.com/survey

data = pd.read_csv('raw/survey_results_public.csv')

In [93]:
# select columns of interest

target_col = 'ConvertedCompYearly'
features_cols = ['Employment', 'RemoteWork', 'EdLevel', 'YearsCode', 'YearsCodePro', 'Country', 'Age']
#
# maybe add 'Age'	'Gender'	'Trans'	'Sexuality'	'Ethnicity'	'Accessibility'
#

# remove rows with missing data

data = data[data[target_col].notnull()]
data = data[data[features_cols].notnull().all(axis=1)]

print('Total number of samples in the dataset ', data.shape[0])

Total number of samples in the dataset  37748


In [94]:
# não apenas as colunas com multiplas classes (tipo, que nem "Gay;Queer", "Straight;Gay", etc), mas também todas as colunas não numéricas precisam ser processadas e existem diversar técnicas com suas respectivas vantagens e desvantagens, vou deixar algumas coisas que encontrei aqui, apague depois
# https://scikit-learn.org/stable/modules/preprocessing.html
# https://chat.openai.com/share/fc9619db-aff2-4114-ad07-df67c1cd6f7f
# https://www.reddit.com/r/learnmachinelearning/comments/qjo2b1/what_is_your_goto_encoding_for_categorical/
# https://stackoverflow.com/questions/38826221/difference-between-binary-relevance-and-one-hot-encoding
# https://datascience.stackexchange.com/questions/9443/when-to-use-one-hot-encoding-vs-labelencoder-vs-dictvectorizor
# https://www.reddit.com/r/learnmachinelearning/comments/nc3vn5/please_help_me_in_understanding_when_to_use_label/
#

# preprocess list of classes

feature_data = data[features_cols]

# Apply one-hot encoding to categorical columns

categorical_cols = ['Employment', 'RemoteWork', 'EdLevel', 'Country']
data_encoded = pd.get_dummies(feature_data, columns=categorical_cols)

# Define mapping for range of numbers
age_mapping = {
    'Under 18 years old': 1,
    '18-24 years old': 2,
    '25-34 years old': 3,
    '35-44 years old': 4,
    '45-54 years old': 5,
    '55-64 years old': 6,
    '65 years or older': 7
}

# Apply mapping to 'Age'

data_encoded['Age'] = data['Age'].map(age_mapping)

# drop old columns

# concatenate the original data with the new one-hot encoded columns

data_encoded


Unnamed: 0,YearsCode,YearsCodePro,Age,"Employment_Employed, full-time","Employment_Employed, full-time;Employed, part-time","Employment_Employed, full-time;Independent contractor, freelancer, or self-employed","Employment_Employed, full-time;Independent contractor, freelancer, or self-employed;Employed, part-time","Employment_Employed, full-time;Independent contractor, freelancer, or self-employed;Retired","Employment_Employed, full-time;Retired","Employment_Employed, part-time",...,Country_United Kingdom of Great Britain and Northern Ireland,Country_United Republic of Tanzania,Country_United States of America,Country_Uruguay,Country_Uzbekistan,"Country_Venezuela, Bolivarian Republic of...",Country_Viet Nam,Country_Yemen,Country_Zambia,Country_Zimbabwe
2,14,5,25-34 years old,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,20,17,35-44 years old,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,6,6,25-34 years old,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,5,2,18-24 years old,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
11,12,10,35-44 years old,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73114,7,2,18-24 years old,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
73116,21,16,35-44 years old,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
73118,4,3,25-34 years old,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
73119,5,1,25-34 years old,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [90]:
# select target variable

target = data[target_col]

# select features

features = data[features_cols]

Unnamed: 0,ResponseId,MainBranch,Employment,RemoteWork,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,LearnCodeCoursesCert,YearsCode,...,TimeSearching,TimeAnswering,Onboarding,ProfessionalTech,TrueFalse_1,TrueFalse_2,TrueFalse_3,SurveyLength,SurveyEase,ConvertedCompYearly
2,3,"I am not primarily a developer, but I write co...","Employed, full-time","Hybrid (some remote, some in-person)",Hobby,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Friend or family member...,Technical documentation;Blogs;Programming Game...,,14,...,,,,,,,,Appropriate in length,Neither easy nor difficult,40205.0
3,4,I am a developer by profession,"Employed, full-time",Fully remote,I don’t code outside of work,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Books / Physical media;School (i.e., Universit...",,,20,...,,,,,,,,Appropriate in length,Easy,215232.0
8,9,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",I don’t code outside of work,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",On the job training;Coding Bootcamp,,,6,...,15-30 minutes a day,Over 120 minutes a day,Somewhat long,Innersource initiative;DevOps function;Microse...,Yes,Yes,Yes,Appropriate in length,Easy,49056.0
10,11,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",I don’t code outside of work,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Friend or family member...,Technical documentation;Blogs;Written Tutorial...,,5,...,,,,,,,,Appropriate in length,Easy,60307.0
11,12,"I am not primarily a developer, but I write co...","Employed, full-time;Independent contractor, fr...",Fully remote,Hobby;Contribute to open-source projects;Freel...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,,12,...,30-60 minutes a day,60-120 minutes a day,Just right,Innersource initiative;DevOps function;Microse...,Yes,Yes,No,Too short,Easy,194400.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73114,73115,I am a developer by profession,"Employed, full-time;Independent contractor, fr...","Hybrid (some remote, some in-person)",Hobby;Contribute to open-source projects;Boots...,"Associate degree (A.A., A.S., etc.)",Books / Physical media;Other online resources ...,Technical documentation;Blogs;Programming Game...,,7,...,,,,,,,,Too long,Neither easy nor difficult,41058.0
73116,73117,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby;Contribute to open-source projects;Freel...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Other online resources ...,Technical documentation;Written Tutorials,,21,...,30-60 minutes a day,Less than 15 minutes a day,Very short,DevOps function;Microservices,No,No,Yes,Appropriate in length,Easy,115000.0
73118,73119,I am a developer by profession,"Employed, full-time",Fully remote,Hobby;Contribute to open-source projects;Freel...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Friend or family member;Other online resources...,Technical documentation;Blogs;How-to videos,,4,...,,,,,,,,Appropriate in length,Easy,57720.0
73119,73120,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby;Other (please specify):,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Online Courses or Certification;Coding Bootcamp,,Coursera;Udemy;Pluralsight;Other,5,...,Over 120 minutes a day,Less than 15 minutes a day,Just right,,No,Yes,,Appropriate in length,Neither easy nor difficult,70000.0


In [73]:
# normalize the variables (????)

In [84]:
from sklearn.model_selection import train_test_split

# Split the data into train, validation, and test sets with proportions 70:10:20
X_train, X_val_test, y_train, y_val_test = train_test_split(
    features,
    target,
    test_size=0.3,
    random_state=42
)

X_val, X_test, y_val, y_test = train_test_split(
    X_val_test,
    y_val_test,
    test_size=0.67,
    random_state=42
)

# check the shapes of the resulting train, validation, and test sets
# X and y have the same shape

print('Train set shape: ', X_train.shape)
print('Validation set shape: ', X_val.shape)
print('Test set shape: ', X_test.shape)

Train set shape:  (26423, 7)
Validation set shape:  (3737, 7)
Test set shape:  (7588, 7)


In [85]:
# stolen code didn't check

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

regressor = LinearRegression()  # Create an instance of the regressor
regressor.fit(X_train, y_train)  # Train the model

y_pred = regressor.predict(X_test)  # Predict the target variable

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print('Mean Squared Error:', mse)
print('Mean Absolute Error:', mae)
print('R-squared:', r2)

ValueError: could not convert string to float: 'Employed, full-time'