### What kind of programming style would affect salary

This section will look into the factors of programming style (what kind of IDE they use, space type) would contribute to salary, in order to determine the contribution, we build machine learning model to predict salary.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import r2_score, mean_squared_error
from collections import defaultdict
import seaborn as sns
%matplotlib inline

df = pd.read_csv('./survey_results_public.csv')
df.head()

Unnamed: 0,Respondent,Professional,ProgramHobby,Country,University,EmploymentStatus,FormalEducation,MajorUndergrad,HomeRemote,CompanySize,...,StackOverflowMakeMoney,Gender,HighestEducationParents,Race,SurveyLong,QuestionsInteresting,QuestionsConfusing,InterestedAnswers,Salary,ExpectedSalary
0,1,Student,"Yes, both",United States,No,"Not employed, and not looking for work",Secondary school,,,,...,Strongly disagree,Male,High school,White or of European descent,Strongly disagree,Strongly agree,Disagree,Strongly agree,,
1,2,Student,"Yes, both",United Kingdom,"Yes, full-time",Employed part-time,Some college/university study without earning ...,Computer science or software engineering,"More than half, but not all, the time",20 to 99 employees,...,Strongly disagree,Male,A master's degree,White or of European descent,Somewhat agree,Somewhat agree,Disagree,Strongly agree,,37500.0
2,3,Professional developer,"Yes, both",United Kingdom,No,Employed full-time,Bachelor's degree,Computer science or software engineering,"Less than half the time, but at least one day ...","10,000 or more employees",...,Disagree,Male,A professional degree,White or of European descent,Somewhat agree,Agree,Disagree,Agree,113750.0,
3,4,Professional non-developer who sometimes write...,"Yes, both",United States,No,Employed full-time,Doctoral degree,A non-computer-focused engineering discipline,"Less than half the time, but at least one day ...","10,000 or more employees",...,Disagree,Male,A doctoral degree,White or of European descent,Agree,Agree,Somewhat agree,Strongly agree,,
4,5,Professional developer,"Yes, I program as a hobby",Switzerland,No,Employed full-time,Master's degree,Computer science or software engineering,Never,10 to 19 employees,...,,,,,,,,,,


### Data understanding and preparation
We will firstly extract all the programming style related columns and create dummy columns as they are categorical columns. 

In [2]:
sal_rm = df.dropna(subset=['Salary'], axis=0)
sal_nona = df.dropna(subset=['Salary', 'TabsSpaces', 'WorkStart', 'HaveWorkedLanguage', 
                             'HaveWorkedFramework', 'HaveWorkedDatabase', 'HaveWorkedPlatform', 'IDE', 
                             'AuditoryEnvironment', 'Methodology', 'VersionControl', 'CheckInCode'], axis=0)

In [3]:
sal_rm

Unnamed: 0,Respondent,Professional,ProgramHobby,Country,University,EmploymentStatus,FormalEducation,MajorUndergrad,HomeRemote,CompanySize,...,StackOverflowMakeMoney,Gender,HighestEducationParents,Race,SurveyLong,QuestionsInteresting,QuestionsConfusing,InterestedAnswers,Salary,ExpectedSalary
2,3,Professional developer,"Yes, both",United Kingdom,No,Employed full-time,Bachelor's degree,Computer science or software engineering,"Less than half the time, but at least one day ...","10,000 or more employees",...,Disagree,Male,A professional degree,White or of European descent,Somewhat agree,Agree,Disagree,Agree,113750.000000,
14,15,Professional developer,"Yes, I program as a hobby",United Kingdom,No,Employed full-time,Professional degree,Computer engineering or electrical/electronics...,All or almost all the time (I'm full-time remote),"5,000 to 9,999 employees",...,Disagree,Male,High school,White or of European descent,Somewhat agree,Agree,Disagree,Agree,100000.000000,
17,18,Professional developer,"Yes, both",United States,"Yes, part-time",Employed full-time,Bachelor's degree,Computer science or software engineering,All or almost all the time (I'm full-time remote),"1,000 to 4,999 employees",...,Disagree,Male,A master's degree,"Native American, Pacific Islander, or Indigeno...",Disagree,Agree,Disagree,Agree,130000.000000,
18,19,Professional developer,"Yes, I program as a hobby",United States,No,Employed full-time,Bachelor's degree,Computer science or software engineering,A few days each month,"10,000 or more employees",...,,,,,,,,,82500.000000,
22,23,Professional developer,No,Israel,No,Employed full-time,Bachelor's degree,Computer engineering or electrical/electronics...,A few days each month,500 to 999 employees,...,Somewhat agree,Male,A bachelor's degree,White or of European descent,Strongly agree,Somewhat agree,Somewhat agree,Agree,100764.000000,
25,26,Professional developer,"Yes, I program as a hobby",United States,No,Employed full-time,Master's degree,Computer science or software engineering,"Less than half the time, but at least one day ...","10,000 or more employees",...,Disagree,Male,A master's degree,White or of European descent,Disagree,Strongly agree,Disagree,Strongly agree,175000.000000,
34,35,Professional developer,"Yes, I program as a hobby",Croatia,"Yes, full-time",Employed full-time,Bachelor's degree,Computer engineering or electrical/electronics...,A few days each month,10 to 19 employees,...,Strongly disagree,Male,A master's degree,White or of European descent,Disagree,Agree,Strongly disagree,Agree,14838.709677,
36,37,Professional developer,"Yes, I program as a hobby",Argentina,No,Employed full-time,Some college/university study without earning ...,Computer programming or Web development,A few days each month,500 to 999 employees,...,Disagree,Male,A bachelor's degree,Hispanic or Latino/Latina,Somewhat agree,Agree,Strongly disagree,Strongly agree,28200.000000,
37,38,Professional developer,"Yes, both",Germany,No,Employed full-time,Some college/university study without earning ...,Mathematics or statistics,All or almost all the time (I'm full-time remote),100 to 499 employees,...,Disagree,Male,A master's degree,White or of European descent,Somewhat agree,Agree,Disagree,Agree,118279.569892,
52,53,Professional developer,"Yes, I program as a hobby",Brazil,No,Employed full-time,Bachelor's degree,Computer engineering or electrical/electronics...,A few days each month,"1,000 to 4,999 employees",...,Disagree,Male,A doctoral degree,Hispanic or Latino/Latina; White or of Europea...,Somewhat agree,Agree,Disagree,Strongly agree,15674.203822,


In [4]:
sal_nona

Unnamed: 0,Respondent,Professional,ProgramHobby,Country,University,EmploymentStatus,FormalEducation,MajorUndergrad,HomeRemote,CompanySize,...,StackOverflowMakeMoney,Gender,HighestEducationParents,Race,SurveyLong,QuestionsInteresting,QuestionsConfusing,InterestedAnswers,Salary,ExpectedSalary
37,38,Professional developer,"Yes, both",Germany,No,Employed full-time,Some college/university study without earning ...,Mathematics or statistics,All or almost all the time (I'm full-time remote),100 to 499 employees,...,Disagree,Male,A master's degree,White or of European descent,Somewhat agree,Agree,Disagree,Agree,118279.569892,
52,53,Professional developer,"Yes, I program as a hobby",Brazil,No,Employed full-time,Bachelor's degree,Computer engineering or electrical/electronics...,A few days each month,"1,000 to 4,999 employees",...,Disagree,Male,A doctoral degree,Hispanic or Latino/Latina; White or of Europea...,Somewhat agree,Agree,Disagree,Strongly agree,15674.203822,
64,65,Professional developer,No,Netherlands,No,Employed full-time,Master's degree,Computer science or software engineering,A few days each month,"10,000 or more employees",...,Disagree,Male,A doctoral degree,White or of European descent,Strongly agree,Somewhat agree,Disagree,Agree,69892.473118,
72,73,Professional developer,"Yes, both",United States,No,Employed full-time,Some college/university study without earning ...,Computer science or software engineering,All or almost all the time (I'm full-time remote),100 to 499 employees,...,Disagree,Male,A bachelor's degree,White or of European descent,Disagree,Somewhat agree,Strongly disagree,Agree,120000.000000,
83,84,Professional developer,"Yes, both",United States,No,Employed full-time,Doctoral degree,Computer science or software engineering,A few days each month,"10,000 or more employees",...,Somewhat agree,Male,A doctoral degree,White or of European descent,Agree,Somewhat agree,Disagree,Agree,165000.000000,
133,134,Professional developer,"Yes, both",India,No,Employed full-time,Bachelor's degree,Computer science or software engineering,A few days each month,Fewer than 10 employees,...,Strongly disagree,Male,A bachelor's degree,South Asian,Somewhat agree,Agree,Strongly disagree,Strongly agree,14682.131846,
143,144,Professional developer,"Yes, both",United Kingdom,No,Employed full-time,Bachelor's degree,Computer science or software engineering,Never,Fewer than 10 employees,...,Disagree,Male,A bachelor's degree,White or of European descent,Somewhat agree,Somewhat agree,Disagree,Strongly agree,43750.000000,
146,147,Professional developer,"Yes, I program as a hobby",Denmark,"Yes, part-time",Employed part-time,Some college/university study without earning ...,Computer science or software engineering,A few days each month,100 to 499 employees,...,Strongly disagree,Male,"Some college/university study, no bachelor's d...",White or of European descent,Strongly agree,Disagree,Strongly disagree,Somewhat agree,51282.051282,
156,157,Professional developer,No,United States,No,Employed full-time,Bachelor's degree,Computer science or software engineering,A few days each month,500 to 999 employees,...,Strongly disagree,Male,"Some college/university study, no bachelor's d...",White or of European descent,Disagree,Agree,Disagree,Strongly agree,80000.000000,
173,174,Professional developer,"Yes, both",Ukraine,No,Employed full-time,Master's degree,Computer science or software engineering,"Less than half the time, but at least one day ...",100 to 499 employees,...,Somewhat agree,Male,A master's degree,White or of European descent,Somewhat agree,Somewhat agree,Somewhat agree,Strongly agree,27000.000000,


We can see that if we remove all the null values, there would only be 1429 rows left comparing to the salary remove only option (5009 rows). Hence it's better to inpute the null values of categorical data with an additional N/A column.

In [5]:

dev_style_cols = ['TabsSpaces', 'WorkStart', 'HaveWorkedLanguage', 'HaveWorkedFramework', 'HaveWorkedDatabase',
            'HaveWorkedPlatform', 'IDE', 'AuditoryEnvironment', 'Methodology', 'VersionControl', 'CheckInCode']
prog_style_df = sal_rm[dev_style_cols]
salary = sal_rm['Salary']

def create_dummy_df(df, cat_cols, dummy_na):
    '''
    INPUT:
    df - pandas dataframe with categorical variables you want to dummy
    cat_cols - list of strings that are associated with names of the categorical columns
    dummy_na - Bool holding whether you want to dummy NA vals of categorical columns or not
    
    OUTPUT:
    df - a new dataframe that has the following characteristics:
            1. contains all columns that were not specified as categorical
            2. removes all the original columns in cat_cols
            3. dummy columns for each of the categorical columns in cat_cols
            4. if dummy_na is True - it also contains dummy columns for the NaN values
            5. Use a prefix of the column name with an underscore (_) for separating 
    '''
    for col in  cat_cols:
        try:
            # for each cat add dummy var, drop original column
            df = pd.concat([df.drop(col, axis=1), pd.get_dummies(df[col], prefix=col, prefix_sep='_', drop_first=True, dummy_na=dummy_na)], axis=1)
        except:
            continue
    return df

def CategorizeSalary(salary):
    result = pd.qcut(salary, 10, labels=False)
    return result

prog_style_dummy = create_dummy_df(prog_style_df, dev_style_cols, True)
prog_style_dummy.head()

Unnamed: 0,TabsSpaces_Spaces,TabsSpaces_Tabs,TabsSpaces_nan,WorkStart_10:00 PM,WorkStart_11:00 AM,WorkStart_11:00 PM,WorkStart_1:00 AM,WorkStart_1:00 PM,WorkStart_2:00 AM,WorkStart_2:00 PM,...,VersionControl_Team Foundation Server,VersionControl_Visual Source Safe,VersionControl_Zip file back-ups,VersionControl_nan,CheckInCode_A few times a week,CheckInCode_Just a few times over the year,CheckInCode_Multiple times a day,CheckInCode_Never,CheckInCode_Once a day,CheckInCode_nan
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
14,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
17,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
18,0,0,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
22,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


### Modelling
We will firstly try to use linear regression model to predict salary. If linear regression does not work very well we will try logistic regression.

In [6]:
X = prog_style_dummy
min_max_scaler = preprocessing.MinMaxScaler()
#y = min_max_scaler.fit_transform(salary.values.reshape(-1, 1))
y_cat = CategorizeSalary(salary)

X_train, X_test, y_train, y_test = train_test_split(X, y_cat , test_size=.30, random_state=42)

lm_model = LinearRegression(normalize = True) # Instantiate
lm_model.fit(X_train, y_train) #Fit

tree_model = DecisionTreeRegressor()
tree_model.fit(X_train, y_train)

#Predict using your model
y_test_preds = lm_model.predict(X_test)
y_train_preds = lm_model.predict(X_train)

y_test_preds_tree = tree_model.predict(X_test)
y_train_preds_tree = tree_model.predict(X_train)

#Score using your model
test_score = r2_score(y_test, y_test_preds)
train_score = r2_score(y_train, y_train_preds)

test_score_tree = r2_score(y_test, y_test_preds_tree)
train_score_tree = r2_score(y_train, y_train_preds_tree)
print("The rsquared on the training data was {}.  The rsquared on the test data was {}.".format(train_score, test_score))
print("LOGIS: The rsquared on the training data was {}.  The rsquared on the test data was {}.".format(train_score_tree, test_score_tree))

The rsquared on the training data was 0.24269874093991195.  The rsquared on the test data was -1.8520610214013103e+29.
LOGIS: The rsquared on the training data was 0.9646497470281983.  The rsquared on the test data was -0.6903688041693936.


### Evaluation
From the training result, it seems that the linear regression model is performing quite bad. With logistic regression, the r2 score on training data is 0.965, which is quite good. However, the testing data shows the weakness of both the model. It means that the model overfitted on training data, and cannot be extended on new data.