# Worldwide Developer Survey by Stack Overflow 2017 - A ML workflow

* [Overview](#Overview) 

* [Section1 - Pre-processing](#1.Preprocessing) 

* [Section2 - Exploratory data analysis](#2.Exploratory-data-analysis)
> * [Career satisfaction vs Job satisfaction](#2.1.Career-satisfaction-vs-Job-satisfaction)
> * [Career satisfaction vs Home remote](#2.2.Career-satisfaction-vs-Home-remote)
> * [Data scientists vs other developers](#2.3.Data-scientists-vs-other-developers)
> * [Plotting career satisfaction with plotly](#2.4.Plotting-career-satisfaction-with-plotly) 
> * [Salary Analysis](#2.5.Salary-Analysis) 
> * [Salary vs Career satisfaction](#2.6.Salary-vs-Career-satisfaction)
> * [Plotting Worldwide Salary with plotly](#2.7.Plotting-worldwide-salary-with-plotly)
> * [Tabs vs Spaces](#2.8.Tabs-vs-Spaces)
> * [Preferred benefit analysis](#2.9Preferred-benefit-analysis) 
> * [Programming experience vs satisfaction](#2.10.Programming-experience-vs-satisfaction)
* [Section3 - Data wrangling/feature engineering](#3.Data-wrangling-and-feature-engineering)
> * [Data wrangling](#3.1.Data-wrangling)
> * [Feature selection](#3.2.Feature-selection)
> * [PCA analysis](#3.3.PCA-analysis)
* [Section4 - Machine Learning](#4.Machine-Learning)
> * [Clustering](#4.1.Clustering) 
> * [Classification](#4.2.Classification) 
> * [Regression](#4.3.Regression) 
* [Section5 - Summary](#5.Summary)

## Overview

This is an analysis of developer survey data conducted by StackOverflow in 2017. 
Following steps are carried out.

1. An exploratory analysis (EDA) was done first to discover any relationships/connection between developers salary and career satisfaction, and also between career satisfaction/salary and other input parameters.
2. Applied clustering models to see how to group the respondents..
3. Then applied classification models to see whether the models would correctly classify as per the labels assigned by clustering models in the second step.
4. Lastly, ran regression models to estimate developers salaries based on the input variables.
5. Key findings from the analysis are outlined in Summary section at the end.

**Note**: Since salary values are provided in local currencies, the values are converted to a common global scale by using purchasing power parity index, in order to compare. 
> This is done keeping in mind that, while determining salaries in each country, cost of living is taken into account.
> So mere conversion of all local currencies to USD using currency exchange rate alone is not going to give true picture. 
> For example, if a developer's salary is 100K USD in USA, it does not literally translate into Rs.7,000,000 salary (considering Rs70/USD) in India. 
> Since cost of living is less in India, the equivalant salary would be $100,000 x 18 = Rs.1,800,000 or 18L (assuming 18 is prevailing PPP index of India).

Import necessary modules

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn as sk
import matplotlib.pyplot as plt
%pylab inline


from sklearn import metrics
from sklearn.preprocessing import StandardScaler

In [None]:
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

In [None]:
from IPython.display import display
pd.options.display.max_columns = None
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
sns.set_style('darkgrid')
plt.subplots(figsize=(18,8))
sns.set(rc={'figure.figsize':(18,8)})

## 1.Preprocessing

Reading meta data file. This file includes descriptions of all the columns (Survey Questions) of the original data frame. 

In [None]:
df_schema = pd.read_csv('../input/so-survey-2017/survey_results_schema.csv', index_col=0)

In [None]:
df_schema.head()

Reading main data file.

In [None]:
df = pd.read_csv('../input/so-survey-2017/survey_results_public.csv', index_col=0)

In [None]:
df.reset_index(inplace=True)
tot_records = len(df)
tot_columns = len(df.columns)
print ('number of records = ' + str(tot_records))
print ('number of columns = ' + str(tot_columns))
df.head(3)

**Removing unimportant  fields.**

1. Remove columns where more than 90% of values are blank/NA.

In [None]:
na_columns = (df.isnull().mean() > 0.9)
na_col_list = na_columns[na_columns].index.tolist()
na_col_list = [e for e in na_col_list if e not in ('NonDeveloperType', 'ExpectedSalary')]
na_col_list

2. Remove Columns that are not related to career satisfaction or salary.

In [None]:
other_col_list = [
'Respondent',
'WebDeveloperType',
'MobileDeveloperType',
'AnnoyingUI',
'HoursPerWeek',
'ResumePrompted',
'SelfTaughtTypes',
'CousinEducation',
'VersionControl',
'InfluenceInternet',
'InfluenceWorkstation',
'InfluenceHardware',
'InfluenceServers',
'InfluenceTechStack',
'InfluenceDeptTech',
'InfluenceVizTools',
'InfluenceDatabase',
'InfluenceCloud',
'InfluenceConsultants',
'InfluenceRecruitment',
'InfluenceCommunication',
'StackOverflowDescribes',
'StackOverflowSatisfaction',
'StackOverflowDevices',
'StackOverflowFoundAnswer',
'StackOverflowCopiedCode',
'StackOverflowJobListing',
'StackOverflowCompanyPage',
'StackOverflowJobSearch',
'StackOverflowNewQuestion',
'StackOverflowAnswer',
'StackOverflowMetaChat',
'StackOverflowAdsRelevant',
'StackOverflowAdsDistracting',
'StackOverflowModeration',
'StackOverflowCommunity',
'StackOverflowHelpful',
'StackOverflowBetter',
'StackOverflowWhatDo',
'StackOverflowMakeMoney',
'SurveyLong',
'QuestionsInteresting',
'QuestionsConfusing',
'InterestedAnswers']

In [None]:
drop_columns = na_col_list + other_col_list
df.drop(drop_columns, axis=1, inplace=True)
print ('number of columns after removing = ' + str(len(df.columns)))

A quick look at the distribution of career satisfaction.

In [None]:
print(df.CareerSatisfaction.describe())

In [None]:
df.CareerSatisfaction.value_counts()

In [None]:
sns.distplot(df[(df.CareerSatisfaction.notnull())].CareerSatisfaction)

Add two new columns for analysis.
> 1.Country_code - since PPP indexes are based on country codes, we need to have this column. 
> 2.SalaryConverted - since salary values are in local currency, we need to convert those into a common scale. this is done by multiplying the input salaries by respective Purchasing Power Parity indexes.

Convert Country Name into standard Country name based of the 'wikipedia-iso-country-codes.csv' and add column "Country Code"

In [None]:
df = df[df.Country != 'I prefer not to say']

df['Country'].replace('Iran','Iran, Islamic Republic of',inplace=True)
df['Country'].replace('Vietnam','Viet Nam',inplace=True)
df['Country'].replace('Slovak Republic','Slovakia',inplace=True)
df['Country'].replace('Aland Islands','Åland Islands',inplace=True)
df['Country'].replace('Moldavia','Moldova, Republic of',inplace=True)
df['Country'].replace('Bolivia','Bolivia, Plurinational State of',inplace=True)
df['Country'].replace('Macedonia','Macedonia, the former Yugoslav Republic of',inplace=True)
df['Country'].replace('Bosnia-Herzegovina','Bosnia and Herzegovina',inplace=True)
df['Country'].replace('Virgin Islands (USA)','Virgin Islands, U.S.',inplace=True)
df['Country'].replace('Virgin Islands (British)','Virgin Islands, British',inplace=True)
df['Country'].replace('South Korea','Korea, Republic of',inplace=True)
df['Country'].replace('Taiwan','Taiwan, Province of China',inplace=True)
df['Country'].replace('North Korea','Korea, Democratic People\'s Republic of',inplace=True)
df['Country'].replace('S. Georgia & S. Sandwich Isls.','South Georgia and the South Sandwich Islands',inplace=True)
df['Country'].replace('Azerbaidjan','Azerbaijan',inplace=True)
df['Country'].replace('Venezuela','Venezuela, Bolivarian Republic of',inplace=True)
df['Country'].replace('Syria','Syrian Arab Republic',inplace=True)
df['Country'].replace('Tanzania','Tanzania, United Republic of',inplace=True)
df['Country'].replace('New Caledonia (French)','New Caledonia',inplace=True)
df['Country'].replace('Laos','Lao People\'s Democratic Republic',inplace=True)
df['Country'].replace('Reunion (French)','Réunion',inplace=True)
df['Country'].replace('Zaire','Congo, the Democratic Republic of the',inplace=True)
df['Country'].replace('Cote D\'Ivoire','Côte d\'Ivoire',inplace=True)
df['Country'].replace('Ivory Coast (Cote D\'Ivoire)','Côte d\'Ivoire',inplace=True)
df['Country'].replace('U.S. Minor Outlying Islands','United States Minor Outlying Islands',inplace=True)
df['Country'].replace('Polynesia (French)','French Polynesia',inplace=True)
df['Country'].replace('French Guyana','French Guiana',inplace=True)
df['Country'].replace('Pitcairn Island','Pitcairn',inplace=True)
df['Country'].replace('Libya','Libyan Arab Jamahiriya',inplace=True)
df['Country'].replace('Saint Vincent & Grenadines','Saint Vincent and the Grenadines',inplace=True)
df['Country'].replace('Martinique (French)','Martinique',inplace=True)
df['Country'].replace('Macau','Macao',inplace=True)
df['Country'].replace('Falkland Islands','Falkland Islands (Malvinas)',inplace=True)
df['Country'].replace('Tadjikistan','Tajikistan',inplace=True)
df['Country'].replace('Heard and McDonald Islands','Heard Island and McDonald Islands',inplace=True)
df['Country'].replace('Saint Helena','Saint Helena, Ascension and Tristan da Cunha',inplace=True)
df['Country'].replace('Vatican City State','Holy See (Vatican City State)',inplace=True)



In [None]:
import csv
dic = {}
with open("../input/iso-country-codes/wikipedia-iso-country-codes.csv", encoding='UTF-8') as f:
    file= csv.DictReader(f, delimiter=',')
    for line in file:
        dic[line['English short name lower case']] = line['Alpha-3 code']   
countries = df.Country
df['CountryCode']=[dic[x] for x in countries]

In [None]:
df.iloc[:5,-3:]

Convert Currency with PPP index and add column 'Salary Converted'.

In [None]:
df.Currency.isnull().mean()

Thus, 42% of data has currency name populated.

In [None]:
df.Currency.value_counts(dropna=False)

Add CurrencyCode column and populate with respective codes for currency names

In [None]:
df['CurrencyCode']=df.Currency.apply(lambda x: 'USA' if x == 'U.S. dollars ($)' 
                         else 'EU28'if x == 'Euros (€)'
                         else 'GBR' if x == 'British pounds sterling (£)'
                         else 'IND' if x == 'Indian rupees (?)'
                        else  'CAN' if x == 'Canadian dollars (C$)'
                        else  'POL' if x == 'Polish zloty (zl)'
                        else 'AUS' if x == 'Australian dollars (A$)'
                        else 'RUS' if x == 'Russian rubles (?)'
                        else 'BRA' if x == 'Brazilian reais (R$)'
                        else 'SWE' if x == 'Swedish kroner (SEK)'
                        else 'CHE' if x ==  'Swiss francs'
                        else 'ZAF' if x == 'South African rands (R)'
                        else 'MEX' if x == 'Mexican pesos (MXN$)'
                        else 'JPN' if x == 'Japanese yen (¥)'
                        else 'CHN' if x == 'Chinese yuan renminbi (¥)'
                        else 'SGP' if x == 'Singapore dollars (S$)'
                        else 'BTC' if x == 'Bitcoin (btc)'
                                      else np.NaN)

In [None]:
(df.Salary.notnull() & df.Currency.isnull()).sum()

898 records have salary without currency

In [None]:
(df[(df.CurrencyCode.notnull())].CountryCode != df[(df.CurrencyCode.notnull())].CurrencyCode).sum()

7364 records have currency code different from country code. Hence, we took currency code first when converting salary. If currency is not given, we used country code for currency conversion

Replace null currency code with country code.

In [None]:
df.CurrencyCode.fillna(df.CountryCode,inplace=True)

In [None]:
(df.Salary.notnull() & df.CurrencyCode.isnull()).sum()

In [None]:
df.loc[:,['Country','Currency','CurrencyCode']].head(5)

**Convert Salary and Expected Salary into worldwide standard salary scale based on the purchasing power parity index and add two new columns "SalaryConverted" and "ExpSalaryConverted".**
> used year 2016 data since survey was taken in 2017.

In [None]:
df_ppp = pd.read_csv('../input/purchasing-power-parity-rates/ppp_rates.csv')
df_ppp_2016 = df_ppp[ (df_ppp.TIME==2016)]
df_ppp_2016.head(3)
df_ppp_2016_index = df_ppp_2016.loc[:, ['LOCATION', 'Value']]
df_ppp_2016_index.head(3)

In [None]:
df = df.merge(df_ppp_2016_index, how='left', left_on='CurrencyCode', right_on='LOCATION')
df.drop('LOCATION', axis=1, inplace=True)
df.rename(columns={'Value':'PPPIndex'},inplace=True)

Populate PPP Index manually for currencies that were not listed in the ppp_rates.csv file.

In [None]:
df.loc[df.CurrencyCode =='SGP','PPPIndex'] = 0.83
df.loc[df.CurrencyCode =='PAK','PPPIndex'] = 40.0
df.loc[df.CurrencyCode =='IRN','PPPIndex'] = 6200
df.loc[df.CurrencyCode =='UKR','PPPIndex'] = 4.4
df.loc[df.CurrencyCode =='PHL','PPPIndex'] = 27.0
df.loc[df.CurrencyCode =='MYS','PPPIndex'] = 1.88
df.loc[df.CurrencyCode =='NGA','PPPIndex'] = 120.0
df.loc[df.CurrencyCode =='AFG','PPPIndex'] = 27.5
df.loc[df.CurrencyCode =='BGD','PPPIndex'] = 33.0
df.loc[df.CurrencyCode =='HKG','PPPIndex'] = 5.0
df.loc[df.CurrencyCode =='PRK','PPPIndex'] = 4.4
df.loc[df.CurrencyCode =='LKA','PPPIndex'] = 11400.0
df.loc[df.CurrencyCode =='ARE','PPPIndex'] = 2.0

Add columns SalaryConverted and ExpSalaryConverted. To convert into stanard salary, divide original salary with PPP index.

In [None]:
df['SalaryConverted'] = df.Salary / df.PPPIndex
df['ExpSalaryConverted'] = df.ExpectedSalary / df.PPPIndex

#Drop PPPIndex as it will not be neededfor future analysis
df.drop(['PPPIndex'],axis=1,inplace=True)

create two data frames, one where respondents rated their career satisfaction, and one where not indicated;
> rest of the analysis is focused on dataframe that has career satisfaction populated.

In [None]:
df_js = df[df.CareerSatisfaction.notnull()]
df_no_js = df[ ~((df.CareerSatisfaction.notnull()))]
len(df_js) + len(df_no_js) == len(df)

In [None]:
print ('number of records in df_js = ' + str(len(df_js)))
print ('number of records in df_no_js = ' + str(len(df_no_js)))

In [None]:
df_js.groupby('Country').Country.apply(lambda x: x.count() == 1).sum()

There are 32 countries in the dataset where only one person from the country responded. However, for completeness retained those rows in the analysis.

## 2.Exploratory data analysis
Exploratory data analysis

## 2.1.Career satisfaction vs Job satisfaction

There are two columns 'JobSatisfaction' and 'CareerSatisfaction'. let's compare them.

In [None]:
print('percentage of rows where Job satisfaction is populated in the main dataset = ' + \
      str( (len(df[df.JobSatisfaction.notnull()]))/tot_records))
print('percentage of rows where Career satisfaction is populated in the main dataset = ' + \
      str( (len(df[df.CareerSatisfaction.notnull()]))/tot_records))

From above, career satisfaction has more number of records populated compared to job satisfaction.

In [None]:
ax = sns.boxplot(y=df['CareerSatisfaction'])

In [None]:
ax = sns.boxplot(y=df['JobSatisfaction'])

In [None]:
(df_js.JobSatisfaction.isnull()).sum()

Thus, 2320 respondents did not indicate their job satisfaction though indicated their career satisfaction. We wanted to analyze who are those people.

In [None]:
df_js[(df_js.JobSatisfaction.isnull())].EmploymentStatus.value_counts()

People either looking for job or who are retired are the ones who did not indicate job satisfaction, which is reasonable.

In [None]:
df_js.JobSatisfaction.corr(df.CareerSatisfaction)

In [None]:
df_js.JobSatisfaction.corr(df.CareerSatisfaction, method='spearman')

In [None]:
df_js.JobSatisfaction.corr(df.CareerSatisfaction, method='kendall')

Thus job satisfaction and career satisfaction are  correlated

In [None]:
sns.jointplot(x='JobSatisfaction', y='CareerSatisfaction', data=df_js, kind='hex')

In [None]:
df_js_ca = df_js[ (df_js.JobSatisfaction.notnull()) & (df_js.CareerSatisfaction.notnull())]
sns.distplot(df_js_ca.JobSatisfaction,hist=False, label='JobSatisfaction')
sns.distplot(df_js_ca.CareerSatisfaction,hist=False, label='CareerSatisfaction', \
             axlabel='JobSatisfaction vs CareerSatisfaction')
plt.show()

In [None]:
df_js_ca_m = df_js_ca.loc[:,'CareerSatisfaction':'JobSatisfaction']
df_js_ca_m.head(3)
df_js_ca_p = df_js_ca_m.melt()
df_js_ca_p.head(3)
sns.countplot(x='value', hue='variable', data=df_js_ca_p)

from the above graphs, we can see that people who are happy with their current job are also happy with their overall career and vice versa.

**Since career Satisfaction and Job Satisfaction are comparable and career satisfaction has higher number of records populated, let's focus our analysis on Career satisfaction only.**

In [None]:
equal_js_cs = (df_js_ca.JobSatisfaction == df_js_ca.CareerSatisfaction).sum()
print( 'Number of developers whose job satisfaction level is same as their career satisfaction = '+ str(equal_js_cs) )

In [None]:
js_gt_cs = (df_js_ca.JobSatisfaction > df_js_ca.CareerSatisfaction).sum()
print( 'Number of developers whose job satisfaction level is higher than their career satisfaction = '+ str(js_gt_cs) )

In [None]:
cs_gt_js = (df_js_ca.JobSatisfaction < df_js_ca.CareerSatisfaction).sum()
print( 'Number of developers whose job satisfaction level is lower than their career satisfaction = '+ str(cs_gt_js) )

## 2.2.Career satisfaction vs Home remote

This section are examples of single column analysis done for each of the 100 columns in the data frame. Keeping only important ones here for space reasons.

One example : career satisfaction vs. HomeRemote option.

In [None]:
sns.factorplot(y='HomeRemote',x='CareerSatisfaction',data=df_js,aspect=4,kind='bar')

In [None]:
sns.countplot(x="CareerSatisfaction", hue='HomeRemote', data=df_js)

**as we can see, people who work full time remotely are most satisfied and people who never worked remotely are least satisfied.**

## 2.3.Data scientists vs other developers
Data-scientists-vs-other-developers

This section compares the career satisfaction among 'DataScientist' and other developers. Because 'DataScientist' shows up in both 'DeveloperType' and 'NonDeveloperType' columns so we had to count for both when considering if the respondent is data scientist or not.

In [None]:
df_js['DeveloperType'] = df_js['DeveloperType'].astype(str)
df_js['NonDeveloperType'] = df_js['NonDeveloperType'].astype(str)

In [None]:
def DeveloperType(datascientist):
    if 'Data scientist' in datascientist:
        return 1
    else:
        return 0

In [None]:
df_js['DataScientist1'] = df_js['DeveloperType'].apply(DeveloperType)
df_js['DataScientist2'] = df_js['NonDeveloperType'].apply(DeveloperType)
df_js['DataScientist3']=df_js.DataScientist1+df_js.DataScientist2

In [None]:
def DataScientist(datascientist):
    if datascientist>=1:
        return 'DataScientist'
    else:
        return 'Others'

In [None]:
df_js['DataScientist']=df_js['DataScientist3'].apply(DataScientist)

In [None]:
df_js.drop(['DataScientist1', 'DataScientist2', 'DataScientist3' ], axis=1, inplace=True)

In [None]:
sns.factorplot(y='DataScientist',x='CareerSatisfaction',data=df_js,aspect=4,kind='bar')

In [None]:
sns.factorplot(y='DataScientist',x='SalaryConverted',data=df_js,aspect=4,kind='bar')

In [None]:
df_js[df_js.SalaryConverted.notnull()].groupby('DataScientist').SalaryConverted.median()

In [None]:
df_js.groupby('DataScientist').CareerSatisfaction.mean()

**finding: as we can see, data scientists are slightly more happier and earn more compared to other types of developers on average.**

## 2.4.Plotting career satisfaction with plotly
Plotting-career-satisfaction-with-plotly

In [None]:
df_js.groupby('Country')['CareerSatisfaction'].agg(['mean','count']).nlargest(5,'mean')

In [None]:
num_countries = len(df_js.groupby('Country')['CareerSatisfaction'].agg(['mean','count']))
print('Number of countries = '+ str(num_countries))

In [None]:
df_worldsatisfaction=df_js.groupby('CountryCode',as_index=False)['CareerSatisfaction'].agg(['mean','count'],)
df_worldsatisfaction1=df_worldsatisfaction.reset_index()
df_worldsatisfaction1.to_csv('worldsatisfaction.csv')
df_worldsatisfaction1.head()

In [None]:
df_ws = pd.read_csv('worldsatisfaction.csv')

In [None]:
df_ws.head(5)

In [None]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import plotly as py

In [None]:
data = [ dict(
        type = 'choropleth',
        locations = df_worldsatisfaction1['CountryCode'],
        z = df_worldsatisfaction1['mean'],
        text = df_worldsatisfaction1['count'],
        colorscale = ["#f7fbff","#ebf3fb","#deebf7","#d2e3f3","#c6dbef","#b3d2e9","#9ecae1",
              "#85bcdb","#6baed6","#57a0ce","#4292c6","#3082be","#2171b5","#1361a9",
              "#08519c","#0b4083","#08306b"],
        autocolorscale = False,
        reversescale = True,
        marker = dict(
            line = dict (
                color = 'rgb(180,180,180)',
                width = 0.5
            ) ),
        colorbar = dict(
            autotick = False,
            tickprefix = '',
            title = 'World Career Satisfaction'),
      ) ]

layout = dict(
    title = 'World Career Satisfaction',
    geo = dict(
        showframe = False,
        showcoastlines = False,
        projection = dict(
            type = 'Mercator'
        )
    )
)

In [None]:
fig = dict( data=data, layout=layout )
py.offline.iplot( fig, validate=False, filename='d3-world-map' )

Interesting observation: developers in Latin America countries such as Mexico,Coumbia are more satisfied on average than US/Canada and Europe on average. Developers in North America are more satisfied than Europe on average.

## 2.5.Salary Analysis

Distribution of salary

In [None]:
sns.distplot(df_js[df_js.SalaryConverted.notnull()].SalaryConverted, bins=100)

There seem to be data quality issues with salary data in the data set, few of them indicated zero salaries, and some less than 1. this trend can be seen in expected salary field as well. there could be some misinterpretation of the survey question by respondents. treated these low salary values as noise in the data and cleaned.

In [None]:
sns.distplot(df_js[df_js.SalaryConverted >10000].SalaryConverted)

In [None]:
sns.boxplot(y=df_js.SalaryConverted)

Box plot shows some outliers in the uppper section

In [None]:
df_js[df_js.SalaryConverted.notnull()].SalaryConverted.describe()

In [None]:
df_js[df_js.SalaryConverted.notnull()].SalaryConverted.median()

In [None]:
df_js[df_js.SalaryConverted.notnull()].groupby('EmploymentStatus').SalaryConverted.describe()

Example of salaries from countries where only one person responded, we think they may or may not to be representable of their country

In [None]:
df_js[df_js.SalaryConverted.notnull()].groupby('Country').SalaryConverted.agg(['mean','count']).nlargest(5,columns='mean')

Example of salaries that are less than 10 

In [None]:
df_js[df_js.SalaryConverted < 10].groupby('Country').SalaryConverted.agg(['mean','count']).nlargest(5,columns='mean')

Distribution of expected salary

In [None]:
df_no_js[df_no_js.ExpSalaryConverted.notnull()].groupby('EmploymentStatus').ExpSalaryConverted.describe()

In [None]:
sns.distplot(df_no_js[df_no_js.ExpSalaryConverted.notnull()].ExpSalaryConverted)

In [None]:
sns.distplot(df_no_js[df_no_js.ExpSalaryConverted > 10000].ExpSalaryConverted)

In [None]:
len(df_no_js[df_no_js.ExpSalaryConverted < 1000])/len(df_no_js)

## 2.6.Salary vs Career satisfaction

Exploring weather job satisfaction/career satisfaction are correlated with salary.

In [None]:
df_js.JobSatisfaction.corr(df.SalaryConverted)

In [None]:
df_js.CareerSatisfaction.corr(df.SalaryConverted)

In [None]:
sns.factorplot(y='SalaryConverted',x='CareerSatisfaction',data=df_js,aspect=4,kind='point')

From above graph, it is evident that people with lower salaries are not that satisfied. We can also see that satisfaction level increases with increase in mean salary

In [None]:
sns.jointplot(y='SalaryConverted', x='CareerSatisfaction', data=df_js, kind='hex')

Many people indicated salaries close to zero; it could be data quality issue.

In [None]:
sns.scatterplot(y='SalaryConverted', x='CareerSatisfaction', data=df_js[df_js.SalaryConverted >10000])

In [None]:
ax = sns.boxplot(y=df_js[df_js.SalaryConverted >10000].SalaryConverted, x=df_js[df_js.SalaryConverted >10000].CareerSatisfaction)

We can see that the there is no pattern in lower levels of satisfaction, however median salary gradually increases as career satisfaction level increases from 5 to 9 and and then comes little down for 10. However each satisfaction level has salaries ranging from zero. There are outliers in the upper sections. 


In [None]:
sns.scatterplot(y='SalaryConverted', x='CareerSatisfaction', hue = 'PronounceGIF',data=df_js[df_js.SalaryConverted >10000])

In [None]:
sns.scatterplot(y='SalaryConverted', x='CareerSatisfaction', hue = 'HomeRemote',data=df_js[df_js.SalaryConverted >10000])

**Salary vs Career satisfaction for USA, Germany and UK**

In [None]:
df_Country= df_js[(df_js.CountryCode=='USA') | (df_js.CountryCode=='DEU') | (df_js.CountryCode=='GBR')]

In [None]:
sns.factorplot(y='SalaryConverted',x='CareerSatisfaction',data=df_Country, hue= 'CountryCode',aspect=4,kind='point')

Individual countries also have similar pattern when satisfaction is plotted against salary. We can also see the gap between salaries in US vs other two countries. Developers in USA earn **20,000** more on an average compared to Germany and UK.

Removing the noise in the salary data

In [None]:
df_js[df_js.SalaryConverted.notnull()].describe()

Fill null salary values with median salary of that currency code

In [None]:
df_js['sal'] = df_js.groupby('CurrencyCode').SalaryConverted.transform(lambda x: x.fillna(x.median()))

In [None]:
df_js['SalaryConverted'] = df_js['sal']
df_js.drop('sal', axis=1, inplace=True)

Remvoing rows where salaries less than 1000

In [None]:
df_js.SalaryConverted.describe()

In [None]:
df_js_no_zero_sal = df_js[df_js.SalaryConverted >= 1000]

In [None]:
df_js_no_zero_sal.SalaryConverted.describe()

In [None]:
print(len(df_js))
print(len(df_js_no_zero_sal))
print(len(df_js[df_js.SalaryConverted < 1000]))
print(len(df_js[df_js.SalaryConverted.isnull()]))

In [None]:
ax = sns.boxplot(y=df_js_no_zero_sal.SalaryConverted, x=df_js_no_zero_sal.CareerSatisfaction)

In [None]:
sns.jointplot(y='SalaryConverted', x='CareerSatisfaction', data=df_js_no_zero_sal, kind='hex')

In [None]:
sns.distplot(df_js_no_zero_sal.SalaryConverted)

In [None]:
sns.distplot(df_js[df_js.SalaryConverted.notnull()].SalaryConverted)

## 2.7.Plotting worldwide salary with plotly
Plotting-worldwide-salary-with-plotly

In [None]:
df_js_1=df_js_no_zero_sal
df_worldsalary=df_js_1.groupby('CountryCode',as_index=False)['SalaryConverted'].agg(['mean','count'],)
df_worldsalary1=df_worldsalary.reset_index()
df_worldsalary1.to_csv('worldsalary.csv')
df_worldsalary1.head()

In [None]:
df_world_sal = pd.read_csv('worldsalary.csv')

data = [ dict(
        type = 'choropleth',
        locations = df_worldsalary1['CountryCode'],
        z = df_worldsalary1['mean'],
        text = df_worldsalary1['count'],
        colorscale = ["#f7fbff","#ebf3fb","#deebf7","#d2e3f3","#c6dbef","#b3d2e9","#9ecae1",
              "#85bcdb","#6baed6","#57a0ce","#4292c6","#3082be","#2171b5","#1361a9",
              "#08519c","#0b4083","#08306b"],
        autocolorscale = False,
        reversescale = True,
        marker = dict(
            line = dict (
                color = 'rgb(180,180,180)',
                width = 0.5
            ) ),
        colorbar = dict(
            autotick = False,
            tickprefix = '',
            title = 'USD'),
      ) ]

layout = dict(
    title = 'Worldwide Salary Level',
    geo = dict(
        showframe = False,
        showcoastlines = False,
        projection = dict(
            type = 'Mercator'
        )
    )
)

fig = dict( data=data, layout=layout )
py.offline.iplot( fig, validate=False, filename='d3-world-map' )

## 2.8.Tabs vs Spaces

In [None]:
sns.factorplot(y='TabsSpaces',x='CareerSatisfaction',data=df_js,aspect=2,kind='bar')

Developers who use Spaces are slighly more satisfied compared to those who use Tabs

In [None]:
sns.factorplot(y='SalaryConverted',x='TabsSpaces',data=df_js,aspect=2,kind='bar')

In [None]:
df_js.groupby('TabsSpaces').SalaryConverted.median()

However the difference is signifacant when it comes to salary. Developers who use Spaces earn **15,000** more on an average compared to those who use Tabs

## 2.9.Preferred benefit analysis

Put career satisfaction into three bins base on their ratings. Satisfaction levels 1 to 4 are put into 'Not Satisfied' bucket, 5 and 6 are into 'Neutral', and 7 to 10 into 'Satisfied'.

In [None]:
df_js['CareerSatisfactionBin']=pd.cut(df_js.JobSatisfaction, bins=[0,4,6,10], labels=['Not Satisfied', 'Neutral', 'Satisfied'])

There are some questions where users provided ratings in literal responses, we need to find those columns and assign them numerical values for our analysis. Used 1 to 5 scale to convert the literal responses, ex: Very satisfied is assigned as 5, Not at all satisfied as 1.

In [None]:
satisfaction_responses = ['Very satisfied','Satisfied','Somewhat satisfied', 'Not very satisfied','Not at all satisfied']
agree_responses = ['Strongly agree', 'Agree','Somewhat agree', 'Strongly disagree','Disagree']
importance_responses = ['Very important','Important','Somewhat important','Not very important','Not at all important']

Identify the columns with literal responses

In [None]:
s = (df_js.isin(importance_responses).sum() != 0)
importance_col_list = s[s].index.tolist()
importance_col_list

Use 1 to 5 scale

In [None]:
for col in importance_col_list:
    df_js[col].replace({'Very important':5, 'Important':4, 'Somewhat important':3,\
                        'Not very important':2, 'Not at all important':1, np.NaN:0}, inplace=True)

Split the column list into groups based on question type

In [None]:
job_features_col_list = []
hiring_criteria_col_list = []

for col in importance_col_list:
    if col.startswith('Assess'):
        job_features_col_list.append(col)
    elif col.startswith('Important'):
        hiring_criteria_col_list.append(col)

job_features_col_list
hiring_criteria_col_list

In this category 'Assess Job', it is asking a scienario type of question, 'When you're assessing potential jobs to apply to, how important are each of the following to you?' From below analysis, we want to explore if any single factor stands out to Developers

In [None]:
job_features_total_scores = df_js.groupby('CareerSatisfactionBin',as_index=False)[job_features_col_list].sum()
job_features_total_scores = job_features_total_scores.melt(id_vars=['CareerSatisfactionBin'])
job_features_total_scores.head(5)

In [None]:
job_features_satisfied=job_features_total_scores.loc[job_features_total_scores.CareerSatisfactionBin=='Satisfied']
job_features_satisfied1=job_features_satisfied.sort_values('value')
g = sns.factorplot(x='value', y='variable', data=job_features_satisfied1, aspect=2, kind='bar', size=4.5)
g.set_axis_labels("score","job_features")

In the end, we can see that 'JobProfDevel' (Opportunities of Job Development) and Compensation are the two most important factors, whereas diversity and company leaders are considered lease important by job seekers.

Analyze Developers' behavior or personality to see if anything interesting we can find, showing below.

In [None]:
s = (df_js.isin(agree_responses).sum() != 0)
agree_col_list = s[s].index.tolist()
agree_col_list

In [None]:
personality_col_list = agree_col_list[:16]
workstyle_col_list= list(set(agree_col_list) - set(personality_col_list))
personality_col_list
workstyle_col_list

In [None]:
for col in agree_col_list:
    df_js[col].replace({'Strongly agree':5, 'Agree':4, 'Somewhat agree':3,\
                        'Disagree':2, 'Strongly disagree':1,  np.NaN:0}, inplace=True)

In [None]:
hiring_criteria_total_scores = df_js.groupby('CareerSatisfactionBin',as_index=False)[hiring_criteria_col_list].sum()
hiring_criteria_total_scores = hiring_criteria_total_scores.melt(id_vars=['CareerSatisfactionBin'])
hiring_criteria_total_scores.head(5)
g = sns.factorplot(x='value', y='variable', data=hiring_criteria_total_scores, \
                   hue='CareerSatisfactionBin', aspect=2, kind='bar', size=6)
g.set_axis_labels("score","hiring_criteria")

From the above graph, we can see that hiring personnel give most importance to Communication and Getting things done qualities from job seekers.

In [None]:
personality_total_scores = df_js.groupby('CareerSatisfactionBin',as_index=False)[personality_col_list].sum()
personality_total_scores = personality_total_scores.melt(id_vars=['CareerSatisfactionBin'])
personality_total_scores.head(5)
g = sns.factorplot(x='value', y='variable', data=personality_total_scores, \
                   hue='CareerSatisfactionBin', aspect=2, kind='bar', size=6)
g.set_axis_labels("score","personality")

Developers like Problem solving the most. Building things and Learning new technologies come next.

In [None]:
workstyle_total_scores = df_js.groupby('CareerSatisfactionBin',as_index=False)[workstyle_col_list].sum()
workstyle_total_scores = workstyle_total_scores.melt(id_vars=['CareerSatisfactionBin'])
workstyle_total_scores.head(5)
g = sns.factorplot(x='value', y='variable', data=workstyle_total_scores, \
                   hue='CareerSatisfactionBin', aspect=2, kind='bar', size=6)
g.set_axis_labels("score","workstyle")

Developers would like to get into their own zone while coding and they also enjoy debugging 

In the following code, analyze all standard benefits, to see which one the developers value the most. The technique used for this analysis is word count because some respondents responded with several options.

First find out the list of benefit given in the questions

In [None]:
import_benefits=df_js.loc[:, ['ImportantBenefits']]
import_benefits=import_benefits[import_benefits.ImportantBenefits.notnull()]
import_benefits.ImportantBenefits = import_benefits.ImportantBenefits.str.replace('\s+', '')
import_benefits['ImportantBenefits'] = import_benefits['ImportantBenefits'].astype(str)
import_benefits1=import_benefits['ImportantBenefits'].str.get_dummies(sep=';')
benefits=import_benefits['ImportantBenefits'].str.lower().str.split(';')
benefits.head()

In [None]:
import_benefits1.head()

In [None]:
list(import_benefits1.columns.values)

With the list of benefits found, start counting the number of times it shows up in the responses

In [None]:
Annual_bonus = import_benefits1['Annualbonus'].sum()
Charitable_match = import_benefits1['Charitablematch'].sum()
Child_elder_care = import_benefits1['Child/eldercare'].sum()
Education_sponsorship = import_benefits1['Educationsponsorship'].sum()
Equipment = import_benefits1['Equipment'].sum()
Expected_work_hours = import_benefits1['Expectedworkhours'].sum()
Health_benefits = import_benefits1['Healthbenefits'].sum()
Longterm_leave = import_benefits1['Long-termleave'].sum()
Meals = import_benefits1['Meals'].sum()
None_of_these = import_benefits1['Noneofthese'].sum()
Private_office = import_benefits1['Privateoffice'].sum()
Professional_development_sponsorship = import_benefits1['Professionaldevelopmentsponsorship'].sum()
Remote_options = import_benefits1['Remoteoptions'].sum()
Retirement = import_benefits1['Retirement'].sum()
Stock_options = import_benefits1['Stockoptions'].sum()
Vacation = import_benefits1['Vacation/daysoff'].sum()
Others = import_benefits1['Other'].sum()

print ("Annual_bonus = %d " % Annual_bonus)
print ("Charitable_match = %d " % Charitable_match)
print ("Child_elder_care = %d " % Child_elder_care)
print ("Education_sponsorship = %d " % Education_sponsorship)
print ("Equipment count = %d " % Equipment)
print ("Expected_work_hours count = %d " % Expected_work_hours)
print ("Health_benefits count = %d " % Health_benefits)
print ("Longterm_leave count = %d " % Longterm_leave)
print ("Meals count = %d " % Meals)
print ("Private_office count = %d " % Private_office)
print ("Professional_development_sponsorship = %d " % Professional_development_sponsorship)
print ("Remote_options count = %d " % Remote_options)
print ("Retirement count = %d " % Retirement)
print ("Vacation count = %d " % Vacation)
print ("Stock_options count = %d " % Stock_options)
print ("others count = %d " % Others)
print ("None of these count = %d " % None_of_these)

Then create new dataframe with their counts for plot

In [None]:
Benefit1 = {'Benefit': ['Annual_bonus','Charitable_match','Child_elder_care','Education_sponsorship','Equipment','Expected_work_hours','Health_benefits','Longterm_leave','Meals','None_of_these','Private_office','Professional_development_sponsorship','Remote_options','Retirement','Stock_options','Vacation','Others'],\
           'Score': [Annual_bonus,Charitable_match,Child_elder_care,Education_sponsorship,Equipment,Expected_work_hours,Health_benefits,Longterm_leave,Meals,None_of_these,Private_office,Professional_development_sponsorship,Remote_options,Retirement,Stock_options,Vacation,Others]}
Benefit = pd.DataFrame(data=Benefit1)
Benefit=Benefit.sort_values('Score')

In [None]:
sns.factorplot(x='Score',y='Benefit',data=Benefit,aspect=3,kind='bar')

**Vacation**, **remote option** and **health benefits** are the top three benefits developers seek.

 single column analysis between career satisfaction and working remote 

In [None]:
sns.factorplot(y='HomeRemote',x='CareerSatisfaction',data=df_js,aspect=4,kind='bar')

We can see that after vacation, which is not a surprise people value it the highest, remote working option is something developers value as one of the top benefits.

## 2.10.Programming experience vs satisfaction

Let's put the years of programming experience into bins of 5 years length

In [None]:
def Yearsexperience(years):
    if years == 'Less than a year':
        return "Less than a year"
    elif years == '2 to 3 years':
        return '2 to 5 years'
    elif years == '3 to 4 years':
        return '2 to 5 years'
    elif years == '4 to 5 years':
        return '2 to 5 years'
    elif years == '5 to 6 years':
        return '5 to 10 years'
    elif years == '6 to 7 years':
        return '5 to 10 years'
    elif years == '7 to 8 years':
        return '5 to 10 years'
    elif years == '8 to 9 years':
        return '5 to 10 years'
    elif years == '9 to 10 years':
        return '5 to 10 years'
    elif years == '10 to 11 years':
        return '10 to 15 years'
    elif years == '11 to 12 years':
        return '10 to 15 years'
    elif years == '12 to 13 years':
        return '10 to 15 years'
    elif years == '13 to 14 years':
        return '10 to 15 years'
    elif years == '14 to 15 years':
        return '10 to 15 years'
    elif years == '15 to 16 years':
        return 'Over 15 years'
    elif years == '16 to 17 years':
        return 'Over 15 years'
    elif years == '17 to 18 years':
        return 'Over 15 years'
    elif years == '18 to 19 years':
        return 'Over 15 years'
    elif years == '20 or more years':
        return 'Over 15 years'
    else:
        return 'Unknown'

In [None]:
df_js['YearsProgram_bin'] = df_js['YearsProgram'].apply(Yearsexperience)
df_js['YearsCoded_bin'] = df_js['YearsCodedJob'].apply(Yearsexperience)

Plot out Career Satisfaction vs. Years of Programming experience

In [None]:
sns.factorplot(y='YearsProgram_bin',x='CareerSatisfaction',data=df_js,aspect=4,kind='bar')

We see that in general, the more experienced developers are with programming, the more satisfied they are. Also noticed that developers/new developers with less than a year programming experience are less satisfied with their career

In [None]:
sns.factorplot(y='YearsProgram_bin',x='SalaryConverted',data=df_js,aspect=4,kind='bar')

In [None]:
df_js.groupby('YearsProgram_bin').SalaryConverted.median()

Similarly, the more the experienced developers are, the more salary they make. 

## 3.Data wrangling and feature engineering

## 3.1.Data wrangling

Besides salary, remote option, years of program experience, which were the findings from above sections,  added respondents education, country, years coded, gender, the highest education parents received, tabs spaces, and pronounce GIF, because through the single column analysis, it was foundhat these features have the most impact on job satisfaction.

Not including country in the features as there are too many countries and it will create too many variables.

In [None]:
df_ML = df_js_no_zero_sal.loc[:, [ 'FormalEducation', 'HomeRemote','YearsProgram_bin',  'CareerSatisfaction', 'PronounceGIF',\
                 'TabsSpaces', 'Gender', 'HighestEducationParents','SalaryConverted' ]]

In [None]:
df_ML['YearsProgram_bin'] = df_js.YearsProgram_bin

**Further simpilfying the data to make it ML ready**

Cleaning up Gender column

In [None]:
df_ML.Gender.fillna('Other', inplace=True)
df_ML.loc[df_ML['Gender'].str.contains('Transgender'), 'Gender'] = 'Transgender'
df_ML.loc[df_ML['Gender'].str.contains('Female'), 'Gender'] = 'Female'
df_ML.loc[df_ML['Gender'].str.contains('Male'), 'Gender'] = 'Male'
df_ML.loc[df_ML['Gender'].str.contains('Other'), 'Gender'] = 'Other'

Cleaning up Formal Education column

In [None]:
df_ML.FormalEducation = df_ML.FormalEducation.apply(lambda x: 'Bachelors' if x=='Bachelor\'s degree'
                           else 'Masters' if x=='Master\'s degree'
                           else 'HighSchool' if x=='Some college/university study without earning a bachelor\'s degree'
                           else 'HighSchool' if x=='Secondary school'
                           else 'PhD' if x=='Doctoral degree'
                           else 'Masters' if x=='Professional degree'
                           else 'Other' if x=='I prefer not to answer'
                           else 'Elementary' if x=='Primary/elementary school'
                           else 'NoEducation' if x=='I never completed any formal education'
                           else x)

Cleaning up Highest education by parents column

In [None]:
df_ML.HighestEducationParents = df_ML.HighestEducationParents.apply(lambda x: 'Bachelors' if x=='A bachelor\'s degree'
                           else 'Masters' if x=='A master\'s degree'
                           else 'HighSchool' if x=='Some college/university study, no bachelor\'s degree'
                           else 'HighSchool' if x=='High school'
                           else 'PhD' if x=='A doctoral degree'
                           else 'Masters' if x=='A professional degree'
                           else 'Other' if x=='I prefer not to answer'
                           else 'Elementary' if x=='Primary/elementary school'
                           else 'NoEducation' if x=='No education'
                                                                    else 'NoEducation' if x=='I don\'t know/not sure'
                           else x)
df_ML.HighestEducationParents.fillna('Other', inplace=True)

Categorizing HomeRemote column into few buckets

In [None]:
df_ML.HomeRemote = df_ML.HomeRemote.apply(lambda x: '>50_percent' if x=='More than half, but not all, the time'
                           else '50_percent' if x=='About half the time'
                           else '<50_percent' if x=='Less than half the time, but at least one day each week'
                           else '<50_percent' if x=='A few days each month'
                           else '100_percent' if x=='All or almost all the time (I\'m full-time remote)'
                           else 'Masters' if x=='Professional degree'
                           else 'Other' if x=='It\'s complicated'
                                                      else x)
df_ML.HomeRemote.fillna('Other', inplace=True)

Cleaning up Pronounce GIF column

In [None]:
df_ML.PronounceGIF = df_ML.PronounceGIF.apply(lambda x: 'hard_g' if x=='With a hard "g," like "gift"'
                           else 'soft_g' if x=='With a soft "g," like "jiff"'
                           else 'g-i-f' if x=='Enunciating each letter: "gee eye eff"'
                           else 'Other' if x=='Some other way'
                                                      else x)
df_ML.PronounceGIF.fillna('Other', inplace=True)

Filled nulls with Other in TabsSpaces column

In [None]:
df_ML.TabsSpaces.fillna('Other', inplace=True)

converted Yes and No to 1 and 0

In [None]:
df_ML.head(3)

Creating dummies

In [None]:
df_ML = pd.get_dummies(df_ML,columns=[ 'FormalEducation', 'HomeRemote','YearsProgram_bin', 'PronounceGIF',\
                 'TabsSpaces', 'Gender', 'HighestEducationParents' ])

In [None]:
print(len(df_ML.columns))
df_ML.head(3)

In [None]:
df_ML.isnull().any().sum()

Identify multi-collinearity

In [None]:
df_ML.corr().unstack().sort_values().drop_duplicates()

In [None]:
#drop _other columns as they seem to be redundant
high_corr_cols = ['Gender_Other', 'HighestEducationParents_Other','TabsSpaces_Other', 'HomeRemote_Other','PronounceGIF_Other', 'FormalEducation_Other','YearsProgram_bin_Unknown', 'PronounceGIF_soft_g']
df_ML.drop(high_corr_cols,axis =1, inplace=True)

In [None]:
#correlation heat map
sns.heatmap(df_ML.corr())

No strong correlation found from the heat map

In [None]:
df_ML.var().sort_values(ascending=False).head(5)

Salary and career satisfaction variables have high variablility compared to rest of the variables.

## 3.2.Feature selection

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
#identify feature importance using RandomForestRegressor
# treat CareerSatisfaction as target variable

df_no_cs = df_ML.drop('CareerSatisfaction', axis=1)

model = RandomForestRegressor(random_state=1, max_depth=10)
model.fit(df_no_cs,df_ML.CareerSatisfaction)

features = df_no_cs.columns
importances = model.feature_importances_
indices = np.argsort(importances)[-15:]  # top 10 features
plt.title('Feature Importances with target variable as CareerSatisfaction')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

In [None]:
# treat CareerSatisfaction as target variable
df_no_sal = df_ML.drop('SalaryConverted', axis=1)

model = RandomForestRegressor(random_state=1, max_depth=10)
model.fit(df_no_sal,df_ML.SalaryConverted)

features = df_no_sal.columns
importances = model.feature_importances_
indices = np.argsort(importances)[-15:]  # top 10 features
plt.title('Feature Importances with target variable as Salary')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

In [None]:
#treat both CareerSatisfaction and SalaryConverted as target variables
df_no_sal_no_cs = df_no_cs.drop('SalaryConverted', axis=1)

model = RandomForestRegressor(random_state=1, max_depth=10)
model.fit(df_no_sal_no_cs,df_ML.SalaryConverted)

features = df_no_sal_no_cs.columns
importances = model.feature_importances_
indices = np.argsort(importances)[-15:]  # top 10 features
plt.title('Feature Importances without CareerSatisfaction and Salary')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

## 3.3.PCA analysis

In [None]:
from sklearn.decomposition import PCA

Normalize features salary and career satisfaction

In [None]:
df_scaled = df_ML.copy()
df_scaled.head(3)

In [None]:
scaler = StandardScaler()
df_scaled['cscaled'] = scaler.fit_transform(df_scaled[['CareerSatisfaction']])
df_scaled['sscaled'] = scaler.fit_transform(df_scaled[['SalaryConverted']])
df_scaled.drop(['CareerSatisfaction','SalaryConverted'], axis=1, inplace=True)
df_scaled.head(2)

In [None]:
df_scaled.shape

In [None]:
# consider number of pca components as 10
pca = PCA(n_components=6)
fit = pca.fit(df_scaled)
var = fit.explained_variance_ratio_
print(fit.explained_variance_ratio_)
#print(fit.components_)
#print(fit.singular_values_)

In [None]:
plt.plot(var)

In [None]:
var1=np.cumsum(np.round(fit.explained_variance_ratio_, decimals=4)*100)
print(var1)
plt.plot(var1)

The 6 components explain 60% variance

In [None]:
df_pca = pca.transform(df_scaled)
print(df_pca.shape)
df_pca[0:3]

In [None]:
#mat plot the components 
plt.matshow(pca.components_,cmap='viridis')
plt.yticks([0,1,2,3,4,5],['1C','2C','3C', '4C', '5C', '6C'],fontsize=10)
plt.colorbar()
plt.xticks(range(len(df_scaled.columns)),df_scaled.columns,rotation=65,ha='left')
plt.tight_layout()
plt.show()# 

In [None]:
# plot the vector lengths to see relative importance
xvector = pca.components_[0] 
yvector = pca.components_[1]

xs = pca.transform(df_scaled)[0:10,0] 
ys = pca.transform(df_scaled)[0:10,1]

fig=plt.figure(figsize=(30, 30), dpi= 80, facecolor='w', edgecolor='k')

for i in range(len(xvector)):
# arrows project features (ie columns from csv) as vectors onto PC axes
    plt.arrow(0, 0, xvector[i]*max(xs)*2, yvector[i]*max(ys)*2,
              color='r', width=0.001, head_width=0.001)
    plt.text(xvector[i]*max(xs)*3, yvector[i]*max(ys)*3,
             list(df_scaled.columns.values)[i], color='y')

for i in range(len(xs)):
# circles project documents (ie rows from csv) as points onto PC axes
    plt.plot(xs[i], ys[i], 'bo')
    plt.text(xs[i]*1.2, ys[i]*1.2, list(df_scaled.index)[i], color='b')

plt.show()    

## 4.Machine Learning

remove low rank columns based on pca analysis and RandomForest regressor, and rename the columns that have special characters as it will cause some models to error out.

In [None]:
low_rank_cols = [ 'HighestEducationParents_Bachelors','HighestEducationParents_HighSchool','HighestEducationParents_Masters','HighestEducationParents_NoEducation','HighestEducationParents_PhD']

df_ML.drop(low_rank_cols,axis =1, inplace=True)

df_ML.rename({'HomeRemote_<50_percent': 'HomeRemote_lt_50_percent', 'HomeRemote_>50_percent': 'HomeRemote_gt_50_percent'}, axis='columns', inplace=True)

## 4.1.Clustering

After several iterations of clustering using KMeans, Agglomerative, DBSCAN, HDBSCAN, MeanShift algorithms and with different hyperparameters such as number of clusters, distance metric etc, KMeans with K=3 performed best in terms of logical grouping, cluster metrics and speed. 

In [None]:
from sklearn.cluster import KMeans
from sklearn import metrics
from mpl_toolkits import mplot3d

In [None]:
df_cluster = df_ML.copy()

In [None]:
# scale salary and career satisfaction variables
scaler = StandardScaler()
df_cluster['SalaryScaled'] = scaler.fit_transform(df_cluster[['SalaryConverted']])
df_cluster['cscaled'] = scaler.fit_transform(df_cluster[['CareerSatisfaction']])
df_cluster.drop('SalaryConverted',axis=1,inplace=True)
df_cluster.drop('CareerSatisfaction',axis=1,inplace=True)

In [None]:
#plot elbow curve
wsse = []
K = range(1,6)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(df_cluster)
    wsse.append(km.inertia_)

plt.plot(K, wsse, 'bx-')
plt.xlabel('k')
plt.ylabel('wsse')
plt.title('Elbow Method For Optimal k')
plt.show()

K=3 seems to be optimal as per the elbow curve

In [None]:
# find out which K has highest silhouette coefficient
for k in range(2,6):
  kmeans = KMeans(n_clusters=k)
  labels = kmeans.fit_predict(df_cluster)
  centers = kmeans.cluster_centers_
  sil = metrics.silhouette_score(df_cluster,labels)
  db = metrics.davies_bouldin_score(df_cluster, labels) 
  print(k,sil, db)  

K=2 has the highest silhouette score, but K=3 has lowest davies_bouldin_score and elbow method showed 3 clusters, so going with K=3

In [None]:
kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(df_cluster)
centers = kmeans.cluster_centers_
sil = metrics.silhouette_score(df_cluster,labels)
db = metrics.davies_bouldin_score(df_cluster, labels) 
print(sil, db)
df_cluster['label'] = labels

In [None]:
#plot the clusters
fig = plt.figure()
cm = plt.cm.get_cmap('RdYlBu')
ax = fig.add_subplot(111)
scatter = ax.scatter(df_cluster['cscaled'],df_cluster['SalaryScaled'],
                     c=df_cluster['label'],s=10, cmap=cm)
scatter = ax.scatter(centers[:,0],centers[:,-1],
                     c='black',s=100, cmap=cm)
ax.set_title('K-Means Clustering')
ax.set_xlabel('CareerSatisfaction')
ax.set_ylabel('SalaryScaled')

In [None]:
#3D plotting
fig = plt.figure()
cm = plt.cm.get_cmap('RdYlBu')
ax = plt.axes(projection='3d')
scatter = ax.scatter3D(df_cluster['cscaled'],df_cluster['SalaryScaled'],df_cluster['YearsProgram_bin_Over 15 years'],
                     c=df_cluster['label'],s=10, cmap=cm)
scatter = ax.scatter3D(centers[:,0],centers[:,-1],centers[:,11],
                     c='black',s=100, cmap=cm)
ax.set_title('K-Means Clustering')
ax.set_xlabel('CareerSatisfaction')
ax.set_ylabel('SalaryScaled')
ax.set_zlabel('Years Programming')

From the plots, three clusters emerge: 
> cluster 1  = high sal, high career satisfaction, cluster 2  = high career satisfaction but low sal, , cluster 3 = everyone with low career satisfaction

## 4.2.Classification

Use the labels from clustering for classfication.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
import scikitplot as skplt

In [None]:
df_cluster.rename({'HomeRemote_<50_percent': 'HomeRemote_lt_50_percent', 'HomeRemote_>50_percent': 'HomeRemote_gt_50_percent'}, axis='columns', inplace=True)

In [None]:
X = df_cluster.drop(['label'],axis=1)
Y = df_cluster['label']

In [None]:
#train test split
X_train, X_test, Y_train, Y_test = \
train_test_split(X,Y,test_size=0.33,random_state = 0)

In [None]:
#use 10 folds for cross validation
nfolds = 10
kf = KFold(n_splits=nfolds,random_state=0,shuffle=True)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

clfs = [DecisionTreeClassifier(), RandomForestClassifier(n_jobs=-1), GaussianNB(),KNeighborsClassifier(n_neighbors = 5),
        sk.linear_model.LogisticRegression(n_jobs=-1), AdaBoostClassifier(),
        LinearDiscriminantAnalysis(n_components=3),SVC(),XGBClassifier(),MLPClassifier()]

In [None]:
#find the best classifier using f1_micro score`
score_cols = ['Classifier', 'f_micro']
score_df = pd.DataFrame(columns=score_cols)

maxf1 = -1
bestCL = ""
for cl in clfs:
    fmicro = sk.model_selection.cross_val_score(cl,X_train,Y_train,cv=kf,n_jobs=-1,scoring='f1_micro').mean()
    print (str(cl) + ' ' + str(fmicro))
    score_df.loc[len(score_df)] = [str(cl),fmicro]
    if fmicro > maxf1:
        bestCL = cl
        maxf1 = fmicro
print('***********************************************')
print ('Best is... ' + str(bestCL) + ' ' + str(maxf1))

In [None]:
score_df

MLP classifier got highest training score. let's evaluate the models with test data to see how they fare on new data.

 **evaluation of classifiers with test dataset**

In [None]:
cla_score_cols = ['classifier', 'mco','accu_score','bal_score', 'f1_micro', 'f1_macro' ]
cla_test_score_df = pd.DataFrame(columns=cla_score_cols)

In [None]:
for cl in clfs:
    cl.fit(X_train,Y_train)
    Y_pred = cl.predict(X_test)
    mco = metrics.matthews_corrcoef(Y_test,Y_pred)
    asc = metrics.accuracy_score(Y_test,Y_pred)
    ba =  metrics.balanced_accuracy_score(Y_test,Y_pred)
    fmicro = metrics.f1_score(Y_test,Y_pred, average='micro')
    fmacro = metrics.f1_score(Y_test,Y_pred,average='macro')
    
    cla_test_score_df.loc[len(cla_test_score_df)] = [str(cl),mco,asc,ba,fmicro,fmacro]

In [None]:
cla_test_score_df

MLP Classifer performed best out of all the classifiers tested, though difference is not huge. Next to MLP is Decision tree classifier

In [None]:
#confusion matrix
cm = metrics.confusion_matrix(Y_test, Y_pred)
sns.heatmap(cm, annot=True)

In [None]:
X_test['actual_label'] = Y_test
X_test['predicted_label'] = Y_pred

In [None]:
#plot the misclassified data points
fig=plt.figure(figsize=(18, 8), dpi= 200, facecolor='w', edgecolor='k')
sns.scatterplot(y='SalaryScaled', x='cscaled', hue = 'actual_label',data=X_test[X_test.actual_label != X_test.predicted_label])

## 4.3.Regression

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

In [None]:
df_reg = df_ML.copy()

In [None]:
df_reg['CountryCode'] = df_js_no_zero_sal.CountryCode # country code is needed as salary will depend upon country.
df_reg['empl_status'] = df_js_no_zero_sal.EmploymentStatus # need employment status as salaries can depend upon employment status; full time vs part time etc.

In [None]:
# encode the literal values
df_reg['empl_status'] = df_reg.empl_status.apply( lambda x: 'FT' if x == 'Employed full-time' 
                                                       else 'IC' if x == 'Independent contractor, freelancer, or self-employed'
                                                       else 'PT' if x == 'Employed part-time' else x) 

# keep only full-time, part-time and independent contractors for regression on sal
df_reg = df_reg[ (df_reg.empl_status == 'FT') | (df_reg.empl_status == 'PT') | (df_reg.empl_status == 'IC') ] 

In [None]:
df_reg.empl_status.value_counts()

In [None]:
df_reg.SalaryConverted.corr(df_reg.CareerSatisfaction,method='pearson')

even though it does not look like there is strong correlation between salary and career satisfaction, we will drop career satisfaction as predictor column for salary since in realistically career satisfaction wont impact the salary, though reverse may be true.

In [None]:
X = df_reg.drop(['SalaryConverted', 'CareerSatisfaction'],axis=1)
Y = df_reg['SalaryConverted']

In [None]:
X_with_countries = pd.get_dummies(X) #hot encoding of country codes
print(X_with_countries.shape)
X_with_countries.head(3)

In [None]:
# remove predictor variables with zero coefficients using Lasso
from sklearn import linear_model 
regLasso = linear_model.Lasso()
regLasso.fit(X_with_countries,Y) 

s = pd.Series({X_with_countries.columns[i] : regLasso.coef_[i]
               for i in range(0,len(X_with_countries.columns))} )
reg_cols = s[s != 0]

X_with_countries = X_with_countries.loc[:, reg_cols.index]
print(X_with_countries.shape)

In [None]:
X_train, X_test, Y_train, Y_test = \
train_test_split(X_with_countries,Y,test_size=0.33,random_state = 0)

n = len(X_test)
p = len(X_test.columns)
print (n , p)

In [None]:
#consider 10 folds for cross validation
nfolds=10
kf = KFold(n_splits=nfolds,random_state=0,shuffle=True)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor


regs = [LinearRegression(), ElasticNet(), DecisionTreeRegressor(), MLPRegressor(), SVR(), RandomForestRegressor(),
        GradientBoostingRegressor(), AdaBoostRegressor(),XGBRegressor(n_estimators=100, learning_rate=0.08, gamma=0, subsample=0.75,
                           colsample_bytree=1, max_depth=7) ]



In [None]:
#find out best regressor based on mean absolute error value
minMAD = -1000000000
for reg in regs:
    kf = KFold(random_state=0,shuffle=True)
    mad = sk.model_selection.cross_val_score(reg,X_train,Y_train,\
             cv=kf,scoring='neg_mean_absolute_error').mean()
    # need the lowest scoring for mad
    print (str(reg)[:25] + ' with mad= ' + str(mad) )
    if mad > minMAD:
        minMAD = mad
        bestREG = reg
        
print('***********************************************')
print ('Best Regressor is... ' + str(bestREG)[:25] )
print('**********************')
print ('With MAD Score ' + str(minMAD))
        

In [None]:
reg_score_cols = ['Regressor', 'mae','median_ae', 'mse', 'r2', 'adj_r2']
reg_score_df = pd.DataFrame(columns=reg_score_cols)

In [None]:

def evaluate_reg_model(reg, x_test, y_test, y_pred):
     
    mae = metrics.mean_absolute_error(y_test,y_pred)
    medae = metrics.median_absolute_error(y_test,y_pred)
    mse = metrics.mean_squared_error(y_test,y_pred)
    r2 = metrics.r2_score(y_test, y_pred)
    adj_r2 = 1-(1-r2)*(len(y_pred)-1)/(len(y_pred)-len(x_test.columns)-1)

    reg_score_df.loc[len(reg_score_df)] = [str(reg),mae,medae,mse,r2,adj_r2]

In [None]:
#evaluate each regression model with test data
for reg in regs:
    reg.fit(X_train, Y_train)
    y_pred = reg.predict(X_test)
    evaluate_reg_model(reg, X_test, Y_test, y_pred)

In [None]:
sns.scatterplot(x=y_pred, y=(Y_test-y_pred))

the residuals are positive when predicted salaries are in lower range and negative when salaries in higher range. that means, predicted salaries are low when actual salaries are high  and vise versa. it seems model predicted salaries closer to the mean. Also, residual variance is higher in the middle. 

In [None]:
# satsmodel linear regressor
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.regressionplots import *
from statsmodels.graphics.gofplots import ProbPlot

In [None]:
lm = sm.OLS(Y_train, X_train).fit()
#lm.summary()

In [None]:
print ("The rsquared values is " + str(lm.rsquared))
print ("The adjusted rsquared values is " + str(lm.rsquared_adj))

In [None]:
# fitted values (need a constant term for intercept)
model_fitted_y = lm.fittedvalues

# model residuals
model_residuals = lm.resid

# normalized residuals
model_norm_residuals = lm.get_influence().resid_studentized_internal

# absolute squared normalized residuals
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))

# absolute residuals
model_abs_resid = np.abs(model_residuals)

# leverage, from statsmodels internals
model_leverage = lm.get_influence().hat_matrix_diag

# cook's distance, from statsmodels internals
model_cooks = lm.get_influence().cooks_distance[0]

In [None]:
model_abs_resid.sort_values(ascending=False).head(10) # top 10 deviations

In [None]:
#plot Residuals vs Fitted
plot_lm_1 = plt.figure(1)
plot_lm_1.set_figheight(8)
plot_lm_1.set_figwidth(12)

plot_lm_1.axes[0] = sns.residplot(model_fitted_y, Y_train, 
                          lowess=True, 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot_lm_1.axes[0].set_title('Residuals vs Fitted')
plot_lm_1.axes[0].set_xlabel('Fitted values')
plot_lm_1.axes[0].set_ylabel('Residuals')

# annotations
abs_resid = model_abs_resid.sort_values(ascending=False)
abs_resid_top_10 = abs_resid[:10]

for i in abs_resid_top_10.index:
    plot_lm_1.axes[0].annotate(i, 
                               xy=(model_fitted_y[i], 
                                   model_residuals[i]));

In [None]:
#plot Normal Q-Q
QQ = ProbPlot(model_norm_residuals)

plot_lm_2 = QQ.qqplot(line='45', alpha=0.5, color='#4C72B0', lw=1)

plot_lm_2.set_figheight(8)
plot_lm_2.set_figwidth(12)

plot_lm_2.axes[0].set_title('Normal Q-Q')
plot_lm_2.axes[0].set_xlabel('Theoretical Quantiles')
plot_lm_2.axes[0].set_ylabel('Standardized Residuals')

Residuals do not seem to follow a clear normal distribution.

In [None]:
#plot Scale-Location
plot_lm_3 = plt.figure(3)
plot_lm_3.set_figheight(8)
plot_lm_3.set_figwidth(12)

plt.scatter(model_fitted_y, model_norm_residuals_abs_sqrt, alpha=0.5)
sns.regplot(model_fitted_y, model_norm_residuals_abs_sqrt, 
            scatter=False, 
            ci=False, 
            lowess=True,
            line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot_lm_3.axes[0].set_title('Scale-Location')
plot_lm_3.axes[0].set_xlabel('Fitted values')
plot_lm_3.axes[0].set_ylabel('$\sqrt{|Standardized Residuals|}$');

# annotations
abs_sq_norm_resid = np.flip(np.argsort(model_norm_residuals_abs_sqrt), 0)
abs_sq_norm_resid_top_3 = abs_sq_norm_resid[:3]

#for i in abs_norm_resid_top_3:
 #   plot_lm_3.axes[0].annotate(i, 
  #                             xy=(model_fitted_y[i], 
   #                                model_norm_residuals_abs_sqrt[i]))

There is no clear trend in standard deviation of residuals.

In [None]:
print (4/(n-p-1))

In [None]:
leverage_top_10 = np.flip(np.argsort(model_cooks), 0)[:10]
for i in leverage_top_10:
 print(i, model_leverage[i], model_abs_resid.iloc[i], model_cooks[i])

In [None]:
#plot Residuals vs Leverage
plot_lm_4 = plt.figure(4)
plot_lm_4.set_figheight(12)
plot_lm_4.set_figwidth(18)

plt.scatter(model_leverage, model_norm_residuals, alpha=0.5)
sns.regplot(model_leverage, model_norm_residuals, 
            scatter=False, 
            ci=False, 
            lowess=True,
            line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot_lm_4.axes[0].set_xlim(0, 0.20)
plot_lm_4.axes[0].set_ylim(-3, 5)
plot_lm_4.axes[0].set_title('Residuals vs Leverage')
plot_lm_4.axes[0].set_xlabel('Leverage')
plot_lm_4.axes[0].set_ylabel('Standardized Residuals')

# annotations
leverage_top_5 = np.flip(np.argsort(model_cooks), 0)[:5]

#for i in leverage_top_5:
 # plot_lm_4.axes[0].annotate(i, 
  #                           xy=(model_leverage[i], 
   #                            '2'))
    
# shenanigans for cook's distance contours
def graph(formula, x_range, label=None):
    x = x_range
    y = formula(x)
    plt.plot(x, y, label=label, lw=1, ls='--', color='red')

p = len(lm.params) # number of model parameters

graph(lambda x: np.sqrt((0.001 * p * (1 - x)) / x), 
      np.linspace(0.001, 0.200, 50), 
      'Cook\'s distance') # 0.5 line
graph(lambda x: np.sqrt((0.005 * p * (1 - x)) / x), 
      np.linspace(0.001, 0.200, 50)) # 1 line
plt.legend(loc='upper right')

few observations fall above 0.005 Cook's Distance line.

In [None]:
sns.distplot(model_residuals)

In [None]:
sns.boxplot(model_residuals)

In [None]:
y_pred = lm.predict(X_test)
evaluate_reg_model('OLS',X_test, Y_test, y_pred)

In [None]:
glm = sm.GLS(Y_train,X_train).fit()
#glm.summary()
Y_pred_glm = glm.predict(X_test)
evaluate_reg_model('glm',X_test, Y_test, Y_pred_glm)

In [None]:
#retry the OLS after taking out top 10 high leverage data points

In [None]:
X_train_wo= X_train.drop(X_train.index[leverage_top_10])
Y_train_wo = Y_train.drop(Y_train.index[leverage_top_10])

In [None]:
lmwo = sm.OLS(Y_train_wo, X_train_wo).fit()
Y_pred_lmwo = lmwo.predict(X_test)
#lmwo.summary()
evaluate_reg_model('OLS-wo-outliers',X_test, Y_test, Y_pred_lmwo)

In [None]:
# retry the OLS model after dropping insignificant features 

In [None]:
X_train_wc = X_train_wo.drop(['Gender_Female',
'Gender_Gender non-conforming',
'Gender_Transgender',
'CountryCode_ARM',
'CountryCode_BMU',
'CountryCode_BRA',
'CountryCode_CHN',
'CountryCode_DOM',
'CountryCode_ECU',
'CountryCode_EST',
'CountryCode_HND',
'CountryCode_LBN',
'CountryCode_MLT',
'CountryCode_MNE',
'CountryCode_PRT',
'CountryCode_PRY',
'CountryCode_SLV',
'CountryCode_SVK',
'CountryCode_SVN',
'CountryCode_SYC'],axis=1)

In [None]:
X_test = X_test.drop([
'Gender_Female',
'Gender_Gender non-conforming',
'Gender_Transgender',
'CountryCode_ARM',
'CountryCode_BMU',
'CountryCode_BRA',
'CountryCode_CHN',
'CountryCode_DOM',
'CountryCode_ECU',
'CountryCode_EST',
'CountryCode_HND',
'CountryCode_LBN',
'CountryCode_MLT',
'CountryCode_MNE',
'CountryCode_PRT',
'CountryCode_PRY',
'CountryCode_SLV',
'CountryCode_SVK',
'CountryCode_SVN',
'CountryCode_SYC'
],axis=1)

In [None]:
lmwc = sm.OLS(Y_train_wo, X_train_wc).fit()
Y_pred_lmwc = lmwc.predict(X_test)
#lmwc.summary()
evaluate_reg_model('OLS-wo-insignificant-features',X_test, Y_test, Y_pred_lmwc)

In [None]:
reg_score_df

From the above,  XGBoost regressor did slightly better job than other models with lowest mse and highest adjusted r2 score. GLM and OLS got exactly same score. Eventhough OLS achieved .92 r2 score with training data, it did not get the same level of accuracy with test data indicating overfitting problem.  Interestingly, removing observations with high leverage and getting rid of insignificant variables did not make any improvement , and instead it detoriated the model performance..

RandomForest Regressor performed better when median error is considered instead of mean error.

## 5.Summary

### Key observations from EDA:

> * There is some correlation between salary and career satisfaction worldwide but not strong enough to conclude any theory. The Pearson correlation coefficient is very low (**0.0847**) . One explanation can be that salaries are widespread in each satisfaction level starting from very low to very high salaries. it may be a data quality issue or incorrect entry of the values by some respondents.  Another reason is that there are many outliers. However, we can see the the pattern that from satisfaction levels 5 to 9 the mean/median salary gradually increases.

> * As we can see, people who work full time remotely are most satisfied and people who never worked remotely are least satisfied. It appears flexibility plays vital role in career satisfaction. It is no surprise that developers value the option to work remotely as second most important after the vacation.

> * Data scientist's median salaries little higher compared to non-data scientists. The median salary for **Data scientists** is **55K**, whereas 
non data sceintists is **48K**. These salary values are standardized to global scale based on purchasing power parity index.

> * Developers who use Spaces earn **15,000** more on an average compared to those who use Tabs.

> * The longer the programming experience the developers have, the more salary they earn on average. For example, programmers who have more than 15 years of experience earn **47K** more than those who have 2 to 5 years of experience, on average.

### Key observations from ML:

> * **Clustering**: After applying K-means, Agglomerative, DBSCAN, hdbscan clustering models, it was evident that respondents seem to fall mainly into three clusters,  based on  career satisfaction and salary as these two are the features with highest variance. K-Means did a better job than Agglomorative. The silhouette score of Agglomerative clustering is 0.10, and K-Means with K=3 is 0.13.

> * **Classification**: In classification, allmost all the models did excellent job with evaluation scores reaching 0.99. Used the labels assigned by the clustering model. MLP classifier performed best.

> * **Regression**: Regression models (with salary as predictor variable) also performed well with adjusted r2 score close to 0.7. XGBoost regressor performed best out of all tested models.