# Starter Notebook for Salary Survey

## Table of Contents
* [Data Cleansing](#1)
* [Explore Categorical Features](#2)
* [Explore Numerical Features](#3)
* [Salary vs. Features](#4)
* [Other Explorations](#5)
* [Data Scientist - Drill Down](#6)

#### Largest part of this notebook is about data cleaning, e. g. string entries in numerical columns, nonsense levels in categorical columns, hidden duplicates by misspellings or different texts meaning the same thing,... It might still not be perfect, however, so feel free to improve.

In [None]:
# packages
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

In [None]:
# read data
df_2018 = pd.read_csv('../input/2020-it-salary-survey-for-eu-region/IT Salary Survey EU 2018.csv')
df_2019 = pd.read_csv('../input/2020-it-salary-survey-for-eu-region/T Salary Survey EU 2019.csv')
df_2020 = pd.read_csv('../input/2020-it-salary-survey-for-eu-region/IT Salary Survey EU  2020.csv')

# We will focus on 2020 data only in the following!

In [None]:
# preview
df_2020.head(5)

In [None]:
# show all columns
print(df_2020.columns.tolist())

In [None]:
# rename columns
df_2020.rename(columns = {'Position ':'Position'}, inplace = True)

In [None]:
# features we ignore in the following
features_not_used = ['Timestamp',
                     'Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country',
                     'Annual bonus+stocks one year ago. Only answer if staying in same country']

In [None]:
# categorical features
features_cat = ['Gender', 'City', 'Position',
       'Total years of experience', 'Years of experience in Germany',
       'Seniority level', 'Your main technology / programming language',
       'Other technologies/programming languages you use often',
       'Number of vacation days',
       'Employment status', 'Сontract duration',
       'Main language at work', 'Company size', 'Company type',
       'Have you lost your job due to the coronavirus outbreak?',
       'Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week',
       'Have you received additional monetary support from your employer due to Work From Home? If yes, how much in 2020 in EUR']

#### A few features above seem rather numerical than categorical, e. g. Number of vacation days. However, there are several string entries in those columns. Here we would need a more in depth data cleaning:

In [None]:
# look at an example of a messy "numerical" columns
df_2020['Number of vacation days'].value_counts()

#### We'll fix that specific column later...

In [None]:
# numeric features
features_num = ['Age',
                'Yearly brutto salary (without bonus and stocks) in EUR',
                'Yearly bonus + stocks in EUR',]

In [None]:
# check if we have captured all features
len(features_cat + features_num + features_not_used) - len(df_2020.columns)

Yep, all good!

<a id='1'></a>
# Data cleansing

### Missing values

In [None]:
# fill missing values

# >>> categorical features
missing_text = '_MISSING_'
df_2020[features_cat] = df_2020[features_cat].fillna(missing_text)

# >>> numerical features
missing_num = -1
df_2020[features_num] = df_2020[features_num].fillna(missing_num)

### Outliers

In [None]:
# check for outliers
plt.boxplot(np.log10(df_2020['Yearly brutto salary (without bonus and stocks) in EUR']))
plt.title('Yearly brutto salary (without bonus and stocks) in EUR')
plt.ylabel('log10(Salary)')
plt.grid()
plt.show()

# remove very high values
cut_point = 500000
df_2020 = df_2020[df_2020['Yearly brutto salary (without bonus and stocks) in EUR'] <= cut_point]
plt.boxplot(df_2020['Yearly brutto salary (without bonus and stocks) in EUR'])
plt.title('After removing outliers')
plt.ylabel('Salary')
plt.grid()
plt.show()

### Clean strings

In [None]:
# string cleaning (remove redundant spaces and convert to upper case)
# => reduce risk of "hidden" duplicates
def clean_string(x):
    return x.strip().upper()

features_for_string_cleaning = ['City', 'Position', 'Employment status',
                                'Your main technology / programming language',
                                'Other technologies/programming languages you use often',
                                'Seniority level',
                                'Have you received additional monetary support from your employer due to Work From Home? If yes, how much in 2020 in EUR',
                                'Number of vacation days']

for f in features_for_string_cleaning:
    df_2020[f] = df_2020[f].apply(clean_string)

### Reduce number of levels

In [None]:
# reduce number of levels: "Position"
current_feature = 'Position'
print('ORIGINAL:', current_feature)
temp_count = df_2020[current_feature].value_counts()
print(temp_count)

# reduce to levels that occur at least freq_min times
freq_min = 3
keep_levels = list(temp_count[temp_count.values>=freq_min].index)
df_2020[current_feature] = df_2020[current_feature].where(df_2020[current_feature].isin(keep_levels), '_OTHER_')
print('\nREDUCED TO:')
print(df_2020[current_feature].value_counts())

In [None]:
# reduce number of levels: "Seniority level"
current_feature = 'Seniority level'
print('ORIGINAL:', current_feature)
temp_count = df_2020[current_feature].value_counts()
print(temp_count)

# reduce to levels that occur at least freq_min times
freq_min = 3
keep_levels = list(temp_count[temp_count.values>=freq_min].index)
df_2020[current_feature] = df_2020[current_feature].where(df_2020[current_feature].isin(keep_levels), '_OTHER_')
print('\nREDUCED TO:')
print(df_2020[current_feature].value_counts())

In [None]:
# reduce number of levels: "Main language at work"
current_feature = 'Main language at work'

# manual adjustment first
df_2020[current_feature].loc[df_2020[current_feature]=='Русский'] = 'Russian'

print('ORIGINAL:', current_feature)
temp_count = df_2020[current_feature].value_counts()
print(temp_count)

# reduce to levels that occur at least freq_min times
freq_min = 3
keep_levels = list(temp_count[temp_count.values>=freq_min].index)
df_2020[current_feature] = df_2020[current_feature].where(df_2020[current_feature].isin(keep_levels), '_OTHER_')
print('\nREDUCED TO:')
print(df_2020[current_feature].value_counts())

In [None]:
# reduce number of levels: "Employment status"
current_feature = 'Employment status'
print('ORIGINAL:', current_feature)
temp_count = df_2020[current_feature].value_counts()
print(temp_count)

# reduce to levels that occur at least freq_min times
freq_min = 2
keep_levels = list(temp_count[temp_count.values>=freq_min].index)
df_2020[current_feature] = df_2020[current_feature].where(df_2020[current_feature].isin(keep_levels), '_OTHER_')
print('\nREDUCED TO:')
print(df_2020[current_feature].value_counts())

In [None]:
# reduce number of levels: "Company type"
current_feature = 'Company type'
print('ORIGINAL:', current_feature)
temp_count = df_2020[current_feature].value_counts()
print(temp_count)

# reduce to levels that occur at least freq_min times
freq_min = 3
keep_levels = list(temp_count[temp_count.values>=freq_min].index)
df_2020[current_feature] = df_2020[current_feature].where(df_2020[current_feature].isin(keep_levels), '_OTHER_')
print('\nREDUCED TO:')
print(df_2020[current_feature].value_counts())

In [None]:
# reduce number of levels: "Have you received additional monetary support..."
current_feature = 'Have you received additional monetary support from your employer due to Work From Home? If yes, how much in 2020 in EUR'
print('ORIGINAL:', current_feature)
temp_count = df_2020[current_feature].value_counts()
print(temp_count)

# reduce to levels that occur at least freq_min times
freq_min = 2
keep_levels = list(temp_count[temp_count.values>=freq_min].index)
df_2020[current_feature] = df_2020[current_feature].where(df_2020[current_feature].isin(keep_levels), '_OTHER_')
print('\nREDUCED TO:')
print(df_2020[current_feature].value_counts())

In [None]:
# reduce number of levels: "Have you lost your job due to the coronavirus outbreak?"
current_feature = 'Have you lost your job due to the coronavirus outbreak?'
print('ORIGINAL:', current_feature)
temp_count = df_2020[current_feature].value_counts()
print(temp_count)

# reduce to levels that occur at least freq_min times
freq_min = 2
keep_levels = list(temp_count[temp_count.values>=freq_min].index)
df_2020[current_feature] = df_2020[current_feature].where(df_2020[current_feature].isin(keep_levels), '_OTHER_')
print('\nREDUCED TO:')
print(df_2020[current_feature].value_counts())

### Fix city names

In [None]:
current_feature = 'City'
# show all levels
print(df_2020[current_feature].value_counts().index.tolist())

In [None]:
# replace levels
df_2020[current_feature].loc[df_2020[current_feature]=='BÖLINGEN'] = 'BOEBLINGEN'
df_2020[current_feature].loc[df_2020[current_feature]=='DUSSELDORF'] = 'DUESSELDORF'
df_2020[current_feature].loc[df_2020[current_feature]=='DÜSSELDORF'] = 'DUESSELDORF'
df_2020[current_feature].loc[df_2020[current_feature]=='DUSSELDURF'] = 'DUESSELDORF'
df_2020[current_feature].loc[df_2020[current_feature]=='NÜRNBERG'] = 'NUREMBERG'
df_2020[current_feature].loc[df_2020[current_feature]=='WARSAW, POLAND'] = 'WARSAW'
df_2020[current_feature].loc[df_2020[current_feature]=='ZÜRICH'] = 'ZURICH'

### Fix vacation days

In [None]:
current_feature = 'Number of vacation days'
# show all levels 
print(df_2020[current_feature].value_counts().index.tolist())

In [None]:
# replace levels
df_2020[current_feature].loc[df_2020[current_feature]=='30 IN CONTRACT (BUT THEORETICALLY UNLIMITED)'] = 'UNLIMITED'
df_2020[current_feature].loc[df_2020[current_feature]=='23+'] = '23'
df_2020[current_feature].loc[df_2020[current_feature]=='(NO IDEA)'] = '_MISSING_'
df_2020[current_feature].loc[df_2020[current_feature]=='24 LABOUR DAYS'] = '24'
df_2020[current_feature].loc[df_2020[current_feature]=='~25'] = '25'
df_2020[current_feature].loc[df_2020[current_feature]=='365'] = 'UNLIMITED'

### Fix experience(s)

In [None]:
current_feature = 'Total years of experience'
# show all levels
print(df_2020[current_feature].value_counts().index.tolist())

In [None]:
# replace levels
df_2020[current_feature].loc[df_2020[current_feature]=='6 (not as a data scientist, but as a lab scientist)'] = '6'
df_2020[current_feature].loc[df_2020[current_feature]=='less than year'] = '1'
df_2020[current_feature].loc[df_2020[current_feature]=='15, thereof 8 as CTO'] = '15'
df_2020[current_feature].loc[df_2020[current_feature]=='1 (as QA Engineer) / 11 in total'] = '11'
df_2020[current_feature].loc[df_2020[current_feature]=='383'] = '_MISSING_'
df_2020[current_feature].loc[df_2020[current_feature]=='1,5'] = '1.5'
df_2020[current_feature].loc[df_2020[current_feature]=='2,5'] = '2.5'

In [None]:
current_feature = 'Years of experience in Germany'
# show all levels
print(df_2020[current_feature].value_counts().index.tolist())

In [None]:
# replace levels
df_2020[current_feature].loc[df_2020[current_feature]=='0,3'] = '0.3'
df_2020[current_feature].loc[df_2020[current_feature]=='0,5'] = '0.5'
df_2020[current_feature].loc[df_2020[current_feature]=='1,5'] = '1.5'
df_2020[current_feature].loc[df_2020[current_feature]=='1,7'] = '1.7'
df_2020[current_feature].loc[df_2020[current_feature]=='2,5'] = '2.5'
df_2020[current_feature].loc[df_2020[current_feature]=='3,5'] = '3.5'
df_2020[current_feature].loc[df_2020[current_feature]=='4,5'] = '4.5'
df_2020[current_feature].loc[df_2020[current_feature]=='<1'] = '0.5'
df_2020[current_feature].loc[df_2020[current_feature]=='< 1'] = '0.5'
df_2020[current_feature].loc[df_2020[current_feature]=='3 months'] = '0.25'
df_2020[current_feature].loc[df_2020[current_feature]=='4 month'] = '0.33'
df_2020[current_feature].loc[df_2020[current_feature]=='4 (in Switzerland), 0 (in Germany)'] = '0'
df_2020[current_feature].loc[df_2020[current_feature]=='less than year'] = '0.5'
df_2020[current_feature].loc[df_2020[current_feature]=='⁰'] = '_MISSING_'
df_2020[current_feature].loc[df_2020[current_feature]=='-'] = '_MISSING_'
df_2020[current_feature].loc[df_2020[current_feature]=='6 (not as a data scientist, but as a lab scientist)'] = '6'
df_2020[current_feature].loc[df_2020[current_feature]=='3 (in Poland)'] = '0'

### Fix bonus/stocks

In [None]:
current_feature = 'Yearly bonus + stocks in EUR'
# show all levels
print(df_2020[current_feature].value_counts().index.tolist())

In [None]:
# replace levels
df_2020[current_feature].loc[df_2020[current_feature]=='bvg only'] = -1
df_2020[current_feature].loc[df_2020[current_feature]=='-'] = -1
df_2020[current_feature].loc[df_2020[current_feature]=='-'] = -1
df_2020[current_feature].loc[df_2020[current_feature]=='15000+-'] = 15000
df_2020[current_feature].loc[df_2020[current_feature]=='Not sure'] = -1
df_2020[current_feature].loc[df_2020[current_feature]=='Na'] = -1
df_2020[current_feature].loc[df_2020[current_feature]=='depends'] = -1
df_2020[current_feature].loc[df_2020[current_feature]=='1150000'] = -1 # seems somewhat high?

# and convert to numeric
df_2020[current_feature] = df_2020[current_feature].astype(float)

# for the sake of simplicity we REPLACE THE MISSINGS/UNKNOWS with 0!!!
df_2020[current_feature].loc[df_2020[current_feature]==-1] = 0

### Add sum of salary + bonus as new feature "Total Income...":

In [None]:
df_2020['Total Income (Salary+Bonus)'] = df_2020['Yearly brutto salary (without bonus and stocks) in EUR'] + df_2020['Yearly bonus + stocks in EUR']

In [None]:
# update list of numerical features correspondingly!
features_num.append('Total Income (Salary+Bonus)')

In [None]:
# add also binned version of age to data frame
df_2020['AgeGroup'] = pd.cut(df_2020.Age, bins=[-2,0,20,25,30,35,40,45,50,60,70])
df_2020.AgeGroup.value_counts().sort_index().plot(kind='bar')
plt.title('Age Groups')
plt.grid()
plt.show()

#### Note that (-2,0] represents the missing values!

### Export cleansed data

In [None]:
# save prepared data to file
df_2020.to_csv('df_2020_cleaned.csv')

<a id='2'></a>
# Explore Categorical Features

In [None]:
# change plot style
plt.style.use('dark_background')

In [None]:
# plot distributions of categorical features
for f in features_cat:
    plt.figure(figsize=(14,5))
    val_c = df_2020[f].value_counts()
    if len(val_c) <= 20:
        val_c.plot(kind='bar', color='lightgreen')
        plt.title(f)
    else: # if more than 20 levels show only the most frequent 20
        val_c[0:20].plot(kind='bar', color='lightgreen')
        plt.title(f + ' - Top 20 only')
        
    plt.grid()
    plt.show()

<a id='3'></a>
# Explore Numerical Features

In [None]:
# plot distributions of numerical features
for f in features_num:
    plt.figure(figsize=(14,5))
    df_2020[f].plot(kind='hist', bins = 50, color='lightgreen')
    plt.title(f)
    plt.grid()
    plt.show()

<a id='4'></a>
# Salary vs. Features

In [None]:
# salary stats
df_2020['Yearly brutto salary (without bonus and stocks) in EUR'].describe()

In [None]:
# total income stats
df_2020['Total Income (Salary+Bonus)'].describe()

### Salary vs Features

In [None]:
# change plot style again
plt.style.use('default')

In [None]:
# violinplots show dependency of salary on the following features:
my_features = ['Gender', 'Seniority level', 'AgeGroup', 'Company size', 'Company type']

for f in my_features:
    plt.style.use('seaborn-pastel')
    plt.figure(figsize=(12,5))
    sns.violinplot(data=df_2020, x=f, y='Yearly brutto salary (without bonus and stocks) in EUR')
    plt.grid()
    plt.title('Salary by ' + f)
    plt.show()

### Other Evaluations

In [None]:
# alternative visualization: scatterplot salary vs age
sns.jointplot(x=df_2020['Age'], y=df_2020['Yearly brutto salary (without bonus and stocks) in EUR'],
              height=5, color='green',
              joint_kws={'alpha' : 0.3})
plt.show()

In [None]:
# alternative visualization: total income vs age
sns.jointplot(x=df_2020['Age'], y=df_2020['Total Income (Salary+Bonus)'],
              height=5, color='green',
              joint_kws={'alpha' : 0.3})
plt.show()

<a id='5'></a>
# Other Explorations

### Age

In [None]:
# age stats (excluding missings)
age4stats = df_2020['Age'][df_2020['Age']>0] # filter "NAs" first
age4stats.describe()

### Programming Languages and Technologies

In [None]:
# create wordcloud fpr main tech/progamming language
f = 'Your main technology / programming language'
text_select = df_2020[f][df_2020[f] != '_MISSING_']
text = " ".join(txt for txt in text_select)

stopwords = set(STOPWORDS)

wordcloud = WordCloud(stopwords=stopwords, max_font_size=100, max_words=500,
                      width = 800, height = 600,
                      background_color='black').generate(text)

plt.figure(figsize=(11,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
# create wordcloud for other tech/PLs
f = 'Other technologies/programming languages you use often'
text_select = df_2020[f][df_2020[f] != '_MISSING_']
text = " ".join(txt for txt in text_select)

stopwords = set(STOPWORDS)

wordcloud = WordCloud(stopwords=stopwords, max_font_size=100, max_words=500,
                      width = 800, height = 600,
                      background_color='black').generate(text)

plt.figure(figsize=(11,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

<a id='6'></a>
# Data Scientist Drill-Down

### Let's now specialize on the position of Data Scientist.

In [None]:
# select data scientists only
df_ds = df_2020[df_2020.Position=='DATA SCIENTIST'].copy()
df_ds.head()

In [None]:
# dimensions
df_ds.shape

In [None]:
# DS salary stats
df_ds['Yearly brutto salary (without bonus and stocks) in EUR'].describe()

In [None]:
# DS total income stats
df_ds['Total Income (Salary+Bonus)'].describe()

In [None]:
# DS age stats (excluding missings)
age4stats_DS = df_ds['Age'][df_ds['Age']>0] # filter "NAs" first
age4stats_DS.describe()

### We can see that average salary is a bit higher for Data Scientists whereas average age is a little bit lower than in the overall "population". However, we have to be careful as our sample of Data Scientists is rather small (110 observations).

In [None]:
# change plot style
plt.style.use('dark_background')

In [None]:
# plot distributions of categorical features (exclude Position)
features_cat_x = features_cat.copy()
features_cat_x.remove('Position')

for f in features_cat_x:
    plt.figure(figsize=(14,5))
    val_c = df_ds[f].value_counts()
    if len(val_c) <= 20:
        val_c.plot(kind='bar', color='lightgreen')
        plt.title('Position = Data Scientist - ' + f)
    else: # if more than 20 levels show only the most frequent 20
        val_c[0:20].plot(kind='bar', color='lightgreen')
        plt.title('Position = Data Scientist - ' + f + ' - Top 20 only')
        
    plt.grid()
    plt.show()

In [None]:
# plot distributions of numerical features
for f in features_num:
    plt.figure(figsize=(14,5))
    df_ds[f].plot(kind='hist', bins = 50, color='lightgreen')
    plt.title('Position = Data Scientist - '+ f)
    plt.grid()
    plt.show()

In [None]:
# change plot style again
plt.style.use('default')

In [None]:
# violinplots show dependency of salary on the following features:
my_features = ['Gender', 'Seniority level', 'AgeGroup', 'Company size', 'Company type']

for f in my_features:
    plt.style.use('seaborn-pastel')
    plt.figure(figsize=(12,5))
    sns.violinplot(data=df_ds, x=f, y='Yearly brutto salary (without bonus and stocks) in EUR')
    plt.grid()
    plt.title('Position = Data Scientist - Salary by ' + f)
    plt.show()