# Big Macs and Rock Stars - What Seperates Top arners #

Welcome dear reader,

In this notebook we will aim to understand what drives professional salaries based on the responses to the 2021 Kaggle Data Science survey. We will do this in two parts:
1. How does salary vary with location. 
2. What separates the highest income earners in each country from the rest?

But wait, you say, what about the Big Macs? Don't worry, we'll get to that shortly in our first chapter...

In [None]:
## Import useful packages
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt


## Make a few helpers then load and clean the survey data ##
def clean_survey(df):
    df = df.loc[1:].copy()
    df = df.loc[~df.Q5.eq("Student")]
    df = df.loc[~df.Q3.eq("Other")]
    df = df.loc[~df.Q25.eq("$0-999")]

    # Rename columns here

    return df

def salary_to_int(salary_range):
    salary_range = str(salary_range.values[0])
    salary_range = salary_range.replace(",","").replace("$","").replace(">","").replace("+", "").replace("years", "").replace("<","")
    salary_range = salary_range.replace("I have never written code", "0")
    
    if salary_range == "NaN":
        return "NaN"
    else:
        lower_bound = salary_range.split("-")[0]
        return lower_bound

def add_columns(df):

    # Income columns
    df['Numeric_Income'] = df[["Q25"]].apply(salary_to_int, axis=1).astype(float)

    # Age columns
    df['Years_Experience'] = df[["Q6"]].apply(salary_to_int, axis=1).astype(float)
    df.dropna(subset=['Years_Experience'], inplace=True)
    
    return df

def rename_survey_columns(df):
    df.rename(columns={'Q3' : 'Country'}, inplace=True)

    return df

def load_survey():
    df = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
    df = clean_survey(df)
    df = add_columns(df)
    df = rename_survey_columns(df)

    return df

df = load_survey()

## Load and format the big mac data ##
big_mac = pd.read_csv('../input/bigmac/bigmac.csv')
big_mac.loc[big_mac.Country.eq('Britain'), 'Country'] = 'United Kingdom'
df.loc[df.Country.eq('United States of America'), 'Country'] = 'United States'
df.loc[df.Country.eq('United Kingdom of Great Britain and Northern Ireland'), 'Country'] = 'United Kingdom'

for country in ['France', 'Germany', 'Spain']:
    new_entry = big_mac.loc[big_mac.Country.eq('Euro area')].copy()
    new_entry['Country'] = country
    big_mac = big_mac.append(new_entry,ignore_index=True)

    
## Format some of the questions for easier plotting
Programming_columns = [f'Q7_Part_{i}' for i in range(1,12)] + ['Q7_OTHER']
Programming_Langauges = ['Python', 'R', 'SQL', 'C', 'C++', 'Java', 'JavaScript', 'Julia' ,'Swift', 'Bash','Matlab','Other']
Work_role_columns = [f'Q24_Part_{i}' for i in range(1,8)]
Work_roles = ['Analyze','Data infrastructure','ML Prototyping','ML Production','ML Experementing','ML Research', 'None']
Learning_columns = [f'Q40_Part_{i}' for i in range(1,11)]
learning_tools = ['Coursera', 'edX','Kaggle Learn Courses','DataCamp', 'Fast.ai', 'Udacity', 'Udemy','Linkedin Learning', 'Cloud Certification','University Degree']

df['Number_of_Languages'] = df[Programming_columns].count(axis=1)
df['Number_of_roles'] = df[Work_role_columns].count(axis=1)
df['Number_of_learning_platforms'] = df[Learning_columns].count(axis=1)

for column in Programming_columns + Work_role_columns + Learning_columns:
    df.loc[~df[column].isnull(), column] = 1
    df.loc[df[column].isnull(), column] = 0

# Chapter 1 - Salaries by location #

### Salary Distribution ###

Before we break down income versus location, let's look at the overall distribution of income. In order to focus on professionals, we've filtered the data down to those not identifying as a student, and those not falling into the income bucket that included an income of $0. 

<div class="alert alert-info"><b>Interactivity Time</b> <br> All the charts in this notebook are built using plotly, meaning you can interact with them and hovering over sections gives additional information. Try zooming in on the highest income earners below to test. If things go wrong, look in the top right corner for the reset zoom option.
</div>

In [None]:
df.sort_values('Numeric_Income', inplace=True)
fig = px.histogram(df,
    x="Q25",
    title='Distribution of income',
    labels = {'Q25' : 'Income Bracket ($USD)'})
fig.update_layout(title_x = 0.5, plot_bgcolor='white')
fig.add_annotation(
    text="Median Income!", x="25,000-29,999", y=450, ay=-150, arrowhead=2, showarrow=True
)
fig.show()

Interesting! The varying bucket widths make it a bit hard to visually assess the distribution, but we can see a tail to the right with some quite high salaries. 

<div class="alert alert-success"><b>Median Income</b> <br>For those wondering how they stack up, the median income bucket is the \$25,000 to \$29,999 range.
</div>

### Salary distribution by country ###

Now let's break down income by country.

In [None]:
def create_income_boxplot(df, x_column, y_column, x_lim, y_lim):
    large_x = df.groupby(x_column).size().sort_values(ascending=False).head(x_lim).index.values
    df[f'{x_column}_{y_column}_Median'] = df.groupby(x_column)[y_column].transform('median')
    max_income = df[f'{x_column}_{y_column}_Median'].max()
    df['Legend'] = 'Less than 80% of max median income'
    df.loc[df[f'{x_column}_{y_column}_Median'].ge(0.20*max_income),'Legend'] = 'Within 80% of max median income'
    df.loc[df[f'{x_column}_{y_column}_Median'].ge(0.5*max_income),'Legend'] = 'Within 50% of max median income'
    df.loc[df[f'{x_column}_{y_column}_Median'].ge(0.75*max_income),'Legend'] = 'Within 25% of max median income'
    df.sort_values(f'{x_column}_{y_column}_Median', inplace=True)
    fig = px.box(
        df.loc[df[x_column].isin(large_x)],
        x=x_column,
        y=y_column,
        color='Legend',
        labels={'Numeric_Income' : 'Income', 'Adjusted_Income' : 'Salary in Big Macs'})
    fig.update_yaxes(range=[0, y_lim])

    return fig

fig = create_income_boxplot(df, 'Country', 'Numeric_Income', 10, 350000)

fig.update_layout(
    title = 'Income Distribution For Countries With Most Responses',
    title_x = 0.5,
    plot_bgcolor='white',
    boxgroupgap=0,
    boxgap=0,
    legend ={'x': 0, 'y': -0.3, 'orientation': 'h'}
)

fig.show()

Here we have the distribution of salaries from each country. 

To keep the chart concise we've just used the 10 countries with the most respondents. Note to plot these boxplots we need numerical values, so from each bin we've extracted the lower bound. That is, \\$25,000-29,999 has been recorded as $25,000. 

Now as we could have expected based on previous years surveys the United States shows the highest median salary at \\$100,000! Following along is Germany, the United Kingdom and Japan. To show how close countries are to being at the top, we've created bands based on how close they are to the highest median income (in this case belonging to the United States). Here you can see no country has a median income within 25% (i.e over 0.75 times that) of the United States. Germany, the United Kingdom and Japan have median incomes within 50% of that in the United States.

So, should everyone chasing an indulgent lifestyle try and move to the United States? Maybe, maybe not! We need to take another angle into consideration when assessing income, and that's the relative purchasing power of one's income. 



### Big Macs and Purchasing Power ###

To show the importance of purchasing power let's take a simple example: no doubt some of our friends in the United States (median \\$100k) answering the survey reported large incomes, whilst living in San Francisco where the average monthly rent is about 3,000 dollars. Meanwhile, suppose some our German respondents (median \\$70k) are paying 1,000 dollars to live in Belgium. If all expenses match the difference of rent, then it may in fact be better for one's wallet to live in Belgium despite the lower salary!

To get a view of which country offers the most "bang for your buck" we need to understand comparative prices in our respondents areas. To do this we merge the survey data with a <a href="https://github.com/TheEconomist/big-mac-data">dataset</a> from the Economist showing the price of a Big Mac in various countries. 

Can a Big Mac really tell us something about the cost of living in an area though? Well, it turns out the Big Mac is actually a reasonably accurate barometer of the difference in prices between localities. This is due to the fact that the price of a Big Mac depends on diverse ingredients such as bread, meat and fresh produce, as well as the cost of local labour. 

Let's plot the median income reported by data scientists in each country versus the cost of a Big Mac in that country

In [None]:
df_combined = df.merge(
    big_mac,
    on='Country',
    how='inner'
)

median_incomes = df_combined.groupby('Country', as_index=False).agg({'Numeric_Income' : 'median', 'dollar_price' : 'mean'})
fig = px.scatter(
    median_incomes,
    x='Numeric_Income',
    y='dollar_price',
    text='Country',
    trendline='ols',
    labels={'Numeric_Income' : 'Median Income Bucket Lower Bound', 'dollar_price' : 'Cost of Big Mac'}
)
fig.update_layout(
    title = 'Bic Mac Price Versus Median Income',
    title_x = 0.5,
    plot_bgcolor='white',
    yaxis = {'tickformat' : '$'},
)
fig.update_traces(textposition='top center')

fig.show()

Indeed, it seems higher salaries correlate with a higher cost of goods. It thus stands to reason that absolute salary in not going to be the best measure of how comfortable one can be financially.

<div class="alert alert-info"><b>Burgenomics</b> <br> A Big Mac can be had for a bargain \$2.26 in Russia or a whopping \$7 in Switzerland. 
</div>


![](https://www.economist.com/sites/default/files/20120825_WOP773.jpg)

To account for price differences between countries, let us introduce a new measure which is the total amount of Big Macs a hungry data science can purchase in a year if they were to dedicate their entire salary to it! 

In [None]:
df_combined['Adjusted_Income'] = df_combined['Numeric_Income'] / df_combined['dollar_price']

fig = create_income_boxplot(df_combined, 'Country', 'Adjusted_Income', 10, 50000)
fig.update_layout(
    title = 'Big Macs Purchaseable For Countries With Most Responses',
    title_x = 0.5,
    plot_bgcolor='white',
    boxgroupgap=0,
    boxgap=0,
    legend ={'x': 0, 'y': -0.3, 'orientation': 'h'}
)

fig.show()

Initially the above seems to not be a large shake up versus the previous chart. However, what is true is that on average, the median income using our new measure appears to converge a little between countries. India, China and Russia all had an absolute median income less than 20% of that seen in the United States, but with our new metric they have moved to our next band where they are at least over 20% the median income in the United States. Japan has also closed to gap and has an income over 75% of that seen in the United States. 

Now even across a single country the price of living can vary greatly so this analysis could be taken a lot further. To help facilate studies like this future surveys could contain more precise geographic information. 

<div class="alert alert-success"><b>Hungry?</b> <br> Those chasing lunch might want to consider working in the USA and Japan, where the average data scientist earns enough for 18,000 and 14,000 Big Macs per year respectively!
</div>

# Chapter 2 - What makes a rockstar? #

Chapter 1 has shown us that there are definitely some locations with higher salaries on average whether measured through Big Macs or dollars. However, a lot of us are happy just where we are. This section asks what separates the high income earners in each country from the rest? **To answer this question we separate our data into the top 20th percentile of income earners within each country. These will be our "rock star" professionals. The remaining population will be our comparison**.

Now a fair protest here would be to say making a lot of money doesn't necessarily make one a successful data scientist, for example, what about contribution to the field, work/life balance, etc. I whole heartily agree, and perhaps this group should simply be called "High Income Earners", but rock stars is just such a catchy name!

To understand our high income rock stars versus the rest of the population, we will look at the proportion of respondents that answers the survey in various ways. The key thing to look for here is wide differences between the proportions. For example, if 50% of our rock stars are doing something that only 5% of the rest of us are doing, then this is likely an exciting insight. 

Let's start by looking at coding experience

### Years of Coding Experience ###

In [None]:
df['Income_Quantile'] = df.groupby('Country')['Numeric_Income'].transform('quantile', 0.80)
df['Rock star'] = df.Numeric_Income > df.Income_Quantile
df['Group_Denoms'] = df.groupby('Rock star').Q1.transform('count')


def plot_proportion(question, title, sort=True):
    plot_df = df.groupby([question,'Rock star'], as_index=False).agg({'Q1' : 'count', 'Group_Denoms' : 'mean'})
    plot_df['Proportion'] = plot_df['Q1'] / plot_df['Group_Denoms']
    if sort:
        plot_df.sort_values(by='Proportion', inplace=True)
    fig = px.bar(
            plot_df,
            x=question,
            y='Proportion',
            color='Rock star',
            barmode='group',
            title=title,
            color_discrete_sequence=px.colors.qualitative.Dark24,
            labels={'Proportion' : 'Percentage of Respondants'}
        )
    fig.update_layout(
        xaxis={'title' : None},
        title_x = 0.5,
        plot_bgcolor='white',
        yaxis = {'tickformat' : '.0%'}
    )
    return fig

fig = plot_proportion('Q6', 'Years of Experience writing code')
fig.show()

This chart shows us that 24% of our rock star group have been coding for over 20 years - amazing! This is quite different from the rest of the population where only 10% of us have been writing code for over 20 years. In fact, a majority of those not in our rock star group have been coding for 3 years or less. It seems salaries within countries are quite fair when it comes to income - the rock star group shows a clear over-index in years of experience. 

<div class="alert alert-success"><b>Time is on your side</b> <br> Rock stars are about 3 times more likely to have been coding for 20+ years. Stick at it and your salary should follow!
</div>

### Level of Education ###

Let's now turn to level of education:

In [None]:
fig = plot_proportion('Q4', title = 'Level of Education',sort=True)
fig.show()

Wow! A lot of our survey respondants have advanced degrees - 52% of our rock star group reported having a masters degree, as did 44% of the remaining population. We can see the rockstar group over-indexes in Doctoral and Master's degrees.

<sup>Perhaps I left school a little early...<sup/>

### Field of occupation ###

Moving on to what sectors our respondents report working in

In [None]:
fig = plot_proportion('Q20', title='Occupation Field',sort=True)
fig.show()

Now here's an interesting chart! What a large range of fields we all work in.

By far the most popular professions are computers/technology, accounting/finance and academics/education. Looking at these common answers, we can see our high earning rockstars show a preference towards technology and finance and an under-index in academia versus the rest of the population.

<div class="alert alert-success"><b>Many options</b> <br> We can all find solace in the fact that every profession has its share of rock stars, suggesting regardless of field we can find ourselves in our countries top earners!
</div>


### Choice in programming language ###

Let's look at a few more metrics comparing our two groups of income earners. To keep things interesting we will change chart type too, from bars to radar plots. Let's now see whether our rockstars are successful because of their choice in language


In [None]:
def plot_radar(columns, labels, group_name, title):
  all = pd.DataFrame(columns=['Rock star', group_name, 'Proportion'])
  for column, label in zip(columns, labels):
      df_plot = df.groupby('Rock star',as_index=False).agg({column : 'sum', 'Group_Denoms' : 'mean'})
      df_plot['Proportion'] = df_plot[column] / df_plot['Group_Denoms']
      df_plot[group_name] = label
      all = pd.concat([all, df_plot[['Rock star', group_name, 'Proportion']]])
  all['Popularity'] = all.groupby(group_name)['Proportion'].transform('max')
  all.sort_values(by='Popularity', inplace=True)
  fig = go.Figure()
  fig.add_trace(go.Scatterpolar(
        r=all.loc[all['Rock star'].eq(True)]['Proportion'].to_list(),
        theta=all.loc[all['Rock star'].eq(True)][group_name].to_list(),
        fill='toself',
        name='Rockstars',
        fillcolor=px.colors.qualitative.Dark24[0],
        opacity=0.9,
        line=dict(color='black'),
        marker_line_color="black",
        marker_line_width=2,
  ))
  fig.add_trace(go.Scatterpolar(
        r=all.loc[all['Rock star'].eq(False)]['Proportion'].to_list(),
        theta=all.loc[all['Rock star'].eq(False)][group_name].to_list(),
        fill='toself',
        name='Rest of population',
        fillcolor=px.colors.qualitative.Dark24[1],
        opacity=0.7,
        line=dict(color='black'),
        marker_line_color="black",
        marker_line_width=2,  ))

  fig.update_layout(
    template=None,
    title=title,
    legend ={'x': 0.36, 'y': -0.2, 'orientation': 'h'},
    polar=dict(
      #bgcolor='white',
      radialaxis=dict(
        visible=True,
        tickformat='.0%',
        range=[0, 1.05*all.Proportion.max()]
      )),
    showlegend=True
  )

  return fig

fig = plot_radar(Programming_columns, Programming_Langauges, 'Language','Languages Used Regularly')
fig.show()

It seems we're mostly the same when it comes to languages in that we all love python. 84% of our Rockstars reported using it versus 82% of the remaining population. Interestingly our rockstars are quite a bit more likely to use Bash, with 18% reporting using it, versus 10% percent of the remaining population. For those wondering on average our rockstars use 2.7 languages whilst the remaining population use 2.4. 

### Day to Day activities ###

Let's now see whether the day to day of our roles changes signficantly between those high income earners and the rest of the population

In [None]:
fig = plot_radar(Work_role_columns, Work_roles, 'role', 'Day to Day Responsibilities')
fig.show()

Probably my favorite chart so far! A greater proportion of Rock stars reported themselves responsible for each of the possible tasks. 

<div class="alert alert-success"><b>Jack of all trades</b> <br> The rockstar group reported doing 2.6 roles on average versus 1.6 for the remaining population. If you want to be well compensated, find more ways to create value!
</div>

![](https://cbk.bschool.cuhk.edu.hk/wp-content/uploads/shutterstock_1298321548.jpg)

### Learning tools ###

Finally let's check whether these high income earners are doing it by spending more time learning than the rest of the population

In [None]:
fig = plot_radar(Learning_columns, learning_tools, 'Learning', 'Learning Platforms Engaged With')
fig.show()

When it comes to learning it seems we're all alike, with clear preferences for Coursera, Kaggle courses and Udemy. It seems like on average the rock star group is more ambitious in learning, with a larger proportion having started a course on all platforms. 

My favorite piece of information about this question is on average, each rock star reported using 2.3 learning platforms, and the rest of the population reported using 1.8. This leads me to the conclusion that regardless of income band, we all love learning something new! 

# Conclusion #

If you've made it this far, thanks for following along! It's been a joy to pull together this notebook and I hope you learnt something. Feel free to ask questions on anything below in the comments

Here's a quick summary of what we've seen:
1. Salary, both absolute and based on purchasing power can vary greatly on location with Japan and the United States standing out.
2. Don't travel to Switzerland if you want a Big Mac..
3. Every field has its share of high income earners
4. High income earners are generally more experienced, more educated, and show a penchant for learning. Luckily for those of us chasing the greenback, **these aren't qualities you're born with, but things obtainable by all of us through effort and dedication.**

Thanks to Kaggle for collecting all these responses! 

### Further reading ###
1. The Big Mac Index: https://www.economist.com/big-mac-index
2. An interesting paper of price purchasing parity and Big Macs: https://files.stlouisfed.org/files/htdocs/publications/review/03/11/pakko.pdf
3. More information on plotly: https://plotly.com/python/

In [None]:
## Code to generate the averages mentioned in comments
df.groupby('Rock star', as_index=False).agg({
    'Years_Experience' : 'mean',
    'Number_of_Languages' : 'mean',
    'Number_of_roles' : 'mean',
    'Number_of_learning_platforms' : 'mean',
})