# How to Become a Part of the Top 6% and 1% of Data Scientists Financially

## Chapter 0: BACKSTORY 

Money, money, money. Im sure, as a data scientist, you have thought to yourself, "hmm what does it take to make the same amount of money as those data science guys I read about online or see on YouTube??". You are not alone. We have all thought about it as well! 

A lot of people might get the wrong idea about what really makes a data scientist valuable. Like valuable valuable. We are talking about $150,000....$250,000....$350,000....$500,000 and more !!!

So the goal is, throughout this notebook, I am able to show:

* Where do the top 1% and 6% live? 
* How old are they? 
* For each bracket of income, how much experience do they have? 
* Education levels accomplished by them 
* For each level of education, how much experience do you need to be a part of the 1% and 6% 

## Chapter 1: LOADING AND CLEANING THE DATA

In this chapter we will simply load and filter the data so we are working with the right information. 

Steps:

1. load the data using pandas
2. drop any person who left the income section of the survey empty. We dont care for the people that didnt put down their income for obvious reasons. 
3. split the data into two dataframes. top1_data and top6_data that consist of the people that make the top 1% and top 6%, respectively, based on their income (> 300,000 and > 150,000, respectively)

In [None]:
# imports 
%matplotlib inline
import os
import pandas as pd 
import numpy as np 
import math
import matplotlib.pyplot as plt

In [None]:
# step 1

def load_data():
    csv_path = os.path.join("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv")
    return pd.read_csv(csv_path)

In [None]:
full_data = load_data()

In [None]:
# step 2 
sub = [
    "Q24"
]

full_data = full_data.dropna(axis=0, subset=sub)

In [None]:
#step 3 
values = full_data["Q24"].value_counts().index
top6_data = full_data.copy()
for value in values:
    if (value != "150,000-199,999") and (value != "200,000-249,999") and (value != "250,000-299,999") and (value != "300,000-500,000") and (value != "> $500,000"):
        top6_data = top6_data[top6_data["Q24"] != value] 

top1_data = full_data.copy()
for value in values:
    if (value != "300,000-500,000") and (value != "> $500,000"):
        top1_data = top1_data[top1_data["Q24"] != value] 

## Chapter 2: WHERE DO THEY LIVE? 

So, we have our data. We gave it a little cleaning here and there. Now its good enough for us to play a round with a bit. 

This chapter we will see where the top 1% and 6% live. We will use pie chart cause......who in their right mind doesnt like looking at pie charts haha. We will get a good idea of where most of the top 1% and 6% live. 

Lets go! 

In [None]:
# function displays the % and # labeling for each piece of the pie 
def func(pct, allvals):
    absolute = int(pct/100.*np.sum(allvals))
    return "{:.1f}%\n({:d})".format(pct, absolute)

In [None]:
fig, (ax1,ax6) = plt.subplots(1, 2, figsize=(25,25), subplot_kw=dict(aspect="equal"))

num_countries_left_1 = sum(top1_data["Q3"].value_counts()[5:].values.tolist())
num_countries_left_6 = sum(top6_data["Q3"].value_counts()[5:].values.tolist())

contries_legend_1 = top1_data["Q3"].value_counts()[:5].index.tolist() 
contries_data_1 = top1_data["Q3"].value_counts()[:5].values.tolist()
contries_legend_6 = top6_data["Q3"].value_counts()[:5].index.tolist() 
contries_data_6 = top6_data["Q3"].value_counts()[:5].values.tolist()

contries_data_1[2] += num_countries_left_1 
contries_data_6[2] += num_countries_left_6 

wedges, texts, autotexts = ax1.pie(
    contries_data_1, 
    autopct=lambda pct: func(pct, contries_data_1),
    textprops=dict(color="w")
    )
wedges, texts, autotexts = ax6.pie(
    contries_data_6, 
    autopct=lambda pct: func(pct, contries_data_6),
    textprops=dict(color="w")
    )

ax1.legend(wedges, contries_legend_1,
          title="Countries",
          loc="upper center",
          bbox_to_anchor=(1, 0, 0.5, 1))

plt.setp(autotexts, size=8, weight="bold")

ax1.set_title("Countries The Top %1 Live In")
ax6.set_title("Countries The Top %6 Live In")

plt.show()

### Take aways from the pie charts above: 

* To put in perspective how much more the USA has people that are in the top 1% and 6%. The "Other" slice of the pie consists of more than 35 countries... 
* Even though India and China is ranked #1 and #3, respectively for best country to outsource software dev as well as both countries producing a significant amount of STEM graduates (https://www.codeinwp.com/blog/best-countries-to-outsource-software-development/), it is interesting to see that India (orange slice) and China (purple slice) has about only 10% of the top 1% and 6%

## Chapter 3: HOW OLD ARE THESE GUYS?

Age is becoming less and less of a factor to how much someone is earning. Lets talk a look at the age groups for the top earnings in data science!

In [None]:
fig, (ax1,ax6) = plt.subplots(1, 2, figsize=(25,25), subplot_kw=dict(aspect="equal"))

num_ages_left_1 = sum(top1_data["Q1"].value_counts()[6:].values.tolist())
num_ages_left_6 = sum(top6_data["Q1"].value_counts()[6:].values.tolist())

ages_legend_1 = top1_data["Q1"].value_counts()[:6].index.tolist() 
ages_data_1 = top1_data["Q1"].value_counts()[:6].values.tolist()
ages_legend_6 = top6_data["Q1"].value_counts()[:6].index.tolist() 
ages_data_6 = top6_data["Q1"].value_counts()[:6].values.tolist()

ages_legend_1.append("Other")
ages_data_1.append(num_ages_left_1)
ages_legend_6.append("Other")
ages_data_6.append(num_ages_left_6)

wedges, texts, autotexts = ax1.pie(
    ages_data_1, 
    autopct=lambda pct: func(pct, ages_data_1),
    textprops=dict(color="w")
    )
wedges, texts, autotexts = ax6.pie(
    ages_data_6, 
    autopct=lambda pct: func(pct, ages_data_6),
    textprops=dict(color="w")
    )

ax1.legend(wedges, ages_legend_1,
          title="Ages",
          loc="upper left",
          bbox_to_anchor=(1, 0, 0.5, 1))

plt.setp(autotexts, size=8, weight="bold")

ax1.set_title("Age Groups For The Top %1 Live")
ax6.set_title("Age Groups For The Top %6 Live")

plt.show()

### Take aways from the pie charts above: 

* portion wise, they seems very similar. The age group 40-44 seems to be in favor by almost 10% in the top 1% yet for the top 6%, the age group 35-39 seems almost equal to it. 
* age groups 40-44 and 25-29 seem to have the highest percentage of actual top 1% from the top 6%. 40-44 is about 22% of top 6% are in the top %1. 25-29 is about 20% of top 6% are in 
the top 1%. Every other group is about 13%.

## Chapter 4: FOR EACH INCOME BRACKET, WHAT IS THE LEVEL OF EXPERIENCE? 

This is something that I constantly hear, "You get paid based off of your experience". Which I do agree with, lets now see if we can show this in a graph clearly. 

We are going to use a simple line graph .plot() to plot the frequency of each experience level for each income bracket. We will place all line graphs on the same axis to get a better idea and to compare. 

Steps:

1. group each income bracket into a dataframe of its own 
2. access the index and values (values in this case is the frequency of each experience level) 
3. Here we just make sure they are all in order by placing them into their repective lists at the same time 
4. we graph 

In [None]:
# step 1
inc_pay_data = top6_data.copy()

group_inc = inc_pay_data.groupby("Q24")

salary1 = group_inc.get_group("150,000-199,999")
salary2 = group_inc.get_group("200,000-249,999")
salary3 = group_inc.get_group("250,000-299,999")
salary4 = group_inc.get_group("300,000-500,000")
salary5 = group_inc.get_group("> $500,000")


In [None]:
# step 2 
salary1_x = salary1["Q6"].value_counts().sort_index().index.tolist()
salary1_y = salary1["Q6"].value_counts().sort_index().values.tolist()

salary2_x = salary2["Q6"].value_counts().sort_index().index.tolist()
salary2_y = salary2["Q6"].value_counts().sort_index().values.tolist()

salary3_x = salary3["Q6"].value_counts().sort_index().index.tolist()
salary3_y = salary3["Q6"].value_counts().sort_index().values.tolist()

salary4_x = salary4["Q6"].value_counts().sort_index().index.tolist()
salary4_y = salary4["Q6"].value_counts().sort_index().values.tolist()

salary5_x = salary5["Q6"].value_counts().sort_index().index.tolist()
salary5_y = salary5["Q6"].value_counts().sort_index().values.tolist()

In [None]:
# step 3
years_exp = [
    'I have never written code',
    '< 1 years',
    '1-2 years',
    '3-5 years',
    '5-10 years',
    '10-20 years',
    '20+ years'
]

s1_y = []
s2_y = [] 
s3_y = []
s4_y = []
s5_y = [] 

for years in years_exp:
    s1_y.append(salary1_y[salary1_x.index(years)])
    s2_y.append(salary2_y[salary2_x.index(years)])
    s3_y.append(salary3_y[salary3_x.index(years)])
    s4_y.append(salary4_y[salary4_x.index(years)])
    s5_y.append(salary5_y[salary5_x.index(years)])

In [None]:
# step 4 
plt.rcParams["figure.figsize"] = (20,20)

plt.plot(years_exp, s1_y)
plt.plot(years_exp, s2_y)
plt.plot(years_exp, s3_y)
plt.plot(years_exp, s4_y)
plt.plot(years_exp, s5_y)

labels = [
    "150,000-199,999",
    "200,000-249,999",
    "250,000-299,999",
    "300,000-500,000",
    "> $500,000"
]

plt.legend(labels)


### What to take away from this graph? 

* majority of people in each income bracket do have at the least 10-20 years of experience under their belt. 
* Almost 0 people are in the top 6% that have never written code. Duh. Yet there might be a bit of noise in the data since not ever employment type in the data is a hands on 
programmer. Not a big deal since this still clearly shows what we have said before. 
* the slope from 10-20 years to 20+ years for the blue, orange, green lines is greater than the red and purple. This is interesting because this shows that to be making enough to be in the top 1%, is it that important to have 20+ years of experience compared to 10-20 years. Something to think about. 

## Chapter 5: WHAT EDUCATION HAS THE TOP 1% AND TOP 6% ACCOMPLISHED? 

This is the best one. Me included ... sometimes, you hear students complain about wanting to drop out because education doesnt seem that important to make money in the data science field. Its about experience right? Well lets put that theory to test. 

In [None]:
fig, (ax1,ax6) = plt.subplots(1, 2, figsize=(25,25), subplot_kw=dict(aspect="equal"))

num_edu_left_1 = sum(top1_data["Q4"].value_counts()[3:].values.tolist())
num_edu_left_6 = sum(top6_data["Q4"].value_counts()[3:].values.tolist())

edu_legend_1 = top1_data["Q4"].value_counts()[:3].index.tolist() 
edu_data_1 = top1_data["Q4"].value_counts()[:3].values.tolist()
edu_legend_6 = top6_data["Q4"].value_counts()[:3].index.tolist() 
edu_data_6 = top6_data["Q4"].value_counts()[:3].values.tolist()

edu_legend_1.append("Other")
edu_data_1.append(num_edu_left_1)
edu_legend_6.append("Other")
edu_data_6.append(num_edu_left_6)

wedges, texts, autotexts = ax1.pie(
    edu_data_1, 
    autopct=lambda pct: func(pct, edu_data_1),
    textprops=dict(color="w")
    )
wedges, texts, autotexts = ax6.pie(
    edu_data_6, 
    autopct=lambda pct: func(pct, edu_data_6),
    textprops=dict(color="w")
    )

ax1.legend(wedges, edu_legend_1,
          title="Education Level",
          loc="upper left",
          bbox_to_anchor=(1, 0, 0.5, 1))

plt.setp(autotexts, size=8, weight="bold")

ax1.set_title("Education Level For The Top %1 Live")
ax6.set_title("Education Level For The Top %6 Live")

plt.show()

## Take aways from the pie charts: 

* Almost 50% of the top 6% have a masters degree, and almost 40% of the top 1% have masters degree. So its fair to say that a masters degree is certainly a bonus. 
* Doctoral degree seems to be a fairly close second in the top 1%. Compared to the top 6% where the number of masters degree almost doubles the doctoral degree. That means almost 7% more people that have a doctoral degree are in the top 1% from the top 6% than people that have a masters degree. 

## Chapter 6: FOR EACH LEVEL OF EDUCATION, HOW MANY YEARS OF EXPERIENCE DO I REQUIRE TO MAKE THE BIG BUCKS? 

If you think about it, the way it should be is, if you have no education, the more years of experience you need to make more money. And the higher your level of education is, the less experience you need to make more money. 

Lets test this theory. 

Steps:

1. copy the data from top 6% and get the different groups based on education. We will focus on the 4 different levels. 
2. get the index and values for the different level of experiences for each level of education 
3. plot 

In [None]:
# step 1
inc_edu_data = top6_data.copy()

group_edu = inc_edu_data.groupby("Q4")

edu1 = group_edu.get_group("No formal education past high school")
edu5 = group_edu.get_group("Bachelor’s degree")
edu6 = group_edu.get_group("Master’s degree") 
edu7 = group_edu.get_group("Doctoral degree") 


In [None]:
# step 2
edu1_x = edu1["Q6"].value_counts().sort_index().index.tolist()
edu1_y = [0,0,0,0,0,0,2]

edu5_x = edu5["Q6"].value_counts().sort_index().index.tolist()
edu5_y = edu5["Q6"].value_counts().sort_index().values.tolist()

edu6_x = edu6["Q6"].value_counts().sort_index().index.tolist()
edu6_y = edu6["Q6"].value_counts().sort_index().values.tolist()

edu7_x = edu7["Q6"].value_counts().sort_index().index.tolist()
edu7_y = edu7["Q6"].value_counts().sort_index().values.tolist()


In [None]:
# step 3
plt.rcParams["figure.figsize"] = (20,20)

x_values = years_exp

bar_edu1 = np.arange(len(x_values))
bar_edu5 = [i+0.2 for i in bar_edu1]
bar_edu6 = [i+0.2 for i in bar_edu5]
bar_edu7 = [i+0.2 for i in bar_edu6]


plt.bar(bar_edu1, edu1_y, width=0.2, label="No education")
plt.bar(bar_edu5, edu5_y, width=0.2, label="Bachelor's")
plt.bar(bar_edu6, edu6_y, width=0.2, label="Master's")
plt.bar(bar_edu7, edu7_y, width=0.2, label="Doctoral")

plt.xticks((bar_edu1+0.2/2)+0.2, x_values)
plt.legend()
plt.xlabel("Years of experience")
plt.ylabel("# of people")
plt.show()

### What to take away from the bar chart above? 

* First we can straight away notice that the people in the top 6% that have no education, do require 20+ years of experience to make the money that they are making. So our earlier 
assumption in the beginning of the chapter seems to be true so far. 

* For masters and doctoral, the green and red bar respectively, we can see that there is a greater number of people that have <1 year and 1-2 years of experience than any other 
category of experience. This shows that the higher level of education you have, the less experience those guys needed to make the top 6%. 

# Conclusion 

so, if you are 40-44 years old, living in USA, with a masters and around 1 year of  experience, congrates, you are going to be rich very soon haha.

In all seriousness, there are a few key points that this data proves that we kinda always knew in the back of our minds:

1. If you live in the USA, you have a higher chance of being in the top 6%. This makes perfect sense. The biggest companies in the world headquaters are in the USA. Some of the best STEM universities are in the states. And so on. 

2. Ages 35-44 tend to have a better chance of making more money. This makes sense cause by that age, if you have been in the field long enough, you can have the right amount of expereince and knowledge to excel. Plus Im assuming you also have more energy and time compared to the older people. I dont know for sure though cause Im only 22 :) 

3. Your level of education does play a factor in the money you will make down the road. masters or doctoral are the better options. I think this is self explanatory. Education is important and what you do with it is even more important.

4. If you have a high level degree, then experience does not play as strong of a factor to your income compared to a lower level of education. This makes perfect sense to me, again self explanatory. 

5. Other factors play a bigger role than years of expereince when you start making more than 300,000. I think this has to do with stuff outside the line of work you do haha. When you start making that much money, you are doing more than just "data analysis". 
