# **HOW TO GET A +100,000 $ SALARY AS A DATA SCIENTIST**

## 1. Introduction

In this notebook I will analyze the *2020 Kaggle Machine Learning & Data Science Survey* to understand what a Data Scientist needs to do to get above $100,000 per year.

In [None]:
import pandas as pd
import re
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 10)
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

The survey was answered by 20,036 people as it can be checked here:

In [None]:
# Read and slighlty clean the dataset
df = pd.read_csv("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv");
df = df.drop(df.columns[0], axis=1)
df = df.iloc[1:].reset_index()

In [None]:
survey_respondents = df.shape[0]
print("2020 Kaggle Machine Learning & Data Science Survey respondents:", survey_respondents)

## 2. Data Preparation

### 2.1 Salary variable

First, it is important to properly prepare the salary column which corresponds to question number 24 (Q24). The possible answers for this question were ranges (example: $0-999):




In [None]:
df['Q24'].unique()

Since the answer was a range (minimum salary and maximum salary), I will split that information in different two columns with the **minimum salary** (MinSalary) and the **maximum salary** (MaxSalary). 

Besides, I will compute the mean salary and create a dataframe with the three variables: **MinSalary**, **MaxSalary** and **MeanSalary**:

In [None]:
# Obtain the min and max salaries of the ranges
min_salaries = []
max_salaries = []
for salary in df['Q24']:
    if pd.isna(salary):
        min_salary = np.nan
        max_salary = np.nan
    else:
        remove_comma = salary.replace(',', '')
        clean_salary = re.findall(r'\d+', remove_comma)  # get the numbers from the string
        if len(clean_salary) == 2:
            min_salary = int(clean_salary[0])
            max_salary = int(clean_salary[1])
        elif len(clean_salary) == 1:
            min_salary = int(clean_salary[0])
            max_salary = min_salary
    min_salaries.append(min_salary)
    max_salaries.append(max_salary)

# Dataframe
salary_df = pd.DataFrame({'MinSalary':min_salaries, 'MaxSalary':max_salaries})
salary_df['MeanSalary'] = salary_df.mean(axis=1)
salary_df.head()

Next, I will concatenate the previous dataframe (salary_df) witht the main dataframe (df):

In [None]:
df = pd.concat([df, salary_df], axis=1)
df.head()

### 2.2 Employed respondents

Some of the respondents were not actually working, hence it makes sense to remove them before considering the salaries of the best paid Data Scientists:

In [None]:
working_people_df = df[(df['Q5'] != 'Currently not employed') & (df['Q5'] != 'Student') & df['Q5'].notna()]
working_people = working_people_df.shape[0]
print("Amount of people working who answered the survey:", working_people)
print("Percentage of working people who answered the survey:", working_people / survey_respondents * 100)

It is interesting to see that around 62% of the respondents were actually working.



### 2.3 Salary overview

Before choosing the best-paid Data Scientists, I will have a look to the the salaries of the employed respondents:

In [None]:
plt.figure(figsize=(15,6))
plt.hist(working_people_df['MeanSalary'], bins=30)
plt.title('Salary histogram of employed respondents')
plt.xlabel('Salary ranges in $')
plt.ylabel('Amount of people');

### 2.4 Best paid professionals

Having into account those active people, I will consider that best-paid people earn at least $100,000 per year. Here we will see who are the best paid professionals by their job title:

In [None]:
# Select best-paid professionals
best_paid_df = working_people_df[working_people_df['MeanSalary'] >= 100000]
# Rank the best-paid workers
best_paid_ranking_by_job = best_paid_df.groupby(['Q5'])['Q1'].count().sort_values(ascending=False).to_frame()
best_paid_ranking_by_job.reset_index(level=0, inplace=True)
best_paid_ranking_by_job = best_paid_ranking_by_job.rename(columns={"Q5": "Position", "Q1": "Amount"})
best_paid_ranking_by_job['Percentage'] = best_paid_ranking_by_job['Amount'] / best_paid_ranking_by_job['Amount'].sum() * 100
best_paid_ranking_by_job

### 2.4 Best paid Data Scientists

Considering the previous dataframe, I will get the best paid Data Scientists. I will consider Machine Learning Engineers in the same group since I think that they are more specialized in one topic, but they still belong to the Data Science field:

In [None]:
# Consider Machine Learning Engineers in the same group as Data Scientists
best_paid_datascientists_df = best_paid_df[(best_paid_df['Q5'] == 'Data Scientist') 
                                           | (best_paid_df['Q5'] == 'Machine Learning Engineer') 
                                           & (best_paid_df['Q6'] != 'I have never written code')]

print("Percentage of Data Scientist and ML engineers in the best-paid professionals group:",
      best_paid_datascientists_df.shape[0] / best_paid_df.shape[0] * 100)

Considering the results from above, there are more Data Scientists that get paid equal or over $100,000 per year. In fact, 38% of the best paid professionals are Data Scientists or Machine Learning Engineers.

Let's plot a histrogram to see the salary distribution of the best paid Data Scientists and Machine Learning Engineers:

In [None]:
plt.figure(figsize=(15,6))
plt.hist(best_paid_datascientists_df['MeanSalary'], bins=30)
plt.title('Salary histogram of best-paid Data Scientists and Machine Learning Engineers')
plt.xlabel('Salary ranges in $')
plt.ylabel('Amount of people');

## 3.Analysis: what do you need to be a top-paid Data Scientist?

In this section I will show some conclusions regarding different aspects of the top-paid Data Scientists.

### 3.1 Useful functions for the analysis

Before getting some insights from the survey questions, I will prepare a couple of functions to summarize the relevant questions I will analyze. As there are two type of questions according to the answer type (unique choice and multiple choice), I will develop two different functions to be faster with the analysis later.

This is the function for **multiple choice answers**:

In [None]:
def get_answers_multiple_choice(df, column, value_list, name):
    """
    Simplify the answer of multiple columns that belong to the same question.
    As they are multiple choice questions, each column states one of the options.
    The function returns a dataframe with the amount of answers and their percentage of the total.
        
    Example: in a 3 possible answer question we will get a dataframe of n x 3 shape
    
    :param df: dataframe to use
    :type df: dataframe
    :param column: column to choose
    :type column: string
    :param value_list: list of the possible values in the column(s)
    :type value_list: list
    :param name: name of the column to be used in the new dataframe
    :type name: string
    :rparam: dataframe with the counted answers and percentage ordered from highest
    to lowest
    :rtype: dataframe
    """

    # Get the columns
    column_list = []  # Create an empty list to store the values of the loop
    for column_i in df.columns:
        splitted = column_i.split("_")
        if column in splitted:
            column_list.append(column_i)
    
    # Dataframe with the chosen columns
    filtered_df = df[column_list]
    
    # Answers per order of the list
    amount_list = []  # Create an empty list to store the values of the loop
    for column_i in filtered_df.columns:
        counter = filtered_df[column_i].count()
        amount_list.append(counter)
    
    # Create a dataframe
    combined_dataframe = pd.DataFrame({name: value_list, 'Amount': amount_list})
    
    # Create percentage column
    total = combined_dataframe['Amount'].sum()
    combined_dataframe['Percentage'] = combined_dataframe['Amount'] / total * 100
    
    # Sort the values
    combined_dataframe = combined_dataframe.sort_values('Percentage', ascending=False)

    return combined_dataframe

And the next function will be used to summarize the **unique answer questions**:

In [None]:
def get_answers_unique_choice(df, chosen_column, chosen_column_name, random_column):
    """
    Obtain the amount of answers and their percentage for a unique choice
    question
    
    :param df: dataframe to chose
    :type df: dataframe
    :param chosen_column: column to analyze
    :type chosen_column: string
    :param random_column: select a random column
    :type random_column: string
    
    """

    grouped_df = pd.DataFrame(df.groupby(chosen_column)[random_column].count().sort_values(ascending=False))
    
    grouped_df = grouped_df.rename(columns={random_column: 'Amount'})
    
    
    total = grouped_df['Amount'].sum()
    grouped_df['Percentage'] = grouped_df['Amount'] / total * 100
    grouped_df.reset_index(level=0, inplace=True)
    grouped_df.rename(columns={chosen_column:chosen_column_name}, inplace=True)
    
    return grouped_df

### 3.2 Where should you work as a Data Scientist?

These are the top 10 countries that have better paid Data Scientists or Machine Learning Engineers:

In [None]:
get_answers_unique_choice(best_paid_datascientists_df, 'Q3', 'Country', 'Q1').head(10)

### 3.3 Education

Here I will consider two points:
- Achieved level of studies
- Type of taken Data Science courses

This is what this top-paids have studied:

In [None]:
get_answers_unique_choice(best_paid_datascientists_df, 'Q4', 'Studies', 'Q1')

Most of the Data Scientists and Machine Learning Engineers that are in the best-paid group have finished a university degree (Bachelor's, Master's or Doctoral).

And here we have more information about the Data Science courses they have taken:

In [None]:
course_platform_list = ['Coursera', 'edX', 'Kaggle', 'DataCamp', 'fast.ai', 'Udacity', 'Udemy', 'LinkedInLearning', 'Cloud-certificationPrograms', 'UniversityCourses', 'None', 'Other']
get_answers_multiple_choice(best_paid_datascientists_df, 'Q37', course_platform_list, 'Q1')

It is relevant to show that only less than 5% of the best-paid have never completed a data science course.

### 3.4 Company type

In this section I will analyze this areas:
- Company size
- Number of people responsible for Data Science in the company

Company size:

In [None]:
get_answers_unique_choice(best_paid_datascientists_df, 'Q20', 'CompanySize', 'Q1')

Number of people responsible for Data Science in the company:

In [None]:
get_answers_unique_choice(best_paid_datascientists_df, 'Q21', 'DataSciencePeople', 'Q1')

This might mean that despite big corporations tend to pay higher salaries, start-ups do also invest in their Data Scientists. 

**Note**: I have considered that companies with less employees and less Data Scientists are start ups.

### 3.5 Technical Requirements and Experience

In this section I will analyze the technical requirements you need to have a top salary as Data Scientist:

#### 3.5.1 Years coding

For a Data Scientist ir Machine Learning Engineer coding is fundamental:

In [None]:
get_answers_unique_choice(best_paid_datascientists_df, 'Q6', 'YearsCoding', 'Q1')

After having 5 years of experience the chances of getting better paid are greater.

I would like to say that the respondent who answered *I have never written code* should not be considered as a Data Scientist since it is part of the daily duties of the job.

#### 3.5.2 Years using Machine Learning

Let's analyze the number of years that best-paid Data Scientists have been using Machine Learning:

In [None]:
get_answers_unique_choice(best_paid_datascientists_df, 'Q15', 'YearsWithML', 'Q1')

The answers are diverse in this case and no clear conclusion can be taken.

#### 3.5.3 Regularly used programming languages

The top 5 languages are the following ones:

In [None]:
language_list = ['Python', 'R', 'SQL', 'C', 'C++', 'Java', 'Javascript', 'Julia', 'Swift', 'Bash', 'MATLAB', 'None', 'Other']
get_answers_multiple_choice(best_paid_datascientists_df, 'Q7', language_list, 'Language').head(5)

In 2020 Python is king for Data Science and Machine Learning. However, do not forget of SQL and R since they are relevant too.

#### 3.5.4 Regularly used Integrated Development Enviroments (IDEs)

The top 5 IDEs are the following ones:

In [None]:
ide_list = ['JupyterLab (or products based off of Jupyter)', 'RStudio', 'Visual Studio', 'Visual Studio Code (VSCode)', 'PyCharm', 'Spyder', 'Notepad++', 'Sublime Text', 'Vim, Fmacs, or similar', 'MATLAB', 'None', 'Other']
get_answers_multiple_choice(best_paid_datascientists_df, 'Q9', ide_list, 'IDE').head(5)

Jupyter is in 2020 the most used IDE by the best-paid group. Nevertheless, consider that other IDEs like VSCode or Pycharm are also interesting since they are more used for software development. They might be tools you will like to learn if models need to go to production.

#### 3.5.5 Regularly used plotting libraries

The top 5 plotting libraries are the following ones:

In [None]:
plot_list = ['Matplotlib', 'Seaborn', 'Plotly / Plotly Express', 'Ggplot / ggplot2',
                 'Shiny', 'D3.js', 'Altair', 'Bokeh', 'Geoplotlib', 'Leaflet / Folium',
                 'None', 'Other']
visualization_best_paid_data_scientists = get_answers_multiple_choice(best_paid_datascientists_df, 'Q14', 
                                                                      plot_list, 'VisualizationLibraries')
visualization_best_paid_data_scientists.head(5)

The most common Matplotlib and Seaborn libraries are the ones most used by the best-paid group.

#### 3.5.6 Regularly used Machine Learning algorithms

The top 5 Machine Learning algorithms are the following ones:

In [None]:
algorithms_list = ['Linear or Logistic Reg', 'Decision Trees or Random Forests',
                   'Gradient Boosting', 'Bayesian Approaches', 'Evolutionary Approaches',
                   'Dense Neural Nets', 'CNN', 'GANS', 'RNN', 'Transformer Networks',
                   'None', 'Other']
algorithms_list_best_paid_data_scientists = get_answers_multiple_choice(best_paid_datascientists_df, 'Q17', 
                                                                        algorithms_list, 'AlgorithmTypes')
algorithms_list_best_paid_data_scientists.head()

It is interesting to see that the top 3 Machine Learning algorithms are not Deep Learning models.

#### 3.5.7 Regularly used Machine Learning frameworks

In [None]:
ml_list = ['Scikit-learn', 'TensorFlow', 'Keras', 'PyTorch', 'Fast-ai', 'MXNet',
           'Xgboost', 'LigthGBM', 'CatBoost', 'Prophet', 'H2O 3', 'Caret',
           ' Tidymodels', 'JAX', 'None', 'Other']
ml_best_paid_data_scientists = get_answers_multiple_choice(best_paid_datascientists_df, 'Q16', ml_list, 'MLLibraries')
ml_best_paid_data_scientists.head()

As seen in the previous question, less complex models are more used by the best-paid Data Scientists in 2020. Hence, frameworks such as Scikit-learn and Xgboost are the most common.

### 4. Summary

To sum up, this is what you need to do in order to become a best-paid (min salary of 100,000$ per year) Data Scientist or Machine Learning engineer:

- **Location**: work in the USA
- **Formal education**: get a university degree and if possible, go for a Master's degree
- **Data Science education**: take Data Science courses at University or through other platforms (Coursera, DataCamp, edX, Udemy, etc.)
- **Company size**: work for a start-up or a corporate
- **Experience**: get at least 5 years of coding experience
- **Languages**: master Python and be confident with SQL. R might be useful too (combine it with RStudio IDE)
- **Integrated Development Environments (IDEs)**: Jupyter products (JupyterLab or JupyterNotebook)  are key, but do not forget about Visual Studio Code or PyCharm in case you need to integrate your work into production software
- **Machine Learning**: focus first on mastering the basic Machine Learning algorithms: Linear or Logistic Regression, Trees and Gradient Boost. For those you will be mainly using Scikit-learn and Xgboost
> 


## 5. Sources

- 2020 Kaggle Machine Learning & Data Science Survey: https://www.kaggle.com/c/kaggle-survey-2020