# **What’s in a Name?<br><br> Analysis of Data Science Job Titles and Skill Sets in the United States**

[Introduction](#intro)

[Survey Methodology](#method)

[Analysis of Survey Results](#analysis)

* [Geography](#geo)

* [Job Titles](#titles)

* [Job Responsibilities](#duties)

* [Technology on the Job](#tech)

* [Educational Background](#education)

* [Age](#age)

* [Gender](#gender)
   
* [Salary](#salary)

* [Years of Experience](#yoe)

[Conclusion](#conclusion)

[References](#refs)

In [None]:
# Import needed libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px

# Install openpyxl

!pip install openpyxl

In [None]:
# Read CSV file as dataframe

df = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')

In [None]:
# Delete first row to drop questions

df = df.iloc[1: , :]

In [None]:
# Create function for bar chart

def create_horizontal_bar_chart(dataframe,
                                x_axis_data,
                                y_axis_data,
                                number_label_text,
                                x_axis_range,
                                title,
                                x_axis_title,
                                y_axis_title):
    fig = px.bar(dataframe,
                 x=x_axis_data,
                 y=y_axis_data,
                 orientation='h',
                 hover_data=[x_axis_data, number_label_text],
                 text=number_label_text,
                 height=600,
                 width=1000,
                 color_discrete_sequence=['#BF1363'],
                 title=title,
                 labels={x_axis_data: x_axis_title, y_axis_data: y_axis_title})
    fig.update_layout(font_family='Raleway',
                      font_color='#191923',
                      title_font_family='Raleway Medium',
                      title_font_color='#191923',
                      title_font_size=18,
                      legend_title_font_color='#191923',
                      plot_bgcolor='white',
                      hoverlabel=dict(bgcolor='#F39237', font_size=14, font_family='Raleway', font_color='white',
                                      bordercolor='#F39237'))
    fig.update_xaxes(type='linear',
                     range=x_axis_range,
                     tickfont=dict(family='Raleway', color='#191923', size=12),
                     title_font=dict(size=14, family='Raleway Medium', color='#191923'),
                     showgrid=False,
                     showline=True, linewidth=1, linecolor='#191923', mirror=False)
    fig.update_yaxes(tickfont=dict(family='Raleway', color='#191923', size=12),
                     title_font=dict(size=14, family='Raleway Medium', color='#191923'),
                     showgrid=False,
                     showline=True, linewidth=1, linecolor='#191923', mirror=False)
    fig = fig.update_traces(marker_line_width=0)
    fig.show()

In [None]:
# Create function for heat map

def create_heat_map(dataframe,
                    x_axis_title,
                    y_axis_title,
                    color_bar,
                    x_axis_tick_angle):
    fig = px.imshow(dataframe,
                    labels=dict(x=x_axis_title, y=y_axis_title, color=color_bar),
                    height=1000,
                    width=1000,
                    color_continuous_scale=['#0E79B2','#BF1363', '#F39237'])
    fig.update_layout(font_family='Raleway',
                      font_color='#191923',
                      title_font_family='Raleway Medium',
                      title_font_color='#191923',
                      title_font_size=18,
                      legend_title_font_color='#191923',
                      plot_bgcolor='white',
                      hoverlabel=dict(bgcolor='#191923', font_size=14, font_family='Raleway', font_color='white',
                                      bordercolor='#191923'),
                      legend_yanchor='top',
                      legend_y=1,
                      legend_xanchor='right',
                      legend_x=1)
    fig.update_xaxes(side='top',
                     tickangle=x_axis_tick_angle,
                     tickfont=dict(family='Raleway', color='#191923', size=12),
                     title_font=dict(size=14, family='Raleway Medium', color='#191923'),
                     showgrid=False,
                     showline=True, linewidth=1, linecolor='#191923', mirror=False)
    fig.update_yaxes(tickfont=dict(family='Raleway', color='#191923', size=12),
                     title_font=dict(size=14, family='Raleway Medium', color='#191923'),
                     showgrid=False,
                     showline=True, linewidth=1, linecolor='#191923', mirror=False,
                     tickangle=0)
    fig.show()

# **Introduction**<a name='intro'></a>

What makes a data scientist a data scientist? Or a data analyst a data analyst? Just like with the term “big data,” there are many definitions of these roles. The field of data science and machine learning continues to grow and evolve, and so do the job titles (and job responsibilities) for these fields.

Because the field of data science is still relatively new and continues to grow, there is no general consensus on how to define the many interrelated roles in the industry. Responsibilities tend to overlap, particularly in smaller companies where there may not be a dedicated data team. The term “data scientist” is still so new that the U.S. Bureau of Labor Statistics (2021) does not list a category for this role—data scientists and data analysts would likely fall within the category of Database Administrators and Architects, and related categories include Computer Programmers and Computer and Information Research Scientists.

The National Center for O\*NET Development (2021) provides slightly better data regarding job titles in the data science field. Its current data set for alternate titles lists various data science job titles, including clinical data managers (14 alternate titles), data entry keyers (82 alternate titles), data scientists (24 alternate titles), data warehousing specialists (11 alternate titles), database administrators (29 alternate titles), and database architects (46 alternate titles).

In their recent paper “A Data Science Approach to Defining a Data Scientist,” Ho et al. (2019) concluded that “the title of ‘Data Scientist’ is still a new concept that has and will continue to evolve as the role molds to the needs of artificial intelligence and business requirements. Our current definition is the result of the current market’s view via job postings. Additional future work is required to capture a higher accuracy. . . . For future work, additional data such as surveys, in person interviews with working data scientists, and web scraping additional websites could provide a more well-rounded (and accurate) prediction on the definition and skill set.”

Perhaps the best way to determine who does what in the data science and machine learning field is to analyze job postings. In a blog post on KDnuggets, Theuwissen (2015) summarizes the job titles of data scientist, data analyst, data architect, data engineer, statistician, database administrator, business analyst, and data and analytics manager based on relevant job postings. He also lists the most popular programming languages, salary ranges, and roles within a team for each of these jobs.

Verma et al. (2019) also examined a sample of online job postings for titles such as business analyst, business intelligence analyst, data analyst, and data scientist using content analysis, focusing on four U.S. states: Arkansas, Florida, Missouri, and Kansas. They concluded that “the [business analyst] category appears to be the least technical of the four studied job categories. The [business analyst] jobs require a high degree of domain knowledge whereas the [business intelligence analyst] jobs focus strongly on structured data management skills along with some knowledge of statistics. The requirements for [data analyst] overlap with those for [data scientist] in the areas of decision-making and organization skills. Compared to the [data analyst] jobs, the [data scientist] ones strongly rely on statistical and programming skills.”

The analysis in this notebook aims to provide a comprehensive overview of job titles and job responsibilities within the U.S. data science field. It examines demographic information such as age, gender, and levels of education as well as technology use and participants’ salaries and years of experience.

In [None]:
# Read ONET Excel file to dataframe

df_onet = pd.read_excel('../input/onet-data/Alternate Titles.xlsx')

In [None]:
# Filter to show titles

df_onet_filtered = df_onet[['Title', 'Alternate Title']]

In [None]:
# Filter to find relevant titles

df_onet_filtered_data = df_onet_filtered[df_onet_filtered['Title'].str.contains('Data')]

In [None]:
# Create pivot table

onet_pivot_table = pd.pivot_table(df_onet_filtered_data,
                       #values='D',
                       index=['Title','Alternate Title'],
                       #columns=['Title'],
                       aggfunc=np.sum,
                       fill_value=0)

In [None]:
df_onet_filtered_data_grouped = df_onet_filtered_data.groupby('Title').count().reset_index()

# **Survey Methodology**<a name='method'></a>

The 2021 Kaggle Machine Learning & Data Science Survey received 25,973 usable responses from participants in 171 different countries and territories. The full list of questions and answer choices is provided in a supplementary file (Kaggle, 2021).

To ensure response quality, Kaggle excluded responses that were flagged by the survey system as “spam” or “duplicate.” Responses from participants who spent less than 2 minutes completing the survey, as well as responses from participants who selected fewer than 15 answer choices in total, were dropped. Follow-up questions were only asked to respondents who answered the setup question affirmatively.
    
To protect the respondents’ privacy, free-form text responses were not included in the public survey data set, and the order of the rows was shuffled (responses are not displayed in chronological order). Likewise, if a country or territory had less than 50 respondents, these were grouped into a group named “other” for the sake of anonymity.
    
An invitation to participate in the survey was sent to the entire Kaggle community (anyone opted in to the Kaggle email list). The survey was also promoted on the Kaggle website (via both banners and popups) as well as on the Kaggle Twitter channel. The survey was live from September 1, 2021, to October 4, 2021.

# **Analysis of Survey Results**<a name='analysis'></a>

## **Geography**<a name='geo'></a>

In [None]:
# Filter to see only country results

df_all = df[['Q3']]

In [None]:
# Group and aggregate data

df_all_grouped = df_all.groupby(['Q3'])['Q3'].count().reset_index(name='Count')

In [None]:
# Add columns

df_all_grouped['Total'] = df_all['Q3'].count()

df_all_grouped['Percentage'] = (df_all_grouped['Count'] / df_all_grouped['Total']) * 100

# Rename column for readability

df_all_grouped = df_all_grouped.rename(columns={'Q3': 'Country'})

# Sort columns

df_all_grouped.sort_values(by='Count', ascending=False, inplace=True)

In [None]:
# Format numbers

df_all_grouped['Percentage'] = df_all_grouped['Percentage'].map('{:,.2f}%'.format)
df_all_grouped['Count'] = df_all_grouped['Count'].map('{:,}'.format)
df_all_grouped['Total'] = df_all_grouped['Total'].map('{:,}'.format)

This analysis focuses solely on respondents in the United States, which accounts for 2,650 (or 10.2%) of the 25,974 total respondents to the 2021 survey. The United States had the second highest volume of Kaggle users who responded to the survey; India occupied the top spot, followed by “Other” and then Japan and China. The chart below shows data for the top 10 countries included in the survey.

In [None]:
# Create chart

create_horizontal_bar_chart(dataframe=df_all_grouped.head(10),
                            x_axis_data='Count',
                            y_axis_data='Country',
                            number_label_text='Percentage',
                            x_axis_range=[0, 8000],
                            title='Top 10 Countries/Regions Included in the 2021 Kaggle Machine Learning & Data Science Survey',
                            x_axis_title='Total',
                            y_axis_title='Country')

## **Job Titles**<a name='titles'></a>

In [None]:
# Filter data for country and job titles

df_title = df[['Q3', 'Q5']]

In [None]:
# Filter data to see only results for United States

df_title_us = df_title[df_title['Q3'] == 'United States of America']

In [None]:
# Group and aggregate

df_title_grouped = df_title_us.groupby(['Q3', 'Q5'])['Q5'].count().reset_index(name='Count')

# Rename columns for readability

df_title_grouped = df_title_grouped.rename(columns={'Q3': 'Country',
                                                    'Q5': 'Job Title'})

# Sort data

df_title_grouped.sort_values(by='Count', ascending=False, inplace=True)

In [None]:
# Add columns

df_title_grouped['U.S. Total'] = df_title_us['Q3'].count()
df_title_grouped['Percentage'] = (df_title_grouped['Count'] / df_title_grouped['U.S. Total']) * 100

# Sort

df_title_grouped.sort_values(by='Job Title', ascending=False, inplace=True)

In [None]:
# Change number formats

df_title_grouped['Count'] = df_title_grouped['Count'].map('{:,}'.format)
df_title_grouped['U.S. Total'] = df_title_grouped['U.S. Total'].map('{:,}'.format)
df_title_grouped['Percentage'] = df_title_grouped['Percentage'].map('{:,.2f}%'.format)

Question 5 asked participants to “select the title most similar to your current role (or most recent title if retired),” with the list of titles as follows:
* Business Analyst
* Data Analyst
* Data Engineer
* Data Scientist
* DBA/Database Engineer
* Machine Learning Engineer
* Product Manager
* Program/Project Manager
* Research Scientist
* Software Engineer
* Statistician
* Student
* Currently not employed
* Other

Of the 2,650 Kaggle users from the United States who responded to the survey, the majority identified themselves as students (452), followed by data scientists (441), “other” (372), data analysts (258), and software engineers (233). The chart below shows the full itemization of responses.


It would be interesting to know what the “other” category includes, but because free-form text responses were not provided, it was not possible to examine what other job titles professionals may be currently using. These titles could be more specific or more general than the job titles provided. Several possibilities come to mind, such as the more general “analyst” or even “developer.” Other more specific titles could include “mathematician,” “business intelligence analyst,” “business intelligence specialist,” or “ETL developer.” Other survey participants may also be in academia and could have titles geared more toward their specialist fields rather than toward the very broad field of data science. The title “research scientist” may not quite encompass these variations.


As the abbreviation “DBA” is not spelled out completely in the title “DBA/database engineer,” this title may have been misunderstood or misread completely if participants skimmed over it while reading through the survey. “Database administrator” is the generally accepted term for “DBA,” but the “A” could also be interpreted to mean “architect” or “analyst.” Additional feedback from respondents would be needed to fully clarify and examine the participants’ responses.

In [None]:
# Create chart

create_horizontal_bar_chart(dataframe=df_title_grouped,
                            x_axis_data='Count',
                            y_axis_data='Job Title',
                            number_label_text='Percentage',
                            x_axis_range=[0, 500],
                            title='U.S. 2021 Kaggle Machine Learning & Data Science Survey Participants by Job Title',
                            x_axis_title='Total',
                            y_axis_title='Job Title')

## **Job Responsibilities**<a name='duties'></a>

In [None]:
# Filter data for country, job title, and relevant questions

df_duties = df[['Q3', 'Q5', 'Q24_Part_1', 'Q24_Part_2', 'Q24_Part_3',
                'Q24_Part_4', 'Q24_Part_5', 'Q24_Part_6', 'Q24_Part_7', 'Q24_OTHER']]

In [None]:
# Filter data to see only results for United States

df_duties_us = df_duties[df_duties['Q3'] == 'United States of America']

In [None]:
# Drop rows for students and those not employed

index_duties_1 = df_duties_us[df_duties_us['Q5'] == 'Student'].index
index_duties_2 = df_duties_us[df_duties_us['Q5'] == 'Currently not employed'].index

df_duties_us_dropped = df_duties_us.drop(index_duties_1)
df_duties_us_dropped = df_duties_us_dropped.drop(index_duties_2)

In [None]:
# Add total column

#df_duties_us_dropped['Total'] = df_duties_us['Q5'].count()

In [None]:
# Rename columns for readability

df_duties_us_renamed = df_duties_us_dropped.rename(columns={'Q5': 'Job Title',
                                                            'Q24_Part_1': 'Analyze and understand data',
                                                            'Q24_Part_2': 'Build and/or run data infrastructure',
                                                            'Q24_Part_3': 'Build prototypes',
                                                            'Q24_Part_4': 'Build and/or run a machine learning service',
                                                            'Q24_Part_5': 'Improve existing machine learning models',
                                                            'Q24_Part_6': 'Do research',
                                                            'Q24_Part_7': 'None',
                                                            'Q24_OTHER': 'Other'})

In [None]:
# Group and aggregate

df_duties_us_grouped = df_duties_us_renamed.groupby(['Job Title']).agg({i: 'count' for i in df_duties_us_renamed.columns[2:]})

In [None]:
# Check rows

#print(df_duties_us.shape)
#print(df_duties_us_dropped.shape)
#print(df_duties_us_renamed.shape)
#print(df_duties_us_grouped.shape)

In [None]:
# Create pivot table

df_duties_grouped_pivot = pd.pivot_table(
    df_duties_us_grouped,
    #values=
    index='Job Title',
    #columns=
    aggfunc=np.sum,
    fill_value=0,
    margins=False
)

For the purposes of the analysis in this section and going forward, the responses for “student” and “currently not employed” were omitted, as these groups do not have job titles to analyze.

Question 24 asked respondents the following:


> Select any activities that make up an important part of your role at work: (Select all that apply)
> * Analyze and understand data to influence product or business decisions [Part 1]
> * Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data [Part 2]
> * Build prototypes to explore applying machine learning to new areas [Part 3]
> * Build and/or run a machine learning service that operationally improves my product or workflows [Part 4]
> * Experimentation and iteration to improve existing ML models [Part 5]
> * Do research that advances the state of the art of machine learning [Part 6]
> * None of these activities are an important part of my role at work [Part 7]
> * Other [Other]

For all job titles, the option of “analyze and understand data to influence product or business decisions” had the highest volume of responses. Across all options for job responsibilities, data scientists had the most responses and also had the most participants with a total of 441.

The heat map below provides a visual representation of work activities by job title.

In [None]:
# Create heat map

create_heat_map(dataframe=df_duties_us_grouped,
                x_axis_title='Activity',
                y_axis_title='Job Title',
                color_bar='Count',
                x_axis_tick_angle=45)

In [None]:
# Question 24, Part 1

# Filter data and group

df_24q1 = df_duties_us[['Q5', 'Q24_Part_1']]
df_24q1_grouped = df_24q1.groupby('Q5')['Q24_Part_1'].value_counts().reset_index(name='Count')

# Add columns

df_24q1_grouped['Part 1 Total'] = df_24q1['Q24_Part_1'].count()
df_24q1_grouped['U.S. Total'] = df_24q1['Q5'].count()
df_24q1_grouped['Percentage'] = (df_24q1_grouped['Count'] / df_24q1_grouped['Part 1 Total']) * 100

# Sort

df_24q1_grouped.sort_values(by='Q5', ascending=False, inplace=True)

# Change number formats

df_24q1_grouped['Percentage'] = df_24q1_grouped['Percentage'].map('{:,.2f}%'.format)
df_24q1_grouped['Count'] = df_24q1_grouped['Count'].map('{:,}'.format)
df_24q1_grouped['Part 1 Total'] = df_24q1_grouped['Part 1 Total'].map('{:,}'.format)
df_24q1_grouped['U.S. Total'] = df_24q1_grouped['U.S. Total'].map('{:,}'.format)

A total of 1,248 participants responded to Question, Part 1, which asked whether professionals “analyze and understand data to influence product or business decisions” regularly at work. Of this total, approximately 30% (362) were data scientists, which was the highest volume for this question. Data analysts also had a significant percentage of responses at 15.9%. The chart below provides a clearer depiction of the responses.

In [None]:
# Create chart

create_horizontal_bar_chart(dataframe=df_24q1_grouped,
                            x_axis_data='Count',
                            y_axis_data='Q5',
                            number_label_text='Percentage',
                            x_axis_range=[0, 400],
                            title='U.S. Survey Participants Who Analyze and Understand Data to Influence Product or Business Decisions',
                            x_axis_title='Total',
                            y_axis_title='Job Title')

In [None]:
# Question 24, Part 2

# Filter data and group

df_24q2 = df_duties_us[['Q5', 'Q24_Part_2']]
df_24q2_grouped = df_24q2.groupby('Q5')['Q24_Part_2'].value_counts().reset_index(name='Count')

# Add columns

df_24q2_grouped['Part 2 Total'] = df_24q2['Q24_Part_2'].count()
df_24q2_grouped['U.S. Total'] = df_24q2['Q5'].count()
df_24q2_grouped['Percentage'] = (df_24q2_grouped['Count'] / df_24q2_grouped['Part 2 Total']) * 100

# Sort

df_24q2_grouped.sort_values(by='Q5', ascending=False, inplace=True)

# Change number formats

df_24q2_grouped['Percentage'] = df_24q2_grouped['Percentage'].map('{:,.2f}%'.format)
df_24q2_grouped['Count'] = df_24q2_grouped['Count'].map('{:,}'.format)
df_24q2_grouped['U.S. Total'] = df_24q2_grouped['U.S. Total'].map('{:,}'.format)

A total of 696 participants responded to Question, Part 2, which examined whether professionals “build and/or run the data infrastructure that [their] business uses for storing, analyzing, and operationalizing data.” Of this total, the largest volume again belonged to data scientists, with 190 (27.3%) of the responses. Interestingly, software engineers were ahead of data engineers in this category, with 77 responses compared with 63—this question seemed to be the most relevant to data engineers. This points again to the varied job descriptions and responsibilities of those in the data science field. The following chart shows the overall breakdown for this question.

In [None]:
# Create chart

create_horizontal_bar_chart(dataframe=df_24q2_grouped,
                            x_axis_data='Count',
                            y_axis_data='Q5',
                            number_label_text='Percentage',
                            x_axis_range=[0, 200],
                            title='U.S. Survey Participants Who Build and/or Run Business Data Infrastructure',
                            x_axis_title='Total',
                            y_axis_title='Job Title')

In [None]:
# Question 24, Part 3

# Filter data and group

df_24q3 = df_duties_us[['Q5', 'Q24_Part_3']]
df_24q3_grouped = df_24q3.groupby('Q5')['Q24_Part_3'].value_counts().reset_index(name='Count')

# Add columns

df_24q3_grouped['Part 3 Total'] = df_24q3['Q24_Part_3'].count()
df_24q3_grouped['U.S. Total'] = df_24q3['Q5'].count()
df_24q3_grouped['Percentage'] = (df_24q3_grouped['Count'] / df_24q3_grouped['Part 3 Total']) * 100

# Sort

df_24q3_grouped.sort_values(by='Q5', ascending=False, inplace=True)

# Change number formats

df_24q3_grouped['Percentage'] = df_24q3_grouped['Percentage'].map('{:,.2f}%'.format)
df_24q3_grouped['Count'] = df_24q3_grouped['Count'].map('{:,}'.format)
df_24q3_grouped['U.S. Total'] = df_24q3_grouped['U.S. Total'].map('{:,}'.format)

A total of 774 participants responded to Question, Part 3, which asked if respondents regularly “build prototypes to explore applying machine learning to new areas” as an important part of their roles at work. Of this total, the largest volume belonged to data scientists, with 299 (38.6%) of the responses. Only approximately 10% were machine learning engineers—the number of research scientists was actually slightly higher for this question. The chart below illustrates the responses in detail.

In [None]:
# Create chart

create_horizontal_bar_chart(dataframe=df_24q3_grouped,
                            x_axis_data='Count',
                            y_axis_data='Q5',
                            number_label_text='Percentage',
                            x_axis_range=[0, 350],
                            title='U.S. Survey Participants Who Build Prototypes to Explore Applying Machine Learning to New Areas',
                            x_axis_title='Total',
                            y_axis_title='Job Title')

In [None]:
# Question 24, Part 4

# Filter data and group

df_24q4 = df_duties_us[['Q5', 'Q24_Part_4']]
df_24q4_grouped = df_24q4.groupby('Q5')['Q24_Part_4'].value_counts().reset_index(name='Count')

# Add columns

df_24q4_grouped['Part 4 Total'] = df_24q4['Q24_Part_4'].count()
df_24q4_grouped['U.S. Total'] = df_24q4['Q5'].count()
df_24q4_grouped['Percentage'] = (df_24q4_grouped['Count'] / df_24q4_grouped['Part 4 Total']) * 100

# Sort

df_24q4_grouped.sort_values(by='Q5', ascending=False, inplace=True)

# Change number formats

df_24q4_grouped['Percentage'] = df_24q4_grouped['Percentage'].map('{:,.2f}%'.format)
df_24q4_grouped['Count'] = df_24q4_grouped['Count'].map('{:,}'.format)
df_24q4_grouped['U.S. Total'] = df_24q4_grouped['U.S. Total'].map('{:,}'.format)

A total of 493 participants responded to Question 24, Part 4, which examined whether participants “build and/or run a machine learning service that operationally improves [their] product or workflows” in their companies. Data scientists again had the majority and accounted for approximately 40% of responses. The percentage for machine learning engineers was slightly higher for this question at approximately 12%. Please refer to the chart below for further details.

In [None]:
# Create chart

create_horizontal_bar_chart(dataframe=df_24q4_grouped,
                            x_axis_data='Count',
                            y_axis_data='Q5',
                            number_label_text='Percentage',
                            x_axis_range=[0, 200],
                            title='U.S. Survey Participants Who Build and/or Run a Machine Learning Service',
                            x_axis_title='Total',
                            y_axis_title='Job Title')

In [None]:
# Question 24, Part 5

# Filter data and group

df_24q5 = df_duties_us[['Q5', 'Q24_Part_5']]
df_24q5_grouped = df_24q5.groupby('Q5')['Q24_Part_5'].value_counts().reset_index(name='Count')

# Add columns

df_24q5_grouped['Part 5 Total'] = df_24q5['Q24_Part_5'].count()
df_24q5_grouped['U.S. Total'] = df_24q5['Q5'].count()
df_24q5_grouped['Percentage'] = (df_24q5_grouped['Count'] / df_24q5_grouped['Part 5 Total']) * 100

# Sort

df_24q5_grouped.sort_values(by='Q5', ascending=False, inplace=True)

# Change number formats

df_24q5_grouped['Percentage'] = df_24q5_grouped['Percentage'].map('{:,.2f}%'.format)
df_24q5_grouped['Count'] = df_24q5_grouped['Count'].map('{:,}'.format)
df_24q5_grouped['U.S. Total'] = df_24q5_grouped['U.S. Total'].map('{:,}'.format)

A total of 590 participants responded to Question 24, Part 5, which asked if respondents perform “experimentation and iteration to improve existing ML [machine learning] models” as part of their typical job responsibilities. Approximately 40% of respondents were data scientists, while surprisingly only 12.5% of the total volume was composed of machine learning engineers. Approximately 11% of the total responses came from research scientists. The chart below illustrates the responses in greater detail.

In [None]:
# Create chart

create_horizontal_bar_chart(dataframe=df_24q5_grouped,
                            x_axis_data='Count',
                            y_axis_data='Q5',
                            number_label_text='Percentage',
                            x_axis_range=[0, 250],
                            title='U.S. Survey Participants Who Improve Existing Machine Learning Models',
                            x_axis_title='Total',
                            y_axis_title='Job Title')

In [None]:
# Question 24, Part 6

# Filter data and group

df_24q6 = df_duties_us[['Q5', 'Q24_Part_6']]
df_24q6_grouped = df_24q6.groupby('Q5')['Q24_Part_6'].value_counts().reset_index(name='Count')

# Add columns

df_24q6_grouped['Part 6 Total'] = df_24q6['Q24_Part_6'].count()
df_24q6_grouped['U.S. Total'] = df_24q6['Q5'].count()
df_24q6_grouped['Percentage'] = (df_24q6_grouped['Count'] / df_24q6_grouped['Part 6 Total']) * 100

# Sort

df_24q6_grouped.sort_values(by='Q5', ascending=False, inplace=True)

# Change number formats

df_24q6_grouped['Percentage'] = df_24q6_grouped['Percentage'].map('{:,.2f}%'.format)
df_24q6_grouped['Count'] = df_24q6_grouped['Count'].map('{:,}'.format)
df_24q6_grouped['U.S. Total'] = df_24q6_grouped['U.S. Total'].map('{:,}'.format)

A total of 375 participants responded to Question 24, Part 6, which asked if participants “do research that advances the state of the art of machine learning” as an important part of their roles at work. Approximately 33% of the total responses came from data scientists, who accounted for the majority of answers to this question. A significant number of responses (17.6% of the total) came from research scientists, which is not greatly surprising, as “research” is part of the question. The chart below shows the itemization of answers.

In [None]:
# Create chart

create_horizontal_bar_chart(dataframe=df_24q6_grouped,
                            x_axis_data='Count',
                            y_axis_data='Q5',
                            number_label_text='Percentage',
                            x_axis_range=[0, 140],
                            title='U.S. Survey Participants Who Do Research That Advances the State of the Art of Machine Learning',
                            x_axis_title='Total',
                            y_axis_title='Job Title')

In [None]:
# Question 24, Part 7

# Filter data and group

df_24q7 = df_duties_us[['Q5', 'Q24_Part_7']]
df_24q7_grouped = df_24q7.groupby('Q5')['Q24_Part_7'].value_counts().reset_index(name='Count')

# Add columns

df_24q7_grouped['Part 7 Total'] = df_24q7['Q24_Part_7'].count()
df_24q7_grouped['U.S. Total'] = df_24q7['Q5'].count()
df_24q7_grouped['Percentage'] = (df_24q7_grouped['Count'] / df_24q7_grouped['Part 7 Total']) * 100

# Sort

df_24q7_grouped.sort_values(by='Q5', ascending=False, inplace=True)

# Change number formats

df_24q7_grouped['Percentage'] = df_24q7_grouped['Percentage'].map('{:,.2f}%'.format)
df_24q7_grouped['Count'] = df_24q7_grouped['Count'].map('{:,}'.format)
df_24q7_grouped['U.S. Total'] = df_24q7_grouped['U.S. Total'].map('{:,}'.format)

The results for Question 24, Part 7, are interesting. In their responses, participants indicated that “none of these activities are an important part of my role at work.” This question received a total of 288 responses (approximately 10% of U.S. survey respondents). The category of “other” had the highest volume at almost half of all responses, followed by the category of software engineers at approximately 20%. Because the survey did not allow participants to fill in additional information, it is not clear what other activities these professionals do at work on a day-to-day basis. The chart below provides further detail.

In [None]:
# Create chart

create_horizontal_bar_chart(dataframe=df_24q7_grouped,
                            x_axis_data='Count',
                            y_axis_data='Q5',
                            number_label_text='Percentage',
                            x_axis_range=[0, 140],
                            title='U.S. Survey Participants Who Report That None of the Activities Listed Are an Important Part of Their Work',
                            x_axis_title='Total',
                            y_axis_title='Job Title')

In [None]:
# Question 24, OTHER

# Filter data and group

df_24qother = df_duties_us[['Q5', 'Q24_OTHER']]
df_24qother_grouped = df_24qother.groupby('Q5')['Q24_OTHER'].value_counts().reset_index(name='Count')

# Add columns

df_24qother_grouped['Other Total'] = df_24qother['Q24_OTHER'].count()
df_24qother_grouped['U.S. Total'] = df_24qother['Q5'].count()
df_24qother_grouped['Percentage'] = (df_24qother_grouped['Count'] / df_24qother_grouped['Other Total']) * 100

# Sort

df_24qother_grouped.sort_values(by='Q5', ascending=False, inplace=True)

# Change number formats

df_24qother_grouped['Percentage'] = df_24qother_grouped['Percentage'].map('{:,.2f}%'.format)
df_24qother_grouped['Count'] = df_24qother_grouped['Count'].map('{:,}'.format)
df_24qother_grouped['U.S. Total'] = df_24qother_grouped['U.S. Total'].map('{:,}'.format)

As with the results for Part 7 discussed in the previous section, the results for Question 24, OTHER, are very interesting. This question allowed respondents to simply mark “other” to indicate that other activities not listed in the survey occupy a significant portion of their work. This question received a total of 60 responses. The majority (approximately 33%) of responses came from the “other” category, but data analysts (at 17%) also reported that their work consists of duties that were not listed in the survey. Data scientists, software engineers, and statisticians all accounted for the same percentage of the total responses at 8.3%. Again, it would be fascinating to know exactly what other job activities the survey omitted that these professionals perform in their everyday work. The following chart shows the complete breakdown of responses.

In [None]:
# Create chart

create_horizontal_bar_chart(dataframe=df_24qother_grouped,
                            x_axis_data='Count',
                            y_axis_data='Q5',
                            number_label_text='Percentage',
                            x_axis_range=[0, 20],
                            title='U.S. Survey Participants Who Report That Other Activities Not Listed Are an Important Part of Their Work',
                            x_axis_title='Total',
                            y_axis_title='Job Title')

## **Technology on the Job**<a name='tech'></a>

### Programming Languages<a name='programming'></a>

In [None]:
# Create dataframe with relevant questions

df_pgmg = df[['Q3', 'Q5', 'Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', 'Q7_Part_5',
              'Q7_Part_6', 'Q7_Part_7', 'Q7_Part_8', 'Q7_Part_9', 'Q7_Part_10', 'Q7_Part_11',
              'Q7_Part_12', 'Q7_OTHER']]

In [None]:
# Filter data to see only results for United States

df_pgmg_us = df_pgmg[df_pgmg['Q3'] == 'United States of America']

# Rename columns for readability

df_pgmg_us = df_pgmg_us.rename(columns={'Q5': 'Job Title',
                                        'Q7_Part_1': 'Python',
                                        'Q7_Part_2': 'R',
                                        'Q7_Part_3': 'SQL',
                                        'Q7_Part_4': 'C',
                                        'Q7_Part_5': 'C++',
                                        'Q7_Part_6': 'Java',
                                        'Q7_Part_7': 'Javascript',
                                        'Q7_Part_8': 'Julia',
                                        'Q7_Part_9': 'Swift',
                                        'Q7_Part_10': 'Bash',
                                        'Q7_Part_11': 'MATLAB',
                                        'Q7_Part_12': 'None',
                                        'Q7_OTHER': 'Other'})

In [None]:
# Drop rows for students and those not employed

index_pgmg_1 = df_pgmg_us[df_pgmg_us['Job Title'] == 'Student'].index
index_pgmg_2 = df_pgmg_us[df_pgmg_us['Job Title'] == 'Currently not employed'].index

df_pgmg_us.drop(index_pgmg_1, inplace = True)
df_pgmg_us.drop(index_pgmg_2, inplace = True)

In [None]:
# Add total column

#df_pgmg_us['Total'] = df_pgmg_us['Job Title'].count()

In [None]:
# Group and aggregate

df_pgmg_us_grouped = df_pgmg_us.groupby(['Job Title']).agg({i: 'count' for i in df_pgmg_us.columns[2:]})

In [None]:
# Create pivot table

df_pgmg_us_grouped_pivot = pd.pivot_table(
    df_pgmg_us_grouped,
    #values=
    index='Job Title',
    #columns=
    aggfunc=np.sum,
    fill_value=0,
    margins=False,
)

Question 7 of the survey asked which programming languages participants use on a regular basis. Python, SQL, R, Java, and JavaScript were the top 5, with the majority of respondents using Python regularly. However, the data shows that most statisticians favor R in their day-to-day work.

The heat map below illustrates the most popular programming languages for Kaggle users.

In [None]:
# Create heat map

create_heat_map(dataframe=df_pgmg_us_grouped,
                x_axis_title='Language',
                y_axis_title='Job Title',
                color_bar='Count',
                x_axis_tick_angle=45)

### Cloud Computing Platforms<a name='cloud'></a>

In [None]:
df_cc = df[['Q3', 'Q5', 'Q27_A_Part_1', 'Q27_A_Part_2', 'Q27_A_Part_3',
            'Q27_A_Part_4', 'Q27_A_Part_5', 'Q27_A_Part_6', 'Q27_A_Part_7', 
            'Q27_A_Part_8', 'Q27_A_Part_9', 'Q27_A_Part_10', 'Q27_A_Part_11', 'Q27_A_OTHER']]

In [None]:
# Filter data to see only results for United States

df_cc_us = df_cc[df_cc['Q3'] == 'United States of America']

# Rename columns for readability

df_cc_us = df_cc_us.rename(columns={'Q5': 'Job Title',
                                    'Q27_A_Part_1': 'Amazon Web Services (AWS)',
                                    'Q27_A_Part_2': 'Microsoft Azure',
                                    'Q27_A_Part_3': 'Google Cloud Platform (GCP)',
                                    'Q27_A_Part_4': 'IBM Cloud/Red Hat',
                                    'Q27_A_Part_5': 'Oracle Cloud',
                                    'Q27_A_Part_6': 'SAP Cloud',
                                    'Q27_A_Part_7': 'Salesforce Cloud',
                                    'Q27_A_Part_8': 'VMware Cloud',
                                    'Q27_A_Part_9': 'Alibaba Cloud',
                                    'Q27_A_Part_10': 'Tencent Cloud',
                                    'Q27_A_Part_11': 'None',
                                    'Q27_A_OTHER': 'Other'})

In [None]:
# Drop rows for students and those not employed

index_cc_1 = df_cc_us[df_cc_us['Job Title'] == 'Student'].index
index_cc_2 = df_cc_us[df_cc_us['Job Title'] == 'Currently not employed'].index

df_cc_us.drop(index_cc_1, inplace = True)
df_cc_us.drop(index_cc_2, inplace = True)

#df_cc_us.shape

In [None]:
# Add total column

#df_cc_us['Total'] = df_cc_us['Job Title'].count()

In [None]:
# Group and aggregate

df_cc_us_grouped = df_cc_us.groupby(['Job Title']).agg({i: 'count' for i in df_cc_us.columns[2:]})

In [None]:
# Create pivot table

df_cc_us_grouped_pivot = pd.pivot_table(
    df_cc_us_grouped,
    #values='Job Title',
    index='Job Title',
    #columns=
    aggfunc=np.sum,
    fill_value=0,
    margins=False,
    sort=False
)

# Sort columns in pivot table

df_cc_us_grouped_pivot_sorted = df_cc_us_grouped_pivot.reindex(['Amazon Web Services (AWS)',
                                                               'Microsoft Azure',
                                                               'Google Cloud Platform (GCP)',
                                                               'IBM Cloud/Red Hat',
                                                               'Oracle Cloud',
                                                               'SAP Cloud',
                                                               'Salesforce Cloud',
                                                               'VMware Cloud',
                                                               'Alibaba Cloud',
                                                               'Tencent Cloud',
                                                               'None',
                                                               'Other'],
                                                                axis=1)

Question 27-A of the survey asked which cloud computing platforms respondents use on a regular basis. The top 3 are clearly Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure, although 91 “other” participants (approximately 4.5% of the 2,001 in total) indicated that they do not use a cloud computing platform.


The heat map below illustrates the most popular cloud computing platforms.

In [None]:
# Create heat map

create_heat_map(dataframe=df_cc_us_grouped_pivot_sorted,
                x_axis_title='Cloud Platform',
                y_axis_title='Job Title',
                color_bar='Count',
                x_axis_tick_angle=45)

## **Educational Background**<a name='education'></a>

In [None]:
# Filter data to see only country names and educational background

df_edu = df[['Q3', 'Q4', 'Q5']]

In [None]:
# Filter data to see only results for United States

df_edu_us = df_edu[df_edu['Q3'] == 'United States of America']

# Rename columns for readability

df_edu_us = df_edu_us.rename(columns={'Q3': 'Country',
                                      'Q4': 'Education Level',
                                      'Q5': 'Job Title'})

In [None]:
# Group and aggregate

df_edu_us_grouped = df_edu_us.groupby(['Job Title', 'Education Level'])['Education Level'].count().reset_index(name='Count')
    
df_edu_us_sorted = df_edu_us_grouped.sort_values(by='Job Title', ascending=True)

In [None]:
# Drop rows for students and those not employed

index_edu_1 = df_edu_us_sorted[df_edu_us_sorted['Job Title'] == 'Student'].index
index_edu_2 = df_edu_us_sorted[df_edu_us_sorted['Job Title'] == 'Currently not employed'].index

df_edu_us_sorted.drop(index_edu_1, inplace = True)
df_edu_us_sorted.drop(index_edu_2, inplace = True)

In [None]:
# Create pivot table

df_edu_grouped_pivot = pd.pivot_table(
    df_edu_us_sorted,
    values='Count',
    index='Job Title',
    columns='Education Level',
    aggfunc=np.sum,
    fill_value=0,
    margins=False
)

A total of 2,001 responses to this question were analyzed (again filtering out those who are not currently employed and students). The overwhelming majority of respondents have master’s degrees (988 [49.3%]). A total of 484 participants (24.2%) have bachelor’s degrees, followed by 367 (18.3%) with doctoral degrees.

Overall, data scientists have the highest number of doctoral degrees and master’s degrees (115 and 248, respectively). Again, it would be helpful to know what job titles are included in the “other” category, as this category shows the highest volume (101) of bachelor’s degrees out of all the job titles listed. Many individuals who identified themselves as being in the “other” category have master’s degrees as well, with a total of 166. Unsurprisingly, research scientists are highly educated, with nearly 60% having doctoral degrees.

It is not readily apparent from the data what differentiates a doctoral degree from a professional doctorate, but this distinction may only be important in academic contexts; it is not clear if this difference is relevant in a real-world setting.

The heat map below shows complete details of education levels by job title.

In [None]:
# Create heat map

create_heat_map(dataframe=df_edu_grouped_pivot,
                x_axis_title='Education Level',
                y_axis_title='Job Title',
                color_bar='Count',
                x_axis_tick_angle=45)

## **Age**<a name='age'></a>

In [None]:
# Filter data for country and relevant questions

df_age = df[['Q1', 'Q3', 'Q5']]

# Filter data to see only results for United States

df_age = df_age[df_age['Q3'] == 'United States of America']

In [None]:
# Group and aggregate

df_age_grouped = df_age.groupby(['Q1', 'Q5'])['Q5'].count().reset_index(name='Count')

# Rename columns for readability

df_age_grouped = df_age_grouped.rename(columns={'Q1': 'Age Range', 'Q5': 'Job Title'})

In [None]:
# Drop rows for students and those not employed

index_age_1 = df_age_grouped[df_age_grouped['Job Title'] == 'Student'].index
index_age_2 = df_age_grouped[df_age_grouped['Job Title'] == 'Currently not employed'].index

df_age_grouped.drop(index_age_1, inplace = True)
df_age_grouped.drop(index_age_2, inplace = True)

In [None]:
# Create pivot table

df_age_grouped_pivot = pd.pivot_table(
    df_age_grouped,
    values='Count',
    index='Job Title',
    columns='Age Range',
    aggfunc=np.sum,
    fill_value=0,
    margins=False,
)

The very first question of the survey asked participants to indicate their age (range in years). By looking at the heat map below, we can see that most data analysts (52), data engineers (14), data scientists (88), research scientists (30), software engineers (35), and statisticians (5) are between 30 and 34 years old. This is in line with the findings reported in the Stack Overflow 2021 Developer Survey, which found that 48.42% of professional developers are 25–34 years old.

The heat map below illustrates the age ranges of respondents.

In [None]:
# Create heat map

create_heat_map(dataframe=df_age_grouped_pivot,
                x_axis_title='Age Range',
                y_axis_title='Job Title',
                color_bar='Count',
                x_axis_tick_angle=0)

## **Gender**<a name='gender'></a>

In [None]:
# Filter data for country and relevant questions

df_gender = df[['Q2', 'Q3', 'Q5']]

In [None]:
# Filter data to see only results for United States

df_gender = df_gender[df_gender['Q3'] == 'United States of America']

In [None]:
# Group and aggregate data

df_gender_grouped = df_gender.groupby(['Q2', 'Q5'])['Q5'].count().reset_index(name='Count')

# Rename columns for readability

df_gender_grouped = df_gender_grouped.rename(columns={'Q2': 'Gender', 'Q5': 'Job Title'})

# Sort data

df_gender_sorted = df_gender_grouped.sort_values(by='Job Title', ascending=True)

In [None]:
# Drop rows for students and those not employed

index_gender_1 = df_gender_sorted[df_gender_sorted['Job Title'] == 'Student'].index
index_gender_2 = df_gender_sorted[df_gender_sorted['Job Title'] == 'Currently not employed'].index

df_gender_sorted.drop(index_gender_1, inplace = True)
df_gender_sorted.drop(index_gender_2, inplace = True)

In [None]:
# Create pivot table

df_gender_grouped_pivot = pd.pivot_table(
    df_gender_sorted,
    values='Count',
    index='Job Title',
    columns='Gender',
    aggfunc=np.sum,
    fill_value=0,
    margins=False,
)

Another important aspect in defining who does what in the field is gender. As shown in the heat map below, the data science field is overwhelmingly male—76% professionals across the job titles listed are men.

Why are there so few women? Duranton et al. (2020) posit that there are a few reasons, including the following:
* The culture seems overly competitive and unappealing.
* Data science is too abstract, with no tangible purpose.
* The field is not inclusive enough.
* Information regarding career opportunities is not communicated well.

The researchers concluded that “companies must work harder to combat the negative perceptions of the field and the lack of tangible information about career paths that both male and female students feel—but that disincentivize women much more strongly.” This will continue to be important as the data science industry continues to grow and companies seek to attract valuable job candidates.

In [None]:
# Create heat map

create_heat_map(dataframe=df_gender_grouped_pivot,
                x_axis_title='Gender',
                y_axis_title='Job Title',
                color_bar='Count',
                x_axis_tick_angle=45)

In [None]:
# Create new dataframe for chart

value_list = ['Man', 'Woman', 'Nonbinary']

boolean_series = df_gender_sorted.Gender.isin(value_list)

df_gender_sorted_chart = df_gender_sorted[boolean_series]

In [None]:
# Create function for bar chart

def create_grouped_bar_chart(dataframe,
                             x_axis_data,
                             y_axis_data,
                             legend_data,
                             number_label_text,
                             title,
                             x_axis_title,
                             y_axis_title,
                             x_axis_range):
    fig = px.bar(dataframe,
                 x=x_axis_data,
                 y=y_axis_data,
                 barmode='stack',
                 color=legend_data,
                 text=number_label_text,
                 height=600,
                 width=1000,
                 color_discrete_sequence=['#BF1363', '#0E79B2', '#191923'],
                 title=title,
                 labels={x_axis_data: x_axis_title, y_axis_data: y_axis_title})
    fig.update_layout(font_family='Raleway',
                      font_color='#191923',
                      title_font_family='Raleway Medium',
                      title_font_color='#191923',
                      title_font_size=18,
                      legend_title_font_color='#191923',
                      plot_bgcolor='white',
                      hoverlabel=dict(bgcolor='#F39237', font_size=14, font_family='Raleway', font_color='white',
                                      bordercolor='#F39237'),
                      legend_yanchor='top',
                      legend_y=1,
                      legend_xanchor='right',
                      legend_x=1)
    fig.update_xaxes(type='linear',
                     range=x_axis_range,
                     tickfont=dict(family='Raleway', color='#191923', size=12),
                     title_font=dict(size=14, family='Raleway Medium', color='#191923'),
                     showgrid=False,
                     showline=True, linewidth=1, linecolor='#191923', mirror=False)
    fig.update_yaxes(tickfont=dict(family='Raleway', color='#191923', size=12),
                     title_font=dict(size=14, family='Raleway Medium', color='#191923'),
                     showgrid=False,
                     showline=True, linewidth=1, linecolor='#191923', mirror=False)
    fig = fig.update_traces(marker_line_width=0,
                            width=0.75)
    fig.show()

The chart below also shows the wide gender disparity in the data science and machine learning community. For improved clarity, the categories of “prefer not to say” and “prefer to self-describe” were removed from the data set for this chart.

In [None]:
# Create chart

create_grouped_bar_chart(df_gender_sorted_chart,
                         x_axis_data='Count',
                         y_axis_data='Job Title',
                         legend_data='Gender',
                         number_label_text='Count',
                         title='Gender Parity by Job Title for U.S. Survey Participants',
                         x_axis_title='Total',
                         y_axis_title='Job Title',
                         x_axis_range=[0, 450])

## **Salary**<a name='salary'></a>

In [None]:
# Filter data for country and relevant questions

df_salary = df[['Q3', 'Q5', 'Q25']]

In [None]:
# Filter data to see only results for United States

df_salary_us = df_salary[df_salary['Q3'] == 'United States of America']

In [None]:
# Filter data and group

df_salary_new = df_salary_us[['Q5', 'Q25']]

df_salary_grouped = df_salary_new.groupby('Q5')['Q25'].value_counts(dropna=False).reset_index(name='Count')

# Add columns

df_salary_grouped['Question Total'] = df_salary_new['Q25'].count()
df_salary_grouped['U.S. Total'] = df_salary_new['Q5'].count()
df_salary_grouped['Percentage'] = (df_salary_grouped['Question Total'] / df_salary_grouped['U.S. Total']) * 100

# Rename columns for readability

df_salary_grouped = df_salary_grouped.rename(columns={'Q5': 'Job Title',
                                                      'Q25': 'Annual Compensation'})

In [None]:
# Change salary ranges

sal_0_50_ranges = ['$0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999', '4,000-4,999', '5,000-7,499',
                                        '7,500-9,999', '10,000-14,999', '15,000-19,999','20,000-24,999', '25,000-29,999',
                                        '30,000-39,999', '40,000-49,999']

for sal_0_50_range in sal_0_50_ranges:
    df_salary_grouped['Annual Compensation'] = df_salary_grouped['Annual Compensation'].replace(sal_0_50_range, '0-49,999')
    
sal_50_99_ranges = ['50,000-59,999', '60,000-69,999', '70,000-79,999', '80,000-89,999', '90,000-99,999']

for sal_50_99_range in sal_50_99_ranges:
    df_salary_grouped['Annual Compensation'] = df_salary_grouped['Annual Compensation'].replace(sal_50_99_range, '50,000-99,999')
    
sal_100_149_ranges = ['100,000-124,999', '125,000-149,999']   
    
for sal_100_149_range in sal_100_149_ranges:
    df_salary_grouped['Annual Compensation'] = df_salary_grouped['Annual Compensation'].replace(sal_100_149_range, '100,000-149,999')    
        
sal_150_199_ranges = ['150,000-199,999']   
    
for sal_150_199_range in sal_150_199_ranges:
    df_salary_grouped['Annual Compensation'] = df_salary_grouped['Annual Compensation'].replace(sal_150_199_range, '150,000-199,999')
    
sal_200_249_ranges = ['200,000-249,999']   
    
for sal_200_249_range in sal_200_249_ranges:
    df_salary_grouped['Annual Compensation'] = df_salary_grouped['Annual Compensation'].replace(sal_200_249_range, '200,000-249,999')    
    
sal_250_299_ranges = ['250,000-299,999']   
    
for sal_250_299_range in sal_250_299_ranges:
    df_salary_grouped['Annual Compensation'] = df_salary_grouped['Annual Compensation'].replace(sal_250_299_range, '250,000-299,999')
    
sal_300_499_ranges = ['300,000-499,999']   
    
for sal_300_499_range in sal_300_499_ranges:
    df_salary_grouped['Annual Compensation'] = df_salary_grouped['Annual Compensation'].replace(sal_300_499_range, '300,000-499,999')
    
sal_500_ranges = ['$500,000-999,999', '>$1,000,000']   
    
for sal_500_range in sal_500_ranges:
    df_salary_grouped['Annual Compensation'] = df_salary_grouped['Annual Compensation'].replace(sal_500_range, '>500,000')

In [None]:
# Drop rows for students and those not employed

index_salary_1 = df_salary_grouped[df_salary_grouped['Job Title'] == 'Student'].index
index_salary_2 = df_salary_grouped[df_salary_grouped['Job Title'] == 'Currently not employed'].index

df_salary_grouped.drop(index_salary_1, inplace = True)
df_salary_grouped.drop(index_salary_2, inplace = True)

In [None]:
# Sort by salary (add column)

df_salary_grouped['Sort'] = df_salary_grouped['Annual Compensation'].str.extract('(\d+)', expand=False).fillna(0).astype(int)

df_salary_us_sorted = df_salary_grouped.sort_values('Sort', ascending=True)

In [None]:
# Create pivot table

df_salary_pivot = pd.pivot_table(
    df_salary_us_sorted,
    values=['Count'],
    index=['Job Title'],
    columns=['Sort', 'Annual Compensation'],
    aggfunc=np.sum,
    fill_value=0,
    margins=False,
)

In [None]:
# Drop level for heat map

df_salary_pivot.columns = df_salary_pivot.columns.droplevel(0)
df_salary_pivot.columns = df_salary_pivot.columns.droplevel(0)

Question 25 asked survey participants to report their current yearly compensation (in approximate U.S. dollars). For readability and ease of analysis, the salary ranges were regrouped so there were fewer categories to review.

Based on the heat map below, we can see that many survey respondents earn between \$100,000 and \$149,999 annually, including data engineers, data scientists, machine learning engineers, product managers, program/project managers, software engineers, and statisticians. However, most (496 or approximately 27% of those who responded to this question) earn between \$50,000 and \$99,999 annually.

Data scientists appear to have the highest salaries, with the greatest totals in the \$150,000–\$199,999, \$200,000–\$249,999, \$250,000–\$299,999, and \$300,000–\$499,999 ranges. The Stack Overflow 2021 Developer Survey reported that data scientists or machine learning specialists in the United States earn a median yearly salary of \$125,000.

In [None]:
# Create heat map

create_heat_map(dataframe=df_salary_pivot,
                x_axis_title='Annual Compensation',
                y_axis_title='Job Title',
                color_bar='Count',
                x_axis_tick_angle=90)

## **Years of Experience**<a name='yoe'></a>

In [None]:
# Filter data for country and relevant questions

df_yoe = df[['Q3', 'Q5', 'Q6']]

In [None]:
# Filter data to see only results for United States

df_yoe_us = df_yoe[df_yoe['Q3'] == 'United States of America']

In [None]:
# Group and aggregate

df_yoe_grouped = df_yoe_us.groupby(['Q5', 'Q6'])['Q5'].count().reset_index(name='Count')

# Rename columns for readability

df_yoe_grouped = df_yoe_grouped.rename(columns={'Q5': 'Job Title', 'Q6': 'Years of Experience'})

In [None]:
# Replace data to sort

df_yoe_grouped['Years of Experience'] = df_yoe_grouped['Years of Experience'].replace('I have never written code', '0')
df_yoe_grouped['Years of Experience'] = df_yoe_grouped['Years of Experience'].replace('1-3 years', '1-3')
df_yoe_grouped['Years of Experience'] = df_yoe_grouped['Years of Experience'].replace('3-5 years', '3-5')
df_yoe_grouped['Years of Experience'] = df_yoe_grouped['Years of Experience'].replace('5-10 years', '5-10')
df_yoe_grouped['Years of Experience'] = df_yoe_grouped['Years of Experience'].replace('10-20 years', '10-20')
df_yoe_grouped['Years of Experience'] = df_yoe_grouped['Years of Experience'].replace('< 1 years', '0-1')
df_yoe_grouped['Years of Experience'] = df_yoe_grouped['Years of Experience'].replace('20+ years', '20+')

In [None]:
# Drop rows for students and those not employed

index_yoe_1 = df_yoe_grouped[df_yoe_grouped['Job Title'] == 'Student'].index
index_yoe_2 = df_yoe_grouped[df_yoe_grouped['Job Title'] == 'Currently not employed'].index

df_yoe_grouped.drop(index_yoe_1, inplace = True)
df_yoe_grouped.drop(index_yoe_2, inplace = True)

In [None]:
# Convert column to string

df_yoe_grouped['Years of Experience'] = df_yoe_grouped['Years of Experience'].astype(str)

In [None]:
# Sort data (add column) and convert to integer

df_yoe_grouped['Sort'] = df_yoe_grouped['Years of Experience'].str.extract('(\d+)', expand=False).astype(int)

df_yoe_grouped.sort_values('Sort', ascending=True, inplace=True)

In [None]:
# Sort data and reset index

df_yoe_sorted = df_yoe_grouped.sort_values('Years of Experience',ascending=True)

df_yoe_sorted.reset_index(drop=True, inplace=True)

In [None]:
# Create pivot table

df_yoe_pivot = pd.pivot_table(
    df_yoe_sorted,
    values='Count',
    index=['Job Title'],
    columns=['Sort', 'Years of Experience'],
    aggfunc=np.sum,
    fill_value=0,
    sort=True,
    margins=False,
)

In [None]:
# Drop level for heat map

df_yoe_pivot.columns = df_yoe_pivot.columns.droplevel(0)

Question 6 of the survey specifically asked “For how many years have you been writing code and/or programming?” This is interesting and would seem more targeted toward software engineers, data scientists, and data engineers rather than data analysts or product managers, for example.

As shown in the heat map below, most business analysts and data analysts who responded to the survey have between 1 and 3 years of experience. Most data scientists (121 [27%]) have 5–10 years of experience. Nearly 40% of software engineers reported that they have 20+ years of experience, and this is true across all job titles (427 [21.3%]).

In [None]:
# Create heat map

create_heat_map(dataframe=df_yoe_pivot,
                x_axis_title='Years of Experience',
                y_axis_title='Job Title',
                color_bar='Count',
                x_axis_tick_angle=0)

# **Conclusion**<a name='conclusion'></a>

Some of the survey results are not surprising—it is widely known, for example, that the tech industry (and therefore the field of data science and machine learning) is dominated by men, and it is clear that companies need to do a better job of trying to attract and retain female candidates.

The salary ranges reported are not all that revelatory either, as the field tends to pay well. A more detailed analysis by region would be interesting, but in general most data science positions in the United States are high-paying jobs.

Python and AWS continue to be popular tools across all job titles, which seems consistent with other data. In the Stack Overflow 2021 Developer Survey, 59% of all professional developers reported that their most widely used cloud platform is AWS.

Some of the responses to more specific questions regarding activities that make up an important part of participants’ roles at work are interesting, however. Questions that I expected to garner a higher volume of responses from data engineers or machine learning engineers—Parts 4 and 5 of Question 24, for example—in fact received more responses from data scientists. I hope that in future surveys Kaggle will allow fill-in answers, as this could shed light on exactly what job duties professionals perform at work. Answers under “none” or “other” only pique my curiosity and leave me wanting to learn more!

# **References**<a name='refs'></a>

Duranton, S., Erlebach, J., Brégé, C., Danziger, J., Gallego, A., & Pauly, M. (2020, March 6). *What’s keeping women out of data science?* BCG. https://www.bcg.com/publications/2020/what-keeps-women-out-data-science

Ho, A., Nguyen, A., Pafford, J. L., & Slater, R. (2019). A data science approach to defining a data scientist. *SMU Data Science Review, 2*(3), 1–20. https://scholar.smu.edu/datasciencereview/vol2/iss3/4

Kaggle. (2021). *2021 Kaggle machine learning & data science survey* [Data set]. https://www.kaggle.com/c/kaggle-survey-2021/data

National Center for O\*NET Development. (2021). *Alternate titles—O\*NET 26.0 data dictionary* [Data set]. O\*NET Resource Center. https://www.onetcenter.org/dictionary/26.0/excel/alternate_titles.html

Stack Overflow. (2021). *2021 developer survey* [Data set]. https://insights.stackoverflow.com/survey/2021#overview

Theuwissen, M. (2015, November). *The different data science roles in the industry.* KDnuggets. https://www.kdnuggets.com/2015/11/different-data-science-roles-industry.html

U.S. Bureau of Labor Statistics. (2021, September 8). *Computer and information technology occupations.* Occupational Outlook Handbook. https://www.bls.gov/ooh/computer-and-information-technology/home.htm

Verma, A., Yurov, K. M., Lane, P. L., & Yurova, Y. V. (2019). An investigation of skill requirements for business and data analytics positions: A content analysis of job advertisements. *Journal of Education for Business, 94*(4), 243–250. https://doi.org/10.1080/08832323.2018.1520685