### Update to [my 2020 Survey notebook](http://www.kaggle.com/kchinn9/charting-a-path-into-the-data-science-field/notebook), using 2021 Survey responses

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

!<img src="https://i.imgur.com/tZtwN1G.jpg" width = 650px)>
Image source: [Data Science Vectors by Vecteezy](http://www.vecteezy.com/free-vector/data-science)

*Revisting my questions from the [2020 Kaggle Survey](http://www.kaggle.com/c/kaggle-survey-2020/overview), using data from the [2021 Survey](http://www.kaggle.com/c/kaggle-survey-2021/overview). Specifically looking at currently employed Kagglers in the US, to find the most common path into the field of data science. (Next up will be combining the findings into one notebook, but that's for another day.)*

This notebook attempts to shed light on the path or paths to becoming a data science professional in the United States, as of the 2021 Kaggle Machine Learning and Data Science Survey.

Data science is a rapidly growing field, and the demand for data scientists is outpacing supply. In the past, most Data Scientist positions went to people with PhDs in Computer Science. I wanted to know if that is changing in light of both the increased job openings and the expanding definition of data science that has come with more companies realizing the wealth of raw data they have available for analysis, and how that can help to grow and refine their businesses.

## **Questions**
In particular, I wanted to look for answers to the following questions:

1. Do you need a a formal degree?
2. What programming language(s) do data science professionals need to know?
3. What are the preferred online learning platforms to gain data science knowledge and skills?

## **Initial Data Exploration and Cleaning**

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from matplotlib.ticker import PercentFormatter

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Reading in the data
df = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
low_memory = False

In [None]:
# Looking at the first 5 rows
df.head()

In [None]:
# Removing the first column and the first row
df.drop(['Time from Start to Finish (seconds)'], axis=1, inplace=True)
df = df.loc[1:, :]
df.head()

In [None]:
df.shape

#### *A quick look at the overall demographics, before filtering for just the US.*

In [None]:
# Overall age distribution of survey respondents
df['Q1'].value_counts().sort_index().plot(kind = 'bar')

In [None]:
df['Q2'].value_counts()

In [None]:
df['Q3'].value_counts().head(10)

In [None]:
df['Q4'].value_counts()

In [None]:
df['Q5'].value_counts()

In [None]:
df['Q6'].value_counts()

## **Data preparation and filtering**

In [None]:
# Aggregating the nonbinary answers
df.loc[(df.Q2 == 'Prefer not to say'), 'Q2'] = 'Other Response'
df.loc[(df.Q2 == 'Prefer to self-describe'),'Q2'] = 'Other Response'
df.loc[(df.Q2 == 'Nonbinary'), 'Q2'] = 'Other Response'

# Abbreviating country name
df.loc[(df.Q3 == 'United States of America'),'Q3']='USA'

# Shortening education level descriptions
df.loc[(df.Q4 == 'Doctoral degree'),'Q4']='PhD'
df.loc[(df.Q4 == 'Master’s degree'),'Q4']='Master’s'
df.loc[(df.Q4 == 'Bachelor’s degree'),'Q4']='Bachelor’s'
df.loc[(df.Q4 == "Some college/university study without earning a bachelor’s degree"), 'Q4']='Some college/university'
df.loc[(df.Q4 == 'No formal education past high school'), 'Q4']='High school'
df.loc[(df.Q4 == 'I prefer not to answer'), 'Q4']='Prefer not to answer'

# Ordering education levels by reverse typical chronological completion
q4_order = [
    'PhD',
    'Master’s', 
    'Professional doctorate', 
    'Bachelor’s', 
    'Some college/university', 
    'High school', 
    'Prefer not to answer']

# Putting coding experience answers in order from shortest time to longest
q6_order = [
    'I have never written code', 
    '< 1 years', 
    '1-3 years', 
    '3-5 years', 
    '5-10 years', 
    '10-20 years', 
    '20+ years']

df.loc[(df.Q40_Part_9 == 'Cloud-certification programs (direct from AWS, Azure, GCP, or similar)'), 'Q40_Part_9']='Cloud-certification programs'
df.loc[(df.Q40_Part_10 == 'University Courses (resulting in a university degree)'), 'Q40_Part_10']='University Courses resulting in a degree'

In [None]:
# Filtering for just US responses
us_df = df[df['Q3'] == 'USA']

# Filtering to only include currently employed Kagglers
q5_order = [
    'Data Scientist',
    'Software Engineer',
    'Data Analyst', 
    'Research Scientist',
    'Machine Learning Engineer',
    'Business Analyst',
    'Program/Project Manager',
    'Data Engineer',
    'Product Manager',
    'Statistician',
    'DBA/Database Engineer',
    'Developer Relations/Advocacy',
    'Other']

us_df = us_df[us_df['Q5'].isin(q5_order)]

In [None]:
us_df.shape

In [None]:
# Filtering to only include specific question columns
us_df = us_df.loc[:, ['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7_Part_1', 'Q7_Part_2','Q7_Part_3','Q7_Part_4','Q7_Part_5','Q7_Part_6', 'Q7_Part_7','Q7_Part_8','Q7_Part_9','Q7_Part_10','Q7_Part_11', 'Q7_Part_12', 'Q7_OTHER','Q40_Part_1', 'Q40_Part_2', 'Q40_Part_3', 'Q40_Part_4', 'Q40_Part_5', 'Q40_Part_6', 'Q40_Part_7', 'Q40_Part_8', 'Q40_Part_9', 'Q40_Part_10', 'Q40_Part_11', 'Q40_OTHER']]

In [None]:
us_df.isna().sum()

In [None]:
us_df.shape

The US-only group is made up of 2001 people.

In [None]:
def list_and_bar(qnum, q_order, title):
    
    '''
    INPUT:
    qnum - the y-axis variable, a single-column question
    q_order - the order to display responses on the barchart
    title - the title of the barchart
    
    OUTPUT:
    1. A list of responses to the selected question, in descending order
    2. A horizontal barchart showing the values, in sorted order 
    '''

    # creating a dataframe of values to include both raw counts and percentages
    val_list = pd.DataFrame()
    val_list['Count'] = us_df[qnum].value_counts()
    pct = round(val_list * 100/us_df[qnum].count(),2)
    val_list['Pct'] = pct
    
    print(val_list)
    
    fig, ax = plt.subplots(1, 1, figsize=(12,6))
    ax = us_df[qnum].value_counts()[q_order].plot(kind='barh')
    
    # reversing the order of y axis  -- 
    # the horizontal barchart displays values in the reverse order of a regular barchart 
    # (i.e., where the barchart might show a - b - c left to right, the corresponding
    #  horizontal barchart would show c at the top, and a at the bottom)
    ax.invert_yaxis()
    
    plt.title(title, fontsize = 14, fontweight = 'bold')
    plt.show()
    
        

def list_and_bar_mc(mc_df, title):
    
    '''
    INPUT:
    mc_df - a dataframe consisting of answers to a specific multiple-column question
    title - the title of the barchart
    
    OUTPUT:
    1. A list of responses to the selected question, in descending order
    2. A horizontal barchart showing the values, also in descending order
    '''
    print(mc_df)
    
    fig, ax = plt.subplots(1, 1, figsize=(12,6))
    mc_df['Count'].sort_values().plot(kind='barh')
    plt.title(title, fontsize = 14, fontweight = 'bold')
    plt.show()
    
    
def heatmap(qnum_a, qnum_b, title, order_rows, columns):
    
    '''
    INPUT:
    qnum_a - the x-axis variable, a single-column question
    qnum_b - the y-axis variable, a single-column question
    title - the title of the heatmap, describing the variables in the visualization
    order_rows - sorted order for the y-axis
    columns - sorted order for the x-axis
    
    OUTPUT:
    A heatmap showing the correlation between the two chosen variables
    '''
    vals = us_df[[qnum_a, qnum_b]].groupby(qnum_b)[qnum_a].value_counts().unstack()
    
    # getting the total number of responses for the columns in order to calculate the % of the total
    vals_rowsums = pd.DataFrame([vals.sum(axis=0).tolist()], columns=vals.columns, index=['All'])
    vals = pd.concat([vals_rowsums, vals], axis=0)

    # convert to % 
    vals = ((vals.T / (vals.sum(axis=1) + 0.001)).T) * 100 

    order = order_rows
    columns = columns
    
    vals = vals.reindex(order).reindex(columns = columns)
    
    fig, ax = plt.subplots(1, 1, figsize=[12,6])
    ax = sns.heatmap(ax = ax, data = vals, cmap = 'GnBu', cbar_kws = {'format': '%.0f%%'})
    plt.title(title, fontsize = 14, fontweight = 'bold')
    ax.set_xlabel('')
    ax.set_ylabel('')
    plt.show()
    
    
def heatmap_mc(qnum, qnum_mc, title, columns, order_rows):
    
    '''
    INPUT:
    qnum - the y-axis variable, a single-column question
    qnum_mc - the x-axis variable, a question with multiple columns of answers
    title - the title of the heatmap, describing the variables in the visualization
    order_rows - sorted order for the y-axis
    columns - a list of column names, representing the multiple-column answer options, ordered
    
    OUTPUT:
    1. A heatmap showing the correlation between the two specified variables
    2. avg_num - the average number of answer options chosen for the multiple column question
    '''
    # creating a dataframe with the single-column question
    df_qnum = us_df[qnum]
    df_qnum = pd.DataFrame(df_qnum)
    
    # creating a dataframe containing all the columns for a given multiple-column question
    cols_mc = [col for col in us_df if col.startswith(qnum_mc)]
    df_mc = us_df[cols_mc]
    df_mc.columns = columns
    
    # converting column values to binary 0 or 1 values (1 if the user chose that answer, 0 if not)
    df_mc = df_mc.notnull().astype(int)
    
    # joining the dataframes together
    df_join = df_qnum.join(df_mc)
    
    # aggregating counts for each answer option and re-ordering dataframe
    df_agg = df_join.groupby([qnum]).agg('sum')
    df_agg = df_agg.reindex(order_rows)
    
    df_agg['users'] = df_join.groupby(qnum)[qnum].count()
    df_agg =  df_agg.div(df_agg.loc[:, 'users'], axis=0)
    df_agg.drop(columns='users', inplace=True)
    
    
    fig, ax = plt.subplots(1, 1, figsize=(12, 6))
    ax = sns.heatmap(ax = ax, data = df_agg, cmap = 'GnBu')
    cbar = ax.collections[0].colorbar
    cbar.ax.yaxis.set_major_formatter(PercentFormatter(1, 0))
    plt.title(title, fontsize = 14, fontweight = 'bold')
    ax.set_xlabel('')
    ax.set_ylabel('')
    plt.show() 
    
    # finding the average number of answers chosen for the multiple column options, 
    # minus tabulations for 'None'
    df_temp = df_join
    df_temp.drop('None', axis = 1, inplace = True)
    rowsums = df_temp.sum(axis = 1)
    avg_num = round(rowsums.mean(), 2)
    
    print('Average number of options chosen by survey respondents: ' + str(avg_num) + '.')

## **Analysis and visualizations**

In [None]:
us_df['Q2'].value_counts().plot(kind = 'bar')

In [None]:
plt.figure(figsize=[12,6])
us_ages = us_df['Q1'].value_counts().sort_index()
sns.countplot(data = us_df, x = 'Q1', hue = 'Q2', hue_order = ('Man', 'Woman', 'Other Response'), order = us_ages.index)
plt.title('Age and Gender Distribution')

Again, majorly male, and mostly younger.

In [None]:
list_and_bar('Q6', q6_order, 'Years of Coding Experience')

In this year's survey, 70% of US employed Kagglers report coding for longer than 3 years.

### **Question 1. Do you need a formal degree to become a data science professional?**
Let's look at formal education, and how it correlates with job title.

In [None]:
list_and_bar('Q4', q4_order, 'Highest Level of Education Attained')

In [None]:
list_and_bar('Q5', q5_order, 'Current Job Title')

In [None]:
heatmap('Q4', 'Q5', 'Roles by Education Level', q5_order, q4_order)

### **Question 1 analysis**
With almost 49% of the responses, a Master's degree was again by far the most common level of education listed, more than double the next most popular answer. Other notable observations:

* Seventy percent of US Kagglers hold a Master's Degree or higher. 
* Research scientists and statisticians are most likely to hold PhDs, followed by Data Scientists.
* Relatively few survey respondents (just over 5%) indicate they do not have at least a Bachelor's degree.
* Only 22% of those responding hold the title of Data Scientist, but again it is nonetheless the title with the highest count. 
    Arguably anyone who is active on Kaggle and who would complete the survey considers themself to be either in, or 
    interested in, the data science field, if not actively working as a Data Scientist.

### **Question 2. What programming language(s) do data science professionals need to know?**
Now we'll turn to programming languages used. 

In [None]:
# creating a dataframe of the language options and the number of times each language was selected
languages = pd.DataFrame()

for col in us_df.columns:
    if(col.startswith('Q7_')):
        language = us_df[col].value_counts()
        languages = languages.append({'Language':language.index[0], 'Count':language[0]}, ignore_index=True)
languages = languages.set_index('Language')
languages = languages.sort_values(by = 'Count', ascending = False)
languages_tot = sum(languages.Count)
languages['Pct'] = round((languages['Count'] * 100 / languages_tot), 2)

In [None]:
list_and_bar_mc(languages, 'Programming Languages Used')

In [None]:
heatmap_mc('Q5', 'Q7', 'Language Use by Role', languages.index, q5_order)

In [None]:
heatmap_mc('Q4', 'Q7','Language Use by Education Level', languages.index, q4_order)

In [None]:
heatmap_mc('Q6', 'Q7', 'Language Use by Years Coding', languages.index, q6_order)

### **Question 2 analysis**
Python was the most widely used language, followed by SQL and R. Python held the top spot across all education levels and coding experience, as well as almost all job roles -- but not as firmly as in the prior survey: Statisticians listed SQL higher and R has moved into the top spot for DBA/Database Engineers, and is a close second this year for Data Engineers, Data Analysts, and Data Scientists. R also enjoys widespread popularity across education levels and coding experience; SQL shows a high number of users overall, but they are more concentrated in people holding Master's or PhD degrees, working as Statisticians, Data Scientists and Data Analysts.

Kagglers reported using 2-3 languages on a regular basis.

### **Question 3. What are the preferred online learning platforms to gain data science knowledge and skills?**
Regarding online learning, Kaggle's survey asked, "On which platforms have you begun or completed data science courses? (Select all that apply)." 

In [None]:
# creating a dataframe of online course providers and the number of times each was selected by users
platforms = pd.DataFrame()

for col in us_df.columns:
    if(col.startswith('Q40_')):
        platform = us_df[col].value_counts()
        platforms = platforms.append({'Platform':platform.index[0], 'Count':platform[0]}, ignore_index=True)
platforms = platforms.set_index('Platform')
platforms = platforms.sort_values(by = 'Count', ascending=False)
platforms_tot = sum(platforms.Count)
platforms['Pct'] = round((platforms['Count'] * 100 / platforms_tot), 2)

In [None]:
list_and_bar_mc(platforms, 'Learning Platforms Used')

In [None]:
heatmap_mc('Q5', 'Q40', 'Learning Platform Use by Role', platforms.index, q5_order)

In [None]:
heatmap_mc('Q4', 'Q40', 'Learning Platform Use by Education Level', platforms.index, q4_order)

### **Question 3 analysis**
Coursera was the most popular response, by a good margin. Kaggle Learn, Udemy, and University Courses towards a degree followed, with Datacamp, edX, and LinkedInLearning coming in a bit behind. Kaggle Learn seemed to close the gap with Coursera a bit this year.

Kagglers reported trying data science courses on two platforms, on average.

Coursera's popularity was high across almost education levels and job titles. Kaggle Learn's usage was fairly uniform across categories. LinkedInLearning and Udemy seem to have been popular across categories as well, noticeably more so than in the 2020 survey. Fast.ai was popular with Statisticians, Machine Learning Engineers, and Data Scientists. It will be interesting to see how this part of the survey evolves, especially in the wake of Covid and it's effects on work and learning.

## **Conclusion**
The most well-travelled path into the data science field in the US, at least for those responding to the 2021 Kaggle survey (which is pretty much the same as the 2020 survey):

* Get at least a Bachelor's degree, though a Master's degree may be preferable
* Learn at least 2 coding languages -- Python and R are the top data science languages; depending on the role you want,
    you might want to get comfortable with another language, such as SQL or MATLAB.
* Take classes on online learning platforms to update your skills and learn new ones. Coursera remains the standard, while
    Kaggle Learn is a good option for short,targeted learning.

*If you've read this far, thank you! This is my second notebook attempt, and I have a lot to learn. Any advice or corrections are greatly appreciated!*