# A Heatmap Roadmap to Your Dream Job
## Exploring the machine learning & data science tools real people are using in the job of your dreams
##### 2020 Kaggle ML & DS Survey

In [None]:
from IPython.display import Image
import os
Image('../input/coverimage/tabea-damm-9-xfYKAI6ZI-unsplash.jpg')

*Image credit: Tabea Damm on Unsplash*

# Greetings, fellow job hunters

I'm preparing to enter the job market in the field of machine learning and data science (ML & DS). Taking a quick glance at the results of the 2020 Kaggle ML & DS Survey, I'd say that many of my fellow Kagglers are preparing for that too - 25% of the >20,000 respondents in the survey were students! 

That got me thinking... How can we leverage the data in this survey to create a roadmap for how to prepare for your dream job? What ML & DS tools and techniques do you need to know? How best to showcase your skillset to potential employers? What level of education do you need in order to be competitive? And before we even get to those job-specific details, what is the right dream job for your interests anyway? Sure, you could just google it... But you'll generally find people's opinions or perhaps interviews with a handfull of experts. It's rare that you'll find answers from *thousands* of ML & DS professionals in one place - but that's what you'll find in this analysis, my friends.

So join me on this roadmap to your dream job, told in a series of heatmaps comparing the responses of professionals with different job titles in the 2020 Kaggle ML & DS Survey. This roadmap has two parts:
* *Part 1: What is my dream job anyway?* will help you identify the dream job that fits your ML & DS interests.
* *Part 2: How do I prepare for my dream job?* will explore the ML & DS tools and techniques used by real people who are currently working in your dream job.

# Part 1. What is my dream job anyway?

Those of us preparing to enter the ML & DS job market have all heard the fancy job titles being thrown around. Data scientist, data analyst, machine learning engineer... But what do these job titles really mean in terms of your day-to-day work? Because let's face it, doing things you actually like is what makes the difference between having your dream job and just hitting the daily grind. 

So let's start by looking at what real people in your dream job are actually doing at work, to help you refine your idea of what your dream job really is.

## Job titles covered in this survey

Let's start by looking at the job titles of the survey respondents (Question 5). Let's exclude people without jobs (*i.e.* student, currently not employed, or missing values), or people who listed their job title as 'Other' (because that's going to lump a bunch of unrelated jobs together). Let's also exclude respondents whose current yearly compensation is missing or $0-999 USD, so we can just focus on the survey responses from people who are working and earning at least a small salary \(it's possible that the some of the folks with low salaries may have been entering their *monthly* instead of *yearly* salary\). Finally, we'll only include job titles with more than 100 respondents to be sure we are getting a large enough sample size to draw meaningful conclusions. Here are the job titles that leaves us with:

In [None]:
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set(style='white', context='notebook')

import warnings
warnings.simplefilter('ignore')

# survey data
survey = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', skiprows = [1])

# clean up survey 
# exclude student, unemployed, Other (not informative), or missing value in job title, as we will not consider these
# exclude salary of 'None' or '$0-999'
survey = survey[survey['Q5'].isin(['Student','Currently not employed', 'Other', np.nan]) == False][survey['Q24'].isin([np.nan, '$0-999']) == False]

# relevant questions: 
## 5 (job title)

# get Q5
q5 = survey[['Q5']]

# add categorical variable for > x respondents
x = 100
job_counts = pd.DataFrame(q5['Q5'].value_counts()).reset_index()
job_counts.columns = ['Q5', 'count']

q5 = q5.merge(job_counts, on='Q5', how = 'left')
q5 = q5.assign(More_than_x = q5['count'] > x)

## FIRST exclude low sample size jobs from survey data (original df) for further analyses
jobs_to_exclude = list(q5[q5['More_than_x'] == False]['Q5'].unique())
survey = survey[survey['Q5'].isin(jobs_to_exclude)==False]

## AND exclude low sample size jobs from q5 analysis
q5 = q5[q5['More_than_x'] == True]

# palette
cpal = sns.color_palette(['#330033'])

# countplot
plt.figure(figsize=(6,4))
g = sns.countplot(x='Q5', data=q5, order = q5['Q5'].value_counts(ascending=False).index, palette = cpal)
plt.xticks(rotation=45, ha='right')
plt.xlabel('')
plt.title('Job titles of 2020 survey respondents', fontdict = {'fontsize':15}, pad=24)
sns.despine();

## Activities real people are doing at work

Let's get a picture of what real people in our dream jobs are doing at work, starting with Question 23: *'Select any activities that make up an important part of your role at work.'* We'll take a look at the proportion of positive responses within each job title. 

Our respondents were allowed to select multiple answers, so in the heatmap below each cell represents the percentage of respondents within each job title (the columns) that selected each answer (the rows). For example in the heatmap below, 81% of Business Analysts and 86% of Data Analysts said that they "analyze data", and 30% of Business Analysts and 37% of Data Analysts said that they perform work involving "data infrastructure". A legend follows the heatmap, giving the full text of the survey responses corresponding to the heatmap row labels.

In [None]:
# relevant questions: 
## 5 (job title)
## 23 (Select any activities that make up an important part of your role at work)

# function to get all columns matching question of interest
def get_q(q):
    ''' Returns all columns with column name containing the string you provide (q)'''
    
    q_col_list = [name for name in survey.columns if q in name]
    survey_q = survey[q_col_list]
    return survey_q

# get Q23 answers as big-picture overview
q23 = pd.concat([get_q('Q5'), get_q('Q23')], axis=1).iloc[1:]
q23 = q23.melt(id_vars = 'Q5', var_name = 'q_part', value_name = 'response').dropna()

## Q23 summary stats
### number of responses per question grouped by job title
q23_summary = pd.DataFrame(q23.groupby('Q5')['response'].value_counts())
q23_summary.columns = ['n_responses']
q23_summary.reset_index(inplace=True)

### add number of respondents for each job title
q23_summary = q23_summary.merge(job_counts, how='left', on='Q5')

### calculate percent positive responses
q23_summary = q23_summary.assign(pct_pos = (q23_summary['n_responses']/q23_summary['count'])*100)

## create legend for abbreviated question responses
label_abbrev = {'Analyze and understand data to influence product or business decisions': 'Analyze data',
               'Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data': 'Data infrastructure',
               'Build prototypes to explore applying machine learning to new areas': 'ML prototyping',
               'Build and/or run a machine learning service that operationally improves my product or workflows': 'ML application',
               'Experimentation and iteration to improve existing ML models': 'ML improvement',
               'Do research that advances the state of the art of machine learning': 'ML research',
               'None of these activities are an important part of my role at work': 'None',
               'Other': 'Other'}

### as a data frame for printout later (like a legend)
labels_df = pd.DataFrame.from_dict(label_abbrev, orient = 'index').reset_index()
labels_df.columns = ['Survey Response', 'Heatmap Label']


### set up a column in df for label abbrev
def abbreviate_label(x):
    '''given the survey response for Q23 ("x"), returns the abbreviated label from labels_df'''
    return label_abbrev[x]
    
q23_summary['label'] = q23_summary['response'].apply(lambda x: abbreviate_label(x))

## wide format data
q23_wide = q23_summary.pivot('label', 'Q5', 'pct_pos')

## plot: heatmap
plt.figure(figsize=(6,6))
sns.heatmap(q23_wide, annot=True, linewidths=.5, square=True, cbar_kws={'shrink':0.6, 'label':'% positive responses within job title'})
plt.xlabel("Job title")
plt.ylabel("Work activities")
plt.title("Q23: Select any activities that make up an important part of your role at work", fontdict = {'fontsize':15}, pad=24);

#### Legend

In [None]:
# legend for heatmap
labels_df[labels_df.columns[::-1]].style.hide_index()

Let's break down what this tells us about each job title, to help you identify your job preference:
* *Business analyst, Data analyst, or Statistician*: Data analysis is the most common response among respondents by far, and data infrastructure is in second place.
* *Data engineer*: Data analysis and data infrastructure are the most common responses.
* *Data scientist*: Data analysis is the most common response, but ML prototyping/improvement/application and data infrastructure are also important activities.
* *Machine learning engineer*: ML prototyping/improvement/application and data infrastructure are the most common responses. There is a strong similarity with the profile of data scientists in this survey, with the exception of their responses on data analysis.
* *Product/project manager*: Data analysis is the most common response, and ML prototyping is in second place.
* *Research scientist*: ML research, ML prototyping, and data analysis are the most common responses. It's probably safe to say these are ML researchers.
* *Software engineer*: Data analysis and 'None' are tied for the most common response.

## Techniques real people are using at work

The responses on work activities gave us a big-picture overview for each job title, but it is still pretty general. And in some cases we just don't get a lot of insight beyond 'Analyzing data' (gee thanks!). So let's try to get some more helpful information for your quest to identify your perfect dream job.

Let's dig deeper using the list of questions below to see what DS & ML techniques are used by the respondents in each job title. For now we won't look at the specific algorithms and libraries - you'll see that in Part 2 of this roadmap. We will simply look at whether the respondent performs a technique or not (*i.e.* did they give a response other than 'None' to the question?). This will tell us:
* *Q14:* Does the respondent perform any kind of data visualization?
* *Q17:* Machine learning?
* *Q18:* Computer vision?
* *Q19:* Natural language processing?
* *Q29-A:* Big data (relational databases etc.)?
* *Q31-A:* Business intelligence?

In [None]:
# relevant questions: 
## 5 (job title)
## 14 (data visualization), 17 (ML), 18 (computer vision), 19 (NLP), 29_A (big data), 31_A (BI)

# generate None/not none summary of general areas of work questions
retrieve_q = ['Q14', 'Q17', 'Q18', 'Q19', 'Q29_A', 'Q31_A']

## function to tell if the respondent answered 'None':
def answer_None(x):
    '''Returns True if the question response (x) is "None", otherwise returns False.'''
    if x == 'None':
        return True
    else:
        return False

## retrieve questions on areas of work and score 1 if answered "None"
ans_none = survey[['Q5']]

for q in retrieve_q:
    this_q = get_q(q).reset_index()
    this_q = this_q.melt(id_vars = 'index', var_name = 'question_part', value_name = 'response')
    this_q[q] = this_q['response'].apply(lambda x: answer_None(x))
    result = this_q.groupby('index')[[q]].sum()
    ans_none = ans_none.join(result)

## column names more informative
informative_names = ['Job title', 'Data visualization (Q14)', 'Machine learning (Q17)', 'Computer vision (Q18)', 
                    'Natural language processing (Q19)', 'Big data (Q29-A)', 'Business intelligence (Q31-A)']

ans_none.columns = informative_names

### make legend table
labels2_df = pd.DataFrame(informative_names[1:], columns = ['Heatmap Label'])
labels2_df['Survey Question'] = ['What data visualization libraries or tools do you use on a regular basis?',
                                'Which of the following ML algorithms do you use on a regular basis?',
                                'Which categories of computer vision methods do you use on a regular basis?',
                                'Which of the following natural language processing (NLP) methods do you use on a regular basis?',
                                'Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you use on a regular basis?',
                                'Which of the following business intelligence tools do you use on a regular basis?']

## invert these answers to capture who did NOT answer "None"
not_none = pd.concat([ans_none[['Job title']], 1-ans_none.drop(columns='Job title')], axis=1)

## calculate percentage of respondents performing each technique by job title
areas_summary = not_none.groupby('Job title').mean()*100
areas_summary = round(areas_summary, 1)

## plot: heatmap
plt.figure(figsize=(6,6))
sns.heatmap(areas_summary.transpose(), annot=True, linewidths=.5, square=True, cbar_kws={'shrink':0.6, 'label':'% positive responses within job title'})
plt.xlabel("Job title")
plt.ylabel("DS & ML techniques")
plt.title("Techniques performed at work", fontdict={'fontsize':15}, pad=24);

This is quite a surprising contrast to the responses we just saw above from Question 23, where relatively few of the work activities broke the 50% mark of positive responses within each job title. Many of the ML & DS tools we are considering here (except business intelligence and big data) had a positive response rate above 80% across all job titles!

Whereas Question 23 was getting at which activites are an *important* part of work, these questions highlight techniques that people are using, regardless of whether they consider it an important focus. And they are all using a *LOT* of different ML & DS techniques! I would speculate that these results may be somewhat skewed toward ML because we are looking at a sample of Kaggle users (a ML-heavy platform), but the results are interesting nonetheless.

The take-away: We job hunters have our work cut out for us when it comes to learning all these techniques. We'd better focus our efforts on learning the most widely-used tools within each category of techniques.

# Part 2. How do I prepare for my dream job?

Let's face it - we don't have the time (or the space in our brains) to learn how to use every programming language, every data visualization library, or every ML framework out there, especially not up front in our quest for our dream jobs. Seasoned ML & DS veterans have probably honed their skills in a multitude of these tools, and are able to select just the right tool for any task. But we job hunters just want to get our foot in the door at our dream job by knowing the most widely-used tools, and we can continue learning from there.

## What are the most widely-used ML & DS tools and techniques?

Let's look at the proportion of respondents in each job title using each ML & DS tool or technique. You can use this information to help inform your decision as to which tools within each category are most valuable to learn first as you prepare to enter the job market. There are a lot of plots here, but don't worry - a summary of the main trends is provided below.

In [None]:
from textwrap import wrap

# reload the survey, questions row only
survey_qs = pd.DataFrame(pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', dtype=str).iloc[0]).reset_index()
survey_qs.columns = ['q_id', 'q_text']

# questions of interest
retrieve_q = ['Q7', 'Q14', 'Q16', 'Q17', 'Q18', 'Q19', 'Q29_A', 'Q31_A']
# or if you want a longer list of questions: retrieve_q = ['Q7', 'Q9', 'Q11', 'Q14', 'Q16', 'Q17', 'Q18', 'Q19', 'Q26_A', 'Q27_A', 'Q28_A', 'Q29_A', 'Q31_A', 'Q33_A', 'Q34_A', 'Q35_A', 'Q36']

# for each question, retrieve number of answers for each job title, and calculate percentage of positive responses within job title

def heatmap_my_qs(qlist, row_order = []):
    '''Plots a heatmap of question and responses by job title.
    Input: 
    - qlist: a list of question numbers in the format "Q1"
    - row_order: a list of the question responses, for ordering the rows in your heatmap output; default: empty list
    Output: 
    -heatmap'''
    
    for q in qlist:
        this_q = pd.concat([survey[['Q5']], get_q(q)], axis=1)

        this_q = this_q.melt(id_vars = 'Q5', var_name = 'q_part', value_name = 'response').dropna()

        ## question summary stats
        ### number of responses per question grouped by job title
        this_summary = pd.DataFrame(this_q.groupby('Q5')['response'].value_counts())
        this_summary.columns = ['n_responses']
        this_summary.reset_index(inplace=True)

        ### add number of respondents for each job title
        this_summary = this_summary.merge(job_counts, how='left', on='Q5')

        ### calculate percent positive responses
        this_summary = this_summary.assign(pct_pos = (this_summary['n_responses']/this_summary['count'])*100)

        ### wide format for heatmap
        this_wide = this_summary.pivot('response', 'Q5', 'pct_pos')
        
        ### option to reorder rows of heatmap if list provided
        if len(row_order) > 0:
            this_wide = this_wide.reindex(row_order)
        else:
            this_wide = this_wide

        ### make labels less long by getting rid of brackets
        this_wide.index = [ind.split(" (")[0] for ind in this_wide.index]

        ## retrieve the question text
        q_text = survey_qs[survey_qs['q_id'].str.match(q)].reset_index()
        q_text[['main_q', 'restof_q']] = q_text['q_text'].str.split('?', expand=True)
        q_title = q + ": " + q_text.iloc[0].loc['main_q'] + "?"

        ## plot
        plt.figure(figsize = (6,12))
        p = sns.heatmap(this_wide, annot=True, linewidths=.5, square=True, cbar_kws={'shrink':0.3, 'label':'% positive responses within job title'})
        plt.xlabel("Job title")
        plt.ylabel("Question response")
        plt.title("\n".join(wrap(q_title, 70)), fontdict = {'fontsize':15}, pad=24)
    
    return p

heatmap_my_qs(retrieve_q);

These plots show that within each Question, certain tools are widely popular among most of the dream jobs. There are also some interesting differences in what is most popular when comparing one dream job to another, for example which programming language is more popular - R or Python - among Statisticians *vs.* all other job titles.

We won't delve into a description of all the specifics here - you can dig into the plots above and look at the results for your specific dream job. But here is a summary of the most striking overall trends:
* *Programming languages:* Python is the most popular across the board, unless you're hoping to be a statistician, in which case it's R. SQL is generally a second most popular choice.
* *Data visualization libraries:* Matplotlib and Seaborn are most popular within the Python universe, and ggplot is most popular in the R universe.
* *Machine learning frameworks:* Scikit-learn is an across-the-board favorite, but TensorFlow and Keras deserve an honorable mention.
* *Machine learning algorithms:* Linear/logistic regression and decision trees/random forests appear to be most popular on the whole, but several other algorithms are also quite popular, particularly among certain job titles.
* *Computer vision methods:* There is a lot of job title-specific variation here, but image classification stands out in a few job titles.
* *Natural language processing:* Again, there is job title-specific variation, but word embeddings/vectors appears to come out on top.
* *Big data products:* MySQL, Microsoft SQL server, and PostgresSQL appear to take 1st, 2nd, and 3rd here.
* *Business intelligence tools:* Microsoft Power BI and Tableau are the favorites among those who use BI tools.

## Refining your job-specific toolkit

The insights from the section above are useful for getting started on learning the right general tools for your dream job. Remember that we were looking at aggregated data within each job title, so we might have missed specific combinations of tools that are used by subgroups within job titles.

To address this, let's look at the answers from individual respondents, and see what clusters of ML & DS tools/techniques emerge within each job title. In each clustered heatmap below, the columns represent individual survey respondents within a given job title, and the rows represent the top 25 most widely-used ML & DS tools/techniques within that job title (there are 83 possible tools/techniques represented in the heatmaps you saw in the section above). Each respondent's answers are represented as either 1 or 0, meaning that the respondent does or does not use that tool/technique, respectively. For the sake of simplicity, we will omit any respondents whose answers were '0' for all of the top 25 tools/techniques shown. 

If you look at the dendrogram on the y-axis of each clustered heatmap, you can see which tools/techniques cluster together for each job title. A summary of interesting trends is provided below the plots.

In [None]:
from matplotlib.colors import LinearSegmentedColormap

# retrieve all the questions from the section above into 1 df
retrieve_q = ['Q7', 'Q14', 'Q16', 'Q17', 'Q18', 'Q19', 'Q29_A', 'Q31_A']

myqs = pd.DataFrame() # for concatenating results onto

for q in retrieve_q:
    myqs = pd.concat([myqs, get_q(q)], axis = 1)
    
# convert to binary response
## change column names to match the answer within each column
myqs = myqs.fillna(0)

new_colnames = []
for col in myqs.columns:
    x = myqs[col].unique()
    colname = x[x!=0][0].strip()
    new_colnames.append(colname)
    
### make new colnames less long by getting rid of brackets
new_colnames = [name.split(" (")[0] for name in new_colnames]

myqs.columns = new_colnames

## get rid of None and Other columns (not informative here)
myqs = myqs.drop(columns=['None', 'Other'])

## change positive responses to 1's
myqs = myqs.replace(regex=r'.+', value=1)

# add job annotation then split into separate clustermaps by job title
myqs = pd.concat([survey[['Q5']], myqs], axis = 1)

for job in sorted(myqs['Q5'].unique()):
    job_data = myqs[myqs['Q5'] == job].reset_index(drop=True)
    job_data.drop(columns = ['Q5'], inplace=True)

    # we'll just look at the top 25 tools/techniques for each job
    top = job_data.sum().sort_values(ascending=False).index[:25]
    job_data = job_data[top]

    # remove any respondents where all answers are 0 in this top 25 tool/technique list
    job_data['row_sum'] = job_data.apply(sum, axis = 1)
    job_data = job_data[job_data['row_sum'] != 0]
    job_data.drop(columns = ['row_sum'], inplace=True)
    
    ## custom colors
    my_colors = ('#330033','#fff2e6')
    my_cmap = LinearSegmentedColormap.from_list('Custom', my_colors, len(my_colors))
    
    ## plot clustermap
    g = sns.clustermap(job_data.transpose(), figsize=(14,10), dendrogram_ratio = 0.2, 
                       cbar_pos=(-0.05, 0.8, 0.05, 0.18), cbar_kws = {'ticks':[0,1], 'shrink':0.1}, cmap=my_cmap,
                      xticklabels = False)
    g.fig.suptitle(job, y=1.02, fontsize = 15)
    g.cax.set_title("Response:\n0=No / 1=Yes", fontsize=12, pad=10)

The way some of these tools/techniques are clustered is quite intuitive - the fact that Python and matplotlib, or R and ggplot cluster together probably comes as no surprise. Other clusters can give you a little bit more information about the different "flavors" of toolkits within a given job title. For example, if you look at the clustered heatmap for Data Scientists (the most frequent job title), there is a cluster of tools/techniques involving neural networks, image classification, and natural language processing (word embeddings/vectors). It appears that around half of the data scientists use this cluster of techniques, and the other half don't. 

If you take a moment to examine the clustered heatmap for your dream job, you might find different "flavors" of toolkits within your profession of interest too.

## Showing off your ML & DS toolkit

So now that you're on the road to learning just the right skillset for your dream job, how should you showcase your newfound skills to potential employers?

Let's assume that the real people in this survey who are currently working in your dream job *could also be the real people interviewing you for your dream job*. Let's look at where they share their work, to get an idea of where is the best place to showcase your ML & DS portfolio. For this we'll look at Question 36: *Where do you publicly share or deploy your data analysis or machine learning applications?*

In [None]:
heatmap_my_qs(['Q36']);

It looks like Github followed by Kaggle are the most popular platforms for sharing ML & DS work across the board. Is Kaggle so popular because this is a Kaggle survey? I'll let you be the judge. In any case, these results indicate that if you want to showcase your portfolio then Github is a good choice... and of course Kaggle too!

## What about those pesky degrees?

We've looked at the ML & DS tools you'll need on your way to your dream job. Finally let's ask about the dreaded D-word: degrees? What level of education do the real people in your dream job typically have? For this we'll look at Question 4: *What is the highest level of formal education you have attained or plan to attain within the next 2 years?*

In [None]:
# provide row order for heatmap rows:
row_order = ['I prefer not to answer',
            'No formal education past high school',
            'Some college/university study without earning a bachelor’s degree',
            'Bachelor’s degree',
            'Master’s degree',
            'Doctoral degree',
            'Professional degree']

heatmap_my_qs(['Q4'], row_order = row_order);

It looks as though Master's degree is the most common level of education achieved in most job titles. In many cases, Bachelor's degree is the second most common answer. Some notable exceptions are Research Scientist (Doctoral degree most common), and Software Engineer (Bachelor's degree won by a hair over Master's). Overall, the percentage of respondents who haven't completed post-secondary education of some kind is very low. So in addition to learning the right ML & DS tools for your dream job, it will likely be advantageous to have a degree as well.

# Parting words

Thanks for joining me on this journey through the “Heatmap Roadmap to Your Dream Job”. Hopefully you found some valuable insights for choosing your dream job, or for preparing yourself to land your dream job. If you did, please show your appreciation with an upvote. If you already have your dream job and you were reading this for its sheer entertainment value, then why not show your appreciation with a job offer or hiring recommendation for my dream job as a data analyst?
  

  
   
.......................................................................................................................................................................   
*Job offers and networking contacts can connect with me: jmg.crockett[at]gmail.com*