If you search for jobs available worldwide on LinkedIn right now, you will have
* over 100,000+ 'Machine Learning'
* over 93,000+ 'Business Analyst' 
* over 85,000+ 'Data Analyst' 
* over 53,000+ 'Data Scientist' 
* over 33,000+ 'AI Engineer' 

jobs available for you. Each job demands experience or expertise in different algorithms, frameworks, IDEs and level of education.
Kaggle's survey data for 2020 gives us an opportunity to know what exactly is the world out there working on/with. 

While the analysis may be long (bear with me, I am a series-fiction fan), I can guarantee you its worth your time. 
By the end of the survey, you will get clarity on:
1. What is the community's choice across different job roles?
2. What are the opportunities you will receive if working for an SMB vs EA across different job roles?
3. What is the expected compensation for each job role as per your preferred location?


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import style
import plotly.graph_objects as go

%matplotlib inline
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')
questions = pd.DataFrame(data.iloc[0])
data.drop(index=0, inplace=True)

# Part A: Demographic Analysis

## Gender by Age group

**Key takeaways**
*   Strength of the existing Kaggle community lies within the age range of 18-29
*   There exists a strong diversity in the survey respondents within the age groups of 18-29
* Kaggle contests has an active diversity reach within 18-30 YOE

In [None]:
plt.figure(figsize=(10, 8))
ax = sns.countplot(x='Q1',hue='Q2',data=data, order = sorted(data['Q1'].value_counts().index))

*Analysis of respondents'*
## Highest level of education [attained/to attain] x Gender

**Key takeaways**:
* ~70% population has/is about to gain Master's and/or Bachelor's degree as their highest level of education

In [None]:
data['Q4'] = data['Q4'].apply(lambda x : 'Others' if x in [np.nan,'I prefer not to answer'] else x)
fig_edu, ax = plt.subplots(figsize=(10,8)) 
ax = sns.countplot(x='Q4',hue='Q2',data=data)
labels = ['Doctoral','Masters','Bacherlors','High school','Professional','Others','College/Univ. study']
ax.set_xticklabels(labels)

# Part B: Experience analysis

**Analysis of respondents **
## Job Role x Gender x Age Group

As shown above, the analysis is biased towards male student population. Potentially, students enrolled for Master's followed by those enrolled for Bachelors' or recent graduates dominate the current survey. In order to evaluate preferences of working professionals, following heatmap gives the clarity of respondents' current designation. 

* Assuming that increasing age group is directly proportional to increasing experience, we have very few respondents from Leadership positions.
* Product/Project Manager role has very weak response rate. The preferences that we will see going forward will majorly be coming from developers.
* The gender breakup is an indicator of potential outreach of Kaggle contests to different sections of the community.

In [None]:
data['Q2'] = data['Q2'].apply(lambda x : 'Others' if x not in ['Man','Woman'] else x)
fig, (ax1, ax2, ax3) = plt.subplots(3,1,figsize=(12,18))

df = pd.pivot_table(data[data.Q2=='Man'],index=['Q5'],columns=['Q1'],aggfunc='size', fill_value=0)
ax1 = sns.heatmap(df,cbar=True, cmap='BuGn', ax=ax1, annot=False, fmt='.1f')

df2 = pd.pivot_table(data[data.Q2=='Woman'],index=['Q5'],columns=['Q1'],aggfunc='size', fill_value=0)
ax2 = sns.heatmap(df2,cbar=True, cmap='Purples', ax=ax2, annot=False, fmt='.1f')

df3 = pd.pivot_table(data[data.Q2=='Others'],index=['Q5'],columns=['Q1'],aggfunc='size', fill_value=0)
ax3 = sns.heatmap(df3,cbar=True, cmap='YlOrBr', ax=ax3, annot=False, fmt='.1f')

ax1.title.set_text('Man')
ax2.title.set_text('Woman')
ax3.title.set_text('Others')


fig.subplots_adjust(hspace=0.5)
fig.text(0.3, 0.9, 'Age x Current Designation', fontweight='bold', fontfamily='serif', fontstretch= 'semi-expanded',fontsize=16)

## **Experience in Machine Learning**: 
*working professionals vs students & others*

1. Software developers have shown interest in ML and are currently in a ramp-up zone
2. Other than the anticipated profiles of Data domain, we have product/project engineers with some hands-on experience in Machine Learning
3. While a lot of students have shown interest in the survey, very few have hands-on experience in Machine Learning
4. Data Analysts & Data Engineers in the respondent pool also have relatively high experience in Machine Learning. 

There exists an opportunity of applying Machine Learning to analyst & data engineering job roles as well, other than data scientist or ML engineering roles.


In [None]:
data.rename(columns={'Q5':'Current Designation'}, inplace=True)
mle = pd.pivot_table(data=data, columns='Current Designation', index='Q15', aggfunc='size')

experience = ['I do not use machine learning methods','Under 1 year','1-2 years','2-3 years','3-4 years','4-5 years','5-10 years','10-20 years','20 or more years']
mapping = {exp: i for i, exp in enumerate(experience)}

mle.rename_axis(index={'Q15':'Machine Learning experience'}, inplace=True)
key = mle.index.map(mapping)
mle = mle.iloc[key.argsort()]
mle[['Student','Other','Currently not employed']].plot(kind='barh', stacked=False, cmap='icefire', figsize=(8,6), width=0.75, use_index=True)

mle.drop(['Student','Other','Currently not employed'], axis=1, inplace=True)
mle.plot(kind='barh', stacked=True, figsize=(8,6), cmap = 'tab20b')

#### **R&Rs of different job roles**

**Key Takeaways:**
* As Product/Project Manager, Data Analyst and Data scientist your primary responsibility will be to derive insights from data to influence product or business decisions
* Primary responsibility of Research scientist: To do research that advances the state of the art of ML
* As a data engineer, you will need to excel equally across all the activities, but will need to have infrastructure knowledge for storage, analysis and operationalizing of data 

In [None]:
activity = {}
jobs = ['Business Analyst','Data Analyst','DBA/Database Engineer','Data Engineer','Statistician','Data Scientist','Machine Learning Engineer','Software Engineer','Research Scientist','Product/Project Manager','Other','Currently not employed']

for i in data.filter(like='Q23').columns:
    activity[i] = str(data[i].value_counts().index.to_list()).strip('[|\'|]')

activity= { k:v.strip() for k, v in activity.items()}

a = data.groupby('Current Designation').count().filter(like='Q23')
a.rename(columns=activity, inplace=True)
a =round(100 * a/a.sum())
fig = go.Figure(
    data=[
        go.Bar(
            name="Analyse data for prod/business decisions",
            x=jobs,
            y=a['Analyze and understand data to influence product or business decisions'],
            #y = a.columns,
            offsetgroup=0,
        ),
        go.Bar(
            name="Optimization of product/workflow using ML",
            x=jobs,
            y=a['Build and/or run a machine learning service that operationally improves my product or workflows'],
            #y = a.columns,
            offsetgroup=1,
        ),
        go.Bar(
            name="Development for new areas of applied ML",
            x=jobs,
            y=a['Build prototypes to explore applying machine learning to new areas'],
            #y = a.columns,
            offsetgroup=1,
        ),
        go.Bar(
            name="Research",
            x=jobs,
            y=a['Do research that advances the state of the art of machine learning'],
            #y = a.columns,
            offsetgroup=1,
        ),
        go.Bar(
            name="Optimization of existing ML models",
            x=jobs,
            y=a['Experimentation and iteration to improve existing ML models'],
            #y = a.columns,
            offsetgroup=1,
        ),
        go.Bar(
            name="Data infrastructure",
            x=jobs,
            y=a['Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data'],
            #y = a.columns,
            offsetgroup=0,
        ),
        go.Bar(
            name="Others",
            x=jobs,
            y=a['Other'],
            #y = a.columns,
            offsetgroup=2,
        ),
        go.Bar(
            name="None",
            x=jobs,
            y=a['None of these activities are an important part of my role at work'],
            offsetgroup=2,
        )
    ],
    layout=go.Layout(
        title="Roles & Responsibilities per job role",
        yaxis_title="% distribution",
        xaxis_title = "Job Roles",
        barmode='stack'
    )
)

fig.show()

#### Compensation distribution basis job role & company's employee strength
* There is very low visible disparity in compensation between companies of varying sizes
* There is no proven correlation as per the data that larger the company, larger the compensation. This stands true across all job roles
* Experience in Machine learning can give you a competitive edge in the job roles of 'Data Engineer', 'Research Scientist', 'Software Engineer', 'Statistician', 'DBA/Database Engineer' [assuming the data is representative of the actual payscale being practiced]

In [None]:
data.rename(columns={'Q20':'Employer strength','Q24':'Salary range'}, inplace=True)

experience = ['I do not use machine learning methods','Under 1 year','1-2 years','2-3 years','3-4 years','4-5 years','5-10 years','10-20 years','20 or more years']
q20_order = ['0-49 employees','50-249 employees','250-999 employees','1000-9,999 employees','10,000 or more employees']
q24_order = ['$0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999', '4,000-4,999', '5,000-7,499', '7,500-9,999',
             '10,000-14,999','15,000-19,999', '20,000-24,999', '25,000-29,999', '30,000-39,999', '40,000-49,999', '50,000-59,999', '60,000-69,999', '70,000-79,999', '80,000-89,999', '90,000-99,999',
             '100,000-124,999', '125,000-149,999',  '150,000-199,999', '200,000-249,999',  '250,000-299,999', '300,000-500,000', '> $500,000']


exp = data.groupby(['Current Designation','Q15'])['Salary range'].value_counts().unstack().reset_index()

c = data.groupby(['Current Designation','Employer strength'])['Salary range'].value_counts().unstack().reset_index()

def plot_pivot_heatmap(df, col, x_values, y_values):
    fig = go.Figure()
    for column in df[col].unique():
        fig.add_trace(
            go.Heatmap(
                x = x_values, 
                y = y_values,
                z = df[df['Current Designation']== column].values[:,2:],
                colorscale = 'ylgnbu',
                showscale=True
                )
            )
        
    fig.update_layout(
      updatemenus=[go.layout.Updatemenu(
          go.layout.Updatemenu(
          type = "dropdown", direction = "down", active = 0, x = 0.1, y = 1.2,
          buttons = list(
              [
                dict(
                    label = "Business Analyst", method = "update",
                    args = [{"visible": [True, False, False, False, False, False, False, False, False, False]},{"title": "Business Analyst"} ]
                ),
                dict(
                    label = "DBA/Database Engineer", method = "update", 
                    args = [{"visible": [False, True, False, False, False, False, False, False, False, False]},{"title": "DBA/Database Engineer"}]
                ),
                dict(
                    label = "Data Analyst", method = "update",
                    args = [{"visible": [False, False, True, False, False, False, False, False, False, False]},{"title": "Data Analyst"} ]
                ),
                dict(
                    label = "Data Engineer", method = "update", 
                    args = [{"visible": [False, False, False, True, False, False, False, False, False, False]},{"title": "Data Engineer"}]
                ),
                dict(
                    label = "Data Scientist", method = "update",
                    args = [{"visible": [False, False, False, False, True, False, False, False, False, False]},{"title": "Data Scientist"} ]
                ),
                dict(
                    label = "Machine Learning Engineer", method = "update", 
                    args = [{"visible": [False, False, False, False, False, True, False, False, False, False]},{"title": "Machine Learning Engineer"}]
                ),
                dict(
                    label = "Software Engineer", method = "update", 
                    args = [{"visible": [False, False, False, False, False, False, True, False, False, False]},{"title": "Software Engineer"}]
                ),
                dict(
                    label = "Research Scientist", method = "update", 
                    args = [{"visible": [False, False, False, False, False, False, False, True, False, False]},{"title": "Research Scientist"}]
                ),
                dict(
                    label = "Product/Project Manager", method = "update", 
                    args = [{"visible": [False, False, False, False, False, False, False, False, True, False]},{"title": "Product/Project Manager"}]
                ),
                dict(
                    label = "Statistician", method = "update", 
                    args = [{"visible": [False, False, False, False, False, False, False, False, False, True]},{"title": "Statistician"}]
                )
              ]
          )
        )
    )])
    
    fig.update_layout(
        title={
          'y':0.9,
          'x':0.5,
          'xanchor': 'center',
          'yanchor': 'top'})
    fig.show()

plot_pivot_heatmap(c, 'Current Designation', q24_order, q20_order)
plot_pivot_heatmap(exp, 'Current Designation' , q24_order, experience)

# Part C: Choices & Preferences

## Language preferences
*based on coding experience*

* Python has emerged as a clear favorite for the developers, esp among 1-2yrs & 3-5yrs.
* The next favorite is SQL followed by R. 
* As respondents' experience with coding increases, the diversity in preferred language also increases.

In [None]:
language = ['Python','R','SQL','C','C++','Java','Javascript','Julia','Swift','Bash','MATLAB','None','Other']
group1 = ['Machine Learning Engineer', 'Data Scientist', 'Statistician']
group2 = ['Data Analyst', 'Business Analyst', 'Data Engineer']
group3 = ['Research Scientist', 'Product/Project Manager']
group4 = ['Student','Other','Currently not employed']

cols= ['Q7_Part_1',	'Q7_Part_2',	'Q7_Part_3',	'Q7_Part_4',	'Q7_Part_5',	'Q7_Part_6',	'Q7_Part_7', 'Q7_Part_8', 'Q7_Part_9', 'Q7_Part_10',	'Q7_Part_11',	'Q7_Part_12',	'Q7_OTHER']
data.rename(columns=dict(zip(cols, language)), inplace=True)

col_map = {'I have never written code':-1, '< 1 years':1, '1-2 years':2, '3-5 years':3, '5-10 years':4, '10-20 years':5, '20+ years':6}
col_map_r = {v: k for k, v in col_map.items()}

data['Q6'].replace(col_map, inplace=True)
data.rename(columns={'Q6':'Coding Experience'}, inplace=True)

lang = pd.pivot_table(data=data, columns='Coding Experience', values=['Python','R','SQL','C','C++','Java','Javascript','Julia','Swift','Bash','MATLAB','None','Other'], aggfunc='count')
lang.rename(columns=col_map_r, inplace=True)

lang.plot(kind='bar', stacked=True, rot=30, width=0.75, cmap = 'YlGnBu', figsize=(10,8))

## **Popularity of IDE** 
*based on job role*

In [None]:
ide = {'1':'JupyterLab','2':'RStudio','3':'Visual Studio','4':'VSCode', '5':'PyCharm','6':'Spyder','7':'Notepad++','8':'Sublime text','':'Vim, Emacs, similar','10':'MATLAB','11':'None','Other':'Other'}

data['Current Designation'].dropna(inplace=True)

In [None]:
def draw_ide_map(data, group, ax):
  pv = pd.pivot_table(data=data,index='Current Designation', values=data.filter(like='Q9'), aggfunc='count')
  pv.columns = pv.columns.str.strip('Q9_Part_')
  pv.rename(columns=ide, inplace=True)
  pv[pv.index.isin(group)].plot(kind='bar', figsize=(15,18), ax=ax, rot=15, cmap='tab20c')

fig_nb, ((idep1, idep2), (idep3, idep4)) = plt.subplots(2,2, figsize=(50,40))
draw_ide_map(data, group1, idep1)
draw_ide_map(data, group2, idep2)
draw_ide_map(data, group3, idep3)
draw_ide_map(data, group4, idep4)

* JupyterLab is a favorite among all job roles
* JupyterLab > PyCharm > RStudio > VSCode for Data scientists, Machine Learning Engineers and Statisticians
* RStudio, VSCode, PyCharm are the next popular choices across all population
* Tough competitors to above IDEs: Notepad++ & Spyder

As a professional in this field, you will mostly be working with JupyterLab, followed by PyCharm and VSCode. As a statistician, having hands-on experience with RStudio is recommended.

In [None]:
pv = pd.pivot_table(data=data,index='Current Designation', values=data.filter(like='Q9'), aggfunc='count')
pv.columns = pv.columns.str.strip('Q9_Part_')
pv.rename(columns=ide, inplace=True)
sns.clustermap(pv, cmap='icefire', col_cluster=False)

## **Frameworks, Algorithms, Hardware & Notebook Product** preferences

* **Frameworks** : Scikit-learn, Tensorflow, Keras, PyTorch and XGBoost are popular applied frameworks, irrespective of the job role. 
* **ML Algorithms**: Unlike popular belief, Linear/Logistic Regression, Decision Trees, followed by CNNs are more commonly used than the Deep Learning algorithms
* **Visualization Libraries**: Matplotlib, Seaborn, plotly & ggplot are most favored visualization tools in the respective order
* **Notebook Products** : Kaggle & Colab notebooks are emerging favorites among users. 
* **Specialised Hardware** : While majority population does not use any special hardware, GPUs are popular amongst Students and Data scientists more than other job roles


In [None]:
def draw_all(col):
  xdict = {}
  for i in data.filter(like=col).columns:
    xdict[i] = str(data[i].value_counts().index.to_list()).strip('[|\'|]')

  xdict= { k:v.strip() for k, v in xdict.items()}

  pv = pd.pivot_table(data,index='Current Designation',values=data.filter(like=col), aggfunc='count')
  pv.rename(columns=xdict, inplace=True)

  return pv

for j in ['Q16','Q17','Q14','Q10', 'Q12']:
  g= sns.clustermap(data=draw_all(j), cmap='icefire', col_cluster=False)
  title = {'Q16':'Frameworks', 'Q17': 'ML Algorithms', 'Q10':'Notebook Products','Q14': 'Visualization Libraries','Q12':'Specialised Hardware'}
  g.fig.suptitle(title[j],fontweight='bold', fontfamily='serif', fontstretch= 'semi-expanded',fontsize=16)

## Computing platform preference & TPU usage frequency

* Irrespective of work experience, personal computer/laptop is the go-to computing environment for over 86% population
* Hardly 10-15% developers have used TPUs between 1-5 times till date

In [None]:
df = data.filter(['Q11','Current Designation']).rename(columns={'Q11':'Computing Platform'})
df['Current Designation'] = df['Current Designation'].apply(lambda x: 'Working Professionals' if x not in ['Student','Currently not employed', 'Other'] else x)
#grouped = df.value_counts(['Computing Platform','Current Designation']).groupby('Computing Platform')

grouped = df.groupby(['Computing Platform', 'Current Designation']).size().unstack()
grouped = (grouped/np.sum(grouped, axis=0)*100).round(2)
grouped.plot(kind='barh', figsize=(8,6), title="Computing Platform preference")
#df.value_counts(['Computing Platform'], ascending=False).plot(kind='barh')

df2 = data.filter(['Q13','Current Designation']).rename(columns={'Q13':'TPU usage'})
df2['Current Designation'] = df['Current Designation'].apply(lambda x: 'Working Professionals' if x not in ['Student','Currently not employed', 'Other'] else x)
df2.head()
grouped2 = df2.groupby(['TPU usage', 'Current Designation']).size().unstack()
grouped2 = (grouped2/np.sum(grouped2, axis=0)*100).round(2)
grouped2.plot(kind='barh', figsize=(8,6), title="TPU usage frequency")

## Cloud platform & product preferences
Ranking of platforms and products based on response received
<table>
<thead>
  <tr>
    <th>Cloud Platforms</th>
    <th>Cloud Products</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td rowspan="3">AWS</td>
    <td>EC2</td>
  </tr>
  <tr>
    <td>Lambda</td>
  </tr>
  <tr>
    <td>ECS</td>
  </tr>
  <tr>
    <td rowspan="4">GCP</td>
    <td>Compute Engine</td>
  </tr>
  <tr>
    <td>Functions</td>
  </tr>
  <tr>
    <td>App Engine</td>
  </tr>
  <tr>
    <td>Cloud Run</td>
  </tr>
  <tr>
    <td rowspan="3">Azure</td>
    <td>Cloud services</td>
  </tr>
  <tr>
    <td>Container instances</td>
  </tr>
  <tr>
    <td>Functions</td>
  </tr>
</tbody>
</table>

In [None]:
prefix = ('Q26_A','Q27_A','Current')
clp = data.loc[:,list(filter(lambda x: x.startswith(prefix), data.columns))]
#clp['all'] = 'all'

def rename_columns(df, idx, col):
  pvt = pd.pivot_table(clp, index=idx, values=clp.loc[:,df.columns.str.startswith(col)], aggfunc='count')
  xdict = {}
  for i in pvt.filter(like=col).columns:
    xdict[i] = str(data[i].value_counts().index.to_list()).strip('[|\'|]')

  xdict= { k:v.strip() for k, v in xdict.items()}
  pvt.rename(columns=xdict, inplace=True)

  return pvt

pvt = rename_columns(clp,'Current Designation','Q26_A')
pvt.T.plot(kind='barh',figsize=(10,8), stacked=True, xlabel='Cloud Platforms')

pvt2 = rename_columns(clp,'Current Designation','Q27_A')
pvt2.T.plot(kind='barh',figsize=(10,8), stacked=True, xlabel='Cloud Products')

# Executive Summary
<table>
<thead>
  <tr>
    <th>Metrics</th>
    <th colspan="5">Popular choice awards</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>Language</td>
    <td>Python</td>
    <td>R</td>
    <td>SQL</td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>IDE</td>
    <td>JupyterLab</td>
    <td>PyCharm</td>
    <td>RStudio</td>
    <td>VSCode</td>
    <td></td>
  </tr>
  <tr>
    <td>Frameworks</td>
    <td>Scikit Learn</td>
    <td>Tensorflow</td>
    <td>Keras</td>
    <td>PyTorch</td>
    <td>XGBoost</td>
  </tr>
  <tr>
    <td>Algorithms</td>
    <td>Linear/Logistic Regression</td>
    <td>Decision Tree</td>
    <td>CNN</td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>Visualization Libraries</td>
    <td>Matplotlib</td>
    <td>Seaborn</td>
    <td>Plotly</td>
    <td>ggplot</td>
    <td></td>
  </tr>
  <tr>
    <td>Computing platform</td>
    <td colspan="5">Personal Laptop / Computer</td>
  </tr>
  <tr>
    <td>Cloud Platforms</td>
    <td>AWS</td>
    <td>GCP</td>
    <td>Azure</td>
    <td></td>
    <td></td>
  </tr>
</tbody>
</table>