This is my attempt to analyse the skills required for a specific job title/profile in Data Science community. 

Example: If I a student is planning to become a Machine Learning Engineer then she/he can understand the skills which are required for that position and try to master it. Same goes for a Data analyst who is planning to become a data engineer or a data Scientists.

In [None]:
import numpy as np
import pandas as pd
from IPython.display import display, clear_output
from IPython.core.display import HTML
import ipywidgets as widgets
from ipywidgets import Button, Layout

import plotly.graph_objects as go

import os

os.chdir('/kaggle/input')

pd.options.display.max_columns = 100

In [None]:
data = pd.read_csv('kaggle-survey-2020/kaggle_survey_2020_responses.csv')

# How does current title distribution look like ?

In [None]:
labels = data['Q5'][1:].value_counts().index
values = data['Q5'][1:].value_counts().values

fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.6)])
fig.show()

# What should I Learn to transition from career X to Y

This notebook will answer following 2 questions from the data.

1. Are there skills which are specific to job title/profile ?

2. What are the common skills across job title ?

As a result, one can use this notebook to obtain immediate skills (actions) to learn to aim for a specific job title

In [None]:
cols = data.columns
multi_choice_qns = [i for i in cols if 'Part' in i]

multi_choice_map = {}
for c in multi_choice_qns:
    val = data[c][1:].dropna().unique()[0]
    multi_choice_map[c] = data[c][1:].dropna().unique()[0].strip()
    #if val not in ['None', 'No / None', 'None of these activities are an important part of my role at work']:
        #multi_choice_map[c] = data[c][1:].dropna().unique()[0].strip()

In [None]:
qns_map = {'Q7': 'Programming Languages', 'Q9': 'IDE', 'Q10': 'Notebooks', 'Q14': 'Visualization Libraries',
           'Q16': 'ML Frameworks', 'Q17': 'ML Algorithms', 'Q23': 'Role at work', 'Q26_A': 'Cloud Platform',
           'Q27_B': 'Cloud Products', 'Q28_A': 'ML Products', 'Q29_A': 'Big Data Products', 'Q31_A': 'BI Tool',
           'Q33_A': 'Automated ML Techinques', 'Q34_A': 'AutoML Tools', 'Q35_A': 'ML Experiment Tool', 
           'Q36': 'Public share platform', 'Q37': 'Learning platform', 'Q39': 'Media Source for DS'}

multi_qns = qns_map.keys()

# Skills are categorized as ***Must have*** and ***Good to have***:

**Must have** : More than 50% of the target group has that skill

**Good to have** : Less than 50% of the target have the skill but it is top skill in the category


Top skills (answers) are obtained for all multiple choice questions with multi-selection

**Note**: There are also questions like media source for DS, regular Learning platform etc. Though answer to them are not skills, I am broadly considering it under skills

In [None]:
def get_top_skills(df):
    must = {}
    good_to = {}
    
    #Education
    top_education = df['Q4'].value_counts(normalize=True).index[0]
    top_education_pct = df['Q4'].value_counts(normalize=True)[0]
    if top_education_pct >= 0.5:
        must['Highest education'] = top_education
    else:
        good_to['Highest education'] = top_education
    for q in multi_qns:
        topic = qns_map[q]
        ans_cols = [c for c in cols if c.startswith(q)]
        pct_ans = df[ans_cols].notnull().sum()/df.shape[0]
        pct_ans.sort_values(ascending=False, inplace=True)
        top_skills = pct_ans[pct_ans > 0.5].index.map(multi_choice_map)
        if len(top_skills) > 0:
            must[topic] = top_skills
        else:
            good_to[topic] = multi_choice_map[pct_ans.index[0]]
    
    #Compute platform
    top_compute_platform = df['Q11'].value_counts(normalize=True).index[0]
    top_compute_platform_pct = df['Q11'].value_counts(normalize=True)[0]
    if top_compute_platform_pct >= 0.5:
        must['Computing platform'] = top_compute_platform
    else:
        good_to['Computing platform'] = top_compute_platform
    
    #Analytical Tool
    top_primary_tool = df['Q38'].value_counts(normalize=True).index[0]
    top_primary_tool_pct = df['Q38'].value_counts(normalize=True)[0]
    if top_primary_tool_pct >= 0.5:
        must['Primary Analytical tool'] = top_primary_tool
    else:
        good_to['Primary Analytical tool'] = top_primary_tool
    return must, good_to

Ipython widgets are used to render the output. Due to nature of Kaggle other tools like dash, TKinter could not be used

In [None]:
class buttons(object):
    def __init__(self, n, skill_type, value):
        self.n = n
        self.type = skill_type
        self.value = value
        #display(HTML("<h1> Heading </h1>"))
    
    def create_grid(self, val):
        idx = 0
        row = []
        keys = list(val.keys())
        values = list(val.values())
        for i in range(self.n):
            header_button = widgets.Button(description = keys[i], button_style='warning', tooltip = keys[i],layout=Layout(height= '30px', width='300px'))
            if (isinstance(values[i], pd.Index) | isinstance(values[i], list)):
                length = len(values[i])
                if length == 1:
                    #print(values[i][0])
                    if values[i][0] in ['None', 'No / None', 'None of these activities are an important part of my role at work']:
                        value_button = [widgets.Button(description = values[i][0], tooltip = values[i][0], layout=Layout(height= '60px', width='300px', margin='2px 2px 30px 2px'))]
                    else:    
                        value_button = [widgets.Button(description = values[i][0], button_style='info', tooltip = values[i][0], layout=Layout(height= '60px', width='300px', margin='2px 2px 30px 2px'))]
                    row.append(widgets.VBox([header_button, *value_button]))
                    
                    continue
                new_height = str(60/length)
                value_button = []
                for v in range(length):
                    if v ==0:
                        value_button.append(widgets.Button(description = values[i][v], button_style='success', tooltip = values[i][v], layout=Layout(height= new_height +'px', width='300px', margin='2px 2px 0px 2px')))
                    elif v == (length-1):
                        value_button.append(widgets.Button(description = values[i][v], button_style='success', tooltip = values[i][v], layout=Layout(height= new_height +'px', width='300px', margin='0px 2px 2px 2px')))
                    else:
                        value_button.append(widgets.Button(description = values[i][v], button_style='success', tooltip = values[i][v], layout=Layout(height= new_height +'px', width='300px', margin='0px 2px 0px 2px')))
            else:
                if values[i] in ['None', 'No / None', 'None of these activities are an important part of my role at work']:
                    value_button = [widgets.Button(description = values[i], tooltip = values[i], layout=Layout(height= '60px', width='300px', margin='2px 2px 30px 2px'))]
                else: 
                    value_button = [widgets.Button(description = values[i], button_style='info', tooltip = values[i], layout=Layout(height= '60px', width='300px', margin='2px 2px 30px 2px'))]
            
            row.append(widgets.VBox([header_button, *value_button]))
    
        elems = []
        for l in range(0,self.n,3):
            elems.append(widgets.HBox(row[l:l+3]))
        
        if self.type == 'must_have':
            self.header = widgets.HTML(description='',value='<h1> Must have skills for {}</h1>'.format(self.value))
        elif self.type == 'good_to_have':
            self.header = widgets.HTML(description='',value='<h1> Good to have skills for {}</h1>'.format(self.value))
        
        self.everything = widgets.VBox([self.header, *elems])
        display(self.everything)
    
    def show_legend(self):
        legends = widgets.HTML(description='', value='<h3> Legends </h3>')
        header_button = widgets.Button(description='', button_style='warning', layout=Layout(height= '20px', width='100px', margin='0px 0px 0px 0px'))
        header_desc = widgets.HTML(description='', value='Question')
        legend0 = widgets.HBox([header_button, header_desc])
        value_button1 = widgets.Button(description='', button_style='success', layout=Layout(height= '20px', width='100px', margin='0px 0px 0px 0px'))
        value_desc1 = widgets.HTML(description='', value='Multiple preference among the group')
        legend1 = widgets.HBox([value_button1, value_desc1])
        value_button2 = widgets.Button(description='', button_style='info', layout=Layout(height= '20px', width='100px', margin='0px 0px 0px 0px'))
        value_desc2 = widgets.HTML(description='', value='Single preference among the group')
        legend2 = widgets.HBox([value_button2, value_desc2])
        value_button3 = widgets.Button(description='', layout=Layout(height= '20px', width='100px', margin='0px 0px 0px 0px'))
        value_desc3 = widgets.HTML(description='', value='No preference')
        legend3 = widgets.HBox([value_button3, value_desc3])
        display(widgets.VBox([legends, legend0, legend1, legend2, legend3]))

Obtain distinct roles and remove values which are not career change e.g : Student, Other and Currently not employed

In [None]:
roles = data['Q5'][1:].dropna().unique()
roles = np.delete(roles, np.argwhere((roles == 'Student') | (roles == 'Other') | (roles == 'Currently not employed')))

In [None]:
display(widgets.HTML(description='',value='<h1> Select Desired Role: </h1>'))

w = widgets.Dropdown(
    options= roles,
    value='Data Scientist',
    description='',
    style={'description_width': 'initial'}
)

def on_change(change):
    if change['type'] == 'change' and change['name'] == 'value':
        print("Selected role is %s" % change['new'])
        new_data = data[data['Q5'] == change['new']]
        must, good_to = get_top_skills(new_data)
        clear_output()
        display(widgets.HTML(description='',value='<h1> Select Desired Role: </h1>'))
        display(w)
        n1 = len(must.keys())
        bt1 = buttons(n1, 'must_have', change['new'])
        bt1.create_grid(must)
        
        n2 = len(good_to.keys())
        bt2 = buttons(n2, 'good_to_have', change['new'])
        bt2.create_grid(good_to)
        
        bt2.show_legend()
        
w.observe(on_change)

display(w)


**Note**: Kaggle notebook doesn't render output interactively based on dropdown input. So clone the notebook and modify the input in dropdown to obtain detailed results

# Compare two of the profiles. Data Scientist vs Statistician

In [None]:
ds = data[data['Q5'] == 'Data Scientist']
ds_must_have, ds_good_to_have = get_top_skills(ds)

stat = data[data['Q5'] == 'Statistician']
stat_must_have, stat_good_to_have = get_top_skills(stat)

In [None]:
n1 = len(ds_must_have.keys())
bt1 = buttons(n1, 'must_have', 'Data Scientist')
bt1.create_grid(ds_must_have)

In [None]:
stat_n1 = len(stat_must_have.keys())
stat_bt1 = buttons(stat_n1, 'must_have', 'Statistician')
stat_bt1.create_grid(stat_must_have)

In [None]:
n2 = len(ds_good_to_have.keys())
bt2 = buttons(n2, 'good_to_have', 'Data Scientist')
bt2.create_grid(ds_good_to_have)

In [None]:
stat_n2 = len(stat_good_to_have.keys())
stat_bt2 = buttons(stat_n2, 'good_to_have', 'Statistician')
stat_bt2.create_grid(stat_good_to_have)

**Key Outcomes:**
1. **Language**: Python is highest used programming language for Data Scienctists while it is R for Statisticians
2. **Education**: Top percent have Master's degree in both groups however percentage is more than 50% in Data Scientist group
3. **Visualization**: Data Scientists heavily rely on matplotlib & seaborn whereas Statisticians use ggplot/ggplot2
4. **ML ALgorithms**: Data Scientists majorly use 3 category of ML algorithms i.e Linear/Logisitic, Tree based & boosting algorithms. Statisticians rely on Linear/Logisitic regression
5. **Role**: While 'Analyze and understand data to influence product/business decision' remains a common goal, Data Scientists also take up additional role to 'Build prototypes to explore applying ML to new areas'


# Common skills across Titles/profile

This code only considers the top most skill. for e.g if both Python & SQL is used by majority in the group, it will pick the one which is most popular/used among the group

In [None]:
def get_top_skill(df):
    skills = {}
    #Education
    top_education = df['Q4'].value_counts(normalize=True).index[0]
    top_education_pct = df['Q4'].value_counts(normalize=True)[0]
    skills['Highest education'] = top_education
    for q in multi_qns:
        topic = qns_map[q]
        ans_cols = [c for c in cols if c.startswith(q)]
        pct_ans = df[ans_cols].notnull().sum()/df.shape[0]
        pct_ans.sort_values(ascending=False, inplace=True)
        skills[topic] = multi_choice_map[pct_ans.index[0]]
    top_compute_platform = df['Q11'].value_counts(normalize=True).index[0]
    top_compute_platform_pct = df['Q11'].value_counts(normalize=True)[0]
    skills['Computing platform'] = top_compute_platform
    top_primary_tool = df['Q38'].value_counts(normalize=True).index[0]
    top_primary_tool_pct = df['Q38'].value_counts(normalize=True)[0]
    skills['Primary tool'] = top_primary_tool
    return skills

Top skills for each category across job titles

In [None]:
out = pd.DataFrame()
for r in roles:
    d = data[data['Q5'] == r]
    d_skill = get_top_skill(d)
    out = pd.concat([out, pd.DataFrame(d_skill, index=[r])])

out.head()

In [None]:
common_skills = out.nunique()[out.nunique() == 1].index
out.loc[:,common_skills].drop_duplicates()

No clear favorite or preference for Tool categories:
* ML Product
* Automated ML techniques
* ML Experiment Tools

Note: It is surprising that MLflow is not one of the options in ML Experiment Tools question.

# Undisputed king of skills/platforms

Following skills are used most irrespective of the job title:

* **ML Frameworks** : Scikit-learn
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Scikit_learn_logo_small.svg/1200px-Scikit_learn_logo_small.svg.png" width="200px">

* **Learning Platform** : Coursera
<img src="https://149362230.v2.pressablecdn.com/wp-content/uploads/2018/08/coursera-social-logo.png" width="200px">

*     **Media Source for DS** : Kaggle(notebooks, forums, etc)

<img src="https://www.kaggle.com/static/images/site-logo.png" width="200px">

*     **ML Algorithms** : Linear or Logistic Regression

<img src="https://saedsayad.com/images/LogReg_1.png" width="300px">

*     **Computing platform** : A personal computer or laptop

# Scope of the analysis:

Note: Below questions are left out from the analysis
* Demography related questions, as the intention is to provide concrete actions to upskill
* Part B of the questions, as they indicate action which needs to be taken in future
* Questions like Q18, Q19 which depend on response of another question. Here the group size changes across such questions
* Questions on salary, years of experience, company size etc as these are not within immediate control of user. 