## Skills categorisation

How to classify data science skills?

Linkedin classification:
1. Industry knowledge
2. Interpersonal skills
3. Languages
4. Tools & Technologies
5. Other skills

Proposed categories: https://hdsr.mitpress.mit.edu/pub/6wx0qmkl/release/3
1. Industry knowledge
2. Interpersonal skills
3. Languages
4. Science and Math
    1. **Scientific Method**: Basics of the scientific method, research methods, hypothesis formulation and problem identification.
    2. **Mathematics**: Basic math, calculus and linear algebra
    3. **Computer Science**: CS essentials such as data structures and algorithms, DB, OS, parallel computing, software engineering
    4. **Statistics**: Probability basics, descriptive, inferential, and Bayesian statistics, stochastic processes and time series, causality, sampling
    5. **Operations Research & Optimization**: linear programming, nonlinear optimization
    6. **Data Preparation and Exploration**: Practical knowledge related to ‘data analysis,’ feature extraction and transformation, data cleaning, data preparation, data exploration
    7. **Machine Learning**: Unsupervised and supervised learning models and algorithms, reinforcement and deep learning, text mining and NLP
5. Programming and Technology
    1. **General Purpose Computing**: general purpose programming languages, shell basics, version control, virtualization and containerization, cloud platforms
    2. **Scientific Computing**: Statistical, numerical programming languages and libraries, ML libraries, development environments, data visualization tools
    3. **Database & Business Intelligence**: relational DBs and SQL, data warehousing, querying and presentation
    4. **Big Data**: Big data infrastructure, processing and execution environments, big data access and integration tools

In [2]:
import pandas as pd
from collections import defaultdict
import json
import pickle

In [15]:
with open('data/raw/scraped_profiles.json') as json_data:
    profiles = json_data.readlines()[:]

In [16]:
## Get the list of skills categorised in each group by LinkedIn
cat_to_skill = defaultdict(set)
for idx, profile in enumerate(profiles):
    p = json.loads(profile)
    if 'top_skills' not in p['skills']: 
        continue
    if 'Industry Knowledge' in p['skills']:
        for sk in p['skills']['Industry Knowledge']:
            cat_to_skill['industry_knowledge'].add(sk[0].lower().strip())
    if 'Tools & Technologies' in p['skills']:
        for sk in p['skills']['Tools & Technologies']:
            cat_to_skill['tools_tech'].add(sk[0].lower().strip())
    if 'Interpersonal Skills' in p['skills']:
        for sk in p['skills']['Interpersonal Skills']:
            cat_to_skill['interp_skill'].add(sk[0].lower().strip())
    if 'Languages' in p['skills']:
        for sk in p['skills']['Languages']:
            cat_to_skill['languages'].add(sk[0].lower().strip())
    if 'Other Skills ' in p['skills']:
        for sk in p['skills']['Other Skills ']:
            cat_to_skill['other'].add(sk[0].lower().strip())
# pickle.dump(cat_to_skill, open("data/processed/cat_to_skill.p", "wb" ) )            

In [21]:
for key, values in cat_to_skill.items():
    print(key, len(values))

industry_knowledge 2607
tools_tech 933
interp_skill 276
other 12951


In [17]:
skills_df = pd.concat([pd.DataFrame({'skill':list(cat_to_skill['industry_knowledge']),'cat':'industry_knowledge'}),
                       pd.DataFrame({'skill':list(cat_to_skill['tools_tech']),'cat':'tools_tech'}),
                       pd.DataFrame({'skill':list(cat_to_skill['interp_skill']),'cat':'interp_skill'}),
                       pd.DataFrame({'skill':list(cat_to_skill['languages']),'cat':'languages'}),
                       pd.DataFrame({'skill':list(cat_to_skill['other']),'cat':'other'})])
                       

In [111]:
# skills_df.to_csv('data/processed/linkedin_skills.csv',index=False)

In [3]:
# Manual classification of skills based on Fayyad and Hamutcus' categories
cat_skills = pd.read_csv('data/processed/linkedin_skills_recat.csv')

In [43]:
skill_to_cat = defaultdict(str)
for cat, skills in cat_to_skill.items():
    if cat == 'other':
        continue
    for sk in skills:
        skill_to_cat[sk] = cat

# pickle.dump(skill_to_cat_lin, open("data/processed/skill_to_cat_lin.p", "wb" ) )

In [82]:
cat_skills['recat'].fillna(cat_skills['cat'], inplace=True)

In [58]:
## Create long dataframe with profile id and skill
ids = []
skills_list = []
for prof in profiles:
    p = json.loads(prof)
    try:
        for cat, skills in p['skills'].items():
            skills_list += [skill[0].lower().strip() for skill in skills]
            ids += [p['id']]*len(skills)
    except:
        pass

In [84]:
skills_long = pd.DataFrame({"id":ids, "skill":skills_list}); len(skills_long)

526177

In [85]:
skills_long = skills_long.merge(cat_skills, on='skill', how='left'); len(skills_long)

526177

In [86]:
skills_long.recat.value_counts()

industry_knowledge                      296984
other                                    57116
interp_skill                             43966
Computer Science                         26724
tools_tech                               24739
General Purpose Computing                17380
Database                                 13338
Data Preparation and Exploration          8816
interp_skills                             8087
ML                                        4462
Scientific Computing                      4135
Big Data                                  3034
Statistics                                1914
languages                                 1276
review                                     138
Math                                       124
Review                                      24
Operations Research and Optimization        15
Name: recat, dtype: int64

In [43]:
skills_long['recat'] = skills_long['recat'].fillna('other')

In [46]:
## Group skills
skills_long.loc[skills_long.recat == 'Statistics','recat'] = 'Stats and Math'
skills_long.loc[skills_long.recat == 'Math','recat'] = 'Stats and Math'
skills_long.loc[skills_long.recat == 'Review','recat'] = 'other'
skills_long.loc[skills_long.recat == 'Operations Research and Optimization','recat'] = 'Stats and Math'
skills_long.loc[skills_long.recat == 'interp_skill','recat'] = 'interp_skills'

In [47]:
skills_long = skills_long[skills_long.recat != 'review']

In [49]:
skills_long.recat.value_counts()

industry_knowledge                  296984
other                                71045
interp_skills                        52053
Computer Science                     26724
tools_tech                           24739
General Purpose Computing            17380
Database                             13338
Data Preparation and Exploration      8816
ML                                    4462
Scientific Computing                  4135
Big Data                              3034
Stats and Math                        2053
languages                             1276
Name: recat, dtype: int64

In [48]:
skills_long.to_csv('data/processed/skills_long_final.csv', index=False)