# Identify normalized skills in CVs

Extracting normalized skills from CVs **simplifies the hiring process** by allowing recruiters to quickly identify candidates with the desired skills. 
While this is currently an open problem, this can be tackled in specific sectors using language processors and interfaces to support applicants in writing their CVs
and applying for jobs.

Normalization allows providing a **consistent context** to be passed to AI models and to humans for evaluation.



Large platforms like linkedin have a large amount of data and significant experience in this area.
On the other hand, a more focused approach for specific tasks can be more effective, especially when evaluating a large number of CVs (e.g. for junior positions).
Moreover, a standardized approach can be more effective in the long term, since it allows to build a common knowledge base to support the relocation of workers in time
(e.g. identifying skill gaps and training needs).

ESCO is a multilingual classification of skills, competences, qualifications and occupations maintained and updated periodically by the European Commission.



## Structure of the ESCO dataset

```mermaid
---
title: ESCO dataset
---
graph

SK[Skill]
O[Occupation]
K[Knowledge]

SK -->|broader| SK

O -->|essentialSkill 1..n| SK
O -->|optionalSkill 1..n| SK

SK -.->|type| S
SK -.->|type| K

```



## Structure of Skills and Occupations

Here is an excerpt from the **SaaS (service-oriented modelling)** skill:

- preferred label: SaaS (service-oriented modelling)
- unique identifier as an URI: http://data.europa.eu/esco/skill/eeca3780-8049-499f-a268-95a7ad26642c
- alternative labels: SaaS model
- description: The SaaS model consists of principles and fundamentals of service-oriented modelling for business and software systems that allow the design and specification of service-oriented business systems within a variety of architectural styles, such as enterprise architecture.


While occupations are something that changes in time, skills are more stable.
Nonetheless, the ESCO Occupation taxonomy can be useful as a reference for writing position and the associated skills.

An ESCO Occupation excerpt for **Cloud engineer**. Note that the URIs of the skills are present in the dataset but not shown below:

- preferred label: cloud engineer
- unique identifier as an URI: http://data.europa.eu/esco/occupation/349ee6f6-c295-4c38-9b98-48765b55280e
- alternative labels: cloud-native engineer, cloud architect, cloud computing engineer, cloud developer, cloud devops engineer, cloud infrastructure engineer, cloud network engineer, cloud security engineer, cloud software engineer, cloud solution engineer, hybrid cloud engineer
- description: Cloud engineers are responsible for the design, planning, management and maintenance of cloud-based systems. They develop and implement cloud-applications, handle the migration of existing on-premise applications to the cloud, and debug cloud stacks.
- required skills: ICT system integration, ICT system programming, SaaS (service-oriented modelling), cloud monitoring and reporting, cloud security and compliance, cloud technologies, computer programming, cyber security database, development tools, implement cloud security and compliance, operating systems, systems development life-cycle, systems theory



## Extracting skills from CVs

This notebook shows how to create a spacy NLP model based on the ESCO skill descriptions and labels, for identifying skills in CVs.

In [None]:
import os
import logging

os.environ.update(
    dict(
        # PYTORCH_ROCM_ARCH="gfx90c",
        LD_LIBRARY_PATH="/opt/rocm/lib:/opt/rocm/libexec",
        # HSA_OVERRIDE_GFX_VERSION="90c",
        ROCM_HOME="/opt/rocm",
        ROCM_PATH="/opt/rocm",
    )
)

In [None]:
import spacy
import yaml
from pathlib import Path
import pandas as pd
import json

# Without GPU this is quite slow. Further info on using a GCP server on GPU can be found in GCP.md
# spacy.prefer_gpu()

In [None]:
# Load the ESCO data from SPARQL and create a dataframe with some useful columns.
# Alternatively, use the pre-processed esco.json file.
df = pd.read_json("../esco/esco.json.gz", orient="records")
df.index = df.uri
skills = df.groupby(df.uri).agg(
    {
        "altLabel": lambda x: list(set(x)),
        "label": lambda x: x.iloc[0],
        "description": lambda x: x.iloc[0],
        "skillType": lambda x: x.iloc[0],
    }
)
# Add a lowercase text field for semantic search.
skills["text"] = skills.apply(
    lambda x: "; ".join([x.label] + x.altLabel + [x.description]).lower(), axis=1
)
# .. and a set of all the labels for each skill.
skills["allLabel"] = skills.apply(
    lambda x: {t.lower() for t in x.altLabel} | {x.label.lower()}, axis=1
)



In [None]:

with pd.option_context("max_colwidth", None):
  display(skills.head())



In [None]:

# Smoke test some skills
list(skills[skills.label.str.contains("MySQL")].altLabel)

In [None]:
from esco import to_curie, from_curie

def make_pattern(id_: str, kn: dict):
    """Given an ESCO skill entry in the dataframe, create a pattern for the matcher.

    The entry has the following fields:
    - label: the preferred label
    - altLabel: a list of alternative labels
    - the skillType: e.g. knowledge, skill, ability

    The logic uses some euristic to decide whether to use the preferred label or the alternative labels.
    """
    label = kn["label"]
    pattern = [{"LOWER": label.lower()}] if len(label) > 3 else [{"TEXT": label}]
    patterns = [pattern]
    altLabel = [kn["altLabel"]] if isinstance(kn["altLabel"], str) else kn["altLabel"]
    for alt in altLabel:
        # If the label is a 3-letter word, use an exact match.
        if len(alt) <= 3:
            candidate = [{"TEXT": alt}]
        
        # If there are up to 3 words, use a lowercase match.
        elif 1 < len(alt.split()) <= 3:
            candidate = [{"LOWER": x} for x in alt.lower().split()]

        # Otherwise use a lowercase match with the whole string.
        # Maybe:
        # - use a lemma match
        # - skip this case, and use a full-text/semantic search
        else:
            candidate = [{"LOWER": alt.lower()}]
        if candidate not in patterns:
            patterns.append(candidate)

    # The following identifier is not used anymore, but it's kept here for reference.
    pattern_identifier = (
        f"{kn['skillType'][:2]}_{label.replace(' ', '_')}".upper().translate(
            str.maketrans("", "", "()")
        )
    )
    return to_curie(id_), patterns



In [None]:
[(id_, kni) for id_, kni in skills.to_dict(orient="index").items() if "software archit" in str(kni) ][:1]

In [None]:
# Create the patterns for the matcher
m = dict(
    make_pattern(id_, kni)
    for id_, kni in skills.to_dict(orient="index").items()
    # if "architecture" in str(kni["altLabel"])
)
m

In [None]:
# Use spacy matcher with a blank model to validate the patterns.
# If this doesn't work, spacy will raise an error.
import spacy
from spacy.matcher import Matcher
nlp_test = spacy.blank("en")
m1 = Matcher(nlp_test.vocab, validate=True)
for pid, patterns in m.items():
    m1.add(pid, patterns) 


In [None]:
# Show the first 3 patterns
list(m.items())[:3]
json.dump(m, open("../generated/esco_matchers.json", "w"))
esco_p = [{"label":"ESCO", "pattern": pattern, "id": k } for k, p in m.items()  for pattern in p ]

# Save the patterns to a json
import json
with open("../generated/esco_patterns.json", "w") as f:
    json.dump(esco_p, f)

## Create an ESCO spacy entity recognizer model

This entity recognizer reuses the en_core_web_trf model, that is quite good at identifying PRODUCTS.
We will add a new entity label, ESCO, that uses the altLabel patterns to identify further entities.

The ESCO entity label is added to the pipeline after the NER component, so that the NER component can identify the entities that are already in the en_core_web_trf model, and the ESCO component can add the ESCO entities.

In [None]:
# This model is quite good at recognizing ICT entities like products.
import spacy
from spacy import displacy  # Load a viewer.
nlp_e = spacy.load("en_core_web_trf")

We can use a custom tokenizer that preserves dashed words, so that the ESCO entity recognizer can identify them.

:warning: **Note**: this is disabled for now, since it should be tested further.


In [None]:
# ... eventually add a custom tokenizer ...
import re
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

if False:
    nlp_e.tokenizer = custom_tokenizer(nlp_e)


In [None]:
# .. then the entity ruler ..
ruler = nlp_e.add_pipe("entity_ruler", after="ner")
ruler.add_patterns(esco_p)


In [None]:

nlp_e.to_disk("../generated/en_core_web_trf_esco_ner")


### Try the model

Let's try the model on some sample texts:

- a minimal text with a single skill
- a complete CV

In [None]:
import sys
from hashlib import sha256
text = """
I design rest API using the openapi specifications.

I daily use linux, courier-imap, openapi, openshift-on-openstack and mysql

"""

def get_stats(doc):
    doc_id = sha256(doc.text.encode("utf-8")).hexdigest()
    all_entities = [(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents]
    ent_count = len(all_entities)
    ent_unique = set(all_entities)
    ent_unique_count = len(ent_unique)
    ent_text_unique = set(text for text, _, _ in ent_unique)
    ent_text_unique_count = len(ent_text_unique)
    ent_skills = set((t, l, i) for t, l, i in ent_unique if l in ("ESCO", "PRODUCT"))
    ent_skills_text_unique = len(set(text for text, _, _ in ent_skills))
    return {
        "doc_id": doc_id,
        "ent_count": ent_count,
    #    "ent_unique": list(ent_unique),
        "ent_unique_count": ent_unique_count,
    #    "ent_text_unique": list(ent_text_unique),
        "ent_text_unique_count": ent_text_unique_count,
        "ent_skills": list(ent_skills),
        "ent_skills_text_unique": ent_skills_text_unique,
    }

doc = nlp_e(text.replace("\n", " "))
# Show some stats
get_stats(doc)

In [None]:
# Visualize the result.
displacy.render(doc, style="ent", jupyter=True)

In [None]:
for t in doc.ents:
    print(t.text, t.label_, t.ent_id_, sep="\t")

# Model testing

The recognizer is tested processing a set of CVs and returning the entities found in each CV.


In [None]:
DATADIR = Path("data/")
cvs = list(DATADIR.glob("*-en.txt"))
len(cvs), DATADIR.absolute().as_posix()

In [None]:
# Prepare the data for the test.
def model_factory(model_name, patterns=None, config=None):
    model = spacy.load(model_name)
    if patterns:
        config = config or {}
        ruler = model.add_pipe("entity_ruler", **config)
        ruler.add_patterns(esco_p)
    return model


testcases = {
    "base": model_factory("en_core_web_md"),
    "trf": model_factory("en_core_web_trf"),
    "trf_pre": model_factory("en_core_web_trf", esco_p, config={"before": "ner"}),
    "trf_post": model_factory("en_core_web_trf", esco_p, config={"after": "ner"}),
}


In [None]:
# Run the testcases and save each result asap, since the test can take a long time.
results_out = Path("results.out")
with results_out.open("wb") as fh:
    fh.write(b"[")
    for model_name, nlp_model in testcases.items():
        for doc, cv in nlp_model.pipe([(cv.read_text(), cv.stem) for cv in cvs], as_tuples=True):
            stats = get_stats(doc)
            stats["model"] = model_name
            fh.write(json.dumps(stats).encode())
            fh.write(b",\n")
    fh.write(b"]")

In [None]:
results = Path("results.out").read_text()
data=yaml.safe_load(results)
df = pd.DataFrame(data)
df["model"] = df.index // 9
# aggregate the dataframe above by doc_id, using the couple (model, ent_.._count) as columns
# and the doc_id as the index
results = df.groupby(["doc_id", "model"]).agg(list).unstack()
results

# Experiment the saved model

Now that we have a saved model, we can experiment with it.


In [None]:
import spacy
import sys
import re
from spacy import displacy

DATADIR = Path(sys.path[0]).parent / "tests" / "data"

nlp_esco = spacy.load("../generated/en_core_web_trf_esco_ner/")

In [None]:
text_raw = (DATADIR / "rpolli.txt").read_text()
text = re.sub('\n+','\n',text_raw)

In [None]:
doc = nlp_esco(text.replace("\n", " "))

In [None]:
# Start analyzing the parts of speech (POS)
from collections import Counter
most_common = lambda pos: Counter([ t.lemma_ for t in doc if t.pos_ == pos and len(t.text)> 2]).most_common(6) 
print("Most common verbs", most_common("VERB"))
print("Most common nouns", most_common("NOUN"))

In [None]:
# Now a nice display
displacy.render(doc, style="ent", jupyter=True)

In [None]:
def get_stats(doc):
    doc_id = sha256(doc.text.encode("utf-8")).hexdigest()
    all_entities = [(ent.text, ent.label_) for ent in doc.ents]
    ent_count = len(all_entities)
    ent_unique = set(all_entities)
    ent_unique_count = len(ent_unique)
    ent_text_unique = set(text for text, _ in ent_unique)
    ent_text_unique_count = len(ent_text_unique)
    ent_skills = set((t, l) for t, l in ent_unique if l in ("ESCO", "PRODUCT"))
    ent_skills_text_unique = len(set(text for text, _ in ent_skills))
    return {
        "doc_id": doc_id,
        "ent_count": ent_count,
        "ent_unique": list(ent_unique),
        "ent_unique_count": ent_unique_count,
        "ent_text_unique": list(ent_text_unique),
        "ent_text_unique_count": ent_text_unique_count,
        "ent_skills": list(ent_skills),
        "ent_skills_text_unique": ent_skills_text_unique,
    }

get_stats(doc)

## Some takings

- The model is not perfect, but it is a good starting point for further improvements.
- Even a simple lexical analysis can provide useful insights.
- It is helpful to identify legacy skills that are no more relevant in the current market.
- The model is good at identifying knowledges (e.g., programming languages, tools).
- The model is not good at identifying skills made up of multiple words (e.g., "can do X"). The `esco` package addresses it using vector search.

## Next steps

- Analyze the ESCO descriptions and labels and do some preprocessing to improve the patterns or the vector search.
