# Identify ESCO skills in CVs using neural search

This notebook uses neural search to identify ESCO skills where `esco:skillType=skill`.
Differently from `esco:skillType=knowledge`, these skills are expressed in natural language using multiple words and do not usually contain acronyms.

Acronyms can be a problem for some neural search models, as they can be confused with other words.



In [None]:
import esco
import pandas as pd
import yaml
from pathlib import Path
from sentence_transformers import SentenceTransformer
import nltk

# Load preliminary data.
db = esco.LocalDB()
skills : pd.DataFrame = db.skills
S = db.skills[db.skills.skillType.str.endswith("skill")]
K = db.skills[db.skills.skillType.str.endswith("knowledge")]
sdb = esco.LocalDB()
sdb.skills = S
print("Using ", len(S), " skills and ", len(K), " knowledges")
assert len(sdb.search_products(["Haskell"])) == 0  # Ensure no knowledges: these should be matched using Spacy NER.

cv_txt = Path("rpolli.txt")
cv_doc = nltk.sent_tokenize(cv_txt.read_text())

# Create various Vector Indexes for the ESCO skills

Create embeddings with different models and store them into qdrant for local searches.
By default Qdrant uses cosine distance.

langchain simplifies this process with data connector classes for [different vector stores](https://python.langchain.com/docs/modules/data_connection/vectorstores/)
e.g.  vector_db = VectorStore.from_documents(..)

This VectorStore allows different search operations:

- similarity_search(query, k, filter):
- similarity_search_with_score(query, k=4, filter, score_threshold=None)
- max_marginal_relevance_search()

In [None]:
import time
from itertools import chain
from collections import Counter

import nltk

from langchain_community.vectorstores.qdrant import Qdrant
from langchain_community.vectorstores.faiss import FAISS
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain.schema import Document


dbs = {}
models = (
    "all-MiniLM-L12-v2",
    "paraphrase-MiniLM-L3-v2",
    "paraphrase-albert-small-v2",
)


## FAISS (faster but less accurate)

Use langchain and FAISS to create different vector indexes for the ESCO skills.
[FAISS](https://faiss.ai) is a library for efficient similarity search and clustering of dense vectors based on Euclidean ($L^2$) distance.


On colab with T4 GPU this takes:

- Time for all-MiniLM-L12-v2 -->  1.9036436080932617
- Time for paraphrase-albert-small-v2 -->  2.5741355419158936

Coupling the given embedding functions with FAISS, the `paraphrase-albert-small-v2` model seems to be the best performing one.
Note that, since the IT skill dataset is small (~600 skills), we could use less search libraries less efficient than FAISS to get better results. 


:warning: FAISS stopped working after switching to langchain_community. Since we don't use it, we don't fix it.

for m in models:
    start = time.time()
    print("Load embedding function", m)
    embedding_function = SentenceTransformerEmbeddings(model_name=m)
    print("Generate embeddings")
    dbs[f"{m}-FAISS"] = FAISS.from_texts(
        list(S.text.values),
        embedding_function,
        skills[["label"]].to_dict(orient="records"),
    )
    print(f"Time for {m} --> ", time.time() - start)

cv_path = Path("rpolli.txt")
cv_text = cv_path.read_text()
print("CV: ", cv_path.stem, sep="\n")
for model, db in dbs.items():
    my_skills = [
        db.search(str(part), search_type="mmr") for part in nltk.sent_tokenize(cv_text)
    ]
    my_skills_c = Counter(x.metadata["label"] for x in chain(*my_skills))
    print(model, *my_skills_c.most_common(10), sep="\n")

### Qdrant

Create embeddings with different models and store them into qdrant for local searches.
By default Qdrant uses cosine distance.

langchain simplifies this process with data connector classes for [different vector stores](https://python.langchain.com/docs/modules/data_connection/vectorstores/)
e.g.  vector_db = VectorStore.from_documents(..)

This VectorStore allows different search operations:

- similarity_search(query, k, filter):
- similarity_search_with_score(query, k=4, filter, score_threshold=None)
- max_marginal_relevance_search()

In [None]:

documents = [
    Document(page_content=t, metadata={"label": l, "uri": i})
    for t, l, i in zip(S.text.values, S.label.values, S.index.values)
]
for m in models:
    start = time.time()
    embedding_function = SentenceTransformerEmbeddings(model_name=m)
    db = dbs[f"qdrant-{m}"] = Qdrant.from_documents(
        documents,
        embedding_function,
        path=f"qdrant-{m}",
        collection_name="esco",
        force_recreate=True,
    )
    print(f"Time for {m} --> ", time.time() - start)
    db.client.close()

## Loading the ESCO embeddings

We load the ESCO embeddings from Qdrant, then confront the results of matching CVs sentences against the ESCO embeddings.

We test different models.

In [None]:
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import Qdrant
from langchain.schema import Document


assert S  # Did you run the previous cell?

In [None]:
m = models[2]
embedding_function = SentenceTransformerEmbeddings(model_name=m)
documents =  [Document(page_content=t, metadata={"label":i}) for t, i in zip(skills.text.values, skills.label.values)]

# Don't modify the original data loaded from disk.
Qdrant.add_texts = lambda *x: None
qdb = Qdrant.from_documents(
    documents[:1],
    embedding_function,
    path=f"qdrant-{m}",
    collection_name="esco",
    force_recreate=False,
)

 For each CV sentence, we extract the top k ESCO skills using the cosine similarity between the CV sentence and the ESCO embeddings generated via an ST embedding model.

 The `distilbert` model returns a lot of false positives, but the similarity scores are higher.

 The `all-MiniLM-L6-v2` and `paraphrase-albert-small-v2` models perform better,
 but their similarity scores are lower 0.25-0.60 for good matches

In [None]:
from itertools import chain
results = {}
for k_, db in dbs.items():
    model_name = k_.replace("qdrant-", "")
    neural_cv = []
    for sentence in cv_doc:
        txt = str(sentence).strip()
        if not txt:
            continue
        if len(txt.split()) < 5:
            continue

        neural_cv.append({
            "text": txt,
            "skills": [{"label": x[0].metadata["label"], "score": x[1], "uri": x[0].metadata["uri"]} for x in db.similarity_search_with_score(txt, k=10, score_threshold=0.25)]
        })

    results[k_] = list(chain(*(x["skills"] for x in neural_cv)))
    Path(f"esco-neural-{k_}.yaml").write_text(
        yaml.dump(neural_cv)
    )

In [None]:
results_dict = {}
for k, items in results.items():
    results_dict[k_] = x = {}
    for d, score in items:
        id_ = d.metadata["uri"]
        if id_ not in x:
            x[id_] = {"label": d.metadata["label"], "score": score, "count": 1}
        else:
            x[id_]["score"] = max(score, x[id_]["score"])
            x[id_]["count"] += 1
    print("Neural skills found: ",k, len(x), max(x.values(), key=lambda x: x["score"]), min(x.values(), key=lambda x: x["score"]), sep="\n")


In [None]:
[db.client.close() for db in dbs.values()]