# Similarity-based Competency Extraction Service

* First we check if the required packages are installed. 

spaCy: an open-source software library for advanced natural language processing, written in the programming languages Python and Cython 

TensorFlow: a free and open-source software library for machine learning and artificial intelligence.

NumPy: a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays

pandas: a software library written for the Python programming language for data manipulation and analysis

In [1]:
!pip3 install numpy
!pip3 install pandas
!pip3 install -U pip setuptools wheel
!pip3 install -U spacy
!python3 -m spacy download de_core_news_lg
!pip3 install absl-py
!pip3 install --upgrade tensorflow
!pip3 install "tensorflow>=2.0.0"
!pip3 install --upgrade tensorflow-hub
!pip3 install tensorflow_text
!pip3 install --ignore-installed Pillow==9.0.0
!pip3 install xmltodict
!pip3 install markupsafe==2.1.1

* Now we can import these packages.

In [2]:
#@title Load the Universal Sentence Encoder's TF Hub module
from absl import logging

import tensorflow as tf
import tensorflow_text
import tensorflow_hub as hub
import numpy as np
import pandas as pd
import os
import re  # RegEx
# import seaborn as sns


import spacy
nlp = spacy.load('de_core_news_lg')

import xmltodict
import time

* We load the tensorflow module for text embedding.  

The main idea of the similarity calculation part is to vectorize the text and then calculate the cosine distance between them.
Machine learning models take vectors as input and the word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding.  
For this, we use a pre-trained transformer model on TensorFlow hub.
This specific module is optimized for multi-word length text, such as sentences, phrases, or short paragraphs.
And it converts variable-length text into a 512 dimensional vector.

In [3]:
%%time
module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3" 
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model(input)

2022-08-06 22:06:49.313037: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


module https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3 loaded
CPU times: user 6.3 s, sys: 458 ms, total: 6.76 s
Wall time: 7.93 s


## Data Preprocessing

We use the ESCO skills dataset as our corpus. You can download the latest version of it here https://esco.ec.europa.eu/en/use-esco/download

* import competency data from csv file

In [6]:
%%time
skills = pd.read_csv('data/skills_de.csv',usecols=['conceptUri','preferredLabel'])

CPU times: user 64.2 ms, sys: 21 ms, total: 85.3 ms
Wall time: 89.8 ms


Note that for efficiency reasons we only use preferred term (contents of column "preferredLabel") for our extraction service.

### further: previous label 

In [7]:
skills.head(10)

Unnamed: 0,conceptUri,preferredLabel
0,http://data.europa.eu/esco/skill/0005c151-5b5a...,Musikpersonal verwalten
1,http://data.europa.eu/esco/skill/00064735-8fad...,Strafvollzugsverfahren beaufsichtigen
2,http://data.europa.eu/esco/skill/000709ed-2be5...,nicht unterdrückende Praktiken anwenden
3,http://data.europa.eu/esco/skill/0007bdc2-dd15...,Einhaltung von Vorschriften von Eisenbahnfahrz...
4,http://data.europa.eu/esco/skill/00090cc1-1f27...,verfügbare Dienste ermitteln
5,http://data.europa.eu/esco/skill/000bb1e4-89f0...,toxikologische Studien durchführen
6,http://data.europa.eu/esco/skill/000c94d2-2a2e...,Einheitlichkeit von Kokillen gewährleisten
7,http://data.europa.eu/esco/skill/000f1d3d-220f...,Haskell
8,http://data.europa.eu/esco/skill/001115fb-569f...,Initiative zeigen
9,http://data.europa.eu/esco/skill/001d46db-035e...,Personal mit Blick auf die Verringerung von Le...


In [8]:
labels = skills['preferredLabel'].copy(deep=True)
labels

0                                  Musikpersonal verwalten
1                    Strafvollzugsverfahren beaufsichtigen
2                  nicht unterdrückende Praktiken anwenden
3        Einhaltung von Vorschriften von Eisenbahnfahrz...
4                             verfügbare Dienste ermitteln
                               ...                        
13886    berufliche Leistungsfähigkeit von Nutzern/Nutz...
13887             Beleuchtung in Transportgeräten einbauen
13888                     Verarbeitung natürlicher Sprache
13889                             Bauarbeiten koordinieren
13890         Absturzsicherungen und Bordbretter anbringen
Name: preferredLabel, Length: 13891, dtype: object

We use spaCy processing pipeline to do the lemmatization of each token. After lemmatization, all punctuations will  be converted to "--" and then deleted.

Since the data were obtained from the official documents of the European Commission, we did not perform data cleaning on them.

In [9]:
def labels_processing(text):
    processed = []
    for item in text:
        item = nlp(item)
        itemString = ''
        for word in item:
            word = word.lemma_.lower() # lemma_: Extract the lemma for each token
            if word != '--' and word != '' and word != ' ':
#                 if word == '\xa0': continue                    
                itemString += word + ' '
        processed.append(itemString[:-1])
    return processed

In [10]:
%%time
processedLabels = labels_processing(labels)

CPU times: user 1min 10s, sys: 1.15 s, total: 1min 12s
Wall time: 1min 19s


* import descriptions data from xml file

Here you can try this course_description_testset.xml file to get the result of test dataset  

In [11]:
%%time
test = open('data/course_description_testset.xml','r', encoding='ISO-8859-15').read() # read data
test_dict = xmltodict.parse(test) # parse xml
xmlDict = test_dict

Load all course description data from xml file

In [12]:
# %%time
# xml_data = open('data/course_description_FOKUS.xml','r', encoding='ISO-8859-15').read() # read data
# xmlDict = xmltodict.parse(xml_data) # parse xml

CPU times: user 8.95 s, sys: 506 ms, total: 9.46 s
Wall time: 15.4 s


In [13]:
%%time
xmlcourses = xmlDict['DEFTISCAT']['COURSETRANSACTIONS']['INSERTCOURSES']['COURSE']

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 7.87 µs


Competency is often included in course titles. So we combine course_name and course_description.

In [14]:
%%time
coursesDict = []
for course in xmlcourses:
    coursesDict.append({"course_id": course['CS_ID'], 
                        "course_name": course['CS_NAME'],
                 "course_description": course['CS_DESC_LONG']})
courses_table = pd.DataFrame(coursesDict)

CPU times: user 36.6 ms, sys: 3.66 ms, total: 40.3 ms
Wall time: 76.9 ms


Before combining the text, we need to deal with the missing values.

In [15]:
col_null = courses_table.isnull().sum(axis=0)
col_null

course_id              0
course_name            0
course_description    97
dtype: int64

In [16]:
courses_table.replace([None],value='',inplace=True)
courses = courses_table['course_name'] + ' ' + courses_table['course_description']

In [17]:
courses_table

Unnamed: 0,course_id,course_name,course_description
0,4A873264-7ADD-DE47-3039-1FDA692E8164,"""Schwierige"" Klienten? - Mit Patienten, Angehö...",- Analyse strapaziöser Gesprächsmuster\n- Schw...
1,99BEDCEB-4FF3-3F8F-88B7-B30E50638F01,Aktuelles Arbeitsrecht 2022,Kurzbeschreibung\nDas Arbeitsrecht unterliegt ...
2,97FD41FD-1A91-179C-3EA5-6A23E2F436D5,Ambulante Pflege - Rechtssicher Handeln und Ha...,- Grundlagen der straf- und zivilrechtlichen H...
3,A0CC573A-79D7-79A4-5B22-F675C4F06950,Aufgaben des gesetzlichen Betreuers - Zur Refo...,Kurzbeschreibung\nNoch im Jahr 2020 plant der ...
4,7C449EF7-12A8-82FC-0E5E-BD2885FDB93C,Basisqualifikation für ungelernte Pflegekräfte...,- Alten- und Krankenpflege\n . Körperpflege\n...
...,...,...,...
16847,A1E046BE-1C4B-B977-C9B9-88D8D2860107,5 Monate Weiterbildung: Organisation & Führung...,Die aktuell vorherrschende Situation auf dem A...
16848,A1DF8EEB-D317-70BA-508F-AFC732369860,Conversion und Usability Experte,Ziel der Maßnahme ist es den Teilnehmern eine ...
16849,A1DD9D98-2BA3-3A11-0A24-72FC4C79422A,Digital Transformation Management,Ziel der Maßnahme ist es den Teilnehmern eine ...
16850,A1DD60F7-9B6E-ECEE-F1DF-53FED2DCCC5C,E-Commerce Geschäftsmodelle,Ziel der Maßnahme ist es den Teilnehmern eine ...


In [18]:
courses

0        "Schwierige" Klienten? - Mit Patienten, Angehö...
1        Aktuelles Arbeitsrecht 2022 Kurzbeschreibung\n...
2        Ambulante Pflege - Rechtssicher Handeln und Ha...
3        Aufgaben des gesetzlichen Betreuers - Zur Refo...
4        Basisqualifikation für ungelernte Pflegekräfte...
                               ...                        
16847    5 Monate Weiterbildung: Organisation & Führung...
16848    Conversion und Usability Experte Ziel der Maßn...
16849    Digital Transformation Management Ziel der Maß...
16850    E-Commerce Geschäftsmodelle Ziel der Maßnahme ...
16851    Experte im Digital Content Creation Die Teilne...
Length: 16852, dtype: object

We found some parts of the text that were not encoded correctly. Let's correct them now.

In [19]:
%%time
length = len(courses)
for i in range(length):
    courses[i] = courses[i].replace('&#8211;','-')
    courses[i] = courses[i].replace('&#8222;','„')
    courses[i] = courses[i].replace('&#8220;','“')
    courses[i] = courses[i].replace('&#8230;','...')

CPU times: user 541 ms, sys: 6.35 ms, total: 548 ms
Wall time: 591 ms


Now we split each description in sentences.
We put the data into the spacy pipeline, remove the special symbols.  

In [20]:
%%time
def To_sentences(courses):
    courses_sent = []
    courses_sent_lemma = []
    for course in (courses):
        with nlp.select_pipes(disable=['attribute_ruler', 'ner']):
            doc = nlp(str(course))
            sentences = []
            for sent in doc.sents:
                course_processed = ''
                sents = []
                for word in sent:
                    word = word.text
                    word = word.replace('\n', ' ')
                    word = word.strip().strip('"@#$%^&*§')
                    if word == '-':
                        if course_processed != '':
                            sents.append(course_processed[:-1])
                            course_processed = ''
                        continue
                    if word.startswith('- ') and len(word) > 1:
                        word = word.replace('- ', '')
                        if course_processed != '':
                            sents.append(course_processed[:-1])
                            course_processed = ''
                    if (not re.match('\s+', word,re.I)) and word != '--' and word != '':
                        if "/-" in word: 
                            word = word.split("/-")[0]
                        if re.match(r'\,|\.|\?|\!|\)', word):
                            course_processed = course_processed[:-1] + word + ' '
                        elif re.match(r'\(',word):
                            course_processed += word
                        else:
                            course_processed += word + ' '
                sentences += sents
                if course_processed != '':
                    sentences.append(course_processed[:-1])
            courses_sent_lemma.append(sentences)
    return courses_sent_lemma

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 9.06 µs


You can try and see the processing result for the first course:

In [21]:
course_in_sentences_test = To_sentences(courses.iloc[:1])
print(courses.iloc[0])
print(course_in_sentences_test)

"Schwierige" Klienten? - Mit Patienten, Angehörigen und Kollegen clever kommunizieren - Analyse strapaziöser Gesprächsmuster
- Schwierige Klienten oder schwierige Situation?
- Was heißt es schwierige Gespräche souverän zu meistern?
- Übungen mit Videofeedback werden durchgeführt!
[['Schwierige Klienten?', 'Mit Patienten, Angehörigen und Kollegen clever kommunizieren', 'Analyse strapaziöser Gesprächsmuster', 'Schwierige Klienten oder schwierige Situation?', 'Was heißt es schwierige Gespräche souverän zu meistern?', 'Übungen mit Videofeedback werden durchgeführt!']]


To increase the precision of TF working result, course description is split into bunch of chunks with key words of competencies.
For more information see https://spacy.io/api/pipeline-functions#merge_noun_chunks

First we add 'merge_noun_chunks' to the spacy processing pipeline so that noun chunks will then be seen as a single token.

In [22]:
nlp.add_pipe('merge_noun_chunks')

<function spacy.pipeline.functions.merge_noun_chunks(doc: spacy.tokens.doc.Doc) -> spacy.tokens.doc.Doc>

In [23]:
print(nlp.pipe_names)

['tok2vec', 'tagger', 'morphologizer', 'parser', 'lemmatizer', 'attribute_ruler', 'ner', 'merge_noun_chunks']


Then we extract the Noun Phrase Chunks from the sentence.  
Chunking identifies groups of words that go together to form symbolic meaning.  
It builds upon the process Part of Speech (POS) Tagging, which is a way to describe the grammatical function of a word.


Chunking is a critical step in our approach. Because it allows us to split the course description into a collection of phrases to compare with the competencies.
Finally, we remove the stop words, punctuation and other special symbols such as dash and slash. Then we lemmatize the remaining words.


In [24]:
%%time
def Segment_processing(part_of_the_courses):
    
    # to sentences
    course_in_sentences = To_sentences(part_of_the_courses)
    
    # to chunks
    courses_chunks = []
    for course in course_in_sentences:
        course_chunks = []
        for sent in course:
            sent_chunks = []
            doc = nlp(sent)
            for tok in doc:
                with nlp.select_pipes(disable=['merge_noun_chunks','attribute_ruler', 'ner']):
                    if not re.match('\(|\)|\,|\.|\!|\?|\/',tok.text) and not tok.is_stop: 
                        tok = nlp(tok.text)
                        words = ''
                        for word in tok:
                            if not word.is_stop and word.lemma_ != '--':
                                words = words + word.lemma_ + ' '  
                        sent_chunks.append(words[:-1])
            course_chunks += sent_chunks
        courses_chunks.append(course_chunks)
    return courses_chunks


CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 8.34 µs


For example the processing result of the first course is as follow: 

In [25]:
%%time
courses_chunks_test = Segment_processing(courses.iloc[:1])
print(course_in_sentences_test)
print(courses_chunks_test)

[['Schwierige Klienten?', 'Mit Patienten, Angehörigen und Kollegen clever kommunizieren', 'Analyse strapaziöser Gesprächsmuster', 'Schwierige Klienten oder schwierige Situation?', 'Was heißt es schwierige Gespräche souverän zu meistern?', 'Übungen mit Videofeedback werden durchgeführt!']]
[['schwierige Klient', 'Patient', 'angehörige', 'Kollege', 'clever', 'kommunizieren', 'Analyse', 'strapaziös Gesprächsmuster', 'schwierig Klient', 'schwierig Situation', 'schwierig Gespräch', 'souverän', 'Meister', 'Übung', 'videofeedback', 'durchführen']]
CPU times: user 637 ms, sys: 846 ms, total: 1.48 s
Wall time: 2.54 s


## Embedding-phase

After the text has been processed, we now embed them as vectors.  
For efficiency, we will handle course descriptions in batches. (cf. function Save_results)

In [26]:
def Course_Embedding(courses_chunks):
    courses_chunks_embed = []
    for course in courses_chunks:
        courses_chunks_embed.append(embed(course))
    return courses_chunks_embed

In [27]:
%%time
labels_embed = embed(processedLabels)

CPU times: user 1min 41s, sys: 33.5 s, total: 2min 14s
Wall time: 59.1 s


## Similarity

We calculate the cosine distance for the transformed text of competency and word chunks of course descriptions. 

In [28]:
%%time
def Similarity(labels, courses):
    result = []
    for course in courses:
        result.append(np.dot(labels, np.transpose(course)))
    return result

CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs
Wall time: 8.34 µs


In [29]:
%%time
def Get_relations(relations,skills,courses_table,course_idx,benchmark,courses_chunks):

    idx = 0
    counter = 0

    df_relations = pd.DataFrame(columns = ['conceptUri'])
    for relationship in relations:
        print('course: ', course_idx)
        competency_idx = np.where(relationship>benchmark) 

        print('numbers of extracted competency: ', len(competency_idx[0]))
        for j in range(len(competency_idx[0])):
            print(processedLabels[competency_idx[0][j]],'--', courses_chunks[idx][competency_idx[1][j]])
            
        competency = set(skills.loc[list(competency_idx[0]),'conceptUri'])
        competency_uri = pd.DataFrame(competency,columns=['conceptUri'], index=[courses_table.iloc[course_idx]['course_id']]*len(competency)) 

        print(' ------')
        print('\n')

        df_relations = pd.concat([df_relations,competency_uri])
    
        course_idx += 1 
        
#         test:
        idx += 1
        counter += len(competency)
    print('number of extracted competency', counter)
    print('\n', df_relations)

    return df_relations

CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs
Wall time: 9.06 µs


The performance of this method is strongly related to the setting of the threshold value.
Theoretically, after Setting different thresholds to determine whether the texts are similar, comparing the results, and finding the threshold that makes the highest F-score.
We can use this f-score to evaluate our method.
However, as we said in the section on the evaluation method,
We do not have a reliable test dataset and thus no f-score.
We have tried several thresholds and made a manual selection based on the output results.

In [30]:
def Competence_extraction(courses,course_start_idx,labels,benchmark):
    courses_chunks = Segment_processing(courses)
    courses_embed = Course_Embedding(courses_chunks)
    calculated_results = Similarity(labels, courses_embed)
    return Get_relations(calculated_results,skills,courses_table,course_start_idx,benchmark,courses_chunks)

In [31]:
%%time
def Save_results(courses,benchmark):
    length = len(courses)
    n = length//500
    for i in range(n+1):
        start = i * 500
        if i == n:
          end = length
        else:
          end = (i+1) * 500
        courses_div = courses.iloc[start:end]
        relations = Competence_extraction(courses_div,start,labels_embed,benchmark)
        relations.to_csv(path_or_buf="output/relations_part_"+str(i)+".csv" ,columns=['conceptUri'], index_label='course_id')



CPU times: user 5 µs, sys: 1e+03 ns, total: 6 µs
Wall time: 10 µs


In [None]:
Save_results(courses,benchmark=0.85)

merge csv files

In [None]:
length = len(courses)
n = length//500
df_relations_merge = pd.DataFrame()
for i in range(n+1):  
    df = pd.read_csv('output/relations_part_' + str(i) + '.csv',index_col="course_id")
    df_relations_merge = pd.concat([df_relations_merge,df])
df_relations_merge.to_csv(path_or_buf="output/relations.csv",columns=['conceptUri'], index_label='course_id')