# OLAF : Microsoft Teams Pipeline

In this demo, we create a pipeline using components from the OLAF library. We use the folder (../teams_doc_250/) and we preprocess the data by filtering out stopwords, punctuation, numbers and URLs, and extract the following components for the pipeline : term extraction (nouns), term enrichment (by their synonyms), concept extractions and hierarchisation, and relation extraction (based off verbs). 

In [1]:
import spacy

In [2]:
# Import all necessary items from the olaf package
from olaf import Pipeline
from olaf.pipeline.data_preprocessing import TokenSelectorDataPreprocessing, TokenSelectorDataPreprocessingConfig #Delete TokenSelectorDataPreprocessingConfig after the merge
from olaf.commons.spacy_processing_tools import is_not_stopword, is_not_punct, is_not_num, is_not_url
from olaf.repository.corpus_loader import JsonCorpusLoader
from olaf.pipeline.pipeline_component.term_extraction import POSTermExtraction
from olaf.pipeline.data_preprocessing import TokenSelectorDataPreprocessing 
from olaf.pipeline.pipeline_component.concept_relation_extraction import CTsToRelationExtraction, CTsToConceptExtraction
from olaf.pipeline.pipeline_component.candidate_term_enrichment import KnowledgeBasedCTermEnrichment
from olaf.pipeline.pipeline_component.concept_relation_hierarchy import SubsumptionHierarchisation
from olaf.pipeline.pipeline_component.concept_relation_extraction import SynonymConceptExtraction
from olaf.repository.serialiser import BaseOWLSerialiser

                By default the token sequence attribute selected_tokens will be used.]
  from .autonotebook import tqdm as notebook_tqdm


We will load the language model according to our needs.

In [3]:
# Load the spacy language model according to the corpus (needs to be downloaded in the virtual environnement)
spacy_model = spacy.load("fr_core_news_sm") 

### LOADING DATA

To load the data, we use the JsonCorpusLoader that will go in the folder and read through all necessary files and add them to the corpus.

In [4]:
# Creating an instance of JsonCorpusLoader with the path to the JSON files and the field name wanted
corpus_loader = JsonCorpusLoader(corpus_path="../data/teams_doc_250/", json_field="description")

# Load the data using JsonCorpusLoader
text_corpus = corpus_loader._read_corpus()

We have 241 documents in our corpus but we will only use the first 10 for this pipeline. So that is what we are doing in the following section:

In [5]:
print(len(text_corpus))
partial_corpus=[doc for doc in spacy_model.pipe(text_corpus[0:10])]
print(len(partial_corpus))

241
10


### DATA PREPROCESSING

Starting by preprocessing the data, and filtering out all stopwrds, punctuation, numbers and urls.

In [6]:
token_selector_params = {
    "selectors": [is_not_num, is_not_url, is_not_punct, is_not_stopword],
    "token_sequence_doc_attribute": "selected_tokens"
}

# To be modified with the updated TokenSelectorDataPreprocessing
default_config = TokenSelectorDataPreprocessingConfig()

# Creating a list of preprocessing components
data_prep = [TokenSelectorDataPreprocessing(
    selector=lambda token: all(selector(token) for selector in token_selector_params["selectors"]), config=default_config
)]


# Creating a list of preprocessing components After the merge
#data_prep = [TokenSelectorDataPreprocessing(
#    selector=lambda token: all(selector(token) for selector in token_selector_params["selectors"])
#)]



                By default the token sequence attribute selected_tokens will be used.]


### TERM EXTRACTION
Next, extracting candidates terms based on POS tagging (taking NOUNS)

In [7]:
term_extract_params_concepts = {
    "token_sequence_doc_attribute": "selected_tokens",
    "pos_selection": ["NOUN"]
}

concept_term_extraction = POSTermExtraction(parameters=term_extract_params_concepts)
concept_extraction = CTsToConceptExtraction(parameters={"concept_max_distance": 5})



### RELATION EXTRACTION

In [8]:
term_extract_params_relations = {
    "token_sequence_doc_attribute": "selected_tokens",
    "pos_selection": ["VERB"],
}

relation_term_extraction = POSTermExtraction(parameters=term_extract_params_relations, )
relation_extraction = CTsToRelationExtraction(parameters={"concept_max_distance": 5})



### PIPELINE SETUP

We can now create a pipeline with the components we created.

In [9]:
# Creating the object Pipeline with all the components instanciated above
teams_demo_pipeline = Pipeline(
    spacy_model=spacy_model,
    preprocessing_components=data_prep,
    pipeline_components=[
        concept_term_extraction,
        concept_extraction
    ],
    corpus=[doc for doc in spacy_model.pipe(partial_corpus)]
)

The other components created can also be added to the pipeline.

In [10]:
teams_demo_pipeline.add_pipeline_component(relation_term_extraction)
teams_demo_pipeline.add_pipeline_component(relation_extraction)

In [11]:
# Empty list   
teams_demo_pipeline.kr

KnowledgeRepresentation(concepts=set(), relations=set(), metarelations=set())

In [12]:
# Running pipeline created 
teams_demo_pipeline.run()

AttributeError: 'spacy.tokens.token.Token' object has no attribute 'start'

In [None]:
# Now has concepts
teams_demo_pipeline.kr

In [None]:
# No concepts found
# Print out the concepts found
for concept in teams_demo_pipeline.kr.concepts:
    print(concept.label)

In [None]:
# Print out the relations found
for relation in teams_demo_pipeline.kr.relations:
    print(relation.label)

# REMARQUE : le preprocessing n'a pas été pris en compte

In [None]:
pos_candidate_terms = concept_term_extraction._extract_candidate_tokens(token_sequences=teams_demo_pipeline.corpus)
occurence_candidate_terms = concept_term_extraction._build_term_corpus_occ_map(pos_candidate_terms)

In [None]:
print(len(teams_demo_pipeline.candidate_terms))

In [None]:
# Remove wrong candidate terms
candidate_terms_to_remove = ["cc", "pouvoir", "bit", ">", "<", "oui"]

# Merge candidate terms
candidate_terms = [candidate_term for candidate_term in pos_candidate_terms if candidate_term in occurence_candidate_terms and candidate_term not in candidate_terms_to_remove]

# Print number of candidates term found
print(f"{len(candidate_terms)} candidate terms have been found")

### TERM ENRICHMENT
Embedding-based similar term extraction. 

We can add the components we create in the following sections to the pipeline we created above and by running the pipeline again, we will be able to have the candidate terms and concepts and relations we are searching for.

In [None]:
term_enrichment = KnowledgeBasedCTermEnrichment(teams_demo_pipeline.candidate_terms)

In [None]:
teams_demo_pipeline.add_pipeline_component(term_enrichment)

### CONCEPT EXTRACTION

In [None]:
concept_extraction = SynonymConceptExtraction(teams_demo_pipeline.candidate_terms)

In [None]:
teams_demo_pipeline.add_pipeline_component(concept_extraction)

### CONCEPT HIERARCHY

In [None]:
concept_hierarchy = SubsumptionHierarchisation()
concept_hierarchy._is_sub_hierarchy()

In [None]:
teams_demo_pipeline.add_pipeline_component(concept_hierarchy)
teams_demo_pipeline.run()

In [None]:
# Now has concepts
teams_demo_pipeline.kr

In [None]:
# Print out the concepts found
for concept in teams_demo_pipeline.kr.concepts:
    print(concept.label)

### SERIALISER

To save the results of this pipeline, we use a serialiser to export the results in turtle (".ttl") format.

In [None]:
# Instantiating serialiser
teams_kr_serialiser = BaseOWLSerialiser("http://teams_kr.org/")

In [None]:
# Build the RDF graph from the olaf pipeline KnowledgeRepresentation
teams_kr_serialiser.build_graph(teams_demo_pipeline.kr)

In [None]:
# Export the RDF graph file path and in default format (turtle)
teams_kr_serialiser.export_graph("teams_kr.ttl")