# From an RDF KG to an entity matcher (fuzzy version)

The goal here is to automatically construct an Entity Matcher based on an RDF Knowledge Graph (KG). 
For the purpose of this demo we have constructed a sample knowledge graph of the pizzas proposed by the restaurant Le bisou, Rouen, France: <https://www.bisourouen.fr/>

The entity matcher used in the demo is based on fuzzy matching. It means that there are two constructed matchers, one is rule-based and solely focused on linking the KG entities in a text by matching the raw strings, the other one uses fuzzing matching measure to link the KG entites in a text.

The input KG is expressed in RDF. For simplicity we only focus on english.

In [None]:
from pathlib import Path

import spacy

from buzz_el.graph import RDFGraphLoader
from buzz_el.entity_matcher import EntityMatcher

In [None]:
# path to the KG RDF file
pizza_kg_filepath = Path("./data/pizzas_bisou_sample.ttl")

In [None]:
# Example documents with misspellings
reviews = [
    """Absolutely delightful! The Bianca Castafiore pizza with its ricota cream base offers a perfect blend of flavors. 
    I love the options of goat chese and vegetarian toppings—especially the combination of black peper, goat cheese, 
    honey, fior di latte mozzarella, spinach, and walnuts. A true treat for the taste buds!""",
    """A flavor explosion! The BurraTadah pizza, with its tomato base and pork pizza category, is a must-try. 
    The combination of black pepper, cherry tomatoes, mozzarella di burrata, fior di late mozarela, olive oil, 
    Parma ham, Parmesan cheese, and rocket creates a symphony of deliciousness. Simply irresistible!""",
    """Fit for royalty! The God Save The King pizza, featuring a tomato base and exquisite pork toppings, is a culinary masterpiece. 
    The Parisian mushroms carpaccio, ham from Paris with herbs, fior di latte mozzarella, and olves create a regal flavor profile. 
    A royal feast for the senses!"""
]

In [None]:
# create a minimum spaCy Language (i.e., a pipeline)
nlp = spacy.load(
        "en_core_web_sm",
        exclude=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer", "ner"], # we make the pipeline as small as possible for our little demo
    )

corpus = [doc for doc in nlp.pipe(reviews)]

The main goal of the graph loader is to construct an index of the entity URIs associated with their representative strings.

By default the graph loader will consider an entity anything that is the subject of a triple containing an labelling property. And it will consider a representative string the object of such triple.

You can provide you own annotation properties. If not a usual one (i.e., for which RDFlib has a namespace defined), you need to provide the full URI. By default it will use rdfs:label.

You can also filter using a language tag.

In [None]:
# Build a graph loader
kg_loader = RDFGraphLoader(
    kg_file_path=pizza_kg_filepath,
    label_properties={"skos:altLabel", "rdfs:label", "skos:prefLabel"}, # define what are the annotation properties in RDF KG
    lang_filter_tag="en" # optionally define a language to focus on
)

pizza_kg = kg_loader.build_knowledge_graph()

In [None]:
# build an entity matcher with fuzzy option
entity_matcher = EntityMatcher(
    knowledge_graph=pizza_kg,
    spacy_model=nlp,
    use_fuzzy=True
)

In [None]:
# process your documents
corpus = [doc for doc in entity_matcher.pipe(corpus)]

In [None]:
# The spans are stored in the Doc.spans attribute 
# in the "string" key for string matching and in "fuzzy" key for the fuzzy matching.
corpus[0].spans

In [None]:
for doc in corpus:
    print(doc)
    for span in doc.spans["fuzzy"]:
        print(span.text, span.start, span.end, span.id_.replace("http://www.msesboue.org/o/pizza-data-demo/bisou#", "pizza:"))
    print()