# From an RDF KG to an entity matcher

The goal here is to automatically construct an Entity Matcher based on an RDF Knowledge Graph (KG), here the well known pizza ontology. (The original version is accessible here: <https://github.com/owlcs/pizza-ontology>)

The produced entity matcher is effectively a spaCy span ruler. It means that the constructed matcher is rule-based en solely focused on linking the KG entities in a text by matching the raw strings.

The input KG is expressed in RDF. It means that we will focus on the annotation properties to construct our entity matching rule set. For simplicity we wil also first focus on english.

In [3]:
import spacy
from random import sample

from buzz_el.graph_loader import GraphLoader
from buzz_el.entity_matcher import EntityMatcher

In [4]:
# path to the KG RDF file
pizza_onto_filepath = "./data/pizza.ttl"

In [5]:
# Example documents
docs = [
    "Margherita Pizza: The classic Italian pizza, topped with tomato sauce, fresh mozzarella cheese, fresh basil leaves, and a drizzle of olive oil.",
    "Pepperoni Pizza: A beloved American favorite, topped with tomato sauce, mozzarella cheese, and slices of pepperoni, which are a type of spicy salami.",
    "Hawaiian Pizza: A controversial choice, featuring tomato sauce, mozzarella cheese, ham, and pineapple. The sweet and salty combination is either loved or loathed by pizza enthusiasts.",
    "BBQ Chicken Pizza: A unique twist, with barbecue sauce instead of tomato sauce, topped with mozzarella cheese, grilled chicken, red onions, and sometimes, cilantro.",
    "Supreme Pizza: Packed with toppings, this pizza typically includes tomato sauce, mozzarella cheese, pepperoni, sausage, bell peppers, onions, olives, and mushrooms.",
    "Vegetarian Pizza: Perfect for those who prefer a meatless option, this pizza includes tomato sauce, mozzarella cheese, and a variety of vegetables such as bell peppers, onions, tomatoes, olives, and mushrooms."
]

In [6]:
# create a minimum spaCy Language (i.e., a pipeline)
nlp = spacy.load(
        "en_core_web_sm",
        exclude=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer", "ner"], # we make the pipeline as small as possible for our little demo
    )

corpus = [doc for doc in nlp.pipe(docs)]

The main goal of the graph loader is to construct an index of the entity URIs associated with their representative strings.

By default the graph loader will consider an entity anything that is the subject of a triple containing an annotation property. And it will consider a representative string the object of such triple.

You can provide you own annotation properties. If not a usual one (i.e., for which RDFlib has a namespace defined), you need to provide the full URI. By default it will use rdfs:label.

You can also filter using a language tag.

In [7]:
# Build a graph loader
kg_loader = GraphLoader(
    kg_file_path=pizza_onto_filepath,
    annotation_properties={"skos:altLabel", "rdfs:label", "skos:prefLabel"}, # define what are the annotation properties in RDF KG
    lang_filter_tag="en" # optionally define a language to focus on
)

In [8]:
# visualise the index
sample(list(kg_loader._entity_str_idx.items()), 10)

[('http://www.co-ode.org/ontologies/pizza/pizza.owl#PolloAdAstra',
  {'Pollo Ad Astra', 'Pollo Ad Astra Pizza', 'PolloAdAstra'}),
 ('http://www.co-ode.org/ontologies/pizza/pizza.owl#InterestingPizza',
  {'Interesting Pizza', 'InterestingPizza'}),
 ('http://www.co-ode.org/ontologies/pizza/pizza.owl#ParmesanTopping',
  {'Parmezan', 'ParmezanTopping'}),
 ('http://www.co-ode.org/ontologies/pizza/pizza.owl#Country', {'Country'}),
 ('http://www.co-ode.org/ontologies/pizza/pizza.owl#PepperTopping',
  {'Pepper', 'PepperTopping'}),
 ('http://www.co-ode.org/ontologies/pizza/pizza.owl#Capricciosa',
  {'Capricciosa', 'Capricciosa Pizza'}),
 ('http://www.co-ode.org/ontologies/pizza/pizza.owl#Veneziana',
  {'Veneziana', 'Veneziana Pizza'}),
 ('http://www.co-ode.org/ontologies/pizza/pizza.owl#CheeseTopping',
  {'Cheese', 'CheeseTopping'}),
 ('http://www.co-ode.org/ontologies/pizza/pizza.owl#American',
  {'American', 'American Pizza'}),
 ('http://www.co-ode.org/ontologies/pizza/pizza.owl#SpicyPizza',


In [9]:
# build an entity matcher
entity_matcher = EntityMatcher(graph_loader=kg_loader, nlp=nlp)

# you can customise your spaCy span ruler config. See: <https://spacy.io/api/spanruler#config> 
# ruler_config = {
#     "phrase_matcher_attr": "LOWER",
#     "spans_key": "kg_ents"
# }
# entity_matcher.build_entity_ruler(config=ruler_config)

In [10]:
# process your documents
corpus = [doc for doc in entity_matcher.pipe(corpus)]

In [11]:
# The spans are stored in the Doc.spans attribute by default in the "ruler" key.
corpus[0].spans

{'ruler': [Margherita, Margherita Pizza, Pizza, pizza, tomato, sauce, mozzarella, cheese, olive]}

In [12]:
for doc in corpus:
    print(doc)
    for span in doc.spans["ruler"]:
        print(span.text, span.start, span.end, span.id_.replace("http://www.co-ode.org/ontologies/pizza/pizza.owl#", "pizza-onto:"))
    print()

Margherita Pizza: The classic Italian pizza, topped with tomato sauce, fresh mozzarella cheese, fresh basil leaves, and a drizzle of olive oil.
Margherita 0 1 pizza-onto:Margherita
Margherita Pizza 0 2 pizza-onto:Margherita
Pizza 1 2 pizza-onto:Pizza
pizza 6 7 pizza-onto:Pizza
tomato 10 11 pizza-onto:TomatoTopping
sauce 11 12 pizza-onto:SauceTopping
mozzarella 14 15 pizza-onto:MozzarellaTopping
cheese 15 16 pizza-onto:CheeseTopping
olive 25 26 pizza-onto:OliveTopping

Pepperoni Pizza: A beloved American favorite, topped with tomato sauce, mozzarella cheese, and slices of pepperoni, which are a type of spicy salami.
Pizza 1 2 pizza-onto:Pizza
American 5 6 pizza-onto:American
tomato 10 11 pizza-onto:TomatoTopping
sauce 11 12 pizza-onto:SauceTopping
mozzarella 13 14 pizza-onto:MozzarellaTopping
cheese 14 15 pizza-onto:CheeseTopping
spicy 26 27 pizza-onto:SpicyTopping

Hawaiian Pizza: A controversial choice, featuring tomato sauce, mozzarella cheese, ham, and pineapple. The sweet and salty