# Knowledge Graphs - Pizza

Here, we demonstrate a variety of tasks with knowledge graphs. We work with a small dataset with information on pizza restaurants in the US.

**Work flow**
1. Initial exploration of data - focus on anomalies and checking for uniqueness (for URIs)
2. Build a knowledge graph by writing functions to add triples
3. Connect to an external KG (Google)
4. Apply reasoning to extend our KG
5. Perform SPARQL queries on our KG
6. Check the alignment between two ontologies

**Notes**
* isub and lookup are scripts not included here

In [28]:
from owlready2 import *

In [29]:
from rdflib import Graph
from rdflib import URIRef, BNode, Literal
from rdflib import Namespace
from rdflib.namespace import OWL, RDF, RDFS, FOAF, XSD
from rdflib.util import guess_format
import pandas as pd
import re
import owlrl
from isub import isub
from lookup import GoogleKGLookup
from time import time

## Tabular Data to Knowledge Graph

### Load data and exploratory analysis

In [None]:
# load data into a dataframe
df = pd.read_csv("data/IN3067-INM713_coursework_data_pizza_500.csv")

In [31]:
# return snapshot of data
df.head()

Unnamed: 0,name,address,city,country,postcode,state,categories,menu item,item value,currency,item description
0,Little Pizza Paradise,Cascade Village Mall Across From Target,Bend,US,97701.0,OR,Pizza Place,Bianca Pizza,22.5,USD,
1,Little Pizza Paradise,Cascade Village Mall Across From Target,Bend,US,97701.0,OR,Pizza Place,Cheese Pizza,18.95,USD,
2,The Brentwood,148 S Barrington Ave,Los Angeles,US,90049.0,Brentwood,"American Restaurant,Bar,Bakery","Pizza, Margherita",12.0,USD,
3,The Brentwood,148 S Barrington Ave,Los Angeles,US,90049.0,Brentwood,"American Restaurant,Bar,Bakery","Pizza, Mushroom",13.0,USD,
4,The Brentwood,148 S Barrington Ave,Los Angeles,US,90049.0,Brentwood,"American Restaurant,Bar,Bakery","Pizza, Puttenesca",13.0,USD,"Olives, onions, capers, tomatoes"


In [32]:
# check table info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 501 entries, 0 to 500
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              501 non-null    object 
 1   address           501 non-null    object 
 2   city              501 non-null    object 
 3   country           501 non-null    object 
 4   postcode          491 non-null    float64
 5   state             501 non-null    object 
 6   categories        501 non-null    object 
 7   menu item         501 non-null    object 
 8   item value        423 non-null    float64
 9   currency          426 non-null    object 
 10  item description  176 non-null    object 
dtypes: float64(2), object(9)
memory usage: 43.2+ KB


In [33]:
# check nan values in columns
df.isna().sum()

name                  0
address               0
city                  0
country               0
postcode             10
state                 0
categories            0
menu item             0
item value           78
currency             75
item description    325
dtype: int64

In [34]:
# check uniqueness in columns - important for URIs
df.nunique()

name                151
address             153
city                132
country               1
postcode            147
state                74
categories          121
menu item           311
item value          129
currency              1
item description    169
dtype: int64

We can infer that country is always "US" and currency is always "USD".

In [35]:
# check item values make sense
df["item value"].describe()

count    423.000000
mean      12.249740
std        9.360744
min        0.590000
25%        6.990000
50%       11.950000
75%       16.950000
max      116.990000
Name: item value, dtype: float64

116.99 is very expensive! Investigate further...

In [None]:
# check for item value outliers. Return rows with item value +2sd's from mean
iv_mean = df["item value"].mean()
iv_sd = df["item value"].std()
df[df["item value"] > iv_mean + 2 * iv_sd]

Unnamed: 0,name,address,city,country,postcode,state,categories,menu item,item value,currency,item description
194,Riccardo's Pizza,522 Saddle River Rd,Saddle Brook,US,7663.0,NJ,"Italian Restaurant,Restaurant",Order 3 Large Pizzas and Get The 4th Pizza Free,37.99,USD,
226,California Pizza Kitchen,10300 Forest Hill Blvd,Wellington,US,33414.0,Village Of Wellington,"Pizza Place,Take Out Restaurants,American Rest...",Pizza,116.99,USD,
483,California Pizza Kitchen,3401 Esperanza Xing,Austin,US,78758.0,TX,"Pizza,Take Out Restaurants,Restaurants,America...",Pizza,116.99,USD,


We can see 37.99 is actually an offer, but 116.99 appears to be erroneous.

Also, US zip codes should be <5 digits> or <5 dgits>-<4 digits>. Given the dtype restriction, none could have the latter format. Investigate further...

In [None]:
# return min and max zip codes
int(df["postcode"].min()), int(df["postcode"].max())

(1566, 99709)

We can see a mix of 4 and 5 digit zip codes. These issues are important to note but do not interfere with our subsequent tasks

Also, the outlier table for item value shows restaurant name is not unique (California Pizza Kitchen in Wellington and Austin). Investigate how we can create a unique name for URIs...

In [None]:
# as a baseline, name + address should be unique, but unlinkely to look nice!
unique_name1 = df["name"] + " " + df["address"]
print(f"Example: {unique_name1[0]}")
print(f"Unique count: {len(unique_name1.unique())}")

Example: Little Pizza Paradise Cascade Village Mall Across From Target
Unique count: 153


In [None]:
# alternative, name + city
# add " in " for clarity
unique_name2 = df["name"] + " in " + df["city"]
print(f"Example: {unique_name2[0]}")
print(f"Unique count: {len(unique_name2.unique())}")

Example: Little Pizza Paradise in Bend
Unique count: 153


name + city has same unique count and looks better. Similarly, as instructed, link menu item to restaurant. But first...

#### Preimplimentation data fixes

In [None]:
# uniqueness
df["name_unique"] = df["name"] + " in " + df["city"]
df["menu_item_linked"] = df["menu item"] + " at " + df["name_unique"]

# fix state to only two letters or blank
df["state_fixed"] = df["state"].apply(lambda x: x if len(x) == 2 else "")

# address builder
df["address_builder"] = "address " + df["address"]+ " " + df["city"]

# merge menu item and item description - will be useful for checking ingredients/types
df["menu_detail"] = df["menu item"] + " " + df["item description"].apply(lambda x:str(x))

# merge item value and currency
df["item_value2"] = df["item value"].apply(lambda x:str("{:.2f}".format(x))) + df["currency"]

df.head()

Unnamed: 0,name,address,city,country,postcode,state,categories,menu item,item value,currency,item description,name_unique,menu_item_linked,state_fixed,address_builder,menu_detail,item_value2
0,Little Pizza Paradise,Cascade Village Mall Across From Target,Bend,US,97701.0,OR,Pizza Place,Bianca Pizza,22.5,USD,,Little Pizza Paradise in Bend,Bianca Pizza at Little Pizza Paradise in Bend,OR,address Cascade Village Mall Across From Targe...,Bianca Pizza nan,22.50USD
1,Little Pizza Paradise,Cascade Village Mall Across From Target,Bend,US,97701.0,OR,Pizza Place,Cheese Pizza,18.95,USD,,Little Pizza Paradise in Bend,Cheese Pizza at Little Pizza Paradise in Bend,OR,address Cascade Village Mall Across From Targe...,Cheese Pizza nan,18.95USD
2,The Brentwood,148 S Barrington Ave,Los Angeles,US,90049.0,Brentwood,"American Restaurant,Bar,Bakery","Pizza, Margherita",12.0,USD,,The Brentwood in Los Angeles,"Pizza, Margherita at The Brentwood in Los Angeles",,address 148 S Barrington Ave Los Angeles,"Pizza, Margherita nan",12.00USD
3,The Brentwood,148 S Barrington Ave,Los Angeles,US,90049.0,Brentwood,"American Restaurant,Bar,Bakery","Pizza, Mushroom",13.0,USD,,The Brentwood in Los Angeles,"Pizza, Mushroom at The Brentwood in Los Angeles",,address 148 S Barrington Ave Los Angeles,"Pizza, Mushroom nan",13.00USD
4,The Brentwood,148 S Barrington Ave,Los Angeles,US,90049.0,Brentwood,"American Restaurant,Bar,Bakery","Pizza, Puttenesca",13.0,USD,"Olives, onions, capers, tomatoes",The Brentwood in Los Angeles,"Pizza, Puttenesca at The Brentwood in Los Angeles",,address 148 S Barrington Ave Los Angeles,"Pizza, Puttenesca Olives, onions, capers, toma...",13.00USD


In [None]:
# update stored column names
cols = list(df.columns)

# and look at nan in new cols
df.isna().sum()

name                  0
address               0
city                  0
country               0
postcode             10
state                 0
categories            0
menu item             0
item value           78
currency             75
item description    325
name_unique           0
menu_item_linked      0
state_fixed           0
address_builder       0
menu_detail           0
item_value2          75
dtype: int64

#### Cautionary notes
* Data snapshot shows state column can be incorrect 
* Missing values in postcode, item value, currency, item description. Safe to assume all currencies are "USD" as country column has value "US" only with no missing values
* Some potentially erroneous postcodes and item values, but these will be treated as correct
* Restaurant names are not unique so cannot be used as a URI. We can merge with location info. Menu items will also be nonunique

### Creating the knowldge graph

#### Building the graph

In [None]:
#load graph from the model ontology
g = Graph()
g.parse("data/pizza-restaurants-ontology.ttl", format = "ttl")

# check how many triples 
print("Loaded " + str(len(g)) + " triples.")

# use cw (coursework) namespace for our URIs
cw_ns_str = "http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#"
cw = Namespace(cw_ns_str)
g.bind("cw", cw)


Loaded 963 triples.


#### Functions

Fuctions for processing

In [None]:
# we extend the code provided by Ernesto Jiménez-Ruiz from labs
# https://github.com/city-knowledge-graphs/python-2024

# dictionary to store future URIs
stringToURI = dict()

# function for fixing strings for URIs
def processLexicalName(name):
    # _ for spaces and ., "and" for &
    # removes ,()@
    return "_".join(name.split()).replace("'","").replace("&", "and").\
    replace("(", "").replace(")", "").replace(",", "").replace(".", "_").\
        replace("@", "")
     
# add fixed strings to URI dictionary
def createURIForEntity(name):
    stringToURI[name] = cw_ns_str + processLexicalName(name)
    return stringToURI[name]

# check for nan values
def is_nan(x):
    return (x != x)

# add spaces to strings (eg VeganCheese->Vegan Cheese) for better searching
def add_spaces(my_string):
    # return list of caps
    caps=re.findall('([A-Z])', my_string)
    # we only see the pattern of one or two caps in this ontology
    if len(caps)==2:
        # at the second cap, add a space before, using regular expressions
        my_string=re.sub(r"(\w)([A-Z])", r"\1 \2", my_string)
    return my_string  

Functions for adding to graph.

Here we have some brute force methods which could be improved on.

In [None]:
# we extend the code provided by Ernesto Jiménez-Ruiz from labs
# https://github.com/city-knowledge-graphs/python-2024

# add type triples to graph
def mappingToCreateTypeTriple(subject_column, class_type):
    for subject in df[subject_column]:
        
        # skip blanks
        if is_nan(subject) or subject == None or subject == "" or subject == "nan":
            pass
        
        else:
            # check if URI already exists         
            if subject.lower() in stringToURI:
                entity_uri = stringToURI[subject.lower()]
            else:
            # if not, add it
                entity_uri = createURIForEntity(subject.lower())
            # add the triple to the graph
            g.add((URIRef(entity_uri), RDF.type, class_type))

# add type triples to graph - cw:pizza
# seperate because we need a simple search in the string
def mappingToCreateTypeTriple_pizza():
    for item, item_linked in zip(df["menu item"], df["menu_item_linked"]):
        
        # skip blanks
        if is_nan(item_linked) or item_linked == None or item_linked == "" or item_linked == "nan":
            pass  
        else:      
            # see if "pizza" is in the menu item string
            if "pizza" in item.lower():
                # switch to menu_item_linked for the URI
                # check if URI already exists 
                if item_linked.lower() in stringToURI:
                    entity_uri = stringToURI[item_linked.lower()]
                # if not, add it
                else:
                    entity_uri = createURIForEntity(item_linked.lower())
                # add the triple to the graph
                g.add((URIRef(entity_uri), RDF.type, cw.Pizza))

# add businesses            
# first time around, this is a brute force method for adding business types
# by lookingh through the ontology
def mappingToCreateTypeTriple_business():
    for subject, object in zip(df['name_unique'], df['categories']):
        
        # skip blanks
        if is_nan(subject) or subject == None or subject == "" or subject == "nan":
            pass
        
        else:
            # empty list of categories for business
            business_categories = []          
            # fix strings and add to URI dict if necessary 
            if subject.lower() in stringToURI:
                subject_uri=stringToURI[subject.lower()]
            else:
                subject_uri=createURIForEntity(subject.lower())
            # work through business categories
            if "asian" in object.lower():
                business_categories.append(cw.AsianRestaurant)
            if "chinese" in object.lower():
                business_categories.append(cw.ChineseRestaurant)
            if "japanese" in object.lower():
                business_categories.append(cw.JapaneseRestaurant)
            if "indian" in object.lower():
                business_categories.append(cw.IndianRestaurant)
            if "american" in object.lower():
                business_categories.append(cw.AmericanRestaurant)
            if "mexican" in object.lower():
                business_categories.append(cw.MexicanRestaurant)
            if "bakery" in object.lower():
                business_categories.append(cw.Bakery)
            if "burger" in object.lower():
                business_categories.append(cw.BurgerPlace)
            if "coffee" in object.lower():
                business_categories.append(cw.CoffeeShop)
            if "gourmet" in object.lower():
                business_categories.append(cw.GourmetRestaurant)
            if "mediterranean" in object.lower():
                business_categories.append(cw.MediterraneanRestaurant)
            if "italian" in object.lower():
                business_categories.append(cw.ItalianRestaurant)
            if "french" in object.lower():
                business_categories.append(cw.FrenchRestaurant)
            if "pizza" in object.lower() or "pizzeria" in object.lower():
                business_categories.append(cw.PizzaPlace)
                business_categories.append(cw.Pizzeria)
            if "seafood" in object.lower():
                business_categories.append(cw.SeafoodRestaurant)
            if "cocktail" in object.lower():
                business_categories.append(cw.CocktailBars)
            if "karaoke" in object.lower():
                business_categories.append(cw.KaraokeBar)
            if "sportsbar" in object.lower() or "sports bar" in object.lower():
                business_categories.append(cw.SportsBar)
            if "bar and grill" in object.lower() or "bar & grill" in object.lower():
                business_categories.append(cw.BarAndGrill)
            if "bar" in object.lower():
                business_categories.append(cw.Bar)
            if "beer" in object.lower():
                business_categories.append(cw.BeerPlace)
            if "pub" in object.lower():
                business_categories.append(cw.Pub)
            if "club" in object.lower():
                business_categories.append(cw.Club)
            if "vegetarian" in object.lower():
                business_categories.append(cw.VegetarianRestaurant)
            if "vegan" in object.lower():
                business_categories.append(cw.VeganRestaurant)
            if "gluten-free" in object.lower():
                business_categories.append(cw.GlutenFreeRestaurant)
                
            for category in business_categories:
                # add triples to graph
                g.add((URIRef(subject_uri), RDF.type, category))
        
# add literal triples to graph
def mappingToCreateLiteralTriple(subject_column, object_column, predicate, datatype):

    for subject, lit_value in zip(df[subject_column], df[object_column]):
        if is_nan(lit_value) or lit_value == None or lit_value == "" or lit_value == "nan"\
            or is_nan(subject) or subject == None or subject == "" or subject=="nan":
            pass
        else:
            # the URI has already been created
            entity_uri = stringToURI[subject.lower()]   
            # get the datatype correct for Literal
            lit = Literal(lit_value, datatype=datatype)
            # add triple to graph
            g.add((URIRef(entity_uri), predicate, lit))

    
# add object properties to graph
def mappingToCreateObjectTriple(subject_column, object_column, predicate):
    for subject, object in zip(df[subject_column], df[object_column]): 
        if is_nan(subject) or subject == None or subject == "" or subject == "nan"\
            or is_nan(object) or object == None or object == "" or object == "nan":
            pass
        else:
            # the URI has already been created 
            subject_uri = stringToURI[subject.lower()]   
            object_uri = stringToURI[object.lower()]
            # add triple to graph
            g.add((URIRef(subject_uri), predicate, URIRef(object_uri)))

# add pizza types (NamedPizza) to graph
def mappingToCreateTypeTriple_namedPizza():        
    # query to return pizza types
    qres = g.query(
    """SELECT ?name where {
    ?name rdfs:subClassOf cw:NamedPizza . 
    }""")
    # these will be the types returned by the query
    categories_cw = []
    # these will be the types but changed for better string searches
    categories = []
    for row in qres:
        # Row is a list of matched RDF terms: URIs, literals or blank nodes
        # process the returned row for URI creation with prefix "cw."
        categories_cw.append("cw." + row.name.split("#")[-1])
        categories.append(row.name.split("#")[-1])
        # remove the word "pizza"
        # this is because Margherita Pizza and Pizza Margherita 
        # are both a Margherita Pizza!
        categories=[_.replace('Pizza', "") for _ in categories]
    # for each menu item
    for item, item_linked in zip(df['menu item'], df["menu_item_linked"]):
        # for each pizza category
        for category in categories:
            # if the pizza category text is in the menu item
            if category.lower() in item.lower():
                # find the URI - it should already have been created
                item_linked_URI = stringToURI[item_linked.lower()]
                # add to graph
                g.add((URIRef(item_linked_URI), RDF.type, eval(categories_cw[categories.index(category)])))                   

# this is a big function because if you are getting all the information
# you might as well use it!
# it adds...
# the ingredient as an entitity of the ontology's ingredients
# adds the ingredient to the pizza
# decides if the pizza is vegetarian/vegan
def mappingForIngredients():
    # query to return ingredient classes in ontology
    # there are many sub-sub-...-categories
    qres_ing = g.query(
    """SELECT ?ingredient where {{
    ?big_ingredient rdfs:subClassOf cw:Ingredient .
    ?ingredient rdfs:subClassOf ?big_ingredient .}
    UNION
    {
    ?big_ingredient rdfs:subClassOf cw:Ingredient .
    ?big_ingredient2 rdfs:subClassOf ?big_ingredient .
    ?ingredient rdfs:subClassOf ?big_ingredient2 .}
    UNION
    {
    ?big_ingredient rdfs:subClassOf cw:Ingredient .
    ?big_ingredient2 rdfs:subClassOf ?big_ingredient .
    ?big_ingredient3 rdfs:subClassOf ?big_ingredient2 .
    ?ingredient rdfs:subClassOf ?big_ingredient3 .}
    UNION
    {
    ?big_ingredient rdfs:subClassOf cw:Ingredient .
    ?big_ingredient2 rdfs:subClassOf ?big_ingredient .
    ?big_ingredient3 rdfs:subClassOf ?big_ingredient2 .
    ?big_ingredient4 rdfs:subClassOf ?big_ingredient3 .
    ?ingredient rdfs:subClassOf ?big_ingredient4 .}}
    """)
    # this will be populated with ingredients returned by the query
    ingredients_cw = []
    # as above, but with some processing for string searches
    ingredients = []
    for row in qres_ing:
        # Row is a list of matched RDF terms: URIs, literals or blank nodes
        # add new ingredient to list, with prefix "cw."
        ingredients_cw.append("cw." + row.ingredient.split("#")[-1])
        # add string part to other list
        ingredients.append(row.ingredient.split("#")[-1])
        # improve the string for better matching - by adding spaces
        ingredients = [add_spaces(_) for _ in ingredients]
    
    # get vegetarian ingredients
    qres_vegetarian = g.query("""SELECT ?ingredient where {{
    ?ingredient rdfs:subClassOf cw:VegetarianIngredient .}
    UNION
    {?ingredient rdfs:subClassOf cw:VeganIngredient .}
    UNION
    {?ingredient rdfs:subClassOf cw:Vegetable .}
    UNION
    {?ingredient rdfs:subClassOf cw:Fruit .}}
    """)
    # we turn the query into a list as above
    vegetarian_list = []
    for row in qres_vegetarian:
        # process the string
        vegetarian_list.append(row.ingredient.split("#")[-1])
        # improve the string for better matching - by adding spaces
        vegetarian_list = [add_spaces(_) for _ in vegetarian_list]
        vegetarian_list = [_.lower() for _ in vegetarian_list]
     
    # get vegan ingredients   
    qres_vegan = g.query("""SELECT ?ingredient where {
    {?ingredient rdfs:subClassOf cw:VeganIngredient .}
    UNION
    {?ingredient rdfs:subClassOf cw:Vegetable .}
    UNION
    {?ingredient rdfs:subClassOf cw:Fruit .}}
    """)
    # we turn the query into a list as above
    vegan_list = []
    for row in qres_vegan:
        # process the string
        vegan_list.append(row.ingredient.split("#")[-1])
        # improve the string for better matching - by adding spaces
        vegan_list = [add_spaces(_) for _ in vegan_list]
        vegan_list = [_.lower() for _ in vegan_list]
    
    
    # for each menu item
    # we use "menu_detail" which merges the item name and description
    # for better searching
    for description,item_linked in zip(df['menu_detail'], df["menu_item_linked"]):
        # skip blanks
        if is_nan(description) or description== None or description == "" or description == "nan"\
            or is_nan(item_linked) or item_linked == None or item_linked == "" or item_linked == "nan":
            pass
        else:
            # start an ingredient list for each menu item
            ingredient_list = []
            # for each ingredient category
            for ingredient in ingredients:
                # if the ingredient is in the menu item's description
                if ingredient.lower() in description.lower():
                    # add ingredient to ingredient list for this menu item
                    ingredient_list.append(ingredient.lower())
                    # get URIs for the menu item
                    item_linked_URI = stringToURI[item_linked.lower()]
                    # we want to use .lower() here to differentiate and 
                    # enable cw:pepper rdfs:subClassOf cw:Pepper
                    add_ingredient_uri = createURIForEntity(ingredient.lower())
                    ingredient_URI = stringToURI[ingredient.lower()]
                    # add to graph
                    # add the object property
                    g.add((URIRef(item_linked_URI), cw.hasIngredient, URIRef(ingredient_URI)))
                    # add the type property
                    g.add((URIRef(ingredient_URI), RDF.type, eval(ingredients_cw[ingredients.index(ingredient)])))
                    
                    # vegetarian/vegan check
                    # check if all ingredients are in the vegetarian list and the word "pizza"
                    if all([x in vegetarian_list for x in ingredient_list]) and ("pizza" in description.lower()):
                        # if so, it's a vegtarian pizza! add triple to graph
                        g.add((URIRef(item_linked_URI), RDF.type, cw.VegetarianPizza)) 
                    # similarly for vegan...
                    if all([x in vegan_list for x in ingredient_list]) and ("pizza" in description.lower()):
                        # add triple to graph
                        g.add((URIRef(item_linked_URI), RDF.type, cw.VeganPizza))     

#### Start adding to graph

Types - we must assign types first to create URIs, otherwise we'll get key errors. Ingredients are a special case and come later

In [None]:
# address/location
mappingToCreateTypeTriple("address_builder", cw.Address)
mappingToCreateTypeTriple("city", cw.City)
mappingToCreateTypeTriple("state_fixed", cw.State)
mappingToCreateTypeTriple("country", cw.Country)

# menu items - further work below
mappingToCreateTypeTriple("menu_item_linked", cw.MenuItem)

# currency
mappingToCreateTypeTriple('currency', cw.Currency)

# item value
mappingToCreateTypeTriple('item_value2', cw.ItemValue)


More complicated types

In [51]:
# businesses
mappingToCreateTypeTriple_business()
# named pizza type
mappingToCreateTypeTriple_namedPizza()

Literals

In [52]:
# literals
mappingToCreateLiteralTriple("address_builder", "address", cw.firstLineAddress, XSD.string)
mappingToCreateLiteralTriple("address_builder", "postcode", cw.postCode, XSD.string)
mappingToCreateLiteralTriple("menu_item_linked", "menu item", cw.itemName, XSD.string)
mappingToCreateLiteralTriple("item_value2", "currency", cw.amountCurrency, XSD.string)
mappingToCreateLiteralTriple("item_value2", "item value", cw.amount, XSD.double)
mappingToCreateLiteralTriple("name_unique", "name", cw.restaurantName, XSD.string)

Object properties

In [53]:
# Create object property triples - all URIs should already have been created
mappingToCreateObjectTriple('name_unique', 'city', cw.locatedInCity)
mappingToCreateObjectTriple('name_unique', 'country', cw.locatedInCountry)
mappingToCreateObjectTriple('state_fixed', 'country', cw.locatedInCountry)
mappingToCreateObjectTriple('name_unique', 'state_fixed', cw.locatedInState)
mappingToCreateObjectTriple('city', 'state_fixed', cw.locatedInState)
mappingToCreateObjectTriple('name_unique', 'address_builder', cw.locatedInAddress)
mappingToCreateObjectTriple('address_builder', 'city', cw.locatedInCity)
mappingToCreateObjectTriple('name_unique', 'menu_item_linked', cw.servesMenuItem)
mappingToCreateObjectTriple('menu_item_linked', 'item_value2', cw.hasValue)

Pizza as a type

In [54]:
# pizza type
mappingToCreateTypeTriple_pizza()

Ingredients - these are more complicated and the function adds both type and object property

In [55]:
# process ingredients
# adds both eg cw:pepper a cw:Pepper
# and          cw:some_specific_pizza cw:hasIngredient cw:pepper
# and classifies vegan/vegetarian pizzas
mappingForIngredients()

Save first graph

In [None]:
# in ttl format
g.serialize(destination = 'data/output_graphs/g_og.ttl', format = 'ttl')

<Graph identifier=N4dbe0e99829844308f890f5ce7e98be7 (<class 'rdflib.graph.Graph'>)>

### Link to external kgs
We link to the google kg. Only implement for city, country and state

In [None]:
# link to lookup script
ggl = GoogleKGLookup()

# new empty dictionary
stringToURI = dict()

Tweak the dictionary and add type functions

In [None]:
# we extend the code provided by Ernesto Jiménez-Ruiz from labs
# https://github.com/city-knowledge-graphs/python-2024

def createURIForEntity_external(name):
        
        #We create fresh URI (default option)
        stringToURI[name] = cw_ns_str + processLexicalName(name)       
       
        uri = getExternalKGURI(name)
        if uri != "":
            stringToURI[name.lower()] = uri
        
        return stringToURI[name.lower()]

# change to ggl kg
def getExternalKGURI(name):
        entities = ggl.getKGEntities(name, 5)
        current_sim = -1
        current_uri = ''
        
        for ent in entities:           
            isub_score = isub(name, ent.label) 
            if current_sim < isub_score:
                # for some reason, we're unable to bind prefixes 
                # later so replace with full URI
                current_uri = ent.ident.replace("kg:", "http://g.co/kg")
                current_sim = isub_score

        return current_uri
            
def mappingToCreateTypeTriple_external(subject_column, class_type):
    for subject in df[subject_column]:    
        # skip blanks
        if is_nan(subject) or subject == None or subject == "":
            pass
        else:
            if subject.lower() in stringToURI:
                entity_uri=stringToURI[subject.lower()]
            else:
                # already returns .lower() - not having it here might improve search?
                entity_uri = createURIForEntity_external(subject)
            g.add((URIRef(entity_uri), RDF.type, class_type))

Start a new graph for Google

In [None]:
#Load graph from the model ontology
g = Graph()
g.parse("data/pizza-restaurants-ontology.ttl", format = "ttl")

# check contents 
print("Loaded " + str(len(g)) + " triples.")

# use cw namespace
cw_ns_str = "http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#"
cw = Namespace(cw_ns_str)
g.bind("cw", cw)
# unable to bind ggl kg with a prefix so the whole URI is used
# as per the replace() call in getExternalKGURI() above
# note sure why this happens!

Loaded 963 triples.


Add the three types from the external KG

In [None]:
# connecting to external kg and time it
t0 = time()
mappingToCreateTypeTriple_external("city", cw.City)
mappingToCreateTypeTriple_external("state_fixed", cw.State)
mappingToCreateTypeTriple_external("country", cw.Country)
print(f"Connecting to kg for these three types took {time() - t0} seconds")

Connecting to kg for these three types took 142.1180272102356 seconds


Add the rest

In [None]:
# address/location
mappingToCreateTypeTriple("address_builder", cw.Address)

# menu items - further work below
mappingToCreateTypeTriple("menu_item_linked", cw.MenuItem)

# currency
mappingToCreateTypeTriple('currency', cw.Currency)

# item value
mappingToCreateTypeTriple('item_value2', cw.ItemValue)

# businesses
mappingToCreateTypeTriple_business()

# pizza type
mappingToCreateTypeTriple_pizza()

# named pizza type
mappingToCreateTypeTriple_namedPizza()

# literals
mappingToCreateLiteralTriple("address_builder", "address", cw.firstLineAddress, XSD.string)
mappingToCreateLiteralTriple("address_builder", "postcode", cw.postCode, XSD.string)
mappingToCreateLiteralTriple("menu_item_linked", "menu item", cw.itemName, XSD.string)
mappingToCreateLiteralTriple("item_value2", "currency", cw.amountCurrency, XSD.string)
mappingToCreateLiteralTriple("item_value2", "item value", cw.amount, XSD.double)
mappingToCreateLiteralTriple("name_unique", "name", cw.restaurantName, XSD.string)

# Create object property triples - all URIs should already have been created
mappingToCreateObjectTriple('name_unique', 'city', cw.locatedInCity)
mappingToCreateObjectTriple('name_unique', 'country', cw.locatedInCountry)
mappingToCreateObjectTriple('state_fixed', 'country', cw.locatedInCountry)
mappingToCreateObjectTriple('name_unique', 'state_fixed', cw.locatedInState)
mappingToCreateObjectTriple('city', 'state_fixed', cw.locatedInState)
mappingToCreateObjectTriple('name_unique', 'address_builder', cw.locatedInAddress)
mappingToCreateObjectTriple('address_builder', 'city', cw.locatedInCity)
mappingToCreateObjectTriple('name_unique', 'menu_item_linked', cw.servesMenuItem)
mappingToCreateObjectTriple('menu_item_linked', 'item_value2', cw.hasValue)

# add ingredients
mappingForIngredients()

In [None]:
# export
g.serialize(destination = 'data/output_graphs/g_ggl.ttl', format = 'ttl')

<Graph identifier=N6968988039204c17891415a4cdf04e61 (<class 'rdflib.graph.Graph'>)>

## Reasoning and SPARQL queries
Note, here we are working with g_og with reasoning

### Add reasoning

In [None]:
g = Graph()
# load our pre-external KG graph
g.parse("data/output_graphs/g_og.ttl", format = "ttl")
# triples before reasoning
print("Pre-reasoning " + str(len(g)) + " triples.")

#Perform reasoning and time it
t0=time()
owlrl.DeductiveClosure(owlrl.OWLRL_Semantics, axiomatic_triples=True, datatype_axioms=False).expand(g)
print(f"Reasoning took {time()-t0} seconds")
print("After-reasoning " + str(len(g)) + " triples.")


Pre-reasoning 6631 triples.
Reasoning took 26.5491681098938 seconds
After-reasoning 35524 triples.


In [None]:
# save graph with reasoning
g.serialize(destination = 'data/output_graphs/g_reasoning.ttl', format = 'ttl')

<Graph identifier=Nfadacffaad8d40c5a3a9bab02bb7a8d2 (<class 'rdflib.graph.Graph'>)>

In [None]:
g = Graph()
g.parse("data/output_graphs/g_reasoning.ttl", format = "ttl")

<Graph identifier=N89ffaa0b7c844231bfe06c3629d11b74 (<class 'rdflib.graph.Graph'>)>

In [None]:
# query to return menu items that cost less than 2USD
query1 = open("data/queries/query1.txt", 'r').read()
qres1 = g.query(query1)

for row in qres1:
    #Row is a list of matched RDF terms: URIs, literals or blank nodes
    print(row)

# save as csv
qres1.serialize(destination = "data/queries/result_sparql1.csv", format = "csv")

(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#pizza_by_the_slice_at_antonios_pizza_in_chatham'), rdflib.term.Literal('1.79', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#double')))
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#pizza_dipping_sauce_cup_at_hungry_howies_pizza_in_chandler'), rdflib.term.Literal('0.59', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#double')))
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#pizza_pretzel_at_mi_pals_deli_in_philadelphia'), rdflib.term.Literal('1.25', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#double')))
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#pizza_roll_at_china_moon_in_suffolk'), rdflib.term.Literal('1.25', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#double')))
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067

In [None]:
# find expensive pizzas(>25USD), return the pizza and dollar amount        
query2 = open("data/queries/query2.txt", 'r').read() 
qres2 = g.query(query2)

for row in qres2:
    #Row is a list of matched RDF terms: URIs, literals or blank nodes
    print(row)

# save as csv
qres2.serialize(destination = "data/queries/result_sparql2.csv", format = "csv")

(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#order_3_large_pizzas_and_get_the_4th_pizza_free_at_riccardos_pizza_in_saddle_brook'), rdflib.term.Literal('38', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#double')))
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#pizza_at_california_pizza_kitchen_in_austin'), rdflib.term.Literal('117', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#double')))
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#pizza_at_california_pizza_kitchen_in_wellington'), rdflib.term.Literal('117', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#double')))


In [None]:
# find the union of seafood ingredients and 
# vegetarian ingredients that aren't vegan
query3 = open("data/queries/query3.txt", 'r').read() 
qres3 = g.query(query3)                


for row in qres3:
    #Row is a list of matched RDF terms: URIs, literals or blank nodes
    print(row)

# save as csv
qres3.serialize(destination = "data/queries/result_sparql3.csv", format = "csv")

(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#Anchovies'),)
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#CrabMeat'),)
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#Salmon'),)
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#Scallops'),)
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#SeaFood'),)
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#Shrimp'),)
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#Tuna'),)
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#BlueCheese'),)
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#Cheddar'),)
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#Cheese'),)
(rdflib.term.URIRef('http://www.semanti

In [None]:
# find the most common pizza ingredients
# filter to ingredients on more than 50 pizzas
# sort descending        
query4 = open("data/queries/query4.txt", 'r').read() 
qres4 = g.query(query4)

for row in qres4:
    #Row is a list of matched RDF terms: URIs, literals or blank nodes
    print(row)

# save as csv
qres4.serialize(destination = "data/queries/result_sparql4.csv", format = "csv")

(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#cheese'), rdflib.term.Literal('116', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')))
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#pepper'), rdflib.term.Literal('65', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')))
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#mozzarella'), rdflib.term.Literal('59', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')))
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#chicken'), rdflib.term.Literal('54', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#integer')))


In [None]:
# finds cities with more than 3 restaurants
# order by state, then city
query5 = open("data/queries/query5.txt", 'r').read() 
qres5 = g.query(query5)

for row in qres5:
    #Row is a list of matched RDF terms: URIs, literals or blank nodes
    print(row)

# save as csv
qres5.serialize(destination = "data/queries/result_sparql5.csv", format = "csv")

(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#ca'), rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#san_diego'))
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#co'), rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#arvada'))
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#co'), rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#denver'))
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#de'), rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#baltimore'))
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#de'), rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#findlay'))
(rdflib.term.URIRef('http://www.semanticweb.org/city/in3067-inm713/2024/restau

## 2.4 Ontology Alignment


Here, we see how well two ontologies align. This part could be optimised by different measures of string simmilarity and thresholds.
Note 1 = pizza - the provided KG, 2 = cw - our KG

#### Find similar items

In [None]:
# load ontologies
urionto1 = "data/pizza.owl"
urionto2 = "data/pizza-restaurants-ontology.owl"

# methods from owlready
onto1 = get_ontology(urionto1).load()
onto2 = get_ontology(urionto2).load()

# only these two triple patterns are in the alignment comparison file
def getClasses(onto):        
    return onto.classes()
        
def getObjectProperties(onto):        
    return onto.object_properties()

In [None]:
# function to find similar strings in two lists list, given a threshold
# uses isub function to measure similarity, could use others
def list_similar(list_a, list_b, threshold):
    similar_list=[]
    for i in list_a:
        for j in list_b:
            # if each match is above the threshold ...
            if isub(i, j) > threshold:
                # ... add to list
                similar_list.append([i, j])
    return similar_list

In [None]:
# get classes from the two ontologies
classes1 = getClasses(onto1)
classes2 = getClasses(onto2)

# turn them into lists to iterate through
class_list1 = [x.name for x in classes1]
class_list2 = [x.name for x in classes2]

# run function from above and print some metrics
similar_classes = list_similar(class_list1, class_list2, 0.9)
print(f"pizza things {len(class_list1)}")
print(f"cw things {len(class_list2)}")
print(f"matched things {len(similar_classes)}")

pizza things 100
cw things 151
matched things 25


In [None]:
# get object properties from the two ontologies
op1 = getObjectProperties(onto1)
op2 = getObjectProperties(onto2)

# turn them into lists to iterate through
op_list1 = [x.name for x in op1]
op_list2 = [x.name for x in op2]

# run function from above and print some metrics
similar_op = list_similar(op_list1, op_list2, 0.9)
print(f"pizza things {len(op_list1)}")
print(f"cw things {len(op_list2)}")
print(f"matched things {len(similar_op)}")

pizza things 8
cw things 17
matched things 2


#### Gather matches and transform to triples

In [None]:
# function that takes a list of similar things and turns them into triples
def equiv_things(namespace1, namespace2, relationship, similar_list):
    ttl_list = []
    for i in similar_list:
        # we need to flip the order to get the correct comparison later
        ttl_list.append(f"{namespace2}.{i[1]},owl.{relationship},{namespace1}.{i[0]}")
    return ttl_list

In [None]:
equiv_classes=equiv_things("pizza", "cw", "equivalentClass", similar_classes)
equiv_op=equiv_things("pizza", "cw", "equivalentProperty", similar_op)

In [None]:
# create our graph of alignments
g_align = Graph()

# define our namespaces - there are only three
pizza = Namespace("http://www.co-ode.org/ontologies/pizza/pizza.owl#")
owl = Namespace("http://www.w3.org/2002/07/owl#")
cw = Namespace("http://www.semanticweb.org/city/in3067-inm713/2024/restaurants#")

g_align.bind("pizza", pizza) 
g_align.bind("owl", owl) 
g_align.bind("cw", cw) 

In [85]:
# add our alignment triples to the graph
for i in equiv_classes:
    g_align.add((eval(i)))

for i in equiv_op:
    g_align.add((eval(i)))

#### Export graph

In [None]:
# save the alignment graph
g_align.serialize(destination =' data/output_graphs/g_align.ttl', format = 'ttl')

<Graph identifier=N26f4ab39ac8b408787a9c7f3de2204fa (<class 'rdflib.graph.Graph'>)>

#### Precision and recall
Again, we do not optimise this!

In [None]:
# we extend the code provided by Ernesto Jiménez-Ruiz from labs
# https://github.com/city-knowledge-graphs/python-2024

# calculate precision, recall and F1 from two graphs
def compareWithReference(reference_mappings_file, system_mappings_file):
    ref_mappings = Graph()
    ref_mappings.parse(reference_mappings_file, format="ttl")
    
    system_mappings = Graph()
    system_mappings.parse(system_mappings_file, format="ttl")
    
    #We calculate precision and recall via true positives, false positives and false negatives
    #https://en.wikipedia.org/wiki/Precision_and_recall        
    tp = 0
    fp = 0
    fn = 0
    
    for t in system_mappings:
        if t in ref_mappings:
            tp += 1
        else:
            fp += 1

    for t in ref_mappings:
        if not t in system_mappings:
            fn += 1
                    
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f_score = (2 * precision * recall) / (precision + recall)

    print("Comparing '" + system_mappings_file + "' with '" + reference_mappings_file)
    print("\tPrecision: " + str(precision))
    print("\tRecall: " + str(recall))
    print("\tF-Score: " + str(f_score))

In [None]:
# function inputs
reference_mappings = "data/reference-mappings-pizza.ttl"
system_mappings = "data/output_graphs/g_align.ttl"
# run function
compareWithReference(reference_mappings, system_mappings)

Comparing 'data/output_graphs/g_align.ttl' with 'data/reference-mappings-pizza.ttl
	Precision: 0.6666666666666666
	Recall: 0.5454545454545454
	F-Score: 0.6
