# Knowledge Representation on the Web -- RDF tutorial

In this tutorial we'll learn the basics of interacting with RDF graphs with Python. We'll be using rdflib for this, a widely used Ptyhon library for RDF (all documentation can be found [here](https://rdflib.readthedocs.io/en/stable/index.html))

## Imports
These are the main classes and types we'll be using from rdflib

In [None]:
import sys

from rdflib import Graph, ConjunctiveGraph, Literal, BNode, Namespace, RDF, URIRef, RDFS
from rdflib.namespace import DC, FOAF

import pprint


## Loading data remotely and from files

rdflib accepts importing RDF data from a variety of sources, either locally from a file (including an extensive support of serializations), or remotely via a URI (this is a great way of checking practically if URIs return RDF according to the 3rd Linked Data principle).

A Graph object is always required to load triples.
**Note**: to load quads, and hence supporting named graphs, you'll need to use an instance of ConjunctiveGraph instead

**Exercise 1** 

1. create two graphs using rdflib:
    - and load one with triples from the site https://csarven.ca/ and/or http://www.w3.org/People/Berners-Lee/card 
    - load one with triples from ./data/ingredients.rdf. 

In [None]:
g = Graph()
h = Graph()
f = Graph()

result = g.parse("http://www.w3.org/People/Berners-Lee/card")

result2 = h.parse("https://csarven.ca/", format="n3")

result3 = f.parse("../data/ingredients.rdf")

print("Graph has %s statements." % len(g))
print("Graph has %s statements." % len(h))
print("Graph has %s statements." % len(f))

## Serialising and saving RDF graphs

There are different formats for storing RDF triples. Semantically, these mean the same, they differ only in their syntax. 


Use the function Graph.serialize(format). 

**Exercise 2**

1. serialise one of the graphs to the .ttl, .xml and .nt format, and print the first n lines to compare the syntax
1. save your graph in the turtle format to the ./data/ folder

In [None]:
v = g.serialize(format="ttl")

print(v)

In [None]:
v = g.serialize(format="xml")

print(v)

In [None]:
v = g.serialize(format="nt")

print(v)

##  Merging graphs

Merging graphs can be done via sequential parsings or by the overloaded operator +

**Note:** Set-theoretic graph semantics apply

The Food knowledge graph FoodKG contains a graph of statements about ingredients, as well as a graph with statements about recipes. 

**Exercise 3**: 

1. load ./data/ingredients.rdf and ./data/ghostbusters.ttl into a single graph, either by sequential parsing or using the operator +.

2. count the number of statements in each graph, and the intersection of the two graphs. 

3. check whether the combined graph is connected (using graph.connected()) 

4. load ./data/ingredients.rdf and ./data/recipes.rdf into a single graph, either by sequential parsing or using the operator +. 

5. count the number of statements in each graph, and the intersection of the two graphs. 

6. check whether the combined graph is connected (using graph.connected()). Explain the result with respect to point 3! 

In [None]:
g1 = Graph()
g1.parse("../data/ingredients.rdf")
print("g1 has {} triples".format(len(g1)))

g2 = Graph()
g2.parse("../data/ghostbusters.ttl", format='ttl')
print("g2 has {} triples".format(len(g2)))

union = g1 + g2
print("g1 + g2 has {} triples".format(len(union)))

print("len(g1) + len(g2) equals {}".format(len(g1)+len(g2)))

intersection = g1 & g2
print("g1 & g2 has {} triples".format(len(intersection)))

#print("is the graph connected? {}".format(union.connected()))

In [None]:
g1 = Graph()
g1.parse("../data/ingredients.rdf")
print("g1 has {} triples".format(len(g1)))

g2 = Graph()
g2.parse("../data/recipes.rdf")
print("g2 has {} triples".format(len(g2)))

union = g1 + g2
print("g1 + g2 has {} triples".format(len(union)))

print("len(g1) + len(g2) equals {}".format(len(g1)+len(g2)))

intersection = g1 & g2
print("g1 & g2 has {} triples".format(len(intersection)))

print("is the union connected? {}".format(union.connected()))

Both combined graphs are not fully connected, meaning that not all entities can be reached, starting traversal from one node in the graph. It might be interesting, however, to see how connected the ingredient + recipe graph is, how many times an ingredient is used in a recipe for instance. RDFlib doesn't provide this functionality, but we could do this using [networkx](https://derwen.ai/docs/kgl/ex6_0/):  

In [None]:
import kglab
import os

namespaces = {
    "wtm":  "http://purl.org/heals/food/",
    "ind":  "http://purl.org/heals/ingredient/",
    }

kg = kglab.KnowledgeGraph(
    name = "A recipe KG example based on https://github.com/foodkg/foodkg.github.io.git",
    namespaces = namespaces,
    )

kg.load_rdf("../data/recipes.rdf", format="xml")
kg.load_rdf("../data/ingredients.rdf", format="xml") 

In [None]:
import networkx as nx

#here we extract the recipes and their ingredients
sparql = """
    SELECT ?subject ?object
    WHERE {
        ?subject rdf:type wtm:Recipe .
        ?subject wtm:hasIngredient ?object .
    }
    """

subgraph = kglab.SubgraphMatrix(kg, sparql)

#from these, we create a digraph
nx_graph = subgraph.build_nx_graph(nx.DiGraph(), bipartite=True)
recipe_nodes, ingredient_nodes = nx.bipartite.sets(nx_graph)

In [None]:
results = nx.degree_centrality(nx_graph)
ind_rank = {}

#we calculate degree centrality: how connected is each node? 
for node_id, rank in sorted(results.items(), key=lambda item: item[1], reverse=True):
    if node_id in ingredient_nodes:
        ind_rank[node_id] = rank
        node = subgraph.inverse_transform(node_id)
        label = subgraph.n3fy(node)
        print("{:6.3f} {}".format(rank, label))

## Namespaces 

Remind yourself what namespaces are. 

In RDFLib, the namespace module defines many common namespaces such as RDF, RDFS, OWL, FOAF, SKOS, etc., but you can also easily add URIs within a different namespace:


In [None]:
TEACH = Namespace("http://linkedscience.org/teach/ns#")
TEACH.Teacher

Check out the specification to see which other terms are used within the TEACH namespace. http://linkedscience.org/teach/ns/#sec-specification. 
You can use a NamespaceManager to bind a prefix to a namespace: 

In [None]:
g = Graph()
g.namespace_manager.bind('TEACH', URIRef('http://linkedscience.org/teach/ns#'))

In [None]:
KRW = Namespace("http://krw.vu.nl/data#")

#creating individuals within your namespace
KRW.Teacher
KRW.Student

**Exercise 4:**
1. create your own namespace (can be made up) 


## Creating RDF triples

Triples are added to the graph with the function Graph.add()

The parameter is a triple given in a Python **tuple** (subject, predicate, object)

Notice the namespace convenience syntax!

**Exercise 5:** 

1. create a new graph and add triples (~10) within your made-up namespace using Graph.add(). These triples can be about anything, for instance ingredients or recipes. Make sure they include the predicates RDF.type, RDFS.label and RDFS.subClassOf

2. open yourRDF.ttl, and write your triples out by hand in a syntax of your choice (turtle is recommended, notice the file extension!). Load the triples here with rdflib. 

In [None]:
g = Graph()

#example namespace
EX = Namespace("https://example.org/")

# Add triples using store's add method.
g.add( (EX.whale, RDF.type, EX.Mammalia) )
g.add( (EX.whale, RDFS.label, Literal('whale'))) #in this example, the identifiers have human readable names, but these can also be arbitrary strings. rdfs:label makes these human-interpretable.  
g.add( (EX.crocodile, RDF.type, EX.Amphibia) )
g.add( (EX.Amphibia, RDFS.subClassOf, EX.Animalia) )

#note that there is not one way to describe your domain! When do you define something as an instance or class?  

# You can reuse matches of subjects to filter further e.g. objects
for entity in g.subjects(RDF.type, None):
    print(entity)
    for objects in g.objects(entity, RDF.type):
        print(objects)

In [None]:
#save your ttl graph
v = g.serialize(destination="myRDF.ttl")




In [None]:
#load it 
d = Graph()
d.parse('myRDF.ttl')
print(d.serialize(format='ttl'))

## Navigating graphs

rdflib uses iterators to navigate Graphs. The methods for navigating subjects, predicates and objects are Graph.subjects, Graph.predicates, Graph.objects

**Exercise 6:**

1. print all the triples in yourRDF.ttl
2. print all subjects in yourRDF.ttl
3. print all predicates in yourRDF.ttl
4. print all objects in yourRDF.ttl


In [None]:
for s,p,o in g.triples( (None, None, None) ):
    print(s,p,o)

In [None]:
for s in g.subjects():
    print(s)

In [None]:
for s in g.predicates():
    print(s)

In [None]:
for s in g.objects():
    print(s)

We can also filter the subjects, predicates and objects we want to retrieve, and match their values like in a database "join" operation


**Exercise 7:**

1. print all subject types in yourRDF.ttl
2. print all subject labels yourRDF.ttl

In [None]:
for s,p,o in g.triples( (None, RDF.type, None) ):
    print(o)

In [None]:
for s,p,o in g.triples( (None, RDFS.label, None) ):
    print(o)

### Basic triple matching (almost querying!)

We use method Graph.triples and a Python tuple that acts as a mask for specifying our criteria

**Exercise 8:**

1. check whether a triple is in your graph -> print true or false
2. print all triples related to a certain subject in your graph
3. print all triples related to a certain object in your graph

In [None]:
print((EX.whale, RDFS.subClassOf, EX.Amphibia) in g)
    
for s,p,o in g.triples( (EX.whale, None, None) ):
    print(p,o)
    
for s,p,o in g.triples( (None, None, EX.Mammalia) ):
    print(s,p)

## Assignment part 1: your own webapplication. 

You are a chef in a restaurant, and you need to serve someone that is gluten intolerant. 

1. load the ./data/recipes.rdf and ./data/ingredients.rdf datasets in one graph
2. query your graph (as we did in previous exercises) to retrieve all recipes without gluten
3. query your graph for all recipes that you can make for your gluten intolerant guest. 
4. the guest asks you whether there are more options. Can you find the recipes for which an ingredient with gluten can be replaced, solely using pattern matching? (Hint: you need to write multiple of these pattern matching queries, and check the predicate __substitutesFor__) 
5. another guest is allergic to pecan nuts, which recipes could you serve them (including those for which pecan nuts can be replaced) 

In [None]:
food = Graph()

# Sequential parsings merge *new* triples

food.parse("../data/ingredients.rdf")
food.parse("../data/recipes.rdf")

print("Graph has {} triples".format(len(food)))

In [None]:
import rdflib
WTM = Namespace("http://purl.org/heals/food/")
IND = Namespace("http://purl.org/heals/ingredient/")

#something to get you started: 

#first retrieve all ingredients: 
gl_rec = []
print("All recipes with gluten:") 
for s1,p1,o1 in food.triples( (None, WTM.hasGluten, None)):
    if o1:
        for s2, p2, o2 in food.triples( (None, WTM.hasIngredient, s1)):
            print("{}, containing: ({})".format(s2,s1))
            gl_rec.append(s2)


print("Glutenfree recipes:")
for s,p,o in food.triples((None, RDF.type, WTM.Recipe)):
    if s not in gl_rec:
        print(s)
        
#note that this is a bit tedious: later on, we will be querying more complicated patterns with SPARQL!





You can also do that by querying the boolean value of the object in the triple where hasGluten is predicate. Note that you have to fo it by first specifying it is a Literal and then a boolean

In [None]:
#actually you can also check for the boolean in the loop like this 
gl_rec = []
print("All recipes with gluten:") 
for s1,p1,o1 in food.triples( (None, WTM.hasGluten,Literal(bool('true'))) ): #you have to do Literal and bool because it is stored as a literal in the KG
    for s2, p2, o2 in food.triples( (None, WTM.hasIngredient, s1)):
        print("{}, containing: ({})".format(s2,s1))
        gl_rec.append(s2)
            
print("Glutenfree recipes:")
for s,p,o in food.triples((None, RDF.type, WTM.Recipe)):
    if s not in gl_rec:
        print(s)

Note that you can also query for triples by specifying the Literal of the object. In that case you have to refer to how the entity has been labeled

In [None]:
#For instance check recipees with Almonds
almonds_rec = []
for s1,p1,o1 in food.triples( (None, None, Literal("almond"))): #the literal should be the label given
       for s2, p2, o2 in food.triples( (None, WTM.hasIngredient, s1)):
        print("{}, containing: ({})".format(s2,s1))
        print("{}, containing: ({})".format(s1,o1))
        almonds_rec.append(s2)
            
print("Almond free recipes:")
for s,p,o in food.triples((None, RDF.type, WTM.Recipe)):
     if s not in almonds_rec:
         print(s)



 Checking for gluten replacements

In [None]:
gl_sub = set()
print("All ingredients with gluten that are replaceable:")
for s1,p1,o1 in food.triples( (None, WTM.hasGluten,Literal(True)) ): #all ingredients with gluten
    for s2,p2,o2 in food.triples( (None, WTM.substitutesFor, s1)):
        if (s2, WTM.hasGluten, Literal(False)) in food:
            gl_sub.add(s1)
            print(f'{s2} replaces {s1}')

print("\nThese recipes have a gluten ingredient that can be replaced:")
for s,p,o in food.triples((None, RDF.type, WTM.Recipe)):
    if s in gl_rec:
        all_replaced = True
        for s1,p1,o1 in food.triples((s, WTM.hasIngredient, None)):
            if (o1, WTM.hasGluten,Literal(True)) in food and o1 not in gl_sub: # ingredient has gluten but cannot be replaced
                all_replaced = False
                break
        if all_replaced:
            print(s)

In [None]:
#Recipees with Pecan nuts
pecan_rec = []
print("All recipes with Pecan:") 
#Recipees that directly have Pecan
for s in food.subjects(WTM.hasIngredient, IND.Pecan):
    pecan_rec.append(s)
    print(s)

print("Pecan free recipes:")
for s,p,o in food.triples((None, RDF.type, WTM.Recipe)):
     if s not in pecan_rec:
         print(s)

