# Knowledge Graphs and Semantic Technologies -- KG measures and clustering tutorial

First, make sure you have networkx installed, here you can find the [documentation](https://networkx.org/documentation/stable/reference/index.html).

In [None]:
## Uncomment if you do not have networkx installed (you should have it installed from the RDFS tutorial)
#import sys
#!{sys.executable} -m pip install networkx

import pandas as pd
import rdflib
from rdflib import Literal
from rdflib.namespace import DC, FOAF

import networkx as nx
from owlready2 import *

In this tutorial, we will focus on how to characterize an ontology or knowledge graph.

Load the ontology, which you have previously created in the OWL tutorial (load the asserted owl file).

In [None]:
ontology = rdflib.Graph()
ontology.parse("data/my_music_ontology_inferred.owl")

## 1. Basic (ontology) measures

Let's first focus on calculating basic measures:
* number of classes
* number of properties
* number of individuals
* number of triples
* number of entities (classes, individuals, etc. anything that can be places in the subject possition in a triple/axiom)

We start by counting the number of classes in the ontology. This can be done using a SPARQL query. We want to get the unique classes used in the ontology. 

PREFIX owl: <http://www.w3.org/2002/07/owl#> 
SELECT DISTINCT ?s 
WHERE { ?s rdf:type owl:Class. FILTER isURI(?s) }')

This query gives as all the classes that also have a definition in the ontology. However, this does not have to equal the number of classes actually used by individuals. Hence, you need to be very specific about what the number you are retriving represents.

In [None]:
answer = list(ontology.query(
    'PREFIX owl: <http://www.w3.org/2002/07/owl#> SELECT DISTINCT ?s WHERE { ?s rdf:type owl:Class. FILTER isURI(?s) }'
))
print("Number of classes: {f}".format(f=len(answer)))
for r in answer:
    print(r)


Even though we used a query, a lot of this information can also be retrieved with owlready2. For example, the number of classes can be retrieved with the function onto.classes(). It returns all classes in the ontology. We try it below.

In [None]:
onto_file = "data/my_music_ontology_inferred.owl"
or_ontology = get_ontology(onto_file).load()
answer = list(or_ontology.classes())

print("Number of classes: {f}".format(f=len(answer)))
for r in answer:
    print(r)

In [None]:
ind = list(or_ontology.individuals())
print(len(ind))

### Exercise 1

Get the following metrics from the Hybrid Intelligence ontology, using queries and check your answer using owlready2 functions.
WARNING: loading the .ttl file might cause the parsing to break. You can pre-process the file by loading it with rdflib and serializing it into ntriples/nt.

* number of properties
* number of individuals
* number of triples
* number of entities (classes, individuals, etc. anything that can be places in the subject possition in a triple/axiom)

In [None]:
### your code here.

## 2. Converting KGs into Gs

To make use of graph measure, we need to convert our ontology into a mathematical graph networkx.


We first need to remove all the logics before we can do the conversion.
We are interested in keeping the following things:
* individual
* classes
* relationships between individuals and classes

What we need to remove is:
* restrictions
* domain/range
* property definitions

There is two ways for us to do it: we can either remove the information from the existing graph, or create a new graph using only the information we are interested in. Depending on the size and complexity of your knowledge graph, one way will be more preferrable than the other. You also need to consider if you want to keep the inferred information in your graph after conversion or not. Here, we want to keep the inferred information, but that is depended on the task you will then execute (for link prediction, you probably want the uninferred ontology and use the inferred information as a test set)

rdflib comes with a function that lets us convert a rdflib graph into an networkx graph.

In [None]:
from rdflib.extras.external_graph_libs import rdflib_to_networkx_digraph
nx_graph = rdflib_to_networkx_digraph(ontology)

list(nx_graph.nodes())

As we can see, there are some blank nodes which were convered into the graph that are not very useful for us at this stage. To analyse the graph as a mathematical graph, we don't want the class restrictions or property range and domain definision in our graph, as we are not doing any reasoning anymore.

Often times, it is easier to create a new graph than removeing already modeled information from the graph. Instead of continuing with the ontology, we will create a graph from the metadata provided in 'data/musicoset_metadata', but will adhere to the ontology from before (use the property and class names,etc.)

In [None]:
csv_albums =  pd.read_csv('data/musicoset_metadata/albums.csv',sep='\t')
print(csv_albums.columns)
csv_artists =  pd.read_csv('data/musicoset_metadata/artists.csv',sep='\t')
print(csv_artists.columns)
csv_songs =  pd.read_csv('data/musicoset_metadata/songs.csv',sep='\t')
print(csv_songs.columns)
csv_tracks =  pd.read_csv('data/musicoset_metadata/tracks.csv',sep='\t')
print(csv_tracks.columns)

In [None]:
# We have prepared a simplified ontology to use in this tutorial
# This ontology doesn't have any restrictions or domain/range definitions
# this is to avoid blank nodes when converting to networx
music_onto = rdflib.Graph()
music_onto.parse("data/music_onto_simple.rdf")

nx_music = rdflib_to_networkx_digraph(music_onto)
list(nx_music.nodes())

This ontology no longer produces any blank nodes. So we can now populate it with the metadata loaded from the CSV.

In [None]:
music_onto.parse("data/music_onto_simple.rdf")

EX = rdflib.Namespace("http://test.org/myonto.owl#")
from rdflib import OWL,RDF,RDFS,URIRef
import json
solo_artists = ["singer",'rapper','DJ',]
band = ['band','duo']
undef = ['-']

for index, artist in csv_artists.iterrows():
    art = URIRef(EX+artist["artist_id"])
    if artist["artist_type"] in solo_artists:
        music_onto.add((art,RDF.type,EX.SoloArtist))
    elif artist["artist_type"] in band:
        music_onto.add((art,RDF.type,EX.Band))
    else:
        music_onto.add((art,RDF.type,EX.Artist))
        
    music_onto.add((art,EX.name,Literal(artist["name"])))
    music_onto.add((art,EX.followers,Literal(artist["followers"])))
    genre = URIRef(EX+artist["main_genre"].replace(' ', '_'))
    music_onto.add((art,EX.hasGenre,genre))
    music_onto.add((genre,RDF.type,EX.Genre))
    
for index, song in csv_songs.iterrows():
    s = URIRef(EX+song["song_id"])
    music_onto.add((s,RDF.type,EX.Song))
    music_onto.add((s,EX.name,Literal(song['song_name'])))
    artists = eval(song['artists'])
    for key in artists.keys():
        art = URIRef(EX+key)
        music_onto.add((art,EX.authorOf,s))


Feel free to extend your ontology to include albums and track information below. You might have to extend the basic ontology as well with some additional relations.

In [None]:
### place your code here

Now we convert the ontology to a networkx graph.

In [None]:
nx_music_graph = rdflib_to_networkx_digraph(music_onto)

### Exercise 2

* Convert the Hybrid Intellgence ontology into a networkx graph. 
* Write a function that checks for blank nodes in your networkx graph, and apply it both to the graph created above (nx_graph) and to the one you have created just now.

In [None]:
### place your code here

## 3. Graph Measures

Now we can calculate some graph measures over the networkX graph. The library provides a lot of different measures that can be calculated. Always check what kind of assumptions the measure has:
* directed or undirected graph?
* does the graph have to be connected?

We will first calculate some basic graph measures: number of nodes, number of edges and the density of the graph.

In [None]:
print("Number of Nodes: {n}".format(n=nx.number_of_nodes(nx_music_graph)))
print("Number of Edges: {n}".format(n=nx.number_of_edges(nx_music_graph)))
print("Density of Graph: {n}".format(n=nx.density(nx_music_graph)))

We will now look at the distribution of degree of our nodes by calculating (retrieving) the degree for each node and plotting a histogram. We do this with the original ontology, to show how it works. The graph which we just created is not dense enough to show much in a histogram. 

In [None]:
import matplotlib.pyplot as plt
import numpy as np

histdegree = pd.DataFrame(nx.degree_histogram(nx_graph))
degree = dict(nx.degree(nx_graph))


mean_degree = np.mean(list(degree.values()))
mean_degree_centrality = np.mean(list(nx.degree_centrality(nx_graph).values()))

fig, ax = plt.subplots(figsize=(12,6)) 
ax.bar(histdegree.index.values,histdegree[0])

plt.title("Mean Degree: {n1}\n Mean Degree Centrality: {n2}".format(n1=mean_degree,n2=mean_degree_centrality))
plt.show()

Another very useful measure is the clustering coefficient, which tells us how likely the nodes are to build clusters. This is a global measure, but can also be calculated for each node.

In [None]:
print("Clustering coefficient: {n}".format(n=nx.average_clustering(nx_graph)))

### Exercise 3

Calculate and visualise the centrality of the music graph. Use a different measure than degree. For different measures you can refer to this online [documentation]{https://networkx.org/documentation/stable/reference/algorithms/centrality.html}. Choose wisely though, some measures require a long time to calculate (like betweenness or eigenvector centrality).

As a second step, take some time to explore the documentation of networkx. Is there something other you can calculate and learn about the graph?

In [None]:
### your code here

## 4. Visualizing your graph

With networkX you can easily visualize your ontology/graph, no matter if they include blank nodes or not. The visualisations are powered by matplotlib. We will use here the first ontology, which also has blank nodes, but is much simpler to visualize.

In [None]:
# list(nx_graph.nodes())

music_ontology = rdflib.Graph()
music_ontology.parse("data/music_onto_simple.rdf")
nx_graph = rdflib_to_networkx_digraph(music_ontology)


mapping = pd.DataFrame(nx_graph.nodes())
mapping['new_names'] = mapping[0].str.split("#",n=1,expand=False)
mapping['label'] = 'NA'
# print(mapping)
mapping_copy = mapping.copy()

for ind, m in mapping_copy.iterrows():
    l = len(m['new_names'])
    names = m['new_names']
    mapping.loc[ind,'label'] = names[l-1]
    
map_dict = dict(zip(mapping[0],mapping['label']))

# print(mapping)
nx_graph_nl = nx.relabel_nodes(nx_graph,map_dict,copy=True)


nx.draw_planar(nx_graph_nl,
                 with_labels=True,
                node_size=200, font_size=8 )
# plt.draw()

### Exercise 4

Try to visualize the Hybrid Intelligence ontology below. Don't shy away from looking for and trying other approaches to visualize an ontology, giving nodes different colors, or varying thickness/color of the edges based on the type of relation.

In [None]:
### your code goes here

## 5. Clustering

NetworkX already comes with some clustering algorithms. We will try the one introduced in the theory part of the class, Louvain clustering algorithm.

In [None]:
import networkx.algorithms.community as nx_comm


communities = nx_comm.louvain_communities(nx_graph,resolution=1)
print(1,len(communities),nx.number_of_nodes(nx_graph))


Now we can visualize the graph and the communities within it. Alternatively you can also visualize a community as an example alone. (WARNING: this might take a while...)

In [None]:
pos = nx.shell_layout(nx_graph)


nx.draw(nx_graph, pos, edge_color='k',font_weight='light', 
        node_size= 100, width= 0.8)
for com in communities:
    nx.draw_networkx_nodes(nx_graph,
                           pos,
                           nodelist=com, 
                           node_color=np.random.rand(3,),
                           node_size=100)
plt.show()

If clustering is successful and helpful is very dependent on your graph. It is beneficial to remove the information from your graph that is not helpful for the clustering, like the T-box of your ontology, as this "polutes" the graph. The clustering algorithm is not made for knowledge graphs but rather for mathematical graphs, hence less semantics is better.

In the case above, the graph is not dense enough to produce meaningful clusters, which is why there are 38706 clusters. 

### Exercise 5

Try to cluster your own graph. You might have to create a mathematical graph first, rather than just converting your knowledge graph, as that will lead to better and more insightful results.

In [None]:
### your code goes here.