**This is an edited "re-post" of my blog post: ["How to Create Representations of Entities in a Knowledge Graph using pyRDF2Vec"](https://towardsdatascience.com/how-to-create-representations-of-entities-in-a-knowledge-graph-using-pyrdf2vec-82e44dad1a0)**

# Representing Data With Knowledge Graphs

**Graphs** are data structures that are useful to represent ubiquitous phenomena, such as social networks, chemical molecules and recommendation systems. One of their strengths lies in the fact that they explicitly model relations (i.e. edges) between individual units (i.e. nodes), which adds an extra dimension to the data.

We can illustrate the added value of this data enrichment using the [Cora citation network](https://relational.fit.cvut.cz/dataset/CORA). This dataset contains a bag-of-words representation for a few hundred papers and the citation relations between each of these papers. If we apply dimensionality reduction (t-SNE) to create a 2D plot of the bag-of-words representations (Figure, left), we can see clusters (they are colored according to their research topic) arise but they overlap. If we produce an embedding with a graph network (Figure, right), that takes into account the citation information, we can see the clusters being better separated.
    
![left: A t-SNE embedding of the bag-of-words representations of each paper. right: An embedding produced by a graph network that takes into account the citations between papers. source: [‚ÄúDeep Graph Infomax‚Äù by Velickovic et al.](https://arxiv.org/abs/1809.10341)](https://miro.medium.com/max/493/0*y332aTSAuQIkzz_K.png)
    
<p style="text-align: center">--> source: <a href="https://arxiv.org/abs/1809.10341">Deep Graph Infomax‚Äù by Velickovic et al.</a> </p>

**Knowledge Graphs (KG)** are a specific type of graph. They are multi-relational (i.e. there are different edges for different types of relations) and directed (i.e. the relations have a subject and object). These properties allow to represent information from heterogeneous sources in a uniform format.

# Countries in DBpedia

This dataset describes information of several countries from DBpedia. Let‚Äôs take a look at how the KG looks in the neighbourhood of a specific country: üáßüá™ Belgium üáßüá™. This process is analogous to going to [its corresponding DBpedia page](http://dbpedia.org/page/Belgium) and then recursively clicking on all the links on that page. We depict this below in Figure 3. We notice that expanding this neighbourhood iteratively makes things complex quickly, even though we introduced some simplifications by removing some of the parts. Nevertheless, we see that DBpedia contains some useful information about Belgium (e.g., its national anthem, largest city, currency, ‚Ä¶).

![](https://miro.medium.com/max/1890/1*-fo07n-06Obzqks4hoGQyg.png)

# Creating Entity Embeddings With RDF2Vec

**[RDF2vec](rdf2vec.org)** stands for Resource Description Framework To Vector. It is an unsupervised, task-agnostic algorithm to numerically represent nodes in a KG, allowing them to be used for further (downstream) machine learning tasks. RDF2Vec builds on top of existing natural language processing techniques: it combines insights from DeepWalk and Word2Vec. Word2Vec is able to generate embeddings for each word in a provided collection of sentences (often called a corpus). To generate a corpus for a KG, we extract walks. Extracting walks is similar to visiting a DBpedia page of an entity and clicking on links. The number of clicks you make is equivalent to the number of hops in a walk. An example of such a walk, again for Belgium, would be: 

Belgium -> dbo:capital -> City of Brussels -> dbo:mayor -> Yvan Mayeur. 

Note that we make no distinction between predicates/properties (e.g., dbo:capital and dbo:mayor) and entities (e.g., Belgium, Brussels, Yvan Mayeur, ‚Ä¶) in our walks, as explained in Figure 2. Each walk can now be seen as a sentence, and the hops in that walk correspond to the tokens (words) of a sentence. Once we extracted a large number of walks rooted at the entities we want to create embeddings for, we can provide that as a corpus to Word2Vec. Word2Vec will then learn embeddings for each unique hop which can then be used for ML tasks.

pyRDF2Vec is a repository that contains a Python implementation of the RDF2Vec algorithm. On top of the original algorithm, different extensions are implemented as well.

![](https://miro.medium.com/max/700/1*dgij9Wdt9LgEo-CCltet8g.png)

In [None]:
# Something is weird going on with the latest version of pip
# we need to do this import or else import gensim (which is done 
# within pyrdf2vec) crashes after the pip install
import gensim

In [None]:
!python -m pip install --upgrade pip
!pip install pyrdf2vec --use-feature=2020-resolver

# Tutorial: creating country representations with pyRDF2Vec

In this tutorial, I will demonstrate basic usage of the pyRDF2Vec library using the country dataset previously described. Let us start with loading our data. 

# 1. Loading the data
pyRDF2Vec can easily load files in different RDF syntaxes by wrapping around rdflib. This will load the entire KG into RAM memory. However, this becomes problematic when the KG is larger than the available RAM memory. We therefore also support interaction with endpoints: the KG can be hosted on some server and our KG object will interact with that endpoint whenever necessary. This drastically reduces the required RAM memory at a cost of higher latencies.

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
from pyrdf2vec.graphs import KG
import rdflib
import pandas as pd

In [None]:
# Load CSV file with country names and their labels
country_data = pd.read_csv('../input/dbpedia-country-information/countries.csv')
entities = country_data['Country']

# We manually remove a few countries for which querying went wrong...
entities = list(set(entities) - {
  'http://dbpedia.org/resource/Antigua_and_Barbuda',
 'http://dbpedia.org/resource/Belize',
 'http://dbpedia.org/resource/Bosnia_and_Herzegovina',
 'http://dbpedia.org/resource/Botswana',
 'http://dbpedia.org/resource/Chad',
 'http://dbpedia.org/resource/China',
 'http://dbpedia.org/resource/Georgia',
 'http://dbpedia.org/resource/Guinea',
 'http://dbpedia.org/resource/Ireland',
 'http://dbpedia.org/resource/Malta',
 'http://dbpedia.org/resource/Mexico',
 'http://dbpedia.org/resource/Trinidad_and_Tobago',
 'http://dbpedia.org/resource/Uganda'
 })


# We will exclude triples (s, p, o) with p in label_predicates from our KG
# as these do not carry any useful information.
label_predicates = [
     'http://dbpedia.org/ontology/abstract',
     'http://dbpedia.org/ontology/flag',
     'http://dbpedia.org/ontology/thumbnail',
     'http://dbpedia.org/ontology/wikiPageExternalLink',
     'http://dbpedia.org/ontology/wikiPageID',
     'http://dbpedia.org/ontology/wikiPageRevisionID',
     'http://dbpedia.org/ontology/wikiPageWikiLink',
     'http://dbpedia.org/property/flagCaption',
     'http://dbpedia.org/property/float',
     'http://dbpedia.org/property/footnoteA',
     'http://dbpedia.org/property/footnoteB',
     'http://dbpedia.org/property/footnoteC',
     'http://dbpedia.org/property/source',
     'http://dbpedia.org/property/width',
     'http://purl.org/dc/terms/subject',
     'http://purl.org/linguistics/gold/hypernym',
     'http://purl.org/voc/vrank#hasRank',
     'http://www.georss.org/georss/point',
     'http://www.w3.org/2000/01/rdf-schema#comment',
     'http://www.w3.org/2000/01/rdf-schema#label',
     'http://www.w3.org/2000/01/rdf-schema#seeAlso',
     'http://www.w3.org/2002/07/owl#sameAs',
     'http://www.w3.org/2003/01/geo/wgs84_pos#geometry',
     'http://dbpedia.org/ontology/wikiPageRedirects',
     'http://www.w3.org/2003/01/geo/wgs84_pos#lat',
     'http://www.w3.org/2003/01/geo/wgs84_pos#long',
     'http://www.w3.org/2004/02/skos/core#exactMatch',
     'http://www.w3.org/ns/prov#wasDerivedFrom',
     'http://xmlns.com/foaf/0.1/depiction',
     'http://xmlns.com/foaf/0.1/homepage',
     'http://xmlns.com/foaf/0.1/isPrimaryTopicOf',
     'http://xmlns.com/foaf/0.1/name',
     'http://dbpedia.org/property/website',
     'http://dbpedia.org/property/west',
     'http://dbpedia.org/property/wordnet_type',
     'http://www.w3.org/2002/07/owl#differentFrom',
]

# KG Loading Alternative 1: Loading the entire turtle file into memory
kg = KG("../input/dbpedia-country-information/countries.ttl", file_type='turtle',
        label_predicates=[rdflib.URIRef(x) for x in label_predicates])

# KG Loading Alternative 2: Using a dbpedia endpoint (nothing is loaded into memory)
# kg = KG("https://dbpedia.org/sparql", is_remote=True,
#         label_predicates=[rdflib.URIRef(x) for x in label_predicates])

In [None]:
# Make sure that every entity can be found in our KG
filtered_entities = [e for e in entities if e in kg._entities]
not_found = set(entities) - set(filtered_entities)
print(f'{len(not_found)} entities could not be found in the KG! Removing them...')
entities = filtered_entities

# 2. Extracting our first embeddings

Now that we have our KG loaded in memory, we can start creating embeddings! In order to do this, we create an `RDF2VecTransformer` and then call the `fit()` function with the freshly loaded KG and a list of entities. Once the model is fitted, we can retrieve their embeddings through the `transform()` function. One thing that is different from the regular scikit-learn flow (where we call `fit()` on the train data and `predict()` or `transform()` on the test data) is that both the train and test entities have to be provided to the `fit()` function (similar to how t-SNE works in scikit-learn). Since RDF2Vec works unsupervised, this does not introduce label leakage.

In [None]:
from pyrdf2vec import RDF2VecTransformer

In [None]:
transformer = RDF2VecTransformer()
walk_embeddings = transformer.fit(kg, entities, verbose=True).transform(entities)

# 3. Creating a t-SNE plot

The code snippet above from Section 2 will give us a list of lists. For each of the provided entities to the transform method, a 100-dimensional embedding will be returned. Now in order to inspect these embeddings with the human eye, we need to further reduce the dimensionality. One great technique to do this, is by using t-SNE.

In [None]:
!pip install adjustText
!pip install matplotlib==3.1.3

In [None]:
from sklearn.manifold import TSNE
%matplotlib inline
import matplotlib.pyplot as plt
from adjustText import adjust_text

In [None]:
walk_tsne = TSNE(random_state=42)
X_tsne = walk_tsne.fit_transform(walk_embeddings)

In [None]:
plt.figure(figsize=(15, 15))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], s=100)

texts = []
for x, y, lab in zip(X_tsne[:, 0], X_tsne[:, 1], entities):
    lab = lab.split('/')[-1]
    text = plt.text(x, y, lab)
    texts.append(text)
    
adjust_text(texts, lim=5, arrowprops=dict(arrowstyle="->", color='r', lw=0.5))
plt.show()

# 4. Baseline classification results

Now let‚Äôs take a look at how good these embeddings are in order to solve the three ML tasks (which are provided in the CSV file): two binary classification tasks (high/low inflation and high/low academic output) and a multi-class classification task (predict the continent). It should be noted that, since RDF2Vec is unsupervised, during the creation of these embeddings, this label information was never used! RDF2Vec is task-agnostic, the projection from our nodes to an embedding is not tailored towards a specific task, and the embeddings can be used for multiple different downstream tasks. Let‚Äôs create an utility function that takes as input the produced embeddings and then performs classification for all three tasks:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix
import numpy as np

def classify(walk_embeddings):
    for task in ['Research Rating', 'Inflation Rating', 'Continent']:

        # Split our data into train and test (50/50 split)
        data = train_test_split(country_data['Country'], country_data[task], 
                                stratify=country_data[task], test_size=0.5,
                                random_state=42)
        train_ent, test_ent, y_train, y_test = data

        # Create masks that filter out all entities that are INCLUDED in the KG.
        train_mask = [x in entities for x in train_ent]
        test_mask = [x in entities for x in test_ent]
        y_train = y_train[train_mask]
        y_test = y_test[test_mask]

        # Create our X_train and X_test which consists out of the created embeddings
        X_train = []
        X_test = []
        for entity in train_ent:
            if entity in entities:
                X_train.append(walk_embeddings[entities.index(entity)])
        for entity in test_ent:
            if entity in entities:
                X_test.append(walk_embeddings[entities.index(entity)])
        X_train = np.array(X_train)
        X_test = np.array(X_test)

        # Fit a Random Forest & tune some of its hyper-parameters
        rf = GridSearchCV(RandomForestClassifier(random_state=42), 
                          {'n_estimators': [10, 50, 100], 'max_depth': [3, 5, None]},
                          cv=10)
        rf.fit(X_train, y_train)
        preds = rf.predict(X_test)

        # Evaluate our model on the test data
        print(task)
        print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
        print('Accuracy =', accuracy_score(y_test, preds))
        print(confusion_matrix(y_test, preds))
        print()

In [None]:
classify(walk_embeddings)

# 5. Initial hyper-parameter tuning

As mentioned before, each of the three building blocks of the RDF2Vec algorithm (walking algorithm, sampling strategy and embedding technique) are configurable. For now, let‚Äôs try to extract deeper walks and generate larger embeddings:

In [None]:
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.walkers import RandomWalker

In [None]:
transformer = RDF2VecTransformer(walkers=[RandomWalker(3, None)], 
                                 embedder=Word2Vec(size=500))
walk_embeddings = transformer.fit(kg, entities, verbose=True).transform(entities)

In [None]:
classify(walk_embeddings)

# 6. Setting the walking strategies

pyRDF2Vec allows us to use different walking strategies, an overview of the different strategies is provided in Figure 5. Moreover, we can combine different strategies: pyRDF2Vec will extract walks with each strategy and concatenate the extracted walks together before providing it to the embedding technique. Let‚Äôs try combining several walking strategies:

In [None]:
from pyrdf2vec.walkers import CommunityWalker, HalkWalker

In [None]:
transformer = RDF2VecTransformer(walkers=[HalkWalker(3, None, freq_thresholds=[0.01]),
                                          CommunityWalker(3, None, resolution=0.1, 
                                                          hop_prob=0.25)],
                                 embedder=Word2Vec(size=500))
walk_embeddings = transformer.fit(kg, entities, verbose=True).transform(entities)

In [None]:
classify(walk_embeddings)

# Check out our github!

[The pyRDF2Vec repository can be found on Github](https://github.com/IBCNServices/pyRDF2Vec). Feel free to give us a star if you like the repository, it is greatly appreciated! Moreover, we welcome all kinds of contributions.