# Extracting RDF triples from plain text

## Inroduction

Working on task for creating Linked Open Data using Wikipedia pages led to realization that most of the first sentences of Wikipedia articles often can be distilled to simple statement. For example the descriptor for the Wikipedia itself "Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation." can be read as "Wikipedia is encyclopedia" by removing the decorators (adjectives) and leaving nouns (proper and common in the case) and predicate (aux verb). This lead to the hypothesis that RDF triples can be extracted from raw text.

This notebook contains the first experiments to test this hypothesis. For this goal the experiments are structured as follows:

    - Data collection - three text documents: one fiction, one media (news articles) and one encyclopedic (Wikipedia articles), roughly the same size.
    - Data preparation - sentence splitting, tokenization, lemmatization, POS-tagging of the text documents
    - Run experiments - using two approaches
        * First approach - look for noun -> verb -> noun patterns in sentences
        * Second approach - collect all nouns in the text documents. Create a basic predicate set. Loop the noun and predicatesets to generate noun -> verb -> noun patterns
        



### RDF

The Resource Description Framework (RDF) is a framework for expressing information about resources. It is a standard model for data interchange on the Web recommended by World Wide Web Consortium (W3C). RDF provides a common framework for expressing information on the Web so it can be exchanged between applications without loss of meaning. This means that the information may be made available to applications other than those for which it was originally created.

RDF is used for series of practices including: adding machine-readable information to Web pages, enriching a dataset by linking it to third-party datasets, interlinking API feeds, putting into work the datasets currently published as Linked Data, building distributed social networks, providing a standards-compliant way for exchanging data between databases, interlinking various datasets and enabling cross-dataset queries with the use of SPARQL.

RDF uses International Resource Identifier (IRI) as resource identifies.


### RDF Triples

RDF allows us to make statements about resources. Due their structure of three elements these statements are called triples. The format of these triples is simple and always has the following structure:

    <subject> <predicate> <object>

For example:

    <Bob> <is a> <person>.
    <Bob> <is a friend of> <Alice>.
    <Bob> <is born on> <the 4th of July 1990>.

The RDF statements expresses a relationship between two resources. The subject and the object represent the two resources being related; the predicate represents the nature of their relationship.

### the Turtle Language

the Terse RDF Triple Language or simply Turtle allows for the textual representations of an RDF graphs. Turtle introduces a number of syntactic shortcuts, such as support for namespace prefixes, lists and shorthands for datatyped literals. This language provides a trade-off between ease of writing, ease of parsing and readability.

Turtle examlpe:

    @base <http://example.org/> .
    @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
    @prefix foaf: <http://xmlns.com/foaf/0.1/> .
    @prefix rel: <http://www.perceive.net/schemas/relationship/> .

    <#green-goblin>
        rel:enemyOf <#spiderman> ;
        a foaf:Person ;    # in the context of the Marvel universe
        foaf:name "Green Goblin" .

    <#spiderman>
        rel:enemyOf <#green-goblin> ;
        a foaf:Person ;
        foaf:name "Spiderman", "Человек-паук"@ru .

This example introduces many of features of the Turtle language: @base and Relative IRI references, @prefix and prefixed names, predicate lists separated by ';', object lists separated by ',', the token a, and literals. Comments may be given after a '#'.

## Data

To install spaCy run "conda install -c conda-forge spacy" and then "python -m spacy download en_core_web_sm" in the anaconda terminal

In [92]:

# import numpy as np
# import pandas as pd
import spacy
from nltk import word_tokenize, sent_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
from nltk import download as nltk_down
import re

### Input

The data consists of three documents:
- fiction text - The Phoenix on the Sword by Robert E. Howard available via [Project Gutenberg of Australia](http://www.gutenberg.net.au/ebooks06/0600811h.html)
- media text - 14 articles from Reuters and DW
- enciclopedic text - around 30 Wikipedia articles from different domains: Biology, Geography, Physics, Chemistry

Let's make a function that loads our files.

In [19]:
def load_input_file(file):
    text = ''
    with open(file, 'r') as input_file:
        text = input_file.read()
    return text

After loading the text documents we can see the number of characters in each document or some regular expressions to measure the size of the documents in words. The word count should be similar for each of the documents.

In [193]:
fiction_text = load_input_file('data/conan.txt')
fiction_words = re.split("\W+", fiction_text)

In [21]:
media_text = load_input_file('data/media.txt')
media_words = re.split("\W+", media_text)

In [22]:
wiki_text = load_input_file('data/wiki.txt')
wiki_words = re.split("\W+", wiki_text)

In [77]:
print(f"Character counts: fiction - {len(fiction_text)}, media - {len(media_text)}, wiki - {len(wiki_text)}")
print(f"RE word counts: fiction - {len(fiction_words)}, media - {len(media_words)}, wiki - {len(wiki_words)}")

Character counts: fiction - 50253, media - 57780, wiki - 63849
RE word counts: fiction - 9186, media - 9598, wiki - 10220


Character counts are not very informative and matching with regular expressions is prone to errors. For text processing is better to look for sentences and words. This can be done wit the help of NLTK and spaCy libraries. They can apply the processes of sentence splitting and tokenization (taking a text or sentence and splitting it into individual units called tokens).

In [32]:
fiction_tokens = word_tokenize(fiction_text)
media_tokens = word_tokenize(media_text)
wiki_tokens = word_tokenize(wiki_text)

In [36]:
nltk_counts = f"NLTK word counts: fiction - {len(fiction_tokens)}, media - {len(media_tokens)}, wiki - {len(wiki_tokens)}"
nltk_counts

'NLTK word counts: fiction - 10585, media - 10728, wiki - 11586'

Ok, the there is a difference. Maybe an arbiter will help. What will spaCy count.
First load the model, then make the document and see the tokens.

In [24]:
nlp = spacy.load("en_core_web_sm")

In [34]:
fiction_doc = nlp(fiction_text)
media_doc = nlp(media_text)
wiki_doc = nlp(wiki_text)

In [37]:
spacy_counts = f"spaCy word counts: fiction - {len(fiction_doc)}, media - {len(media_doc)}, wiki - {len(wiki_doc)}"
print(spacy_counts)
print(nltk_counts)

spaCy word counts: fiction - 11041, media - 11206, wiki - 11897
NLTK word counts: fiction - 10585, media - 10728, wiki - 11586


It seem that NLTK and spaCy tokenizers are producing different token counts. Lets see the outputs:

In [62]:
nltk_result = []
for nltk_token in wiki_tokens[:20]:
    nltk_result.append(nltk_token)

spacy_result = []
for spacy_token in wiki_doc[:20]:
    spacy_result.append(spacy_token)


In [63]:
print(nltk_result)
print(spacy_result)

['\ufeffTurtles', 'are', 'an', 'order', 'of', 'reptiles', 'known', 'as', 'Testudines', ',', 'characterized', 'by', 'a', 'shell', 'developed', 'mainly', 'from', 'their', 'ribs', '.']
[﻿Turtles, are, an, order, of, reptiles, known, as, Testudines, ,, characterized, by, a, shell, developed, mainly, from, their, ribs, .]


This looks fine. Lets try with the number of sentences.

In [80]:
fiction_sents = sent_tokenize(fiction_text)
media_sents = sent_tokenize(media_text)
wiki_sents = sent_tokenize(wiki_text)

In [81]:
nltk_sentences = f"NLTK sentence counts: fiction - {len(fiction_sents)}, media - {len(media_sents)}, wiki - {len(wiki_sents)}"

In [82]:
fiction_spacy_sent = list(fiction_doc.sents)
media_spacy_sent = list(media_doc.sents)
wiki_spacy_sent = list(wiki_doc.sents)

In [83]:
spacy_sentences = f"spaCy sentence counts: fiction - {len(fiction_spacy_sent)}, media - {len(media_spacy_sent)}, wiki - {len(wiki_spacy_sent)}"

In [84]:
print(nltk_sentences)
print(spacy_sentences)

NLTK sentence counts: fiction - 565, media - 433, wiki - 486
spaCy sentence counts: fiction - 527, media - 434, wiki - 485


Close enough. It seems that the splitters disagree most in the fiction document.

### Preprocessing

For the purposes of the experiments the text will need to be separated into sentences and tokes. Then the tokens will need to be lemmatized and recieve part-of-speech-tags.

Lemmatisation is the process of grouping together the inflected forms of a word so they can be analysed as a single item and Part-of-speech (POS) tagging is the assignment of part-of-speech tags to words.

Trying to measure the size of the documents the separation work is done. NLTK does lemmatization and POS-tagging via separate methods while spaCy does all at doc creation.

Lets begin with NLTK lemmatization.


In [95]:
nltk_down('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Ivo\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [98]:
lemmatizer = WordNetLemmatizer()

def lematize_tokens(tokens):
    lemmas = []

    for token in tokens:
        lemmetized_word = lemmatizer.lemmatize(token)
        lemmas.append(lemmetized_word)

    return lemmas

In [113]:
fiction_lemmas = lematize_tokens(fiction_tokens)
media_lemmas = lematize_tokens(media_tokens)
wiki_lemmas = lematize_tokens(wiki_tokens)

In [114]:
fiction_pos = pos_tag(fiction_lemmas)
media_pos = pos_tag(media_lemmas)
wiki_pos = pos_tag(wiki_lemmas)

In [126]:
wiki_pos[:20]

[('\ufeffTurtles', 'NNS'),
 ('are', 'VBP'),
 ('an', 'DT'),
 ('order', 'NN'),
 ('of', 'IN'),
 ('reptile', 'NN'),
 ('known', 'VBN'),
 ('a', 'DT'),
 ('Testudines', 'NNP'),
 (',', ','),
 ('characterized', 'VBN'),
 ('by', 'IN'),
 ('a', 'DT'),
 ('shell', 'NN'),
 ('developed', 'VBN'),
 ('mainly', 'RB'),
 ('from', 'IN'),
 ('their', 'PRP$'),
 ('rib', 'NN'),
 ('.', '.')]

This should do it. Now to proceed with the triple extraction.

## Extracting Triples

### approach 1

### approach 2

In [179]:
def nltk_extract_triples(pos_tokens):
    nltk_triples = []

    for idx, tup in enumerate(pos_tokens):
        if tup[0] in ['is', 'are', 'have', 'can']:
            predicate = tup[0]
            subj = ''
            obj = ''
            
            for i in range(idx, -1, -1):
                if 'NN' in pos_tokens[i][1]:
                    subj = pos_tokens[i][0]
                    break
            for j in range(idx, len(pos_tokens)):
                if 'NN' in pos_tokens[j][1]:
                    obj = pos_tokens[j][0]
                    break

            triple = f"{subj} > {predicate} > {obj}"
            nltk_triples.append(triple)

    return nltk_triples

In [180]:
fiction_triples_nltk = nltk_extract_triples(fiction_pos)
media_triples_nltk = nltk_extract_triples(media_pos)
wiki_triples_nltk = nltk_extract_triples(wiki_pos)

In [194]:
print(len(wiki_triples_nltk))
wiki_triples_nltk[:20]

391


['\ufeffTurtles > are > order',
 'turtle > are > group',
 'head > are > living',
 'terrapin > are > continent',
 'shell > are > bone',
 'part > is > carapace',
 'underside > is > flatter',
 'surface > is > scale',
 'Turtles > are > temperature',
 'environment > are > omnivore',
 'turtle > are > reptile',
 'Turtles > have > myth',
 'specie > are > pet',
 'Turtles > have > meat',
 'turtle > are > bycatch',
 'world > are > result',
 'specie > are > extinction',
 'hedgehog > is > mammal',
 'Erinaceidae > are > specie',
 'introduction > are > Australia']

In [188]:
def spacy_extract_triples(document):
    spacy_triples = []

    for token in document:
        if token.lemma_ in ['be', 'have', 'can']:
            idx = token.i
            predicate = token
            subj = ''
            obj = ''

            for i in range(idx, -1, -1):
                if document[i].pos_ == "NOUN":
                    subj = document[i].lemma_
                    break
            for j in range(idx, len(document)):
                if document[j].pos_ == "NOUN":
                    obj = document[j].lemma_
                    break

            triple = f"{subj} > {predicate} > {obj}"
            spacy_triples.append(triple)

    return spacy_triples

In [189]:
fiction_triples_spacy = spacy_extract_triples(fiction_doc)
media_triples_spacy = spacy_extract_triples(media_doc)
wiki_triples_spacy = spacy_extract_triples(wiki_doc)

In [195]:
print(len(wiki_triples_spacy))
wiki_triples_spacy[:20]

533


['\ufeffturtle > are > order',
 'turtle > are > group',
 'head > are > living',
 'terrapin > are > continent',
 'shell > are > bone',
 'part > is > carapace',
 'underside > is > plastron',
 'surface > is > scale',
 'turtle > are > ectotherm',
 'environment > are > omnivore',
 'turtle > are > reptile',
 'turtle > have > myth',
 'specie > are > pet',
 'turtle > have > meat',
 'turtle > been > meat',
 'turtle > are > bycatch',
 'world > are > result',
 'world > being > result',
 'specie > are > extinction',
 'hedgehog > is > mammal']

In [196]:
print(len(fiction_triples_nltk))
print(fiction_triples_nltk[:10])
print(len(fiction_triples_spacy))
print(fiction_triples_spacy[:10])


99
['dupe > have > street', 'desert > have > heart', 'night > have > agent', 'agent > have > face', 'face > have > empire', 'shadow > have > downfall', 'task > is > wit', 'task > are > wit', 'deed > is > people', 'reign > is > people']
333
['rise > was > age', 'world > was > supreme', 'countenance > was > door', 'hand > was > giant', 'dupe > have > street', 'desert > have > heart', 'desert > been > heart', 'night > have > noble', 'agent > have > face', 'face > have > empire']


In [197]:
print(len(media_triples_nltk))
print(media_triples_nltk[:10])
print(len(media_triples_spacy))
print(media_triples_spacy[:10])

199
['government > have > developer', 'move > are > completion', 'sale > are > future', 'sector > is > tightening', 'project > can > developer', 'project > is > developer', 'developer > have > crisis', 'sector > have > capital', 'efficiency > can > credit', 'developer > are > project']
440
['government > have > developer', 'company > be > move', 'move > are > completion', 'sale > are > future', 'sector > is > tightening', 'government > has > help', 'sector > has > year', 'sector > been > year', "demand > 's > dilemma", 'project > can > developer']


## Conclusion

## References

1. [RDF](https://www.w3.org/RDF/)
2. [RDF Primer](https://www.w3.org/TR/2014/NOTE-rdf11-primer-20140624/)
3. [RDF triples](https://www.w3.org/TR/rdf12-n-triples/)
4. [Turtle format](https://www.w3.org/TR/rdf12-turtle/)
5. [Entity extraction: From unstructured text to DBpedia RDF triples](https://lucris.lub.lu.se/ws/portalfiles/portal/3053000/3191702.pdf)
6. [FactCheck: Validating RDF Triples Using Textual Evidence](https://svn.aksw.org/papers/2018/CIKM_FACTCHECK/public.pdf)
7. [Harvesting RDF triples](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=05620dd4b5346e5c17f3f4da97efcb18e2fcb7e6)
8. [NLTK](https://www.nltk.org/)
9. [Install spaCy](https://spacy.io/usage)
10. [NLP with spaCy](http://spacy.pythonhumanities.com/intro.html)
