In [1]:
'''how to implement a pipeline for extracting a Knowledge Base from texts or online articles'''

#https://www.nlplanet.org/course-practical-nlp/02-practical-nlp-first-tasks/16-knowledge-graph-from-text


'how to implement a pipeline for extracting a Knowledge Base from texts or online articles'

In [2]:
!pip install transformers wikipedia newspaper3k GoogleNews pyvis


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
#Step 1: import Libraries
'''need each library for the following reasons:
    transformers: Load the REBEL mode.
    wikipedia: Validate extracted entities by checking if they have a corresponding Wikipedia page.
    newspaper: Parse articles from URLs.
    GoogleNews: Read Google News latest articles about a topic.
    pyvis: Graphs visualizations.
'''

'need each library for the following reasons:\n    transformers: Load the REBEL mode.\n    wikipedia: Validate extracted entities by checking if they have a corresponding Wikipedia page.\n    newspaper: Parse articles from URLs.\n    GoogleNews: Read Google News latest articles about a topic.\n    pyvis: Graphs visualizations.\n'

In [4]:
# needed to load the REBEL model
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import math
import torch

# wrapper for wikipedia API
import wikipedia

# scraping of web articles
from newspaper import Article, ArticleException

# google news scraping
from GoogleNews import GoogleNews

# graph visualization
from pyvis.network import Network

# show HTML in notebook
import IPython


In [5]:
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")

In [6]:
#fStep 2; from Short Text to KB
'''
step 2: is to write a function that is able to parse the strings generated by REBEL and transform them into relation triplets (e.g. the 
<Fabio, lives in, Italy> triplet). This function must take into account additional new tokens (i.e. the <triplet> , <subj>, and <obj> tokens) 
used while training the model. 
he REBEL model card provides us with a complete code example for this function.
'''


'\nstep 2: is to write a function that is able to parse the strings generated by REBEL and transform them into relation triplets (e.g. the \n<Fabio, lives in, Italy> triplet). This function must take into account additional new tokens (i.e. the <triplet> , <subj>, and <obj> tokens) \nused while training the model. \nhe REBEL model card provides us with a complete code example for this function.\n'

In [7]:
# from https://huggingface.co/Babelscape/rebel-large
def extract_relations_from_model_output(text):
    relations = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    text_replaced = text.replace("<s>", "").replace("<pad>", "").replace("</s>", "")
    for token in text_replaced.split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                relations.append({
                    'head': subject.strip(),
                    'type': relation.strip(),
                    'tail': object_.strip()
                })
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                relations.append({
                    'head': subject.strip(),
                    'type': relation.strip(),
                    'tail': object_.strip()
                })
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        relations.append({
            'head': subject.strip(),
            'type': relation.strip(),
            'tail': object_.strip()
        })
    return relations

In [8]:
'''
The above function outputs a list of relations, where each relation is represented as a dictionary with the following keys:
    head: The subject of the relation (e.g. “Fabio”).
    type: The relation type (e.g. “lives in”).
    tail: The object of the relation (e.g. “Italy”).
'''

'\nThe above function outputs a list of relations, where each relation is represented as a dictionary with the following keys:\n    head: The subject of the relation (e.g. “Fabio”).\n    type: The relation type (e.g. “lives in”).\n    tail: The object of the relation (e.g. “Italy”).\n'

In [9]:
'''
Next, let’s write the code for implementing a knowledge base class. Our KB class is made of a list of relations and has several methods to 
deal with adding new relations to the knowledge base or printing them. 
'''

'\nNext, let’s write the code for implementing a knowledge base class. Our KB class is made of a list of relations and has several methods to \ndeal with adding new relations to the knowledge base or printing them. \n'

In [10]:
# knowledge base class
class KB():
    def __init__(self):
        self.relations = []

    def are_relations_equal(self, r1, r2):
        return all(r1[attr] == r2[attr] for attr in ["head", "type", "tail"])

    def exists_relation(self, r1):
        return any(self.are_relations_equal(r1, r2) for r2 in self.relations)

    def add_relation(self, r):
        if not self.exists_relation(r):
            self.relations.append(r)

    def print(self):
        print("Relations:")
        for r in self.relations:
            print(f"  {r}")

In [11]:
#we define a from_small_text_to_kb function that returns a KB object with relations extracted from a short text.
'''
    Initialize an empty knowledge base KB object.
    Tokenize the input text.
    Use REBEL to generate relations from the text.
    Parse REBEL output and store relation triplets into the knowledge base object.
    Return the knowledge base object.
'''

'\n    Initialize an empty knowledge base KB object.\n    Tokenize the input text.\n    Use REBEL to generate relations from the text.\n    Parse REBEL output and store relation triplets into the knowledge base object.\n    Return the knowledge base object.\n'

In [12]:
# build a knowledge base from text
def from_small_text_to_kb(text, verbose=False):
    kb = KB()

    # Tokenizer text
    model_inputs = tokenizer(text, max_length=512, padding=True, truncation=True,
                            return_tensors='pt')
    if verbose:
        print(f"Num tokens: {len(model_inputs['input_ids'][0])}")

    # Generate
    gen_kwargs = {
        "max_length": 216,
        "length_penalty": 0,
        "num_beams": 3,
        "num_return_sequences": 3
    }
    generated_tokens = model.generate(
        **model_inputs,
        **gen_kwargs,
    )
    decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=False)

    # create kb
    for sentence_pred in decoded_preds:
        relations = extract_relations_from_model_output(sentence_pred)
        for r in relations:
            kb.add_relation(r)

    return kb

In [13]:
# test the `from_small_text_to_kb` function

text = "Napoleon Bonaparte (born Napoleone di Buonaparte; 15 August 1769 – 5 " \
"May 1821), and later known by his regnal name Napoleon I, was a French military " \
"and political leader who rose to prominence during the French Revolution and led " \
"several successful campaigns during the Revolutionary Wars. He was the de facto " \
"leader of the French Republic as First Consul from 1799 to 1804. As Napoleon I, " \
"he was Emperor of the French from 1804 until 1814 and again in 1815. Napoleon's " \
"political and cultural legacy has endured, and he has been one of the most " \
"celebrated and controversial leaders in world history."

kb = from_small_text_to_kb(text, verbose=True)
kb.print()

Num tokens: 133
Relations:
  {'head': 'Napoleon Bonaparte', 'type': 'date of birth', 'tail': '15 August 1769'}
  {'head': 'Napoleon Bonaparte', 'type': 'date of death', 'tail': '5 May 1821'}
  {'head': 'Napoleon Bonaparte', 'type': 'participant in', 'tail': 'French Revolution'}
  {'head': 'Napoleon Bonaparte', 'type': 'conflict', 'tail': 'Revolutionary Wars'}
  {'head': 'Revolutionary Wars', 'type': 'part of', 'tail': 'French Revolution'}
  {'head': 'French Revolution', 'type': 'participant', 'tail': 'Napoleon Bonaparte'}
  {'head': 'Revolutionary Wars', 'type': 'participant', 'tail': 'Napoleon Bonaparte'}


In [14]:
#step 3: From Long Text to KB
'''
Transformer models like REBEL have memory requirements that grow quadratically with the size of the inputs. This means that REBEL is able to
work on common hardware at a reasonable speed with inputs of about 512 tokens, which correspond to about 380 English words. However, we may 
need to extract relations from documents long several thousands of words.

Moreover, from my experiments with the model, it seems to work better with shorter inputs. Intuitively, raw text relations are often 
expressed in single or contiguous sentences, therefore it may not be necessary to consider a high number of sentences at the same time 
to extract specific relations. Additionally, extracting a few relations is a simpler task than extracting many relations.
'''

'\nTransformer models like REBEL have memory requirements that grow quadratically with the size of the inputs. This means that REBEL is able to\nwork on common hardware at a reasonable speed with inputs of about 512 tokens, which correspond to about 380 English words. However, we may \nneed to extract relations from documents long several thousands of words.\n\nMoreover, from my experiments with the model, it seems to work better with shorter inputs. Intuitively, raw text relations are often \nexpressed in single or contiguous sentences, therefore it may not be necessary to consider a high number of sentences at the same time \nto extract specific relations. Additionally, extracting a few relations is a simpler task than extracting many relations.\n'

In [15]:
'''
So, how do we put all this together?

For example, we can divide an input text long 1000 tokens into eight shorter overlapping spans long 128 tokens and extract relations from 
each span. While doing so, we also add some metadata to the extracted relations containing their span boundaries. With this info, we are able
to see from which span of the text we extracted a specific relation which is now saved in our knowledge base.
'''

'\nSo, how do we put all this together?\n\nFor example, we can divide an input text long 1000 tokens into eight shorter overlapping spans long 128 tokens and extract relations from \neach span. While doing so, we also add some metadata to the extracted relations containing their span boundaries. With this info, we are able\nto see from which span of the text we extracted a specific relation which is now saved in our knowledge base.\n'

In [16]:
'''Let’s modify the KB methods so that span boundaries are saved as well. The relation dictionary has now the keys:
    head : The subject of the relation (e.g. “Fabio”).
    type : The relation type (e.g. “lives in”).
    tail : The object of the relation (e.g. “Italy”).
    meta : A dictionary containing meta information about the relation. This dictionary has a spans key, whose value is the list of span 
    boundaries (e.g. [[0, 128], [119, 247]]) where the relation has been found.
'''

'Let’s modify the KB methods so that span boundaries are saved as well. The relation dictionary has now the keys:\n    head : The subject of the relation (e.g. “Fabio”).\n    type : The relation type (e.g. “lives in”).\n    tail : The object of the relation (e.g. “Italy”).\n    meta : A dictionary containing meta information about the relation. This dictionary has a spans key, whose value is the list of span \n    boundaries (e.g. [[0, 128], [119, 247]]) where the relation has been found.\n'

In [17]:
# add `merge_relations` to KB class
class KB():
    def __init__(self):
        self.relations = []

    def are_relations_equal(self, r1, r2):
        return all(r1[attr] == r2[attr] for attr in ["head", "type", "tail"])

    def exists_relation(self, r1):
        return any(self.are_relations_equal(r1, r2) for r2 in self.relations)

    def add_relation(self, r):
        if not self.exists_relation(r):
            self.relations.append(r)

    def print(self):
        print("Relations:")
        for r in self.relations:
            print(f"  {r}")
            
    def merge_relations(self, r1):
        r2 = [r for r in self.relations
              if self.are_relations_equal(r1, r)][0]
        spans_to_add = [span for span in r1["meta"]["spans"]
                        if span not in r2["meta"]["spans"]]
        r2["meta"]["spans"] += spans_to_add

    def add_relation(self, r):
        if not self.exists_relation(r):
            self.relations.append(r)
        else:
            self.merge_relations(r)

In [18]:
'''
Next, write the from_text_to_kb function, which is similar to the from_small_text_to_kb function but is able to manage longer texts
by splitting them into spans. All the new code is about the spanning logic and the management of the spans into the relations.
'''

'\nNext, write the from_text_to_kb function, which is similar to the from_small_text_to_kb function but is able to manage longer texts\nby splitting them into spans. All the new code is about the spanning logic and the management of the spans into the relations.\n'

In [19]:
# extract relations for each span and put them together in a knowledge base
def from_text_to_kb(text, span_length=128, verbose=False):
    # tokenize whole text
    inputs = tokenizer([text], return_tensors="pt")

    # compute span boundaries
    num_tokens = len(inputs["input_ids"][0])
    if verbose:
        print(f"Input has {num_tokens} tokens")
    num_spans = math.ceil(num_tokens / span_length)
    if verbose:
        print(f"Input has {num_spans} spans")
    overlap = math.ceil((num_spans * span_length - num_tokens) / 
                        max(num_spans - 1, 1))
    spans_boundaries = []
    start = 0
    for i in range(num_spans):
        spans_boundaries.append([start + span_length * i,
                                 start + span_length * (i + 1)])
        start -= overlap
    if verbose:
        print(f"Span boundaries are {spans_boundaries}")

    # transform input with spans
    tensor_ids = [inputs["input_ids"][0][boundary[0]:boundary[1]]
                  for boundary in spans_boundaries]
    tensor_masks = [inputs["attention_mask"][0][boundary[0]:boundary[1]]
                    for boundary in spans_boundaries]
    inputs = {
        "input_ids": torch.stack(tensor_ids),
        "attention_mask": torch.stack(tensor_masks)
    }

    # generate relations
    num_return_sequences = 3
    gen_kwargs = {
        "max_length": 256,
        "length_penalty": 0,
        "num_beams": 3,
        "num_return_sequences": num_return_sequences
    }
    generated_tokens = model.generate(
        **inputs,
        **gen_kwargs,
    )

    # decode relations
    decoded_preds = tokenizer.batch_decode(generated_tokens,skip_special_tokens=False)

    # create kb
    kb = KB()
    i = 0
    for sentence_pred in decoded_preds:
        current_span_index = i // num_return_sequences
        relations = extract_relations_from_model_output(sentence_pred)
        for relation in relations:
            relation["meta"] = {
                "spans": [spans_boundaries[current_span_index]]
            }
            kb.add_relation(relation)
        i += 1

    return kb
    

In [20]:
#try it with a longer text of 726 tokens about Napoleon. We are currently splitting the text into spans long 128 tokens.

In [21]:
text = """
Napoleon Bonaparte (born Napoleone di Buonaparte; 15 August 1769 – 5 May 1821), and later known by his regnal name Napoleon I, was a French 
military and political leader who rose to prominence during the French Revolution and led several successful campaigns during the 
Revolutionary Wars. He was the de facto leader of the French Republic as First Consul from 1799 to 1804. As Napoleon I, he was Emperor of 
the French from 1804 until 1814 and again in 1815. Napoleon's political and cultural legacy has endured, and he has been one of the most 
celebrated and controversial leaders in world history. Napoleon was born on the island of Corsica not long after its annexation by the 
Kingdom of France.[5] He supported the French Revolution in 1789 while serving in the French army, and tried to spread its ideals to his 
native Corsica. He rose rapidly in the Army after he saved the governing French Directory by firing on royalist insurgents. In 1796, 
he began a military campaign against the Austrians and their Italian allies, scoring decisive victories and becoming a national hero. 
Two years later, he led a military expedition to Egypt that served as a springboard to political power. He engineered a coup in November 1799
and became First Consul of the Republic. Differences with the British meant that the French faced the War of the Third Coalition by 1805. 
Napoleon shattered this coalition with victories in the Ulm Campaign, and at the Battle of Austerlitz, which led to the dissolving of the 
Holy Roman Empire. In 1806, the Fourth Coalition took up arms against him because Prussia became worried about growing French influence on 
the continent. Napoleon knocked out Prussia at the battles of Jena and Auerstedt, marched the Grande Armée into Eastern Europe, annihilating 
the Russians in June 1807 at Friedland, and forcing the defeated nations of the Fourth Coalition to accept the Treaties of Tilsit. Two years 
later, the Austrians challenged the French again during the War of the Fifth Coalition, but Napoleon solidified his grip over Europe after 
triumphing at the Battle of Wagram. Hoping to extend the Continental System, his embargo against Britain, Napoleon invaded the Iberian 
Peninsula and declared his brother Joseph King of Spain in 1808. The Spanish and the Portuguese revolted in the Peninsular War, culminating 
in defeat for Napoleon's marshals. Napoleon launched an invasion of Russia in the summer of 1812. The resulting campaign witnessed the 
catastrophic retreat of Napoleon's Grande Armée. In 1813, Prussia and Austria joined Russian forces in a Sixth Coalition against France. 
A chaotic military campaign resulted in a large coalition army defeating Napoleon at the Battle of Leipzig in October 1813. The coalition 
invaded France and captured Paris, forcing Napoleon to abdicate in April 1814. He was exiled to the island of Elba, between Corsica and Italy.
In France, the Bourbons were restored to power. However, Napoleon escaped Elba in February 1815 and took control of France.[6][7] The Allies
responded by forming a Seventh Coalition, which defeated Napoleon at the Battle of Waterloo in June 1815. The British exiled him to the 
remote island of Saint Helena in the Atlantic, where he died in 1821 at the age of 51. Napoleon had an extensive impact on the modern world, 
bringing liberal reforms to the many countries he conquered, especially the Low Countries, Switzerland, and parts of modern Italy and Germany.
He implemented liberal policies in France and Western Europe.
"""

kb = from_text_to_kb(text, verbose=True)
kb.print()


Input has 782 tokens
Input has 7 spans
Span boundaries are [[0, 128], [109, 237], [218, 346], [327, 455], [436, 564], [545, 673], [654, 782]]
Relations:
  {'head': 'Napoleon Bonaparte', 'type': 'date of birth', 'tail': '15 August 1769', 'meta': {'spans': [[0, 128]]}}
  {'head': 'Napoleon Bonaparte', 'type': 'date of death', 'tail': '5 May 1821', 'meta': {'spans': [[0, 128]]}}
  {'head': 'Napoleon Bonaparte', 'type': 'participant in', 'tail': 'French Revolution', 'meta': {'spans': [[0, 128]]}}
  {'head': 'Napoleon Bonaparte', 'type': 'conflict', 'tail': 'Revolutionary Wars', 'meta': {'spans': [[0, 128]]}}
  {'head': 'Revolutionary Wars', 'type': 'part of', 'tail': 'French Revolution', 'meta': {'spans': [[0, 128]]}}
  {'head': 'Revolutionary Wars', 'type': 'participant', 'tail': 'Napoleon Bonaparte', 'meta': {'spans': [[0, 128]]}}
  {'head': 'Napoleon', 'type': 'participant in', 'tail': 'French Revolution', 'meta': {'spans': [[109, 237]]}}
  {'head': 'French Directory', 'type': 'replaces

In [22]:
#Step 4, filter and normalize entities with Wikipedia
'''
supress redundant relations
one way to do this is to use the wikipedia library to check if “Napoleon Bonaparte” and “Napoleon” have the same Wikipedia page. If so, they 
are normalized to the title of the Wikipedia page. If an extracted entity doesn’t have a corresponding Wikipedia page, we ignore it at the 
moment. This step is commonly called Entity Linking.
'''

'\nsupress redundant relations\none way to do this is to use the wikipedia library to check if “Napoleon Bonaparte” and “Napoleon” have the same Wikipedia page. If so, they \nare normalized to the title of the Wikipedia page. If an extracted entity doesn’t have a corresponding Wikipedia page, we ignore it at the \nmoment. This step is commonly called Entity Linking.\n'

In [23]:
'''
 modify our KB code:
    The KB now stores an entities dictionary with the entities of the stored relations. The keys are the entity identifiers (i.e. the title 
    of the corresponding Wikipedia page), and the value is a dictionary containing the Wikipedia page url and its summary.
    When adding a new relation, we now check its entities with the wikipedia library.
'''

'\n modify our KB code:\n    The KB now stores an entities dictionary with the entities of the stored relations. The keys are the entity identifiers (i.e. the title \n    of the corresponding Wikipedia page), and the value is a dictionary containing the Wikipedia page url and its summary.\n    When adding a new relation, we now check its entities with the wikipedia library.\n'

In [24]:
# filter and normalize entities before adding them to the KB
class KB():
    def __init__(self):
        self.entities = {}
        self.relations = []

    ...

    def get_wikipedia_data(self, candidate_entity):
        try:
            page = wikipedia.page(candidate_entity, auto_suggest=False)
            entity_data = {
                "title": page.title,
                "url": page.url,
                "summary": page.summary
            }
            return entity_data
        except:
            return None

    def add_entity(self, e):
        self.entities[e["title"]] = {k:v for k,v in e.items() if k != "title"}

    def add_relation(self, r):
        # check on wikipedia
        candidate_entities = [r["head"], r["tail"]]
        entities = [self.get_wikipedia_data(ent) for ent in candidate_entities]

        # if one entity does not exist, stop
        if any(ent is None for ent in entities):
            return

        # manage new entities
        for e in entities:
            self.add_entity(e)

        # rename relation entities with their wikipedia titles
        r["head"] = entities[0]["title"]
        r["tail"] = entities[1]["title"]

        # manage new relation
        if not self.exists_relation(r):
            self.relations.append(r)
        else:
            self.merge_relations(r)

    def print(self):
        print("Entities:")
        for e in self.entities.items():
            print(f"  {e}")
        print("Relations:")
        for r in self.relations:
            print(f"  {r}")

In [25]:
text = """
Napoleon Bonaparte (born Napoleone di Buonaparte; 15 August 1769 – 5 May 1821), and later known by his regnal name Napoleon I, was a French military and political leader who rose to prominence during the French Revolution and led several successful campaigns during the Revolutionary Wars. He was the de facto leader of the French Republic as First Consul from 1799 to 1804. As Napoleon I, he was Emperor of the French from 1804 until 1814 and again in 1815. Napoleon's political and cultural legacy has endured, and he has been one of the most celebrated and controversial leaders in world history. Napoleon was born on the island of Corsica not long after its annexation by the Kingdom of France.[5] He supported the French Revolution in 1789 while serving in the French army, and tried to spread its ideals to his native Corsica. He rose rapidly in the Army after he saved the governing French Directory by firing on royalist insurgents. In 1796, he began a military campaign against the Austrians and their Italian allies, scoring decisive victories and becoming a national hero. Two years later, he led a military expedition to Egypt that served as a springboard to political power. He engineered a coup in November 1799 and became First Consul of the Republic. Differences with the British meant that the French faced the War of the Third Coalition by 1805. Napoleon shattered this coalition with victories in the Ulm Campaign, and at the Battle of Austerlitz, which led to the dissolving of the Holy Roman Empire. In 1806, the Fourth Coalition took up arms against him because Prussia became worried about growing French influence on the continent. Napoleon knocked out Prussia at the battles of Jena and Auerstedt, marched the Grande Armée into Eastern Europe, annihilating the Russians in June 1807 at Friedland, and forcing the defeated nations of the Fourth Coalition to accept the Treaties of Tilsit. Two years later, the Austrians challenged the French again during the War of the Fifth Coalition, but Napoleon solidified his grip over Europe after triumphing at the Battle of Wagram. Hoping to extend the Continental System, his embargo against Britain, Napoleon invaded the Iberian Peninsula and declared his brother Joseph King of Spain in 1808. The Spanish and the Portuguese revolted in the Peninsular War, culminating in defeat for Napoleon's marshals. Napoleon launched an invasion of Russia in the summer of 1812. The resulting campaign witnessed the catastrophic retreat of Napoleon's Grande Armée. In 1813, Prussia and Austria joined Russian forces in a Sixth Coalition against France. A chaotic military campaign resulted in a large coalition army defeating Napoleon at the Battle of Leipzig in October 1813. The coalition invaded France and captured Paris, forcing Napoleon to abdicate in April 1814. He was exiled to the island of Elba, between Corsica and Italy. In France, the Bourbons were restored to power. However, Napoleon escaped Elba in February 1815 and took control of France.[6][7] The Allies responded by forming a Seventh Coalition, which defeated Napoleon at the Battle of Waterloo in June 1815. The British exiled him to the remote island of Saint Helena in the Atlantic, where he died in 1821 at the age of 51. Napoleon had an extensive impact on the modern world, bringing liberal reforms to the many countries he conquered, especially the Low Countries, Switzerland, and parts of modern Italy and Germany. He implemented liberal policies in France and Western Europe.
"""

kb = from_text_to_kb(text)
kb.print()


AttributeError: 'KB' object has no attribute 'exists_relation'