<a href="https://colab.research.google.com/github/tomasonjo/blogs/blob/master/ie_pipeline/SpaCy_informationextraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install crosslingual-coreference spacyopentapioca spacy-transformers wikipedia
!pip install --upgrade google-cloud-storage
!pip install --upgrade transformers
!python -m spacy download en_core_web_sm


Collecting crosslingual-coreference
  Downloading crosslingual_coreference-0.2.1-py3-none-any.whl (11 kB)
Collecting spacyopentapioca
  Downloading spacyopentapioca-0.1.4-py3-none-any.whl (6.8 kB)
Collecting spacy-transformers
  Downloading spacy_transformers-1.1.5-py2.py3-none-any.whl (51 kB)
[K     |████████████████████████████████| 51 kB 146 kB/s 
[?25hCollecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
Collecting torchaudio<0.11
  Downloading torchaudio-0.10.2-cp37-cp37m-manylinux1_x86_64.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 48.1 MB/s 
[?25hCollecting allennlp<3.0,>=2.8
  Downloading allennlp-2.9.3-py3-none-any.whl (719 kB)
[K     |████████████████████████████████| 719 kB 46.6 MB/s 
[?25hCollecting torchvision<0.12.0
  Downloading torchvision-0.11.3-cp37-cp37m-manylinux1_x86_64.whl (23.2 MB)
[K     |████████████████████████████████| 23.2 MB 1.4 MB/s 
[?25hCollecting allennlp-models<3.0,>=2.8
  Downloading allennlp_models-2.9.3-py3-none

Collecting transformers
  Using cached transformers-4.18.0-py3-none-any.whl (4.0 MB)
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.17.0
    Uninstalling transformers-4.17.0:
      Successfully uninstalled transformers-4.17.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spacy-transformers 1.1.5 requires transformers<4.18.0,>=3.4.0, but you have transformers 4.18.0 which is incompatible.[0m
Successfully installed transformers-4.18.0
Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 20.2 MB/s 
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing inst

In [1]:
import spacy
import crosslingual_coreference

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [2]:
# Add rebel component https://github.com/Babelscape/rebel/blob/main/spacy_component.py
from spacy import Language
from typing import List

from spacy.tokens import Doc, Span

import re

from transformers import pipeline

def extract_triplets(text):
    """
    Function to parse the generated text and extract the triplets
    """
    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})

    return triplets


@Language.factory(
    "rebel",
    requires=["doc.sents"],
    assigns=["doc._.rel"],
    default_config={
        "model_name": "Babelscape/rebel-large",
        "device": 0,
    },
)
class RebelComponent:
    def __init__(
        self,
        nlp,
        name,
        model_name: str,
        device: int,
    ):
        assert model_name is not None, ""
        self.triplet_extractor = pipeline("text2text-generation", model=model_name, tokenizer=model_name, device=device)
        # Register custom extension on the Doc
        if not Doc.has_extension("rel"):
          Doc.set_extension("rel", default={})
    
    def _generate_triplets(self, sent: Span) -> List[dict]:
          output_ids = self.triplet_extractor(sent.text, return_tensors=True, return_text=False)[0]["generated_token_ids"]["output_ids"]
          extracted_text = self.triplet_extractor.tokenizer.batch_decode(output_ids[0])
          extracted_triplets = extract_triplets(extracted_text[0])
          return extracted_triplets

    def set_annotations(self, doc: Doc, triplets: List[dict]):
        for triplet in triplets:

            # Remove self-loops (relationships that start and end at the entity)
            if triplet['head'] == triplet['tail']:
                continue
            print({"relation": triplet["type"], "head_span": triplet['head'], "tail_span": triplet['tail']})
            # Match tail and head to doc entity span
            try:
              head_span = [span for span in doc.ents if triplet['head'] in span.text][0]
              tail_span = [span for span in doc.ents if triplet['tail'] in span.text][0]
            except IndexError:
              continue # Both head and tail entities are not present in entities 
            offset = (head_span.start, tail_span.start)
            if offset not in doc._.rel:
                doc._.rel[offset] = {"relation": triplet["type"], "head_span": head_span, "tail_span": tail_span}

    def __call__(self, doc: Doc) -> Doc:
        for sent in doc.sents:
            sentence_triplets = self._generate_triplets(sent)
            self.set_annotations(doc, sentence_triplets)
        return doc

In [5]:
DEVICE = -1 # Number of the GPU, -1 if want to use CPU

# Start with english model
coref = spacy.load('en_core_web_sm')

# Add coreference resolution
coref.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": DEVICE})


nlp = spacy.load('en_core_web_sm')

# Add opentapioca entity linking
nlp.add_pipe('opentapioca')

# Add Rebel relationship extraction
nlp.add_pipe("rebel", config={
    'device':DEVICE, # Number of the GPU, -1 if want to use CPU
    'model_name':'Babelscape/rebel-large'} # Model used, will default to 'Babelscape/rebel-large' if not given
    )

input_text = "Christian Drosten works in Germany. He likes to work for Google."

coref_text = coref(input_text)._.resolved_text

doc = nlp(coref_text)

for span in doc.ents:
    print((span.text, span.kb_id_, span.label_, span._.description, span._.score))

for value, rel_dict in doc._.rel.items():
    print(f"{value}: {rel_dict}")

{'relation': 'country of citizenship', 'head_span': 'Christian Drosten', 'tail_span': 'Germany'}
{'relation': 'employer', 'head_span': 'Christian Drosten', 'tail_span': 'Google'}
('Christian Drosten', 'Q1079331', 'PERSON', 'German virologist and university teacher', 1.8970209111714604)
('Germany', 'Q183', 'LOC', 'sovereign state in Central Europe', 2.0062482394392687)
('Christian Drosten', 'Q1079331', 'PERSON', 'German virologist and university teacher', 2.041460252110812)
('Google', 'Q95', 'ORG', 'American multinational Internet and technology corporation', 0.4212893030607042)
(0, 4): {'relation': 'country of citizenship', 'head_span': Christian Drosten, 'tail_span': Germany}
(0, 12): {'relation': 'employer', 'head_span': Christian Drosten, 'tail_span': Google}


In [5]:
import wikipedia

In [17]:
text = wikipedia.page("Albert Einsten").content

In [7]:
input_text = """Albert Einstein ( EYEN-styne; German: [ˈalbɛʁt ˈʔaɪnʃtaɪn] (listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. Einstein is best known for developing the theory of relativity, but he also made important contributions to the development of the theory of quantum mechanics. Relativity and quantum mechanics are together the two pillars of modern physics. His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been dubbed "the world's most famous equation". His work is also known for its influence on the philosophy of science. He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect", a pivotal step in the development of quantum theory. His intellectual achievements and originality resulted in "Einstein" becoming synonymous with "genius".In 1905, a year sometimes described as his annus mirabilis ('miracle year'), Einstein published four groundbreaking papers. These outlined the theory of the photoelectric effect, explained Brownian motion, introduced special relativity, and demonstrated mass-energy equivalence. Einstein thought that the laws of classical mechanics could no longer be reconciled with those of the electromagnetic field, which led him to develop his special theory of relativity. He then extended the theory to gravitational fields; he published a paper on general relativity in 1916, introducing his theory of gravitation. In 1917, he applied the general theory of relativity to model the structure of the universe. He continued to deal with problems of statistical mechanics and quantum theory, which led to his explanations of particle theory and the motion of molecules. He also investigated the thermal properties of light and the quantum theory of radiation, which laid the foundation of the photon theory of light.
However, for much of the later part of his career, he worked on two ultimately unsuccessful endeavors. First, despite his great contributions to quantum mechanics, he opposed what it evolved into, objecting that nature "does not play dice". Second, he attempted to devise a unified field theory by generalizing his geometric theory of gravitation to include electromagnetism. As a result, he became increasingly isolated from the mainstream of modern physics.
Einstein was born in the German Empire, but moved to Switzerland in 1895, forsaking his German citizenship (as a subject of the Kingdom of Württemberg) the following year. In 1897, at the age of 17, he enrolled in the mathematics and physics teaching diploma program at the Swiss Federal polytechnic school in Zürich, graduating in 1900. In 1901, he acquired Swiss citizenship, which he kept for the rest of his life, and in 1903 he secured a permanent position at the Swiss Patent Office in Bern. In 1905, he was awarded a PhD by the University of Zurich. In 1914, Einstein moved to Berlin in order to join the Prussian Academy of Sciences and the Humboldt University of Berlin. In 1917, Einstein became director of the Kaiser Wilhelm Institute for Physics; he also became a German citizen again, this time Prussian.
In 1933, while Einstein was visiting the United States, Adolf Hitler came to power in Germany. Einstein, of Jewish origin, objected to the policies of the newly elected Nazi government; he settled in the United States and became an American citizen in 1940. On the eve of World War II, he endorsed a letter to President Franklin D. Roosevelt alerting him to the potential German nuclear weapons program and recommending that the US begin similar research. Einstein supported the Allies but generally denounced the idea of nuclear weapons."""

coref_text = coref(input_text)._.resolved_text

doc = nlp(coref_text)

for span in doc.ents:
    print((span.text, span.kb_id_, span.label_, span._.description, span._.score))

for value, rel_dict in doc._.rel.items():
    print(f"{value}: {rel_dict}")

  num_effective_segments = (seq_lengths + self._max_length - 1) // self._max_length


{'relation': 'date of birth', 'head_span': 'Albert Einstein', 'tail_span': '14 March 1879'}
{'relation': 'date of death', 'head_span': 'Albert Einstein', 'tail_span': '18 April 1955'}
{'relation': 'notable work', 'head_span': 'Albert Einstein', 'tail_span': 'theory of relativity'}
{'relation': 'notable work', 'head_span': 'Albert Einstein', 'tail_span': 'quantum mechanics'}
{'relation': 'discoverer or inventor', 'head_span': 'theory of relativity', 'tail_span': 'Albert Einstein'}
{'relation': 'discoverer or inventor', 'head_span': 'theory of relativity', 'tail_span': 'Albert Einstein'}
{'relation': 'notable work', 'head_span': 'Albert Einstein', 'tail_span': 'theory of relativity'}
{'relation': 'notable work', 'head_span': 'Albert Einstein', 'tail_span': 'quantum mechanics'}
{'relation': 'discoverer or inventor', 'head_span': 'quantum mechanics', 'tail_span': 'Albert Einstein'}
{'relation': 'discoverer or inventor', 'head_span': 'quantum mechanics', 'tail_span': 'Albert Einstein'}
{'re