In [17]:
import pandas as pd
import random
import sys
from transformers import AutoTokenizer, pipeline
from collections import Counter

articles = pd.read_json('../scraping/articles/all_articles.json')
text = articles['text'][666][1:8]
print('\n'.join(text))

The US Department of Energy, he said, is studying ways to bail out nuclear (and coal) facilities, including potentially by mandating grid operators to purchase power from them.
Supporters say it’s a sensible move because the nation’s power grid cannot rely solely on natural gas, wind, and solar. Opponents of the nuclear industry say nuclear power that was once advertised as being “too cheap to meter” has become too costly for electric utilities to buy. 
The US has the largest number of nuclear plants in the world – 99 in commercial operation providing 20% of its electricity generation – but its industry leadership is declining as efforts to build a new generation of reactors have been plagued by problems, and ageing plants have been retired or closed in the face of economic, market, and financial pressures.
The situation, exacerbated by robust competition in the new-build sector from China and Russia, has seen the nuclear industry and its supporters call on the government to enact legi

In [18]:
# For entity extraction
model = "dslim/distilbert-NER"
tokenizer = AutoTokenizer.from_pretrained(model)
ner_pipeline = pipeline('ner', model=model, tokenizer=tokenizer)

# NOTE: with current process, entities' start and end indices are reset for each new paragraph
import importlib
import _NER
importlib.reload(_NER)
from _NER import merge_result

entities = []
for paragraph in text:
    print(paragraph)

    # Entities
    entity_result = ner_pipeline(paragraph)
    merged_result = merge_result(entity_result, "dslim/distilbert-NER")
    for eR in merged_result:
        print(eR)
        entities.append(eR)

The US Department of Energy, he said, is studying ways to bail out nuclear (and coal) facilities, including potentially by mandating grid operators to purchase power from them.
{'entity': 'B-ORG', 'score': 0.99445456, 'index': 2, 'word': 'US', 'start': 4, 'end': 6}
{'entity': 'I-ORG', 'score': 0.9905466, 'index': 3, 'word': 'Department', 'start': 7, 'end': 17}
{'entity': 'I-ORG', 'score': 0.9945914, 'index': 4, 'word': 'of', 'start': 18, 'end': 20}
{'entity': 'I-ORG', 'score': 0.9955772, 'index': 5, 'word': 'Energy', 'start': 21, 'end': 27}
Supporters say it’s a sensible move because the nation’s power grid cannot rely solely on natural gas, wind, and solar. Opponents of the nuclear industry say nuclear power that was once advertised as being “too cheap to meter” has become too costly for electric utilities to buy. 
The US has the largest number of nuclear plants in the world – 99 in commercial operation providing 20% of its electricity generation – but its industry leadership is decli

In [19]:
# RE: REBEL-large
from transformers import pipeline
model = 'Babelscape/rebel-large'

triplet_extractor = pipeline('text2text-generation', model=model, tokenizer=model)

# CODE FROM DOCUMENTATION
def extract_triplets(text):
    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
    return triplets

relations = []
for paragraph in text:
    print(paragraph)

    # Relations
    extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor(paragraph, return_tensors=True, return_text=False)[0]["generated_token_ids"]])
    relation_result = extract_triplets(extracted_text[0])
    for rR in relation_result:
        print(rR)
        relations.append(rR)

The US Department of Energy, he said, is studying ways to bail out nuclear (and coal) facilities, including potentially by mandating grid operators to purchase power from them.
{'head': 'Department of Energy', 'type': 'country', 'tail': 'US'}
Supporters say it’s a sensible move because the nation’s power grid cannot rely solely on natural gas, wind, and solar. Opponents of the nuclear industry say nuclear power that was once advertised as being “too cheap to meter” has become too costly for electric utilities to buy. 
{'head': 'nuclear industry', 'type': 'product or material produced', 'tail': 'nuclear power'}
The US has the largest number of nuclear plants in the world – 99 in commercial operation providing 20% of its electricity generation – but its industry leadership is declining as efforts to build a new generation of reactors have been plagued by problems, and ageing plants have been retired or closed in the face of economic, market, and financial pressures.
{'head': '99 in comme

In [20]:
# RE: mREBEL Large
from transformers import pipeline
triplet_extractor = pipeline('translation_xx_to_yy', model='Babelscape/mrebel-large', tokenizer='Babelscape/mrebel-large')

# Function to parse the generated text and extract the triplets
def extract_triplets_typed(text):

    triplets = []
    relation = ''
    text = text.strip()
    current = 'x'
    subject, relation, object_, object_type, subject_type = '','','','',''

    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").replace("tp_XX", "").replace("__en__", "").split():
        if token == "<triplet>" or token == "<relation>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'head_type': subject_type, 'type': relation.strip(),'tail': object_.strip(), 'tail_type': object_type})
                relation = ''
            subject = ''
        elif token.startswith("<") and token.endswith(">"):
            if current == 't' or current == 'o':
                current = 's'
                if relation != '':
                    triplets.append({'head': subject.strip(), 'head_type': subject_type, 'type': relation.strip(),'tail': object_.strip(), 'tail_type': object_type})
                object_ = ''
                subject_type = token[1:-1]
            else:
                current = 'o'
                object_type = token[1:-1]
                relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '' and object_type != '' and subject_type != '':
        triplets.append({'head': subject.strip(), 'head_type': subject_type, 'type': relation.strip(),'tail': object_.strip(), 'tail_type': object_type})
    return triplets

for paragraph in text:
    print(paragraph)
    extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor(
        paragraph, 
        decoder_start_token_id=250058, 
        src_lang="en_XX", tgt_lang="<triplet>",
        return_tensors=True, return_text=False)[0]["translation_token_ids"]]) # change en_XX for the language of the source.
    relation_result = extract_triplets_typed(extracted_text[0])
    for rR in relation_result:
        print(rR)
        relations.append(rR)

The US Department of Energy, he said, is studying ways to bail out nuclear (and coal) facilities, including potentially by mandating grid operators to purchase power from them.
{'head': 'US Department of Energy', 'head_type': 'media', 'type': 'field of work', 'tail': 'nuclear', 'tail_type': 'concept'}
Supporters say it’s a sensible move because the nation’s power grid cannot rely solely on natural gas, wind, and solar. Opponents of the nuclear industry say nuclear power that was once advertised as being “too cheap to meter” has become too costly for electric utilities to buy. 
{'head': 'solar', 'head_type': 'concept', 'type': 'part of', 'tail': 'power grid', 'tail_type': 'concept'}
The US has the largest number of nuclear plants in the world – 99 in commercial operation providing 20% of its electricity generation – but its industry leadership is declining as efforts to build a new generation of reactors have been plagued by problems, and ageing plants have been retired or closed in the

In [6]:
# RE: KnowGL
from transformers import AutoTokenizer, pipeline

model = 'ibm/knowgl-large'
classifier = pipeline('text2text-generation', model=model, tokenizer=model)

relations = []
for paragraph in text:
    print(paragraph)

    # Relations
    relation_result = classifier(paragraph)
    for rR in relation_result:
        print(rR)
        relations.append(rR)


According to DESNZ: “GBN will drive the rapid expansion of new nuclear power plants in the UK at an unprecedented scale and pace.
{'generated_text': '[(New Zealand#New Zealand#sovereign state)|member of|(Commonwealth#Commonwealth of Nations#intergovernmental organization)]'}
“This will boost UK energy security, reduce dependence on volatile fossil fuel imports, create more affordable power and grow the economy, with the nuclear industry estimated to generate around £6bn for the UK economy.”
{'generated_text': '[economy#Economy of the United Kingdom#national economy)|location|(UK#United Kingdom#sovereign state)]'}
DESNZ added that GBN will play a key role in helping the government hit its ambition to provide up to a quarter of the UK’s electricity from homegrown nuclear energy by 2050.
{'generated_text': '[(UK#United Kingdom#sovereign state)|executive body|(government#Government of the United Kingdom#government)]'}
GBN’s official launch is supported by a competition for organisations to

In [5]:
# RE: yseop/distilbert-base-financial-relation-extraction
model = 'yseop/distilbert-base-financial-relation-extraction'
classifier = pipeline('text-classification', model=model, tokenizer=model)

relations = []
for paragraph in text:
    print(paragraph)

    # Relations
    relation_result = classifier(paragraph)
    for rR in relation_result:
        print(rR)
        relations.append(rR)



According to DESNZ: “GBN will drive the rapid expansion of new nuclear power plants in the UK at an unprecedented scale and pace.
{'label': 'are', 'score': 0.9930928945541382}
“This will boost UK energy security, reduce dependence on volatile fossil fuel imports, create more affordable power and grow the economy, with the nuclear industry estimated to generate around £6bn for the UK economy.”
{'label': 'x', 'score': 0.9958141446113586}
DESNZ added that GBN will play a key role in helping the government hit its ambition to provide up to a quarter of the UK’s electricity from homegrown nuclear energy by 2050.
{'label': 'x', 'score': 0.46124711632728577}
GBN’s official launch is supported by a competition for organisations to bid for funding support for the development of their nuclear products, including small modular reactors (SMRs).
{'label': 'is in', 'score': 0.7382863759994507}
Current SMR projects include Rolls-Royce SMR proposals looking at constructing reactors in Oldbury and Berk