# An introduction to `relatio` 
**Runtime $\sim$ 1h**

Original paper: ["Text Semantics Capture Political and Economic Narratives"](https://arxiv.org/abs/2108.01720)

----------------------------

This is a short demo of the package `relatio`.  It takes as input a text corpus and outputs a list of narrative statements. The pipeline is unsupervised: the user does not need to specify narratives beforehand. Narrative statements are defined as tuples of semantic roles with a (agent, verb, patient, attribute) structure. 

Here, we present the main wrapper functions to quickly obtain narrative statements from a corpus.

----------------------------

We provide datasets that have already been split into sentences and annotated by our team.

The datasets are provided in three different formats:
 1. `raw` (unprocessed)
 2. `split_sentences` (as a list of sentences)
 3. `srl` (as a list of annotated sentences by the semantic role labeler)

In this tutorial, we work with the Trump Tweet Archive corpus.

----------------------------

In [1]:
# Catch warnings for an easy ride
from relatio._logging import FileLogger
logger = FileLogger(level = 'WARNING')

2022-03-02 14:34:04.183679: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-02 14:34:04.183697: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
# Browse list of available datasets
from relatio.datasets import list_datasets
print(list_datasets())

# Load an available dataset
from relatio.datasets import load_trump_data
df = load_trump_data("raw")


    List of available datasets:

    Trump Tweet Archive
    - function call: load_trump_data()
    - format: 'raw', 'split_sentences', 'srl_res'
    - allennlp version: 0.9
    - srl model: srl-model-2018.05.25.tar.gz
    


## Step 1: Split into sentences

----------------------------

For any new corpus, the first thing you will want to do is to split the corpus into sentences.

We do this on the first 100 tweets. 

The output is two lists: one with an index for the document and one with the resulting split sentences.

----------------------------


In [3]:
from relatio.preprocessing import *

p = Preprocessor(
    spacy_model = "en_core_web_md",
    remove_punctuation = True,
    remove_digits = True,
    lowercase = True,
    lemmatize = True,
    stop_words = [],
    n_process = -1,
    batch_size = 100
)

split_sentences = p.split_into_sentences(
    df.iloc[0:1000], output_path='sentences.json', progress_bar=True
)

from relatio.utils import load_sentences
doc_index, sentences = load_sentences('sentences.json')

Splitting into sentences...


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:01<00:00, 924.53it/s]


## Step 2: Annotate semantic roles

----------------------------

Once the corpus is split into sentences. You can feed it to the semantic role labeler.

The output is a list of json objects which contain the semantic role annotations for each sentence in the corpus.

----------------------------


In [4]:
from relatio.semantic_role_labeling import *

SRL = SRL(
    path = "https://storage.googleapis.com/allennlp-public-models/openie-model.2020.03.26.tar.gz",
    batch_size = 10,
    cuda_device = -1
)

srl_res = SRL(split_sentences[1], progress_bar=True)

Running SRL...


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 229/229 [01:18<00:00,  2.90it/s]


In [5]:
srl_res[0]

{'verbs': [{'verb': 'have',
   'description': 'Republicans and Democrats [V: have] both created our economic problems .',
   'tags': ['O', 'O', 'O', 'B-V', 'O', 'O', 'O', 'O', 'O', 'O']},
  {'verb': 'created',
   'description': '[ARG0: Republicans and Democrats] have both [V: created] [ARG1: our economic problems] .',
   'tags': ['B-ARG0',
    'I-ARG0',
    'I-ARG0',
    'O',
    'O',
    'B-V',
    'B-ARG1',
    'I-ARG1',
    'I-ARG1',
    'O']}],
 'words': ['Republicans',
  'and',
  'Democrats',
  'have',
  'both',
  'created',
  'our',
  'economic',
  'problems',
  '.']}

NB: This step is faster with a GPU. The argument cuda_device allows users to use their GPUs:

```
import torch
print(torch.cuda.is_available())
print(torch.cuda.current_device())

SRL = SRL(
    path = "https://storage.googleapis.com/allennlp-public-models/openie-model.2020.03.26.tar.gz",
    batch_size = 10,
    cuda_device = 0
)

srl_res = SRL(split_sentences[1], progress_bar=True)
```

In [6]:
# To save us some time, we download the results from the datasets module.
#split_sentences = load_trump_data("split_sentences")
#srl_res = load_trump_data("srl_res")

## Step 3: Pre-process semantic roles

----------------------------


----------------------------

In [7]:
roles, sentence_index = p.extract_roles(
    srl_res, 
    used_roles = ["ARG0","B-V","B-ARGM-NEG","B-ARGM-MOD","ARG1","ARG2"],
    progress_bar = True
)

for d in roles[0:5]: print(d)

Extracting semantic roles...


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2283/2283 [00:00<00:00, 30930.73it/s]

{'B-V': 'have'}
{'ARG0': 'Republicans and Democrats', 'ARG1': 'our economic problems', 'B-V': 'created'}
{'ARG1': 'I', 'ARG2': 'thrilled to be back in the Great city of Charlotte , North Carolina with thousands of hardworking American Patriots who love our Country , cherish our values , respect our laws , and always put AMERICA FIRST', 'B-V': 'was'}
{'ARG1': 'I', 'ARG2': 'back in the Great city of Charlotte , North Carolina with , respect our laws , and always put AMERICA FIRST', 'B-V': 'be'}
{'ARG0': 'thousands of hardworking American Patriots who', 'ARG1': 'our Country', 'B-V': 'love'}





In [8]:
postproc_roles = p.process_roles(roles, 
                                 dict_of_pos_tags_to_keep = {
                                     "ARG0": ['NOUN', 'PROPN'],
                                     "B-V": ['VERB'],
                                     "ARG1": ['NOUN', 'PROPN'],
                                     "ARG2": ['NOUN', 'PROPN']
                                 }, 
                                 progress_bar = True,
                                 output_path = 'postproc_roles.json')

from relatio.utils import load_roles
postproc_roles = load_roles('postproc_roles.json')

for d in postproc_roles[0:5]: print(d)

Cleaning roles ARG0...


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1873/1873 [00:01<00:00, 1798.77it/s]


Cleaning roles B-V...


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4684/4684 [00:02<00:00, 2080.21it/s]


Cleaning roles B-ARGM-MOD...


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 484/484 [00:00<00:00, 1300.78it/s]


Cleaning roles ARG1...


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3396/3396 [00:01<00:00, 1805.20it/s]


Cleaning roles ARG2...


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 951/951 [00:00<00:00, 1283.01it/s]

{'B-V': 'have'}
{'ARG0': 'republicans democrats', 'B-V': 'create', 'ARG1': 'problem'}
{'B-V': '', 'ARG1': '', 'ARG2': 'city charlotte north carolina thousand patriots country value law america'}
{'B-V': '', 'ARG1': '', 'ARG2': 'city charlotte north carolina law america'}
{'ARG0': 'thousand patriots', 'B-V': '', 'ARG1': 'country'}





In [9]:
known_entities = p.mine_entities(
    split_sentences[1], 
    clean_entities = True, 
    progress_bar = True,
    output_path = 'entities.pkl'
)

from relatio.utils import load_entities
known_entities = load_entities('entities.pkl')

for n in known_entities.most_common(10): print(n)

Mining named entities...


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2283/2283 [00:01<00:00, 1317.05it/s]

('biden', 78)
('georgia', 58)
('pennsylvania', 53)
('joe biden', 50)
('trump', 39)
('michigan', 35)
('democrats', 29)
('america', 29)
('republican', 27)
('breitbartnews', 27)





In [10]:
top_known_entities = [e[0] for e in list(known_entities.most_common(100)) if e[0] != '']

## Step 4: Build the narrative model

----------------------------

We are now ready to build a narrative model.

----------------------------

In [11]:
from relatio._embeddings import Embeddings
nlp_model = Embeddings("TensorFlow_USE","https://tfhub.dev/google/universal-sentence-encoder/4")

2022-03-02 14:35:46.038403: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-02 14:35:46.038885: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-02 14:35:46.038959: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2022-03-02 14:35:46.039015: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2022-03-02 14:35:46.040510: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Co

In [12]:
from relatio.narrative_models import *
from relatio.utils import prettify
from collections import Counter

m = NarrativeModel(model_type = 'deterministic',
                   roles_considered = ['ARG0', 'B-V', 'B-ARGM-NEG', 'B-ARGM-MOD', 'ARG1', 'ARG2'],
                   roles_with_entities = ['ARG0','ARG1','ARG2'],
                   list_of_known_entities = top_known_entities,
                   assignment_to_known_entities = 'character_matching',
                   roles_with_embeddings = [],
                   embeddings_model = None,
                   threshold = 1)    

m.train(postproc_roles)

No training required: the model is deterministic.


In [13]:
import cProfile
cProfile.run("narratives = m.predict(postproc_roles, progress_bar = True, prettify = False)")

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5118/5118 [00:00<00:00, 7216.88it/s]

         2700156 function calls (2671928 primitive calls) in 0.757 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
        1    0.000    0.000    0.757    0.757 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 _monitor.py:94(report)
        1    0.000    0.000    0.000    0.000 _weakrefset.py:106(remove)
        2    0.000    0.000    0.000    0.000 _weakrefset.py:16(__init__)
        2    0.000    0.000    0.000    0.000 _weakrefset.py:20(__enter__)
        2    0.000    0.000    0.000    0.000 _weakrefset.py:26(__exit__)
        2    0.000    0.000    0.000    0.000 _weakrefset.py:52(_commit_removals)
        3    0.000    0.000    0.000    0.000 _weakrefset.py:58(__iter__)
        1    0.000    0.000    0.000    0.000 _weakrefset.py:81(add)
  28229/1    0.021    0.000    0.045    0.045 copy.py:132(deepcopy




In [14]:
pretty_narratives = []
for n in narratives: 
    if n.get('ARG0') not in ["", None]:
        if n.get('B-V') not in ["", None]:
            if n.get('ARG1') not in ["", None]:
                pretty_narratives.append(prettify(n))
                
pretty_narratives = Counter(pretty_narratives)
for t in pretty_narratives.most_common(10): print(t)

('biden want country', 2)
('republicans|house vote will ndaa|national defense authorization act', 2)
('biden lie pennsylvania', 1)
('democrats|republican|rino look d.c.', 1)
('state want state', 1)
('twitter ban pennsylvania|state', 1)
('pennsylvania give biden', 1)
('fbi make must fbi', 1)
('mike have country', 1)
('georgia|state refuse republican|kelly|david', 1)


In [15]:
m = NarrativeModel(model_type = 'dynamic',
                   roles_considered = ['ARG0', 'B-V', 'B-ARGM-NEG', 'B-ARGM-MOD', 'ARG1', 'ARG2'],
                   roles_with_entities = ['ARG0','ARG1','ARG2'],
                   list_of_known_entities = top_known_entities,
                   assignment_to_known_entities = 'character_matching',
                   roles_with_embeddings = [['ARG0','ARG1','ARG2']],
                   embeddings_model = None,
                   threshold = 0.3)    

m.train(postproc_roles, progress_bar = True)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5118/5118 [00:04<00:00, 1159.98it/s]


In [16]:
cProfile.run("narratives = m.predict(postproc_roles, progress_bar = True, prettify = False)")

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5118/5118 [00:05<00:00, 859.38it/s]

         5366387 function calls (5100151 primitive calls) in 5.994 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2532    0.002    0.000    0.032    0.000 <__array_function__ internals>:2(amin)
     2532    0.002    0.000    0.012    0.000 <__array_function__ internals>:2(atleast_1d)
     7596    0.007    0.000    0.044    0.000 <__array_function__ internals>:2(concatenate)
     4763    0.004    0.000    0.015    0.000 <__array_function__ internals>:2(count_nonzero)
     2532    0.002    0.000    0.010    0.000 <__array_function__ internals>:2(dot)
     2532    0.002    0.000    0.043    0.000 <__array_function__ internals>:2(hstack)
     2532    0.002    0.000    0.034    0.000 <__array_function__ internals>:2(norm)
     2532    0.002    0.000    0.008    0.000 <__array_function__ internals>:2(where)
        3    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
        1    0.000 




In [17]:
pretty_narratives = []
for n in narratives: 
    if n.get('ARG0') not in ["", None]:
        if n.get('B-V') not in ["", None]:
            if n.get('ARG1') not in ["", None]:
                pretty_narratives.append(prettify(n))
                
pretty_narratives = Counter(pretty_narratives)
for t in pretty_narratives.most_common(10): print(t)

('nate simington individual have senate', 2)
('jim have complete total endorsement', 2)
('judge brann not allow would case evidence', 2)
('democrats steal election', 2)
('biden|joe biden|joe sacrifice blood treasure', 2)
('biden want country', 2)
('republicans|house vote will ndaa|national defense authorization act', 2)
('democrats|republicans create problem', 1)
('thousand patriots cherish value', 1)
('election use system', 1)


## Step 5: Model validation and basic analysis

----------------------------


----------------------------

In [18]:
for i,k in enumerate(sentence_index):
    
    n = narratives[i]
    r = roles[i]
    
    if n.get('ARG0') not in ["", None]:
        if n.get('B-V') not in ["", None]:
            if n.get('ARG1') not in ["", None]:
                print('Original statement:')
                print(prettify(r))
                print('\n')
                print('Underlying narrative:')
                print(prettify(n))
                print("\n")
                

Original statement:
Republicans and Democrats created our economic problems


Underlying narrative:
democrats|republicans create problem


Original statement:
thousands of hardworking American Patriots who cherish our values


Underlying narrative:
thousand patriots cherish value


Original statement:
Almost all recent elections using this system


Underlying narrative:
election use system


Original statement:
Sudan agreed to a peace and normalization agreement with Israel


Underlying narrative:
sudan agree peace normalization agreement israel


Original statement:
AdamLaxalt finding things that , when released , will be absolutely shocking


Underlying narrative:
adamlaxalt find thing


Original statement:
SenTomCotton Republicans have pluses & amp ; minuses


Underlying narrative:
republicans have plus amp minus


Original statement:
Biden lied Pennsylvania


Underlying narrative:
biden lie pennsylvania


Original statement:
Dominion running our Election


Underlying narrative:
dom



Original statement:
New York regain its luster


Underlying narrative:
new york regain luster


Original statement:
Governor Cuomo shown tremendously poor leadership skills in running N.Y. Bad time for him to be writing and promoting a book


Underlying narrative:
governor cuomo show leadership skill n.y. time book


Original statement:
Governor Cuomo running N.Y. Bad time


Underlying narrative:
governor cuomo run n.y. time


Original statement:
a Man that Wants to Do It All for America


Underlying narrative:
man want america


Original statement:
Darin has my Complete and Total Endorsement


Underlying narrative:
darin have complete total endorsement


Original statement:
Ken has my Complete and Total Endorsement


Underlying narrative:
ken have complete total endorsement


Original statement:
Senator CindyHydeSmith delivers for Mississippi


Underlying narrative:
senator cindyhydesmith deliver mississippi


Original statement:
a Corrupt Politician who Raise will your Taxes


Unde

georgia|breitbartnews not rely can briankempga


Original statement:
Joe Biden used the term Super Predator


Underlying narrative:
biden|joe biden|joe use term super predator


Original statement:
Joe Biden referring to young Black Men


Underlying narrative:
biden|joe biden|joe refer black men


Original statement:
Facebook stated that they made an enforcement error


Underlying narrative:
facebook state enforcement error


Original statement:
Democrats Growing More Anxious in Pennsylvania via BreitbartNews Biden kill would Fracking and our great 2nd Amendment


Underlying narrative:
biden|pennsylvania|democrats|breitbartnews kill would fracking 2nd amendment


Original statement:
Justices Alito and Thomas say they would have allowed Texas to proceed with its election lawsuit


Underlying narrative:
justices alito thomas say texas


Original statement:
Texas proceed with its election lawsuit


Underlying narrative:
texas proceed election lawsuit


Original statement:
Antifa Shouts De

## Step 6: Visualization // Plotting narrative graphs
----------------------------

A collection of narrative statements has an intuitive network structure, in which the edges are verbs and the nodes are entities.

Here, we plot Trump's narrative statements on Twitter.

----------------------------