# An introduction to `relatio` 
**Runtime $\sim$ 1h**

Original paper: ["Text Semantics Capture Political and Economic Narratives"](https://arxiv.org/abs/2108.01720)

----------------------------

This is a short demo of the package `relatio`.  It takes as input a text corpus and outputs a list of narrative statements. The pipeline is unsupervised: the user does not need to specify narratives beforehand. Narrative statements are defined as tuples of semantic roles with a (agent, verb, patient, attribute) structure. 

Here, we present the main wrapper functions to quickly obtain narrative statements from a corpus.

----------------------------

We provide datasets that have already been split into sentences and annotated by our team.

The datasets are provided in three different formats:
 1. `raw` (unprocessed)
 2. `split_sentences` (as a list of sentences)
 3. `srl` (as a list of annotated sentences by the semantic role labeler)

In this tutorial, we work with the Trump Tweet Archive corpus.

----------------------------

In [1]:
import cProfile

In [2]:
# Catch warnings for an easy ride
from relatio import FileLogger
logger = FileLogger(level = 'WARNING')

2022-03-05 10:04:04.068473: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-05 10:04:04.068494: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [3]:
# Browse list of available datasets
from relatio.datasets import list_datasets
print(list_datasets())

# Load an available dataset
from relatio.datasets import load_trump_data
df = load_trump_data("raw")


    List of available datasets:

    Trump Tweet Archive
    - function call: load_trump_data()
    - format: 'raw', 'split_sentences', 'srl_res'
    - allennlp version: 0.9
    - srl model: srl-model-2018.05.25.tar.gz
    


## Step 1: Split into sentences

----------------------------

For any new corpus, the first thing you will want to do is to split the corpus into sentences.

We do this on the first 100 tweets. 

The output is two lists: one with an index for the document and one with the resulting split sentences.

----------------------------


In [4]:
from relatio.preprocessing import *

p = Preprocessor(
    spacy_model = "en_core_web_md",
    remove_punctuation = True,
    remove_digits = True,
    lowercase = True,
    lemmatize = True,
    stop_words = [],
    n_process = -1,
    batch_size = 100
)

split_sentences = p.split_into_sentences(
    df.iloc[0:1000], output_path='sentences.json', progress_bar=True
)

from relatio.utils import load_sentences
doc_index, sentences = load_sentences('sentences.json')

Splitting into sentences...


100%|██████████████████████████████████████| 1000/1000 [00:01<00:00, 884.93it/s]


## Step 2: Annotate semantic roles

----------------------------

Once the corpus is split into sentences. You can feed it to the semantic role labeler.

The output is a list of json objects which contain the semantic role annotations for each sentence in the corpus.

----------------------------


In [5]:
from relatio.semantic_role_labeling import SRL

SRL = SRL(
    path = "https://storage.googleapis.com/allennlp-public-models/openie-model.2020.03.26.tar.gz",
    batch_size = 10,
    cuda_device = 0
)

srl_res = SRL(split_sentences[1], progress_bar=True)

Running SRL...


100%|█████████████████████████████████████████| 229/229 [01:32<00:00,  2.48it/s]


In [6]:
srl_res[0]

{'verbs': [{'verb': 'have',
   'description': 'Republicans and Democrats [V: have] both created our economic problems .',
   'tags': ['O', 'O', 'O', 'B-V', 'O', 'O', 'O', 'O', 'O', 'O']},
  {'verb': 'created',
   'description': '[ARG0: Republicans and Democrats] have both [V: created] [ARG1: our economic problems] .',
   'tags': ['B-ARG0',
    'I-ARG0',
    'I-ARG0',
    'O',
    'O',
    'B-V',
    'B-ARG1',
    'I-ARG1',
    'I-ARG1',
    'O']}],
 'words': ['Republicans',
  'and',
  'Democrats',
  'have',
  'both',
  'created',
  'our',
  'economic',
  'problems',
  '.']}

NB: This step is faster with a GPU. The argument cuda_device allows users to use their GPUs:

```
import torch
print(torch.cuda.is_available())
print(torch.cuda.current_device())

SRL = SRL(
    path = "https://storage.googleapis.com/allennlp-public-models/openie-model.2020.03.26.tar.gz",
    batch_size = 10,
    cuda_device = 0
)

srl_res = SRL(split_sentences[1], progress_bar=True)
```

In [7]:
# To save us some time, we download the results from the datasets module.
# split_sentences = load_trump_data("split_sentences")
# srl_res = load_trump_data("srl_res")

## Step 3: Pre-process semantic roles

----------------------------


----------------------------

In [8]:
from relatio.semantic_role_labeling import extract_roles

roles, sentence_index = extract_roles(
    srl_res, 
    used_roles = ["ARG0","B-V","B-ARGM-NEG","B-ARGM-MOD","ARG1","ARG2"],
    progress_bar = True
)

for d in roles[0:5]: print(d)

Extracting semantic roles...


100%|████████████████████████████████████| 2283/2283 [00:00<00:00, 25242.98it/s]

{'B-V': 'have'}
{'ARG0': 'Republicans and Democrats', 'ARG1': 'our economic problems', 'B-V': 'created'}
{'ARG1': 'I', 'ARG2': 'thrilled to be back in the Great city of Charlotte , North Carolina with thousands of hardworking American Patriots who love our Country , cherish our values , respect our laws , and always put AMERICA FIRST', 'B-V': 'was'}
{'ARG1': 'I', 'ARG2': 'back in the Great city of Charlotte , North Carolina with , respect our laws , and always put AMERICA FIRST', 'B-V': 'be'}
{'ARG0': 'thousands of hardworking American Patriots who', 'ARG1': 'our Country', 'B-V': 'love'}





In [9]:
postproc_roles = p.process_roles(roles, 
                                 dict_of_pos_tags_to_keep = {
                                     "ARG0": ['NOUN', 'PROPN'],
                                     "B-V": ['VERB'],
                                     "ARG1": ['NOUN', 'PROPN'],
                                     "ARG2": ['NOUN', 'PROPN']
                                 }, 
                                 progress_bar = True,
                                 output_path = 'postproc_roles.json')

from relatio.utils import load_roles
postproc_roles = load_roles('postproc_roles.json')

for d in postproc_roles[0:5]: print(d)

Cleaning phrases for role ARG0...


100%|██████████████████████████████████████| 1873/1873 [00:02<00:00, 845.82it/s]


Cleaning phrases for role B-V...


100%|██████████████████████████████████████| 4684/4684 [00:06<00:00, 691.43it/s]


Cleaning phrases for role B-ARGM-MOD...


100%|████████████████████████████████████████| 484/484 [00:01<00:00, 406.89it/s]


Cleaning phrases for role ARG1...


100%|██████████████████████████████████████| 3396/3396 [00:04<00:00, 765.14it/s]


Cleaning phrases for role ARG2...


100%|███████████████████████████████████████| 951/951 [00:00<00:00, 1054.70it/s]

{'B-V': 'have'}
{'ARG0': 'republicans democrats', 'B-V': 'create', 'ARG1': 'problem'}
{'ARG2': 'city charlotte north carolina thousand patriots country value law america'}
{'ARG2': 'city charlotte north carolina law america'}
{'ARG0': 'thousand patriots', 'ARG1': 'country'}





In [10]:
known_entities = p.mine_entities(
    split_sentences[1], 
    clean_entities = True, 
    progress_bar = True,
    output_path = 'entities.pkl'
)

from relatio.utils import load_entities
known_entities = load_entities('entities.pkl')

for n in known_entities.most_common(10): print(n)

Mining named entities...


100%|█████████████████████████████████████| 2283/2283 [00:01<00:00, 1230.35it/s]

('biden', 78)
('georgia', 58)
('pennsylvania', 53)
('joe biden', 50)
('trump', 39)
('michigan', 35)
('democrats', 29)
('america', 29)
('republican', 27)
('breitbartnews', 27)





In [11]:
top_known_entities = [e[0] for e in list(known_entities.most_common(100)) if e[0] != '']

## Step 4: Build a narrative model

----------------------------

We are now ready to build a narrative model.

----------------------------

In [12]:
from relatio.narrative_models import NarrativeModel
from relatio.utils import prettify
from collections import Counter

m = NarrativeModel(model_type = 'deterministic',
                   roles_considered = ['ARG0', 'B-V', 'B-ARGM-NEG', 'B-ARGM-MOD', 'ARG1', 'ARG2'],
                   roles_with_known_entities = ['ARG0','ARG1','ARG2'],
                   known_entities = top_known_entities,
                   assignment_to_known_entities = 'character_matching',
                   roles_with_unknown_entities = [['ARG0','ARG1','ARG2']],
                   embeddings_model = None,
                   threshold = 1)    

m.train(postproc_roles)

No training required: the model is deterministic.


In [13]:
cProfile.run("narratives = m.predict(postproc_roles, progress_bar = True)")


Predicting entities for role: ARG0...
Matching known entities (with character matching)...


100%|███████████████████████████████████████| 908/908 [00:00<00:00, 8642.93it/s]


Assigning labels to matches...

Predicting entities for role: ARG1...
Matching known entities (with character matching)...


100%|█████████████████████████████████████| 2424/2424 [00:00<00:00, 7830.32it/s]


Assigning labels to matches...

Predicting entities for role: ARG2...
Matching known entities (with character matching)...


100%|███████████████████████████████████████| 657/657 [00:00<00:00, 8192.10it/s]

Assigning labels to matches...
         1777582 function calls (1757132 primitive calls) in 0.537 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        9    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
        1    0.000    0.000    0.536    0.536 <string>:1(<module>)
        3    0.000    0.000    0.000    0.000 _monitor.py:94(report)
        3    0.000    0.000    0.000    0.000 _weakrefset.py:106(remove)
        6    0.000    0.000    0.000    0.000 _weakrefset.py:16(__init__)
        6    0.000    0.000    0.000    0.000 _weakrefset.py:20(__enter__)
        6    0.000    0.000    0.000    0.000 _weakrefset.py:26(__exit__)
        6    0.000    0.000    0.000    0.000 _weakrefset.py:52(_commit_removals)
        9    0.000    0.000    0.000    0.000 _weakrefset.py:58(__iter__)
        3    0.000    0.000    0.000    0.000 _weakrefset.py:81(add)
  20451/1    0.014    0.000    0.03




In [14]:
pretty_narratives = []
for n in narratives: 
    if n.get('ARG0') is not None:
        if n.get('B-V') is not None:
            if n.get('ARG1') is not None:
                pretty_narratives.append(prettify(n))
                
pretty_narratives = Counter(pretty_narratives)
for t in pretty_narratives.most_common(10): print(t)

('biden want country', 2)
('republicans|house will vote ndaa|national defense authorization act', 2)
('biden lie pennsylvania', 1)
('democrats|republican|rino look d.c.', 1)
('state want state', 1)
('twitter ban pennsylvania|state', 1)
('pennsylvania give biden', 1)
('fbi must make fbi', 1)
('mike have country', 1)
('georgia|state refuse republican|kelly|david', 1)


In [15]:
from relatio import Embeddings
#nlp_model = Embeddings("TensorFlow_USE","https://tfhub.dev/google/universal-sentence-encoder/4")
nlp_model = Embeddings("Gensim_pretrained", "glove-twitter-25")

In [16]:
m = NarrativeModel(model_type = 'static',
                   roles_considered = ['ARG0', 'B-V', 'B-ARGM-NEG', 'B-ARGM-MOD', 'ARG1', 'ARG2'],
                   roles_with_known_entities = ['ARG0','ARG1','ARG2'],
                   known_entities = top_known_entities,
                   assignment_to_known_entities = 'embeddings',
                   roles_with_unknown_entities = [['ARG0','ARG1','ARG2']],
                   n_clusters = [100],
                   embeddings_model = nlp_model,
                   threshold = 0.3)    

m.train(postproc_roles, progress_bar = True)

Focus on roles: ARG0-ARG1-ARG2
Ignoring known entities...
Computing phrase embeddings...


100%|██████████████████████████████████████| 466/466 [00:00<00:00, 20398.74it/s]


Computing phrase embeddings...


100%|████████████████████████████████████| 1400/1400 [00:00<00:00, 21976.97it/s]


Computing phrase embeddings...


100%|██████████████████████████████████████| 524/524 [00:00<00:00, 21583.60it/s]


Computing phrase embeddings...


100%|████████████████████████████████████| 2052/2052 [00:00<00:00, 26911.53it/s]


Clustering phrases into 100 clusters...
Labeling the clusters by the most frequent phrases...


In [17]:
cProfile.run("narratives = m.predict(postproc_roles, progress_bar = True)")


Predicting entities for role: ARG0...
Computing phrase embeddings...


100%|██████████████████████████████████████| 908/908 [00:00<00:00, 10286.93it/s]


Matching known entities (with embeddings distance)...
Matching unknown entities (with embeddings distance)...
Assigning labels to matches...

Predicting entities for role: ARG1...
Computing phrase embeddings...


100%|████████████████████████████████████| 2424/2424 [00:00<00:00, 12341.55it/s]


Matching known entities (with embeddings distance)...
Matching unknown entities (with embeddings distance)...
Assigning labels to matches...

Predicting entities for role: ARG2...
Computing phrase embeddings...


100%|██████████████████████████████████████| 657/657 [00:00<00:00, 12841.08it/s]


Matching known entities (with embeddings distance)...
Matching unknown entities (with embeddings distance)...
Assigning labels to matches...
         413121 function calls (388759 primitive calls) in 0.424 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        6    0.000    0.000    0.002    0.000 <__array_function__ internals>:2(amin)
        6    0.000    0.000    0.001    0.000 <__array_function__ internals>:2(argmin)
        3    0.000    0.000    0.003    0.001 <__array_function__ internals>:2(concatenate)
     3912    0.003    0.000    0.011    0.000 <__array_function__ internals>:2(count_nonzero)
     3912    0.003    0.000    0.013    0.000 <__array_function__ internals>:2(dot)
     3989    0.004    0.000    0.113    0.000 <__array_function__ internals>:2(mean)
     3912    0.003    0.000    0.046    0.000 <__array_function__ internals>:2(norm)
        6    0.000    0.000    0.000    0.000 <__array_function__ inter

In [18]:
pretty_narratives = []
for n in narratives: 
    pretty_narratives.append(prettify(n))
                
pretty_narratives = Counter(pretty_narratives)
for t in pretty_narratives.most_common(10): print(t)

('', 1155)
('have', 151)
('do', 127)
('thank', 55)
('go', 48)
('get', 46)
('vote', 34)
('see', 28)
('win', 27)
('taxis', 22)


## Step 5: Model validation and basic analysis

----------------------------


----------------------------

In [19]:
postproc_roles[0:10]

[{'B-V': 'have'},
 {'ARG0': 'republicans democrats', 'B-V': 'create', 'ARG1': 'problem'},
 {'ARG2': 'city charlotte north carolina thousand patriots country value law america'},
 {'ARG2': 'city charlotte north carolina law america'},
 {'ARG0': 'thousand patriots', 'ARG1': 'country'},
 {'ARG0': 'thousand patriots', 'B-V': 'cherish', 'ARG1': 'value'},
 {'ARG1': 'law'},
 {'B-V': 'put', 'ARG1': 'america first'},
 {'B-V': 'thank', 'ARG2': 'evening'},
 {'ARG1': 'unsolicited mail ballot scam', 'ARG2': 'threat democracy amp'}]

In [20]:
narratives[0:10]

[{'B-V': 'have'},
 {'ARG0': 'result', 'B-V': 'create', 'ARG1': 'job'},
 {'ARG2': 'administration'},
 {'ARG2': 'republican'},
 {'ARG0': 'military', 'ARG1': 'dominion voting systems'},
 {'ARG0': 'military', 'B-V': 'cherish', 'ARG1': 'governor briankempga'},
 {'ARG1': 'fake news'},
 {'B-V': 'put', 'ARG1': 'republican poll watcher'},
 {'B-V': 'thank', 'ARG2': 'biden'},
 {'ARG1': 'signature verification', 'ARG2': 'mistake'}]

In [21]:
m._model_obj.vocab_unknown_entities[0][90].most_common()

[('job', 26),
 ('child', 6),
 ('biden corruption', 3),
 ('mail ballot', 3),
 ('stench election hoax', 3),
 ('terrorist anarchist agitator antifa', 3),
 ('term', 3),
 ('signature envelope ballot', 2),
 ('price transparency', 2),
 ('businessman jobs life', 2),
 ('going', 2),
 ('u.s. energy industry fracking energy gas price', 2),
 ('taxis brave law enforcement wall second amendment', 2),
 ('ballot loss america', 2),
 ('november 3rd', 2),
 ('department justice department homeland security u.s. supreme court', 1),
 ('democrat senate', 1),
 ('bret baier tweet', 1),
 ('way shape form', 1),
 ('order', 1),
 ('star pennsylvania', 1),
 ('alfred e. newman mayor pete amp', 1),
 ('misery', 1),
 ('man woman secretservice', 1),
 ('drug price', 1),
 ('victory country', 1),
 ('information', 1),
 ('job president', 1),
 ('covid covid covid way election', 1),
 ('law amp order', 1),
 ('stephaniebice', 1),
 ('joe bidenim wing medium big tech giant washington swamp', 1),
 ('voter fraud place election', 1)]

In [22]:
pretty_narratives = []
for n in narratives: 
    if n.get('ARG0') not in ["", None]:
        if n.get('B-V') not in ["", None]:
            if n.get('ARG1') not in ["", None]:
                pretty_narratives.append(prettify(n))
                
pretty_narratives = Counter(pretty_narratives)
for t in pretty_narratives.most_common(100): print(t)

('law have clue', 5)
('decision have clue', 5)
('sleepy joe have sleepy joe', 2)
('term super predator have clue', 2)
('law enforcement would not allow guy', 2)
('election draw vote', 2)
('sleepy joe steal troop', 2)
('state have clue', 2)
('result steal troop', 2)
('election sacrifice world', 2)
('election vow friend', 2)
('election abolish friend', 2)
('sleepy joe will vote fake news', 2)
('signature have america', 2)
('way have clue', 2)
('result create job', 1)
('military cherish governor briankempga', 1)
('troop use court', 1)
('classified information agree president united states', 1)
('result have thug lowlife', 1)
('election lie taxis', 1)
('poll run troop', 1)
('court would announce problem', 1)
('democrat cities have point', 1)
('donald trump have point', 1)
('radical left would destroy vote', 1)
('result not want news media', 1)
('result have news media', 1)
('oann have politician', 1)
('oann reject light', 1)
('donald trump allow consent decree', 1)
('election expose great 

## Step 6: Visualization // Plotting narrative graphs
----------------------------

A collection of narrative statements has an intuitive network structure, in which the edges are verbs and the nodes are entities.

Here, we plot Trump's narrative statements on Twitter.

----------------------------