# An introduction to `relatio` 
**Runtime $\sim$ 1h**

Original paper: ["Text Semantics Capture Political and Economic Narratives"](https://arxiv.org/abs/2108.01720)

----------------------------

This is a short demo of the package `relatio`.  It takes as input a text corpus and outputs a list of narrative statements. The pipeline is unsupervised: the user does not need to specify narratives beforehand. Narrative statements are defined as tuples of semantic roles with a (agent, verb, patient, attribute) structure. 

Here, we present the main wrapper functions to quickly obtain narrative statements from a corpus.

----------------------------

We provide datasets that have already been split into sentences and annotated by our team.

The datasets are provided in three different formats:
 1. `raw` (unprocessed)
 2. `split_sentences` (as a list of sentences)
 3. `srl` (as a list of annotated sentences by the semantic role labeler)

In this tutorial, we work with the Trump Tweet Archive corpus.

----------------------------

In [1]:
# Catch warnings for an easy ride
from relatio._logging import FileLogger
logger = FileLogger(level = 'WARNING')

In [2]:
# Browse list of available datasets
from relatio.datasets import list_datasets
print(list_datasets())

# Load an available dataset
from relatio.datasets import load_trump_data
df = load_trump_data("raw")


    List of available datasets:

    Trump Tweet Archive
    - function call: load_trump_data()
    - format: 'raw', 'split_sentences', 'srl_res'
    - allennlp version: 0.9
    - srl model: srl-model-2018.05.25.tar.gz
    


## Step 1: Split into sentences

----------------------------

For any new corpus, the first thing you will want to do is to split the corpus into sentences.

We do this on the first 100 tweets. 

The output is two lists: one with an index for the document and one with the resulting split sentences.

----------------------------


In [3]:
from relatio.preprocessing import *

p = Preprocessor(
    spacy_model = "en_core_web_md",
    remove_punctuation = True,
    remove_digits = True,
    lowercase = True,
    lemmatize = True,
    stop_words = [],
    n_process = -1,
    batch_size = 100
)

split_sentences = p.split_into_sentences(
    df.iloc[0:1000], output_path='sentences.json', progress_bar=True
)

from relatio.utils import load_sentences
split_sentences = load_sentences('sentences.json')

Splitting into sentences...


100%|█████████████████████████████████████| 1000/1000 [00:00<00:00, 1113.03it/s]


## Step 2: Annotate semantic roles

----------------------------

Once the corpus is split into sentences. You can feed it to the semantic role labeler.

The output is a list of json objects which contain the semantic role annotations for each sentence in the corpus.

----------------------------


In [4]:
from relatio.semantic_role_labeling import *

SRL = SRL(
    path = "https://storage.googleapis.com/allennlp-public-models/openie-model.2020.03.26.tar.gz",
    batch_size = 10,
    cuda_device = -1
)

srl_res = SRL(split_sentences[1], progress_bar=True)

Running SRL...


100%|█████████████████████████████████████████| 229/229 [01:13<00:00,  3.12it/s]


In [5]:
srl_res[0]

{'verbs': [{'verb': 'have',
   'description': 'Republicans and Democrats [V: have] both created our economic problems .',
   'tags': ['O', 'O', 'O', 'B-V', 'O', 'O', 'O', 'O', 'O', 'O']},
  {'verb': 'created',
   'description': '[ARG0: Republicans and Democrats] have both [V: created] [ARG1: our economic problems] .',
   'tags': ['B-ARG0',
    'I-ARG0',
    'I-ARG0',
    'O',
    'O',
    'B-V',
    'B-ARG1',
    'I-ARG1',
    'I-ARG1',
    'O']}],
 'words': ['Republicans',
  'and',
  'Democrats',
  'have',
  'both',
  'created',
  'our',
  'economic',
  'problems',
  '.']}

NB: This step is faster with a GPU. The argument cuda_device allows users to use their GPUs:

```
import torch
print(torch.cuda.is_available())
print(torch.cuda.current_device())

SRL = SRL(
    path = "https://storage.googleapis.com/allennlp-public-models/openie-model.2020.03.26.tar.gz",
    batch_size = 10,
    cuda_device = 0
)

srl_res = SRL(split_sentences[1], progress_bar=True)
```

In [6]:
# To save us some time, we download the results from the datasets module.
split_sentences = load_trump_data("split_sentences")
srl_res = load_trump_data("srl_res")

## Step 3: Pre-process semantic roles

----------------------------


----------------------------

In [7]:
roles, sentence_index = p.extract_roles(
    srl_res, 
    used_roles = ["ARG0","B-V","B-ARGM-NEG","B-ARGM-MOD","ARG1","ARG2"],
    progress_bar = True
)

for d in roles[0:5]: print(d)

Extracting semantic roles...


100%|██████████████████████████████████| 68616/68616 [00:01<00:00, 43712.22it/s]

{'B-V': 'have'}
{'ARG0': 'Republicans and Democrats', 'ARG1': 'our economic problems', 'B-V': 'created'}
{'ARG1': 'I', 'ARG2': 'thrilled to be back in the Great city of Charlotte , North Carolina with thousands of hardworking American Patriots who love our Country , cherish our values , respect our laws , and always put AMERICA FIRST', 'B-V': 'was'}
{'ARG1': 'I', 'ARG2': 'to be back in the Great city of Charlotte , North Carolina with thousands of hardworking American Patriots who love our Country , cherish our values , respect our laws , and always put AMERICA FIRST', 'B-V': 'thrilled'}
{'ARG1': 'I', 'ARG2': 'back in the Great city of Charlotte , North Carolina', 'B-V': 'be'}





In [8]:
postproc_roles = p.process_roles(roles, 
                                 dict_of_pos_tags_to_keep = {
                                     "ARG0": ['NOUN', 'PROPN'],
                                     "B-V": ['VERB'],
                                     "ARG1": ['NOUN', 'PROPN'],
                                     "ARG2": ['NOUN', 'PROPN']
                                 }, 
                                 progress_bar = True,
                                 output_path = 'postproc_roles.json')

from relatio.utils import load_roles
postproc_roles = load_roles('postproc_roles.json')

for d in postproc_roles[0:5]: print(d)

Cleaning roles ARG0...


100%|███████████████████████████████████| 49281/49281 [00:15<00:00, 3187.87it/s]


Cleaning roles B-V...


100%|█████████████████████████████████| 135562/135562 [00:41<00:00, 3284.41it/s]


Cleaning roles B-ARGM-MOD...


100%|███████████████████████████████████| 13752/13752 [00:04<00:00, 3097.86it/s]


Cleaning roles ARG1...


100%|███████████████████████████████████| 88730/88730 [00:31<00:00, 2789.57it/s]


Cleaning roles ARG2...


100%|███████████████████████████████████| 32132/32132 [00:11<00:00, 2701.72it/s]


{'B-V': 'have'}
{'ARG0': 'republicans democrats', 'B-V': 'create', 'ARG1': 'problem'}
{'B-V': '', 'ARG1': '', 'ARG2': 'city charlotte north carolina thousand patriots country value law america'}
{'B-V': 'thrill', 'ARG1': '', 'ARG2': 'city charlotte north carolina thousand patriots country value law america'}
{'B-V': 'be', 'ARG1': '', 'ARG2': 'city charlotte north carolina'}


In [9]:
known_entities = p.mine_entities(
    split_sentences[1], 
    clean_entities = True, 
    progress_bar = True,
    output_path = 'entities.pkl'
)

from relatio.utils import load_entities
known_entities = load_entities('entities.pkl')

for n in known_entities.most_common(10): print(n)

Mining named entities...


100%|███████████████████████████████████| 68616/68616 [00:32<00:00, 2093.99it/s]

('democrats', 1149)
('obama', 1023)
('china', 932)
('u.s.', 836)
('trump', 798)
('america', 768)
('american', 589)
('barackobama', 512)
('republicans', 502)
('foxnews', 502)





In [10]:
top_known_entities = [e[0] for e in list(known_entities.most_common(100))]

## Step 4: Build the narrative model

----------------------------

We are now ready to build a narrative model.

----------------------------

In [11]:
from relatio._embeddings import Embeddings
nlp_model = Embeddings("TensorFlow_USE","https://tfhub.dev/google/universal-sentence-encoder/4")

2022-03-01 19:48:21.284901: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-01 19:48:21.316426: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-01 19:48:21.316690: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-03-01 19:48:21.317309: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags

In [12]:
from relatio.narrative_models import *
from relatio.utils import prettify
from collections import Counter

m = NarrativeModel(model_type = 'deterministic',
                   roles_considered = ['ARG0', 'B-V', 'B-ARGM-NEG', 'B-ARGM-MOD', 'ARG1', 'ARG2'],
                   roles_with_entities = ['ARG0','ARG1','ARG2'],
                   list_of_known_entities = top_known_entities,
                   assignment_to_known_entities = 'character_matching',
                   roles_with_embeddings = [],
                   embeddings_model = None,
                   threshold = 1)    

m.train(postproc_roles)

No training required: the model is deterministic.


In [13]:
import cProfile
cProfile.run("narratives = m.predict(postproc_roles, progress_bar = True, prettify = False)")

100%|█████████████████████████████████| 150213/150213 [00:18<00:00, 8327.75it/s]


         73924340 function calls (73121661 primitive calls) in 19.164 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   19.164   19.164 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 _monitor.py:93(report)
        1    0.000    0.000    0.000    0.000 _weakrefset.py:106(remove)
        2    0.000    0.000    0.000    0.000 _weakrefset.py:16(__init__)
        2    0.000    0.000    0.000    0.000 _weakrefset.py:20(__enter__)
        2    0.000    0.000    0.000    0.000 _weakrefset.py:26(__exit__)
        2    0.000    0.000    0.000    0.000 _weakrefset.py:52(_commit_removals)
        3    0.000    0.000    0.000    0.000 _weakrefset.py:58(__iter__)
        1    0.000    0.000    0.000    0.000 _weakrefset.py:81(add)
 802680/1    0.531    0.000    1.122    1.122 copy.py:128(deepcopy)
   652466    0.042    0.000    0.042    0.000 copy.py:182(_deepcopy_atomic)
        1    0.048

In [14]:
pretty_narratives = []
for n in narratives: 
    if n.get('ARG0') not in ["", None]:
        if n.get('B-V') not in ["", None]:
            if n.get('ARG1') not in ["", None]:
                pretty_narratives.append(prettify(n))
                
pretty_narratives = Counter(pretty_narratives)
for t in pretty_narratives.most_common(10): print(t)

('house pass bill', 4)
('china take u.s.', 4)
('pelosi demand republicans|house', 3)
('dem want senate', 3)
('dem run senate', 3)
('north carolina make republican', 3)
('foxnews discuss barackobama', 3)
('canada look china', 3)
('foxnews discuss mittromney', 3)
('iran take iraq', 3)


In [15]:
m = NarrativeModel(model_type = 'dynamic',
                   roles_considered = ['ARG0', 'B-V', 'B-ARGM-NEG', 'B-ARGM-MOD', 'ARG1', 'ARG2'],
                   roles_with_entities = ['ARG0','ARG1','ARG2'],
                   list_of_known_entities = top_known_entities,
                   assignment_to_known_entities = 'embeddings',
                   roles_with_embeddings = [['ARG0','ARG1','ARG2']],
                   embeddings_model = None,
                   threshold = 0.3)    

m.train(postproc_roles, progress_bar = True)

Computing vectors for known entities...


100%|██████████████████████████████████| 150213/150213 [17:05<00:00, 146.45it/s]


In [16]:
cProfile.run("narratives = m.predict(postproc_roles, progress_bar = True, prettify = False)")

 46%|████████████████▋                   | 69499/150213 [14:01<16:16, 82.62it/s]


         178271273 function calls (164777428 primitive calls) in 842.214 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    50486    0.090    0.000    0.890    0.000 <__array_function__ internals>:2(amax)
    50485    0.096    0.000    1.893    0.000 <__array_function__ internals>:2(amin)
   128983    0.132    0.000    1.030    0.000 <__array_function__ internals>:2(atleast_1d)
   100972    0.091    0.000    0.657    0.000 <__array_function__ internals>:2(atleast_2d)
   583357    0.542    0.000    3.440    0.000 <__array_function__ internals>:2(concatenate)
    78497    0.104    0.000    0.343    0.000 <__array_function__ internals>:2(count_nonzero)
    50486    0.047    0.000    0.256    0.000 <__array_function__ internals>:2(dot)
   128983    0.140    0.000    2.672    0.000 <__array_function__ internals>:2(hstack)
    50486    0.071    0.000    0.995    0.000 <__array_function__ internals>:2(norm)
   112044    0.111    

KeyboardInterrupt: 

In [None]:
pretty_narratives = []
for n in narratives: 
    if n.get('ARG0') not in ["", None]:
        if n.get('B-V') not in ["", None]:
            if n.get('ARG1') not in ["", None]:
                pretty_narratives.append(prettify(n))
                
pretty_narratives = Counter(pretty_narratives)
for t in pretty_narratives.most_common(10): print(t)

## Step 5: Model validation and basic analysis

----------------------------


----------------------------

In [None]:
for i,k in enumerate(sentence_index):
    
    n = narratives[i]
    r = roles[i]
    
    if n.get('ARG0') not in ["", None]:
        if n.get('B-V') not in ["", None]:
            if n.get('ARG1') not in ["", None]:
                print('Original statement:')
                print(prettify(r))
                print('\n')
                print('Underlying narrative:')
                print(prettify(n))
                print("\n")
                

## Step 6: Visualization // Plotting narrative graphs
----------------------------

A collection of narrative statements has an intuitive network structure, in which the edges are verbs and the nodes are entities.

Here, we plot Trump's narrative statements on Twitter.

----------------------------