# An introduction to `relatio` 
**Runtime $\sim$ 1h**

Original paper: ["Text Semantics Capture Political and Economic Narratives"](https://arxiv.org/abs/2108.01720)

----------------------------

This is a short demo of the package `relatio`.  It takes as input a text corpus and outputs a list of narrative statements. The pipeline is unsupervised: the user does not need to specify narratives beforehand. Narrative statements are defined as tuples of semantic roles with a (agent, verb, patient, attribute) structure. 

Here, we present the main wrapper functions to quickly obtain narrative statements from a corpus.

----------------------------

We provide datasets that have already been split into sentences and annotated by our team.

The datasets are provided in three different formats:
 1. `raw` (unprocessed)
 2. `split_sentences` (as a list of sentences)
 3. `srl` (as a list of annotated sentences by the semantic role labeler)

In this tutorial, we work with the Trump Tweet Archive corpus.

----------------------------

In [None]:
import cProfile

In [1]:
# Catch warnings for an easy ride
from relatio import FileLogger
logger = FileLogger(level = 'WARNING')

2022-03-07 14:03:28.094156: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-07 14:03:28.094175: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
# Browse list of available datasets
from relatio.datasets import list_datasets
print(list_datasets())

# Load an available dataset
from relatio.datasets import load_trump_data
df = load_trump_data("raw")


    List of available datasets:

    Trump Tweet Archive
    - function call: load_trump_data()
    - format: 'raw', 'split_sentences', 'srl_res'
    - allennlp version: 0.9
    - srl model: srl-model-2018.05.25.tar.gz
    


## Step 1: Split into sentences

----------------------------

For any new corpus, the first thing you will want to do is to split the corpus into sentences.

We do this on the first 100 tweets. 

The output is two lists: one with an index for the document and one with the resulting split sentences.

----------------------------


In [9]:
from relatio.preprocessing import *

p = Preprocessor(
    spacy_model = "en_core_web_md",
    remove_punctuation = True,
    remove_digits = True,
    lowercase = True,
    lemmatize = True,
    remove_chars = None,
    stop_words = [],
    n_process = -1,
    batch_size = 100
)

split_sentences = p.split_into_sentences(
    df.iloc[0:100], output_path='sentences.json', progress_bar=True
)

from relatio.utils import load_sentences
doc_index, sentences = load_sentences('sentences.json')

Splitting into sentences...


100%|████████████████████████████████████████| 100/100 [00:00<00:00, 342.05it/s]


## Step 2: Annotate semantic roles

----------------------------

Once the corpus is split into sentences. You can feed it to the semantic role labeler.

The output is a list of json objects which contain the semantic role annotations for each sentence in the corpus.

----------------------------


In [10]:
from relatio.semantic_role_labeling import SRL

SRL = SRL(
    path = "https://storage.googleapis.com/allennlp-public-models/openie-model.2020.03.26.tar.gz",
    batch_size = 10,
    cuda_device = -1
)

srl_res = SRL(split_sentences[1], progress_bar=True)

Running SRL...


100%|███████████████████████████████████████████| 21/21 [00:05<00:00,  3.89it/s]


In [11]:
srl_res[0]

{'verbs': [{'verb': 'have',
   'description': 'Republicans and Democrats [V: have] both created our economic problems .',
   'tags': ['O', 'O', 'O', 'B-V', 'O', 'O', 'O', 'O', 'O', 'O']},
  {'verb': 'created',
   'description': '[ARG0: Republicans and Democrats] have both [V: created] [ARG1: our economic problems] .',
   'tags': ['B-ARG0',
    'I-ARG0',
    'I-ARG0',
    'O',
    'O',
    'B-V',
    'B-ARG1',
    'I-ARG1',
    'I-ARG1',
    'O']}],
 'words': ['Republicans',
  'and',
  'Democrats',
  'have',
  'both',
  'created',
  'our',
  'economic',
  'problems',
  '.']}

NB: This step is faster with a GPU. The argument cuda_device allows users to use their GPUs:

```
import torch
print(torch.cuda.is_available())
print(torch.cuda.current_device())

SRL = SRL(
    path = "https://storage.googleapis.com/allennlp-public-models/openie-model.2020.03.26.tar.gz",
    batch_size = 10,
    cuda_device = 0
)

srl_res = SRL(split_sentences[1], progress_bar=True)
```

In [12]:
# To save us some time, we download the results from the datasets module.
# split_sentences = load_trump_data("split_sentences")
# srl_res = load_trump_data("srl_res")

## Step 3: Pre-process semantic roles

----------------------------


----------------------------

In [13]:
from relatio.semantic_role_labeling import extract_roles

roles, sentence_index = extract_roles(
    srl_res, 
    used_roles = ["ARG0","B-V","B-ARGM-NEG","B-ARGM-MOD","ARG1","ARG2"],
    progress_bar = True
)

for d in roles[0:5]: print(d)

Extracting semantic roles...


100%|██████████████████████████████████████| 205/205 [00:00<00:00, 21833.13it/s]

{'B-V': 'have'}
{'ARG0': 'Republicans and Democrats', 'ARG1': 'our economic problems', 'B-V': 'created'}
{'ARG1': 'I', 'ARG2': 'thrilled to be back in the Great city of Charlotte , North Carolina with thousands of hardworking American Patriots who love our Country , cherish our values , respect our laws , and always put AMERICA FIRST', 'B-V': 'was'}
{'ARG1': 'I', 'ARG2': 'back in the Great city of Charlotte , North Carolina with , respect our laws , and always put AMERICA FIRST', 'B-V': 'be'}
{'ARG0': 'thousands of hardworking American Patriots who', 'ARG1': 'our Country', 'B-V': 'love'}





In [14]:
postproc_roles = p.process_roles(roles, 
                                 dict_of_pos_tags_to_keep = {
                                     "ARG0": ['NOUN', 'PROPN'],
                                     "B-V": ['VERB'],
                                     "ARG1": ['NOUN', 'PROPN'],
                                     "ARG2": ['NOUN', 'PROPN']
                                 }, 
                                 progress_bar = True,
                                 output_path = 'postproc_roles.json')

from relatio.utils import load_roles
postproc_roles = load_roles('postproc_roles.json')

for d in postproc_roles[0:5]: print(d)

Cleaning phrases for role ARG0...


100%|████████████████████████████████████████| 136/136 [00:00<00:00, 495.03it/s]


Cleaning phrases for role B-V...


100%|████████████████████████████████████████| 377/377 [00:00<00:00, 956.39it/s]


Cleaning phrases for role B-ARGM-MOD...


100%|██████████████████████████████████████████| 30/30 [00:00<00:00, 109.17it/s]


Cleaning phrases for role ARG1...


100%|████████████████████████████████████████| 258/258 [00:00<00:00, 672.76it/s]


Cleaning phrases for role ARG2...


100%|██████████████████████████████████████████| 71/71 [00:00<00:00, 258.14it/s]

{'B-V': 'have'}
{'ARG0': 'republicans democrats', 'B-V': 'create', 'ARG1': 'problem'}
{'ARG2': 'cit charlotte north carolina thousand patriots countr value law america'}
{'ARG2': 'cit charlotte north carolina law america'}
{'ARG0': 'thousand patriots', 'ARG1': 'countr'}





In [None]:
known_entities = p.mine_entities(
    split_sentences[1], 
    clean_entities = True, 
    progress_bar = True,
    output_path = 'entities.pkl'
)

from relatio.utils import load_entities
known_entities = load_entities('entities.pkl')

for n in known_entities.most_common(10): print(n)

In [None]:
top_known_entities = [e[0] for e in list(known_entities.most_common(100)) if e[0] != '']

## Step 4: Build a narrative model

----------------------------

We are now ready to build a narrative model.

----------------------------

In [None]:
from relatio.narrative_models import NarrativeModel
from relatio.utils import prettify
from collections import Counter

m = NarrativeModel(model_type = 'deterministic',
                   roles_considered = ['ARG0', 'B-V', 'B-ARGM-NEG', 'B-ARGM-MOD', 'ARG1', 'ARG2'],
                   roles_with_known_entities = ['ARG0','ARG1','ARG2'],
                   known_entities = top_known_entities,
                   assignment_to_known_entities = 'character_matching',
                   roles_with_unknown_entities = [['ARG0','ARG1','ARG2']],
                   embeddings_model = None,
                   threshold = 1)    

m.train(postproc_roles)

In [None]:
cProfile.run("narratives = m.predict(postproc_roles, progress_bar = True)")

In [None]:
pretty_narratives = []
for n in narratives: 
    if n.get('ARG0') is not None:
        if n.get('B-V') is not None:
            if n.get('ARG1') is not None:
                pretty_narratives.append(prettify(n))
                
pretty_narratives = Counter(pretty_narratives)
for t in pretty_narratives.most_common(10): print(t)

In [None]:
from relatio import Embeddings
nlp_model = Embeddings("TensorFlow_USE","https://tfhub.dev/google/universal-sentence-encoder/4")
# nlp_model = Embeddings("Gensim_pretrained", "glove-twitter-25")

In [None]:
m = NarrativeModel(model_type = 'static',
                   roles_considered = ['ARG0', 'B-V', 'B-ARGM-NEG', 'B-ARGM-MOD', 'ARG1', 'ARG2'],
                   roles_with_known_entities = ['ARG0','ARG1','ARG2'],
                   known_entities = top_known_entities,
                   assignment_to_known_entities = 'embeddings',
                   roles_with_unknown_entities = [['ARG0','ARG1','ARG2']],
                   n_clusters = [100],
                   embeddings_model = nlp_model,
                   threshold = 0.3)    

m.train(postproc_roles, progress_bar = True)

In [None]:
cProfile.run("narratives = m.predict(postproc_roles, progress_bar = True)")

In [None]:
m = NarrativeModel(model_type = 'dynamic',
                   roles_considered = ['ARG0', 'B-V', 'B-ARGM-NEG', 'B-ARGM-MOD', 'ARG1', 'ARG2'],
                   roles_with_known_entities = ['ARG0','ARG1','ARG2'],
                   known_entities = top_known_entities,
                   assignment_to_known_entities = 'character_matching',
                   roles_with_unknown_entities = [['ARG0','ARG1','ARG2']],
                   embeddings_model = nlp_model,
                   threshold = 0.3)    

m.train(postproc_roles, progress_bar = True)

In [None]:
cProfile.run("narratives = m.predict(postproc_roles[0:100], progress_bar = True)")

In [None]:
pretty_narratives = []
for n in narratives: 
    pretty_narratives.append(prettify(n))
                
pretty_narratives = Counter(pretty_narratives)
for t in pretty_narratives.most_common(10): print(t)

## Step 5: Model validation and basic analysis

----------------------------


----------------------------

In [None]:
postproc_roles[0:10]

In [None]:
narratives[0:10]

In [None]:
m._model_obj.vectors_unknown_entities[0].shape

In [None]:
len(m._model_obj.vocab_unknown_entities[0])

In [None]:
pretty_narratives = []
for n in narratives: 
    if n.get('ARG0') not in ["", None]:
        if n.get('B-V') not in ["", None]:
            if n.get('ARG1') not in ["", None]:
                pretty_narratives.append(prettify(n))
                
pretty_narratives = Counter(pretty_narratives)
for t in pretty_narratives.most_common(100): print(t)

## Step 6: Visualization // Plotting narrative graphs
----------------------------

A collection of narrative statements has an intuitive network structure, in which the edges are verbs and the nodes are entities.

Here, we plot Trump's narrative statements on Twitter.

----------------------------