# Demo: Creating and inspecting a narrative graph

This notebook will serve as a demo and small tour of some of the core functionalities of a `NarrativeGraph` object.

## Data setup

For this demo notebook, we will be using _News Category Dataset_ [1, 2] available on Kagglehub because it has short texts, timestamps and are categorized.

In [1]:
from kagglehub import KaggleDatasetAdapter
import kagglehub

data = kagglehub.dataset_load(
    KaggleDatasetAdapter.PANDAS,
    "rmisra/news-category-dataset",
    "News_Category_Dataset_v3.json",
    pandas_kwargs=dict(lines=True),
)
data.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


The columns that we will be using as input for our narrative graph.
- Documents: _headline_ + _short_description_
- IDs: link, but without the part that is in all of them
- Timestamps: _date_
- Categories: _category_

There are many categories. We will create a subset with just two of them: _U.S. News_ and _Politics_.

In [2]:
# create a sample
sample = data[data["category"].isin(["U.S. NEWS", "POLITICS"])].sample(
    5000, random_state=42
)
docs = sample["headline"] + "\n\n" + sample["short_description"]
ids = sample["link"].replace("https://www.huffpost.com/entry/", "")  # get rit of the first part of the URL
categories = sample["category"]
timestamps = sample["date"]

## Creating the model

Once we have our list of documents, which is the only required input, and extra metadata in aligned lists, we can create a narrative graph.

In [3]:
from narrativegraphs import NarrativeGraph

model = NarrativeGraph()
model.fit(docs, doc_ids=ids, categories=categories, timestamps=timestamps)

INFO:narrativegraphs.pipeline:Adding 5000 documents to database
INFO:narrativegraphs.pipeline:Extracting triplets
Extracting triplets: 100%|██████████| 5000/5000 [00:43<00:00, 113.66it/s] 
INFO:narrativegraphs.pipeline:Resolving entities and predicates
INFO:narrativegraphs.pipeline:Mapping triplets and tuplets
INFO:narrativegraphs.pipeline:Calculating stats


<narrativegraphs.narrativegraph.NarrativeGraph at 0x148052cc0>

## Inspecting the model visually

One of the key features of the _narrativegraphs_ package is that it lets a user inspect the output interactively in a browser-based visualizer. It is hosted directly on your machine by the Python package – no extra dependencies required. This is achieved with the one line below.

Click the link in the log messages to open in your browser.

In [4]:
# create server to be viewed in own browser which blocks execution of other cells
model.serve_visualizer()

INFO:     Started server process [57523]
INFO:     Waiting for application startup.
INFO:root:Database engine provided to state before startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8001 (Press CTRL+C to quit)
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [57523]
INFO:root:Server stopped by user


Stop it by hitting the stop button on the cell in Jupyter Notebook or hit CTRL+C elsewhere.

## Inspecting and accessing the model programmatically

The graph consists of entities as nodes and their relations or cooccurrences as edges. These, along with the data that back them, like documents and extracted semantic triplets, can be retrieved from the model through properties or service attributes.

### Attributes

We can get the graph as a whole, as NetworkX graph, through the properties `.relation_graph_` and `.cooccurrence_graph_`.

In [13]:
relation_graph = model.relation_graph_

In [14]:
print(type(relation_graph))

<class 'networkx.classes.digraph.DiGraph'>


In [15]:
print(*list(relation_graph.nodes(data=True))[:3], sep="\n")

(1, {'id': 1, 'label': 'metal detectors', 'frequency': 3, 'focus': False})
(2, {'id': 2, 'label': 'a seat', 'frequency': 4, 'focus': False})
(3, {'id': 3, 'label': 'Jill Stein’s', 'frequency': 1, 'focus': False})


Similarly, entities and relations and everything else can be accessed as `pandas.DataFrame`s through properties.

In [16]:
model.entities_

Unnamed: 0,id,label,frequency,doc_frequency,spread,adjusted_tf_idf,first_occurrence,last_occurrence,alt_labels,category
0,1,metal detectors,3,2,0.0004,3333.333333,2020-02-02,2021-01-13,"[""metal detectors""]","[POLITICS, U.S. NEWS, POLITICS]"
1,2,a seat,4,3,0.0006,3750.000000,2017-05-12,2018-02-28,"[""a seat""]","[POLITICS, POLITICS, POLITICS, POLITICS]"
2,3,Jill Stein’s,1,1,0.0002,0.000000,2016-12-02,2016-12-02,"[""Jill Stein’s""]",[POLITICS]
3,4,The Senate's Stealth Raid,1,1,0.0002,0.000000,2017-06-14,2017-06-14,"[""The Senate's Stealth Raid""]",[POLITICS]
4,5,Trump-Style\n\n,1,1,0.0002,0.000000,2017-09-25,2017-09-25,"[""Trump-Style\n\n""]",[POLITICS]
...,...,...,...,...,...,...,...,...,...,...
12430,12431,Blames Obama,1,1,0.0002,0.000000,2018-10-24,2018-10-24,"[""Blames Obama""]",[POLITICS]
12431,12432,all the right moves,1,1,0.0002,0.000000,2016-02-14,2016-02-14,"[""all the right moves""]",[POLITICS]
12432,12433,the visitor's gallery,2,1,0.0002,2500.000000,2015-03-05,2015-03-05,"[""the visitor's gallery""]","[POLITICS, POLITICS]"
12433,12434,the edge,1,1,0.0002,0.000000,2016-11-30,2016-11-30,"[""the edge""]",[POLITICS]


The properties (with trailing `_`) are nice in that they give back the data in well-known formats that one can continue working with, e.g. NetworkX graphs for graph algorithms and DataFrames for statistical analyses.

### Service attributes

However, the service attributes offer more control and may be especially handy if the model is quite big, so that you do not necessarily want everything spit out at once.

For instance, you can search for entities with the `entities` service.

In [17]:
white_house_matches = model.entities.search("White House")
white_house_matches[:10]

[EntityLabel(id=617, label='White House'),
 EntityLabel(id=4597, label="White House's"),
 EntityLabel(id=157, label='White House Story'),
 EntityLabel(id=389, label='White House Protest'),
 EntityLabel(id=6105, label='White House On Health Care\n\n'),
 EntityLabel(id=1251, label='Leave White House'),
 EntityLabel(id=6695, label='Obama White House'),
 EntityLabel(id=4229, label='Nominate White House'),
 EntityLabel(id=5900, label='The Trump White House'),
 EntityLabel(id=3020, label="the White House Correspondents' Dinner")]

In [18]:
white_house_id = white_house_matches[0].id

And you can create a subgraph that expands from a set of focus nodes and only includes those that pass a filter.

In [19]:
from datetime import date
from narrativegraphs import GraphFilter

white_house_graph = model.graph.expand_from_focus_entities(
    [white_house_id],
    "relation",
    graph_filter=GraphFilter(
        minimum_node_frequency=20,
        categories={'category': ["POLITICS"]},
        earliest_date=date(2014, 1, 1)
    )
)

# stripping labels to remove some whitespaces
print("NODES")
for node in white_house_graph.nodes:
    print(node.id, node.label.strip())

print("\nEDGES")
for edge in white_house_graph.edges:
    print(edge.subject_label.strip(), '-', edge.label, '->', edge.object_label.strip())

NODES
204 Ted Cruz
391 DONALD Trump
617 White House
1600 Hillary Clinton
2265 Rules
2606 the bill
3567 President Obama
10348 last year

EDGES
Ted Cruz - Rejected Bush -> White House
Ted Cruz - ,, to take on -> DONALD Trump
DONALD Trump - and, Calls On, On, ... -> Hillary Clinton
DONALD Trump - Accuses, and, and Texas Sen., ... -> Ted Cruz
DONALD Trump - In -> White House
DONALD Trump - has rewritten -> Rules
White House - plans to scrap -> Rules
White House - after -> President Obama
Hillary Clinton - Met For Lunch At, Win -> White House
Hillary Clinton - and, Over, : -> DONALD Trump
the bill - at -> White House
President Obama - and -> Hillary Clinton
last year - at -> White House


### Saving and loading the model

We can save the model for later use, especially if we have a lot of documents that takes a while to process.

In [21]:
model.save_to_file("demo")

And we can load it from that saved file.

In [22]:
model = NarrativeGraph.load("demo")

## References

[1] Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022).

[2] Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).

