# Load data
First we load the data, I've pulled a [generic quotations dataset](https://huggingface.co/datasets/m-ric/english_historical_quotes/) from huggingface, it's downloaded to the repo for convenience.

In [70]:
import json
import pandas as pd

# https://huggingface.co/datasets/m-ric/english_historical_quotes/blob/main/english_historical_quotes.json

# Load in from json
with open("data/english_historical_quotes.json", "r") as f:
    data = json.load(f)

# I drop the category file as it's not required for the demo
df = pd.DataFrame(data).drop(columns=["category"])

df.head()

Unnamed: 0,quote,author
0,Almost anyone can be an author the business is...,A. A. Milne
1,"If you live to be a hundred, I want to live to...",A. A. Milne
2,Golf is so popular simply because it is the be...,A. A. Milne
3,"To the uneducated, an A is just three sticks.",A. A. Milne
4,Promise me you'll always remember: You're brav...,A. A. Milne


# Building the network
Semnet makes constructing an embedding-based network simple. Just bring your own embeddings and pass them to the `.fit()` method.

In [71]:
from semnet import SemanticNetwork
from sentence_transformers import SentenceTransformer

docs = df["quote"].tolist()

embedding_model = SentenceTransformer("BAAI/bge-base-en-v1.5")
embeddings = embedding_model.encode(docs, show_progress_bar=True)

Batches:   0%|          | 0/751 [00:00<?, ?it/s]

## API
We pass the docs as the labels, and also pass the author as additional data to each node in the network. Passing a single item using `.to_dict()` will record the key as `value`.

In [115]:
# Larger values for thresh will generate sparser networks with fewer edges and more outliers
# I've found good values to be between 1.5 and 5. 3 will get a strong core, but with a large number of outliers
sem = SemanticNetwork(thresh=0.3, n_trees=100)
sem.fit(embeddings=embeddings, labels=docs, node_data=df["author"].to_dict())

<semnet.semnet.SemanticNetwork at 0x7f7e90304180>

## Under the hood
Semnet uses [annoy](https://github.com/spotify/annoy) to perform rapid pair-wise distance calculations across all embeddings in the dataset.

The result of this process is an edgelist, which can be used to construct an undirected graph, weighted by the semantic similarity between each record.

# Network analysis of text
With our data loaded into a `networkx` object, we now have access to hundreds of graph-based algorithms that can be used to explore, analyse and clean our data.

Use cases include:
- Outlier detection
- Enriching network with non-semantic data
- Clustering
- Visualisation
- Semantic pathways
- Deduplication and more!

## Outlier detection
Any node that has no edges, has no semantic relationship with any other item in the dataset at the threshold set during training. These records may be considered outliers.

We can use networkx to find all connected components in the graph. The demo threshold is pretty high so we'll see a fair few outliers.

In [116]:
import networkx as nx
import random

G = sem.graph_.copy()

# Returns generator of sets of connected components
connected_components = list(nx.connected_components(G))
unconnected_components = [list(c)[0] for c in connected_components if len(c) == 1]

print(
    f"Number of connected components (groups of 2 or more nodes): {len(connected_components)}"
)
print(f"Number of unconnected components (outliers): {len(unconnected_components)}")

Number of connected components (groups of 2 or more nodes): 16410
Number of unconnected components (outliers): 15573


Outliers, in this context represent topics or phrasing that is somewhat unique within the dataset. Exploring the outliers, we can see references to Mexico, Pearl Habour, puppies and such.

In [117]:
def sample_nodes(subgraph, n=5, seed=12345):
    random.seed(seed)
    node_candidates = list(subgraph.nodes(data=True))
    if len(node_candidates) < n:
        n = len(node_candidates)
    sample_nodes = random.sample(node_candidates, n)
    for idx, data in sample_nodes:
        print(f"{data['name']}, {data['value']}")


largest_cc = max(nx.connected_components(G), key=len)
large_subgraph = G.subgraph(largest_cc)

print("Largest")
sample_nodes(large_subgraph, n=10)
print()
print("Outliers")
sample_nodes(G.subgraph(unconnected_components), n=10)

Largest
The first condition of education is being able to put someone to wholesome and meaningful work., John Ruskin
Death doesn't affect the living because it has not happened yet. Death doesn't concern the dead because they have ceased to exist., W. Somerset Maugham
A liar will not be believed, even when he speaks the truth., Aesop
The one permanent emotion of the inferior man is fear - fear of the unknown, the complex, the inexplicable. What he wants above everything else is safety., H. L. Mencken
You cannot do everything at once, so find people you trust to help you. And don't be afraid to say no., Jane Seymour
It is the nature of all greatness not to be exact., Edmund Burke
To be trusted is a greater compliment than being loved., George MacDonald
One should always be in love. That is the reason one should never marry., Oscar Wilde
Dream and give yourself permission to envision a You that you choose to be., Joy Page
I'm not in politics., David Attenborough

Outliers
The democratic 

How we treat outliers will depend on our use case. As a demonstration, I'm keen at looking at the core of the dataset, getting themes, vibes and relationships rather than trying to classify every node. I drop the outliers and focus on the centre.

In [118]:
print(len(G.nodes()), len(unconnected_components))
non_ouliers = [n for n in G.nodes() if n not in unconnected_components]
G = G.subgraph(non_ouliers)
print(f"Graph after removing outliers has {len(G.nodes())} nodes")

24022 15573
Graph after removing outliers has 8449 nodes
Graph after removing outliers has 8449 nodes


# Clustering

Whilst excellent methods and libraries (e.g., BerTopic) exist for topic modelling on embeddings, the graph structure allows us to use a _relationship_-based approach

In [119]:
communities = nx.community.louvain_communities(G, seed=123, resolution=1.5)
for i, comm in enumerate(sorted(communities, key=len, reverse=True)):
    print(f"Community {i+1}, size: {len(comm)}")
    subgraph = G.subgraph(comm)
    sample_nodes(subgraph, n=5)
    print()

    # Label nodes with their community, I put small communities into -1
    for node in comm:
        if len(comm) > 5:
            G.nodes[node]["community"] = i + 1
        else:
            G.nodes[node]["community"] = -1

Community 1, size: 368
Argument is conclusive, but it does not remove doubt, so that the mind may rest in the sure knowledge of the truth, unless it finds it by the method of experiment., Roger Bacon
Wisdom begins in wonder., Socrates
Fear is the main source of superstition, and one of the main sources of cruelty. To conquer fear is the beginning of wisdom., Bertrand Russell
What a man knows is everywhere at war with what he wants., Joseph Wood Krutch
Discipline is wisdom and vice versa., M. Scott Peck

Community 2, size: 337
My one regret in life is that I am not someone else., Woody Allen
Death in itself is nothing but we fear to be we know not what, we know not where., John Dryden
It is the mark of a good action that it appears inevitable in retrospect., Robert Louis Stevenson
Our entire life - consists ultimately in accepting ourselves as we are., Jean Anouilh
Life's Tragedy is that we get old to soon and wise too late., Benjamin Franklin

Community 3, size: 319
I believe that bein

In [None]:
from cosmograph import cosmo

# Use the new to_pandas method to export the graph
nodes, edges = sem.to_pandas(G)

# For cosmograph, we need to prepare the data
widget = cosmo(
    points=nodes,
    links=edges,
    point_id_by="id",  # Index column
    link_source_by="source",
    link_target_by="target",
    link_strength_by="similarity",
    point_color_by="community",  # Color by community
    point_cluster_by="community",
    show_hovered_point_label=True,
    select_point_on_click=True,
    point_include_columns=["value"],  # Include author info
    point_label_by="name",
)
widget

Cosmograph(background_color=None, components_display_state_mode=None, focused_point_ring_color=None, hovered_p…

In [69]:
# Shortest path between two nodes
import random


for n in range(10):
    print("\n" + "-" * 50 + "\n")
    random.seed(n)
    node_a, node_b = random.sample(list(large_subgraph.nodes(data=True)), 2)

    all_path = nx.all_simple_paths(
        large_subgraph,
        source=node_a[0],
        target=node_b[0],
        cutoff=20,
    )
    # Find the longest path
    sorted_paths = sorted(all_path, key=len, reverse=True)
    long_path = sorted_paths[0] if len(sorted_paths) > 0 else None

    if long_path is not None:
        print(f"Long path between:\n- {node_a[1]['name']}\n- {node_b[1]['name']}\n")
        for idx in long_path:
            print(f"- {large_subgraph.nodes[idx]['name']}")


--------------------------------------------------



KeyboardInterrupt: 

In [None]:
# Clustering

In [None]:
communities = []
for idx, community in points.groupby("community"):
    top_nodes = community.nlargest(5, "degree_centrality")
    communities.append(
        {
            "community_id": idx,
            "representative_docs": top_nodes["name"].values,
            "size": len(community),
        }
    )

for community in sorted(communities, key=lambda x: x["size"], reverse=True):
    print(f"Community {community['community_id']} (size={community['size']}):")
    for doc in community["representative_docs"]:
        print(f" - {doc}")
    print()

In [None]:
widget

In [None]:
# Shortest path between two nodes
import random


def find_shortest_path(graph, source_idx, target_idx):
    try:
        path = nx.shortest_path(graph, source=source_idx, target=target_idx)
        return path
    except nx.NetworkXNoPath:
        return None


largest_component = max(connected_components, key=len)
largest_subgraph = reduced_graph.subgraph(largest_component)

node_a, node_b = random.sample(list(largest_subgraph.nodes(data=True)), 2)
path = find_shortest_path(largest_subgraph, source_idx=node_a[0], target_idx=node_b[0])
for idx in path:
    print(f"- {largest_subgraph.nodes[idx]['name']}")