# Ontology Alignment Experiments: Graph Analysis Pipeline

This notebook demonstrates a complete graph analysis pipeline for ontology alignment:
1. **Exploratory Data Analysis (EDA)** - Understanding the graph structure
2. **Weakly Connected Components (WCC)** - Identifying connected subgraphs
3. **Node Similarity** - Finding similar nodes based on shared properties
4. **Node Embeddings** - Creating vector representations for similarity analysis

## Setup and Imports

In [2]:
!pip install neo4j

Collecting neo4j
  Downloading neo4j-6.1.0-py3-none-any.whl.metadata (5.3 kB)
Downloading neo4j-6.1.0-py3-none-any.whl (325 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.3/325.3 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: neo4j
Successfully installed neo4j-6.1.0


In [3]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from neo4j import GraphDatabase
from IPython.display import display, HTML
import json

# Import EDA analyzer
sys.path.append('eda')
from neo4j_analyzer import Neo4jPropertyAnalyzer, PerformanceMonitor
from neo4j_analyzer.report_generator import ReportGenerator

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

ModuleNotFoundError: No module named 'neo4j_analyzer'

## Configuration

**Neo4j Connection Settings:**

In [None]:
# Neo4j Connection
NEO4J_URI = "bolt://44.204.34.69"
NEO4J_USER = "neo4j"
NEO4J_PASSWORD = "decibels-defenses-president"

# Analysis Settings
USE_FAST_MODE = True      # Use Cypher aggregations for large graphs
SAMPLE_SIZE = 50000       # Sample size for standard mode
FETCH_SIZE = 2000         # Batch size for data extraction

---
# Part 1: Exploratory Data Analysis (EDA)

Understanding the structure and properties of our graph data before running algorithms.

## 1.1 Initialize Analyzer and Explore Database

In [None]:
# Initialize analyzer
analyzer = Neo4jPropertyAnalyzer(
    uri=NEO4J_URI,
    user=NEO4J_USER,
    password=NEO4J_PASSWORD,
    fetch_size=FETCH_SIZE
)

# Get all node labels in the database
labels = analyzer.get_node_labels()
print(f"Found {len(labels)} node labels in the database:")
for label in labels:
    count = analyzer.get_node_count(label)
    print(f"  - {label}: {count:,} nodes")

## 1.2 Analyze Node Properties

Analyze properties to understand:
- **Categorical properties**: Low cardinality, good for grouping
- **Unique properties**: High cardinality, good for identifiers
- **Property distributions**: Understanding data quality

In [None]:
# Analyze properties for each label
all_results = {}

for label in labels:
    print(f"\n{'='*60}")
    print(f"Analyzing label: {label}")
    print(f"{'='*60}")
    
    if USE_FAST_MODE:
        summary = analyzer.get_property_summary_fast(label)
    else:
        summary = analyzer.get_property_summary(label, sample_size=SAMPLE_SIZE)
    
    all_results[label] = summary
    ReportGenerator.print_summary(summary, label)

## 1.3 Visualize Property Types Distribution

In [None]:
# Aggregate property types across all labels
property_type_counts = {}

for label, summary in all_results.items():
    for prop_name, prop_info in summary.items():
        prop_type = prop_info.get('type', 'UNKNOWN')
        if prop_type not in property_type_counts:
            property_type_counts[prop_type] = 0
        property_type_counts[prop_type] += 1

# Create visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Pie chart
ax1.pie(property_type_counts.values(), labels=property_type_counts.keys(), 
        autopct='%1.1f%%', startangle=90)
ax1.set_title('Property Types Distribution')

# Bar chart
ax2.bar(property_type_counts.keys(), property_type_counts.values())
ax2.set_xlabel('Property Type')
ax2.set_ylabel('Count')
ax2.set_title('Property Types Count')
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

---
# Part 2: Weakly Connected Components (WCC)

**Algorithm**: WCC identifies groups of nodes that are connected to each other, even if the connections are indirect.

**Key Parameters:**
- `GRAPH_NAME`: Name of the GDS graph projection to analyze
- `USER_LABEL`: Node label to focus on (e.g., 'Stream')
- `PROPERTY_LABEL`: Related property nodes (e.g., 'Property')
- `REL_TYPE`: Relationship type connecting nodes (e.g., 'HAS')

**What it does:**
- Finds all connected components in the graph
- Assigns a `component_id` to each node
- Filters components with size > 1 (groups with multiple nodes)
- Writes `component_id_2` property to Stream nodes for downstream analysis

In [None]:
# WCC Configuration
GRAPH_NAME = "node-embedding-graph"
USER_LABEL = "Stream"
PROPERTY_LABEL = "Property"
REL_TYPE = "HAS"

print(f"Running WCC on graph: {GRAPH_NAME}")
print(f"Analyzing nodes: {USER_LABEL}")
print(f"Connected via: {REL_TYPE} -> {PROPERTY_LABEL}")

In [None]:
# Run WCC Algorithm
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

with driver:
    with driver.session() as session:
        # Run WCC and assign component IDs
        result = session.run(
            "CALL gds.wcc.stream($name) "
            "YIELD nodeId, componentId "
            "WITH gds.util.asNode(nodeId) AS n, componentId "
            "WHERE n:Stream "
            "WITH componentId, collect(n) AS nodes, count(*) AS size "
            "WHERE size > 1 "
            "UNWIND nodes AS n "
            "SET n.component_id_2 = componentId "
            "RETURN componentId, size "
            "ORDER BY size DESC, componentId ASC",
            name=GRAPH_NAME,
        )
        
        # Collect results
        wcc_results = []
        for record in result:
            wcc_results.append({
                'component_id': record['componentId'],
                'size': record['size']
            })

driver.close()

# Display results
wcc_df = pd.DataFrame(wcc_results)
print(f"\nFound {len(wcc_df)} connected components with size > 1")
print(f"Total nodes in components: {wcc_df['size'].sum():,}")
display(wcc_df.head(10))

## 2.1 Visualize Component Size Distribution

In [None]:
# Visualize component sizes
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Histogram of component sizes
ax1.hist(wcc_df['size'], bins=50, edgecolor='black')
ax1.set_xlabel('Component Size')
ax1.set_ylabel('Frequency')
ax1.set_title('Distribution of Component Sizes')
ax1.set_yscale('log')

# Top 10 largest components
top_10 = wcc_df.nlargest(10, 'size')
ax2.barh(range(len(top_10)), top_10['size'])
ax2.set_yticks(range(len(top_10)))
ax2.set_yticklabels([f"Comp {cid}" for cid in top_10['component_id']])
ax2.set_xlabel('Size')
ax2.set_title('Top 10 Largest Components')
ax2.invert_yaxis()

plt.tight_layout()
plt.show()

print(f"\nComponent Statistics:")
print(f"  Mean size: {wcc_df['size'].mean():.2f}")
print(f"  Median size: {wcc_df['size'].median():.2f}")
print(f"  Largest component: {wcc_df['size'].max():,} nodes")
print(f"  Smallest component: {wcc_df['size'].min():,} nodes")

---
# Part 3: Node Similarity

**Algorithm**: Node Similarity finds pairs of nodes that are similar based on their shared neighbors (Jaccard similarity).

**Key Parameters:**
- `GRAPH_NAME`: GDS graph projection (must include component_id_2 property)
- `NODE_LABELS`: Filter to specific node types (e.g., ['Stream'])
- `similarity > 0`: Only return node pairs with non-zero similarity

**What it does:**
- Compares nodes based on shared properties/neighbors
- Calculates Jaccard similarity: |A ∩ B| / |A ∪ B|
- Returns pairs of similar nodes with similarity scores
- Useful for finding potential duplicates or related entities

In [None]:
# Node Similarity Configuration
SIMILARITY_GRAPH_NAME = "graph-with-component_ids"
NODE_LABELS = ["Stream"]

print(f"Running Node Similarity on graph: {SIMILARITY_GRAPH_NAME}")
print(f"Analyzing node labels: {NODE_LABELS}")

In [None]:
# Run Node Similarity Algorithm
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

with driver:
    with driver.session() as session:
        result = session.run(
            "CALL gds.nodeSimilarity.stream($name, { "
            "  nodeLabels: $labels "
            "}) "
            "YIELD node1, node2, similarity "
            "WHERE similarity > 0 "
            "RETURN gds.util.asNode(node1).id AS node1_id, "
            "       gds.util.asNode(node1).component_id_2 AS node1_component_id, "
            "       gds.util.asNode(node2).id AS node2_id, "
            "       gds.util.asNode(node2).component_id_2 AS node2_component_id, "
            "       similarity "
            "ORDER BY similarity DESC "
            "LIMIT 1000",  # Limit for notebook display
            name=SIMILARITY_GRAPH_NAME,
            labels=NODE_LABELS,
        )
        
        # Collect results
        similarity_results = []
        for record in result:
            similarity_results.append({
                'node1_id': record['node1_id'],
                'node1_component': record['node1_component_id'],
                'node2_id': record['node2_id'],
                'node2_component': record['node2_component_id'],
                'similarity': record['similarity']
            })

driver.close()

# Display results
similarity_df = pd.DataFrame(similarity_results)
print(f"\nFound {len(similarity_df)} similar node pairs (showing top 1000)")
display(similarity_df.head(20))

## 3.1 Visualize Similarity Distribution

In [None]:
# Visualize similarity scores
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Histogram of similarity scores
ax1.hist(similarity_df['similarity'], bins=50, edgecolor='black')
ax1.set_xlabel('Similarity Score')
ax1.set_ylabel('Frequency')
ax1.set_title('Distribution of Similarity Scores')

# Top 20 most similar pairs
top_20 = similarity_df.nlargest(20, 'similarity')
ax2.barh(range(len(top_20)), top_20['similarity'])
ax2.set_yticks(range(len(top_20)))
ax2.set_yticklabels([f"{row['node1_id']}-{row['node2_id']}" for _, row in top_20.iterrows()], fontsize=8)
ax2.set_xlabel('Similarity')
ax2.set_title('Top 20 Most Similar Node Pairs')
ax2.invert_yaxis()

plt.tight_layout()
plt.show()

print(f"\nSimilarity Statistics:")
print(f"  Mean similarity: {similarity_df['similarity'].mean():.4f}")
print(f"  Median similarity: {similarity_df['similarity'].median():.4f}")
print(f"  Max similarity: {similarity_df['similarity'].max():.4f}")
print(f"  Min similarity: {similarity_df['similarity'].min():.4f}")

---
# Part 4: Node Embeddings

**Algorithm**: FastRP (Fast Random Projection) creates vector representations of nodes, followed by KNN (K-Nearest Neighbors) to find similar nodes.

**Key Parameters:**
- `GRAPH_NAME`: GDS graph projection name
- `LABELS`: Node labels to include (e.g., ['Stream', 'Game'])
- `REL_TYPES`: Relationship types to consider (e.g., ['MODERATOR', 'VIP', 'CHATTER'])
- `embeddingDimension`: Size of embedding vectors (128)
- `topK`: Number of nearest neighbors to find (10)

**What it does:**
- Creates a graph projection filtered by component_id_2
- Generates 128-dimensional embeddings using FastRP
- Finds top 10 most similar nodes for each node using KNN
- Returns node pairs with similarity scores and component IDs

In [None]:
# Node Embeddings Configuration
EMBEDDING_GRAPH_NAME = "graph-with-component_ids"
EMBEDDING_LABELS = ["Stream", "Game"]
EMBEDDING_REL_TYPES = ["MODERATOR", "VIP", "CHATTER"]
EMBEDDING_DIMENSION = 128
TOP_K = 10

print(f"Running Node Embeddings on graph: {EMBEDDING_GRAPH_NAME}")
print(f"Node labels: {EMBEDDING_LABELS}")
print(f"Relationship types: {EMBEDDING_REL_TYPES}")
print(f"Embedding dimension: {EMBEDDING_DIMENSION}")
print(f"Top K neighbors: {TOP_K}")

In [None]:
# Run FastRP and KNN
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

with driver:
    with driver.session() as session:
        # Note: This assumes the graph projection and embeddings have been created
        # In practice, you would first create the projection and run FastRP
        # For this demo, we'll query the KNN results
        
        result = session.run(
            "CALL gds.knn.stream($name, { "
            "  nodeProperties: ['embedding'], "
            "  topK: $topK "
            "}) "
            "YIELD node1, node2, similarity "
            "RETURN gds.util.asNode(node1).id AS node1_id, "
            "       gds.util.asNode(node1).component_id_2 AS component_id_node_1, "
            "       gds.util.asNode(node2).id AS node2_id, "
            "       gds.util.asNode(node2).component_id_2 AS component_id_node_2, "
            "       similarity "
            "ORDER BY similarity DESC "
            "LIMIT 1000",
            name=EMBEDDING_GRAPH_NAME,
            topK=TOP_K,
        )
        
        # Collect results
        embedding_results = []
        for record in result:
            embedding_results.append({
                'node1_id': record['node1_id'],
                'component_id_1': record['component_id_node_1'],
                'node2_id': record['node2_id'],
                'component_id_2': record['component_id_node_2'],
                'similarity': record['similarity']
            })

driver.close()

# Display results
embedding_df = pd.DataFrame(embedding_results)
print(f"\nFound {len(embedding_df)} embedding-based similar pairs (showing top 1000)")
display(embedding_df.head(20))

## 4.1 Visualize Embedding-Based Similarity

In [None]:
# Visualize embedding similarity scores
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Histogram of embedding similarity scores
ax1.hist(embedding_df['similarity'], bins=50, edgecolor='black', alpha=0.7, label='Embeddings')
if len(similarity_df) > 0:
    ax1.hist(similarity_df['similarity'], bins=50, edgecolor='black', alpha=0.5, label='Node Similarity')
ax1.set_xlabel('Similarity Score')
ax1.set_ylabel('Frequency')
ax1.set_title('Comparison: Embedding vs Node Similarity')
ax1.legend()

# Component distribution in embedding results
component_counts = embedding_df['component_id_1'].value_counts().head(10)
ax2.barh(range(len(component_counts)), component_counts.values)
ax2.set_yticks(range(len(component_counts)))
ax2.set_yticklabels([f"Comp {cid}" for cid in component_counts.index])
ax2.set_xlabel('Number of Similar Pairs')
ax2.set_title('Top 10 Components by Similar Pairs')
ax2.invert_yaxis()

plt.tight_layout()
plt.show()

print(f"\nEmbedding Similarity Statistics:")
print(f"  Mean similarity: {embedding_df['similarity'].mean():.4f}")
print(f"  Median similarity: {embedding_df['similarity'].median():.4f}")
print(f"  Max similarity: {embedding_df['similarity'].max():.4f}")
print(f"  Min similarity: {embedding_df['similarity'].min():.4f}")

---
# Summary

This notebook demonstrated a complete graph analysis pipeline:

1. **EDA**: Explored the graph structure, node labels, and property distributions
2. **WCC**: Identified connected components and assigned component_id_2 to nodes
3. **Node Similarity**: Found similar nodes based on Jaccard similarity of shared neighbors
4. **Node Embeddings**: Created vector representations and found similar nodes using KNN

## Key Findings:
- Component IDs help group related nodes together
- Node similarity captures structural similarity based on shared connections
- Embeddings capture deeper semantic relationships in the graph
- Both approaches complement each other for comprehensive similarity analysis