[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/intelligence/01_Criminal_Network_Analysis.ipynb)

# Criminal Network Analysis - Graph Analytics & Centrality

## Overview

This notebook demonstrates **criminal network analysis** using Semantica with focus on **network centrality**, **community detection**, and **relationship mapping**. The pipeline processes OSINT feeds, police reports, and court records to build knowledge graphs for analyzing criminal networks and identifying key players and communities.

### Key Features

- **Network Centrality**: Uses centrality measures (degree, betweenness, closeness, eigenvector) to identify key players
- **Community Detection**: Detects criminal communities and groups using Louvain and Leiden algorithms
- **Relationship Mapping**: Maps relationships between persons, organizations, and events
- **Graph Analytics**: Comprehensive graph analysis including path finding and connectivity
- **Intelligence Reporting**: Generates intelligence reports from network analysis

### Learning Objectives

- Understand how to analyze criminal networks using graph analytics
- Learn to identify key players using centrality measures
- Master community detection algorithms for criminal group identification
- Explore relationship mapping and path finding in networks
- Practice graph analytics for intelligence reporting
- Analyze network structure and connectivity patterns

### Pipeline Flow

```mermaid
graph TD
    A[Data Ingestion] --> B[Document Parsing]
    B --> C[Text Processing]
    C --> D[Entity Extraction]
    D --> E[Relationship Extraction]
    E --> F[Deduplication]
    F --> G[KG Construction]
    G --> H[Embedding Generation]
    H --> I[Vector Store]
    G --> J[Centrality Analysis]
    G --> K[Community Detection]
    G --> L[Graph Analytics]
    I --> M[GraphRAG Queries]
    J --> N[Visualization]
    K --> N
    L --> N
    G --> O[Export]
```


---


In [1]:
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn


Note: you may need to restart the kernel to use updated packages.




---

## Configuration & Setup

Configure API keys and set up constants for the criminal network analysis pipeline.


In [2]:
import os

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "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")

# Configuration constants
EMBEDDING_DIMENSION = 384
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200


---

## Data Ingestion

Ingest intelligence data from multiple sources including OSINT RSS feeds, web APIs, and local files.


In [5]:
from semantica.ingest import FeedIngestor, WebIngestor, FileIngestor
from contextlib import redirect_stderr
from io import StringIO
import os

os.makedirs("data", exist_ok=True)

documents = []

# Ingest from OSINT RSS feeds
osint_feeds = [
    "https://www.us-cert.gov/ncas/alerts.xml",
    "https://www.europol.europa.eu/rss.xml",
    "https://www.treasury.gov/resource-center/sanctions/OFAC-Enforcement/Pages/rss.xml",
    "https://feeds.feedburner.com/oreilly/radar",
    "https://krebsonsecurity.com/feed/",
    "https://www.schneier.com/feed/",
    "https://www.darkreading.com/rss.xml",
    "https://threatpost.com/feed/",
    "https://www.bleepingcomputer.com/feed/",
    "https://www.securityweek.com/rss",
    "https://www.infosecurity-magazine.com/rss/news/",
    "https://www.csoonline.com/index.rss"
]

feed_ingestor = FeedIngestor()
for i, feed_url in enumerate(osint_feeds, 1):
    try:
        with redirect_stderr(StringIO()):
            feed_data = feed_ingestor.ingest_feed(feed_url, validate=False)
            
            feed_count = 0
            for item in feed_data.items:
                if not item.content:
                    item.content = item.description or item.title or ""
                if item.content:
                    if not hasattr(item, 'metadata'):
                        item.metadata = {}
                    item.metadata['source'] = feed_url
                    documents.append(item)
                    feed_count += 1
            
            if feed_count > 0:
                print(f"  [{i}/{len(osint_feeds)}] Feed: {feed_count} documents")
    except Exception as e:
        print(f"  [{i}/{len(osint_feeds)}] Feed failed: {str(e)[:50]}")
        pass

# Web ingestion from various intelligence and security sources
web_links = [
    "https://www.interpol.int/en/How-we-work/Notices/View-Red-Notices",
    "https://www.unodc.org/unodc/en/data-and-analysis/index.html",
    "https://www.cisa.gov/news-events/cybersecurity-advisories",
    "https://www.us-cert.gov/ncas/alerts",
    "https://www.europol.europa.eu/newsroom",
    "https://www.ncsc.gov.uk/news",
    "https://www.cyber.gov.au/news"
]

web_ingestor = WebIngestor(respect_robots=False, delay=1.0)
for i, web_url in enumerate(web_links, 1):
    try:
        with redirect_stderr(StringIO()):
            web_content = web_ingestor.ingest_url(web_url)
            if web_content and web_content.text:
                # Add content attribute for compatibility with parser
                web_content.content = web_content.text
                if not hasattr(web_content, 'metadata'):
                    web_content.metadata = {}
                web_content.metadata['source'] = web_url
                documents.append(web_content)
                print(f"  [{i}/{len(web_links)}] Web: {len(web_content.text)} characters")
    except Exception as e:
        print(f"  [{i}/{len(web_links)}] Web failed: {str(e)[:50]}")
        pass

# Example: Web ingestion from FBI API (commented - requires authentication)
# web_ingestor = WebIngestor()
# fbi_docs = web_ingestor.ingest_url("https://api.fbi.gov/wanted/v1/list")

print(f"Ingested {len(documents)} documents")


üß† Semantica is ingesting: 404 Client Error: Not Found for url: https://www.justice.gov/opa/pressreleases ‚ùåüì• (1.6s) | üß† Semantica is ingesting: 403 Client Error: Forbidden for url: https://www.dea.gov/press-releases ‚ùåüì• (0.4s)  [1/12] Feed: 10 documents
üß† Semantica is ingesting: 404 Client Error: Not Found for url: https://www.justice.gov/opa/pressreleases ‚ùåüì• (1.6s) | üß† Semantica is ingesting: 403 Client Error: Forbidden for url: https://www.dea.gov/press-releases ‚ùåüì• (0.4s)  [2/12] Feed: 10 documents
üß† Semantica is ingesting: 404 Client Error: Not Found for url: https://www.justice.gov/opa/pressreleases ‚ùåüì• (1.6s) | üß† Semantica is ingesting: 403 Client Error: Forbidden for url: https://www.dea.gov/press-releases ‚ùåüì• (0.4s)  [3/12] Feed failed: Failed to parse feed: not well-formed (invalid tok
üß† Semantica is ingesting: 403 Client Error: Forbidden for url: https://www.dea.gov/press-releases ‚ùåüì• (0.4s) | üß† Semantica is ingesting: 404 

In [6]:
from semantica.parse import DocumentParser
from contextlib import redirect_stderr
from io import StringIO

parser = DocumentParser()

print(f"Parsing {len(documents)} documents...")
parsed_documents = []
for i, doc in enumerate(documents, 1):
    try:
        with redirect_stderr(StringIO()):
            parsed = parser.parse(
                doc.content if hasattr(doc, 'content') else str(doc),
                format="auto"
            )
            parsed_documents.append(parsed)
    except Exception:
        parsed_documents.append(doc.content if hasattr(doc, 'content') else str(doc))
    if i % 50 == 0 or i == len(documents):
        print(f"  Parsed {i}/{len(documents)} documents...")

print(f"Parsed {len(parsed_documents)} documents")


Parsing 118 documents...
üß† Semantica is ingesting: HTTPSConnectionPool(host='www.cyber.gov.au', port=443): Max retries exceeded with url: /news (Caused by ReadTimeoutError("HTTPSConnectionPool(host='www.cyber.gov.au', port=443): Read timed out. (read timeout=30)")) ‚ùåüì• (119.3s) | üß† Semantica is parsing: Document: p>
üß† Semantica is parsing: Document: p>                                                                                   
 üîÑüîç (0.0s) | üß† Semantica is parsing: Document file not found: <div>
<h2><strong>Summary<\strong><\h2>
<\div>
<p>Unpatched Pulse Secure VPN servers continue to be an attractive target for malicious actors. Affected organizations that have not applied the software patch to fix an arbitrary file reading vulnerability, known as CVE-2019-11510, can become compromised in an attack.[<a href="https:\nvd.nist.gov\vuln\detail\CVE-2019-11510" target="_blank" title="[1]">1<\a>]<\p>
<p>Although Pulse Secure [2] disclosed the vulnerability and prov

---

## Text Processing

Normalize entity names and split documents using entity-aware chunking to preserve network relationships.


In [7]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter
from contextlib import redirect_stderr
from io import StringIO

normalizer = TextNormalizer()
print(f"Normalizing {len(parsed_documents)} documents...")
normalized_docs = []

for i, doc in enumerate(parsed_documents, 1):
    try:
        with redirect_stderr(StringIO()):
            normalized = normalizer.normalize(
                doc if isinstance(doc, str) else str(doc),
                clean_html=True,
                normalize_entities=True,
                remove_extra_whitespace=True
            )
            normalized_docs.append(normalized)
    except Exception:
        normalized_docs.append(doc if isinstance(doc, str) else str(doc))
    if i % 50 == 0 or i == len(parsed_documents):
        print(f"  Normalized {i}/{len(parsed_documents)} documents...")

# Use entity-aware chunking to preserve network relationships
entity_splitter = TextSplitter(
    method="entity_aware",
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

print(f"Chunking {len(normalized_docs)} documents...")
chunked_docs = []
for i, doc_text in enumerate(normalized_docs, 1):
    try:
        with redirect_stderr(StringIO()):
            chunks = entity_splitter.split(doc_text)
            chunked_docs.extend([chunk.content if hasattr(chunk, 'content') else str(chunk) for chunk in chunks])
    except Exception:
        chunked_docs.append(doc_text)
    if i % 50 == 0 or i == len(normalized_docs):
        print(f"  Chunked {i}/{len(normalized_docs)} documents ({len(chunked_docs)} chunks so far)")

print(f"Created {len(chunked_docs)} chunks from {len(normalized_docs)} documents")


Normalizing 118 documents...
üß† Semantica is parsing: Document file not found: Open-source server monitoring tool, Nezha, is being exploited by attackers for remote system control ‚ùåüîç (0.0s) | üß† Semantica is normalizing |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüîß  Normalized 50/118 documents...
üß† Semantica is parsing: Document file not found: Open-source server monitoring tool, Nezha, is being exploited by attackers for remote system control ‚ùåüîç (0.0s) | üß† Normalizing text üîÑüîß (0.0s)  Normalized 100/118 documents...
üß† Semantica is parsing: Document file not found: Open-source server monitoring tool, Nezha, is being exploited by attackers for remote system control ‚ùåüîç (0.0s) | üß† Semantica is normalizing |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüîß  Normalized 118/118 documents...
Chunking 118 documents...
üß† Semantica is normalizing |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüîß | üß† Semantica is 

---

## Entity Extraction

Extract criminal network entities including persons, organizations, events, locations, and relationships.


In [None]:
from semantica.semantic_extract import NERExtractor

extractor = NERExtractor(method="ml", model="en_core_web_sm")
chunks_to_process = chunked_docs[:10]
entity_results = extractor.extract(chunks_to_process)

all_entities = []
relevant_types = ["PERSON", "ORG", "GPE", "LOC", "EVENT", "DATE"]
for entities in entity_results:
    all_entities.extend([e for e in entities if e.label in relevant_types])

print(f"Extracted {len(all_entities)} entities")


üß† Semantica is extracting: Extracted 19 entities using ml |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüéØ | üß† Semantica is extracting: Extracted 0 relations |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüéØÔøΩ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüéØExtracted 289 entities


---

## Relationship Extraction

Extract network relationships between entities such as associations, connections, involvement, and location relationships.


In [None]:
from semantica.semantic_extract import RelationExtractor

relation_extractor = RelationExtractor(
    method=["dependency", "pattern", "cooccurrence"],
    model="en_core_web_sm",
    confidence_threshold=0.5,
    max_distance=100
)

relevant_types = ["PERSON", "ORG", "GPE", "LOC", "EVENT", "DATE"]
chunk_entities_list = [[e for e in entities if e.label in relevant_types] for entities in entity_results]
relation_results = relation_extractor.extract(chunks_to_process, chunk_entities_list)

all_relationships = []
seen = set()
for relationships in relation_results:
    for rel in relationships:
        key = (rel.subject.text, rel.predicate, rel.object.text)
        if key not in seen:
            seen.add(key)
            all_relationships.append(rel)

print(f"Extracted {len(all_relationships)} relationships")


üß† Semantica is extracting: Extracted 19 entities using ml |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüéØ | üß† Semantica is extracting: Starting batch extraction... 0/10 (remaining: 10) |‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë| 0.0% [0/10] üîÑüéØ

üß† Semantica is extracting: Extracted 19 entities using ml |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüéØ | üß† Semantica is extracting: Extracting relations using pattern... üîÑüéØ (1.0s)DEBUG: Entity map keys: ["chunk(text='<div>\\n", 'australia', 'canada', 'new zealand', 'the united kingdom', 'united states.[1][2][3][4][5]</p>\\n', 'li>webshell', 'china', 'li>c2 obfuscation', "entity(text='canada", 'confidence=1.0', "entity(text='new zealand'", "new zealand'", "entity(text='the united kingdom'", "the united kingdom'", 'end_char=482']
üß† Semantica is extracting: Extracted 19 entities using ml |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüéØ | üß† Semantica is extracting: Extracting relations using pattern... üîÑüéØ (0.7s)DEBUG: Entity map keys: ['today']
üß† Semantica is extracting: Extracted 19 entities using ml |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüéØ | üß† Semantica is extracting: Extracting relations using pattern...

## Conflict Detection

Detect and resolve conflicts in intelligence data from multiple sources. Intelligence sources have different credibility levels.


In [16]:
from semantica.conflicts import ConflictDetector, ConflictResolver

conflict_detector = ConflictDetector()
conflict_resolver = ConflictResolver()

entity_dicts = [
    {
        "id": getattr(e, "text", str(e)),
        "text": getattr(e, "text", str(e)),
        "label": getattr(e, "label", ""),
        "metadata": getattr(e, "metadata", {})
    }
    for e in all_entities
]

print(f"Detecting conflicts in {len(entity_dicts)} entities...")
conflicts = conflict_detector.detect_entity_conflicts(entity_dicts)

if all_relationships:
    relationship_dicts = [
        {
            "source_id": getattr(rel.subject, "text", str(rel.subject)),
            "target_id": getattr(rel.object, "text", str(rel.object)),
            "type": rel.predicate,
            "confidence": rel.confidence,
            "metadata": rel.metadata
        }
        for rel in all_relationships
    ]
    relationship_conflicts = conflict_detector.detect_relationship_conflicts(relationship_dicts)
    conflicts.extend(relationship_conflicts)

print(f"Detected {len(conflicts)} conflicts")

if conflicts:
    resolved = conflict_resolver.resolve_conflicts(
        conflicts,
        strategy="credibility_weighted"
    )
    print(f"Resolved {len(resolved)} conflicts")
else:
    print("No conflicts detected")


Detecting conflicts in 289 entities...
üß† Semantica is extracting: Extracting relations using pattern... üîÑüéØ (1.8s) | üß† Semantica is resolving: Checking entity groups for conflicts... 26/155 (remaining: 129) |‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë| 16.8% [26/155] üîÑ‚ö†Ô∏è (ETA: 3.5s | 37.0/s)s)

Value conflict detected: PII.label has conflicting values: ['PERSON', 'ORG']


üß† Semantica is extracting: Extracting relations using pattern... üîÑüéØ (1.8s) | üß† Semantica is resolving: Checking entity groups for conflicts... 47/155 (remaining: 108) |‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë| 30.3% [47/155] üîÑ‚ö†Ô∏è (ETA: 1.6s | 65.6/s)

Value conflict detected: China Chopper.label has conflicting values: ['PERSON', 'ORG']


üß† Semantica is extracting: Extracting relations using pattern... üîÑüéØ (1.8s) | üß† Semantica is resolving: Checking entity groups for conflicts... 105/155 (remaining: 50) |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë| 67.7% [105/155] üîÑ‚ö†Ô∏è (ETA: 0.4s | 134.9/s)

Value conflict detected: href="https://learn.microsoft.com.label has conflicting values: ['PERSON', 'ORG']


üß† Semantica is extracting: Extracting relations using pattern... üîÑüéØ (1.8s) | üß† Semantica is resolving: Detected 0 relationship conflicts |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% [0/176] ‚úÖ‚ö†Ô∏è‚ñë| 0.0% [0/176] üîÑ‚ö†Ô∏è223.9/s) | 223.5/s)).3s | 148.8/s)Detected 3 conflicts
üß† Semantica is resolving: Detected 0 relationship conflicts |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% [0/176] ‚úÖ‚ö†Ô∏è | üß† Semantica is resolving: Resolving conflicts... 0/3 (remaining: 3) |‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë| 0.0% [0/3] üîÑ‚ö†Ô∏èResolved 3 conflicts


---

## Knowledge Graph Construction

Build the criminal network knowledge graph from extracted entities and relationships.


In [18]:
from semantica.kg import GraphBuilder

builder = GraphBuilder()

print(f"Building knowledge graph...")
kg = builder.build(
    sources=all_entities,
    relationships=all_relationships
)

print(f"Built KG with {len(kg.get('entities', []))} entities and {len(kg.get('relationships', []))} relationships")


Building knowledge graph...
üß† Semantica is resolving: Resolving conflicts... 0/3 (remaining: 3) |‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë| 0.0% [0/3] üîÑ‚ö†Ô∏è | üß† Semantica is building: Knowledge graph from 289 source(s) üîÑüß† (0.0s)Building graph structure...
‚úÖ Graph structure built (0.00s)
üß† Semantica is resolving: Resolving conflicts... 0/3 (remaining: 3) |‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë| 0.0% [0/3] üîÑ‚ö†Ô∏è | üß† Semantica is building: Knowledge graph from 289 source(s) üîÑüß† (0.0s)

Type conflict detected: PII conflicting types: ['PERSON', 'ORG']


üß† Semantica is resolving: Resolving conflicts... 0/3 (remaining: 3) |‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë| 0.0% [0/3] üîÑ‚ö†Ô∏è | üß† Semantica is building: Knowledge graph from 289 source(s) üîÑüß† (0.0s)

Type conflict detected: China Chopper conflicting types: ['PERSON', 'ORG']


üß† Semantica is resolving: Resolving conflicts... 0/3 (remaining: 3) |‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë| 0.0% [0/3] üîÑ‚ö†Ô∏è | üß† Semantica is building: Knowledge graph from 289 source(s) üîÑüß† (0.0s)

Type conflict detected: href="https://learn.microsoft.com conflicting types: ['PERSON', 'ORG']


üß† Semantica is resolving: Resolving conflicts... 0/3 (remaining: 3) |‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë| 0.0% [0/3] üîÑ‚ö†Ô∏è | üß† Semantica is building: Knowledge graph from 289 source(s) üîÑüß† (0.0s)

Detected 3 conflict(s) in graph


üß† Semantica is resolving: Resolving conflicts... 0/3 (remaining: 3) |‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë| 0.0% [0/3] üîÑ‚ö†Ô∏è | üß† Semantica is building: Knowledge graph from 289 source(s) üîÑüß† (0.0s)

No conflicts were automatically resolved



‚úÖ Knowledge Graph Build Complete
   Entities: 289
   Relationships: 176
   Total time: 4.39s
Built KG with 289 entities and 176 relationships


---

## Embedding Generation & Vector Store

Generate embeddings for intelligence documents and store them in a vector database for semantic search.


In [None]:
from semantica.embeddings import EmbeddingGenerator
from semantica.vector_store import VectorStore

embedding_gen = EmbeddingGenerator(model_name=EMBEDDING_MODEL, dimension=EMBEDDING_DIMENSION)
chunks_to_embed = chunked_docs[:20]

embeddings = embedding_gen.generate_embeddings(chunks_to_embed)

vector_store = VectorStore(backend="faiss", dimension=EMBEDDING_DIMENSION)
for i, (chunk, embedding) in enumerate(zip(chunks_to_embed, embeddings)):
    vector_store.add(str(i), embedding, {"text": chunk[:100]})

print(f"Generated {len(embeddings)} embeddings and stored in vector database")


fastembed not available. Install with: pip install fastembed. Using fallback embedding method.
fastembed not available. Install with: pip install fastembed. Using fallback embedding method.


Generating embeddings for 20 chunks...
  Generated 5/20 embeddings...
  Generated 10/20 embeddings...
  Generated 15/20 embeddings...
  Generated 20/20 embeddings...
Storing 0 embeddings in vector store...
Generated 0 embeddings and stored in vector database


---

## Network Centrality Analysis

Calculate centrality measures to identify key players in the criminal network. This is unique to this notebook and critical for intelligence analysis.


In [23]:
from semantica.kg import CentralityCalculator

calculator = CentralityCalculator()
all_centrality = calculator.calculate_all_centrality(kg)

degree = all_centrality["centrality_measures"]["degree"]
betweenness = all_centrality["centrality_measures"]["betweenness"]

print(f"Top 5 key players: {[p['node'] for p in degree['rankings'][:5]]}")
print(f"Top 5 brokers: {[b['node'] for b in betweenness['rankings'][:5]]}")

üß† Semantica is building: Knowledge graph from 289 source(s) üîÑüß† (0.0s) | üß† Semantica is building: Calculating degree centrality üîÑüß† (0.0s)Top 5 key players: ['confidence=1.0', 'Mimikatz', 'Microsoft', '2016', 'Australia']
Top 5 brokers: ['confidence=1.0', 'Mimikatz', 'Microsoft', "the United Kingdom National Cyber Security Centre'", 'PowerShell']


---

## Community Detection

Detect criminal communities and groups in the network. This is unique to this notebook and helps identify organized crime structures.


In [24]:
from semantica.kg import CommunityDetector

detector = CommunityDetector()
communities = detector.detect_communities(kg, "louvain")
overlapping = detector.detect_communities(kg, "overlapping")

print(f"Detected {len(communities.get('communities', []))} communities")
print(f"Detected {len(overlapping.get('communities', []))} overlapping communities")


üß† Semantica is building: Calculating degree centrality üîÑüß† (0.0s) | üß† Semantica is building: Detecting communities with NetworkX... üîÑüß† (3.5s) (0.0s)Detected 19 communities
Detected 13 overlapping communities


---

## Graph Analytics

Perform comprehensive graph analytics including path finding and connectivity analysis to understand network structure.


In [25]:
from semantica.kg import GraphAnalyzer

analyzer = GraphAnalyzer()
results = analyzer.analyze_graph(kg)

stats = results.get("metrics", {})
connectivity = results.get("connectivity", {})

print(f"Graph: {stats.get('num_nodes', 0)} nodes, {stats.get('num_edges', 0)} edges")
print(f"Connected components: {len(connectivity.get('components', []))}")


üß† Semantica is building: Calculating degree centrality üîÑüß† (0.0s) | üß† Semantica is building: Detecting communities with NetworkX... üîÑüß† (3.5s)Graph: 125 nodes, 168 edges
Connected components: 13


---

## GraphRAG Queries

Use hybrid retrieval combining vector search and graph traversal to answer complex intelligence questions.


In [34]:
from semantica.context import AgentContext, ContextGraph
from semantica.llms import Groq
import os

context_graph = ContextGraph()
context_graph.build_from_entities_and_relationships(
    entities=kg.get('entities', []),
    relationships=[{**r, 'source_id': r.get('source_id') or r.get('source'), 'target_id': r.get('target_id') or r.get('target')} for r in kg.get('relationships', [])]
)

graph_stats = context_graph.stats()
print(f"Intelligence Context Graph: {graph_stats['node_count']} nodes, {graph_stats['edge_count']} edges")

context = AgentContext(
    vector_store=vector_store,
    knowledge_graph=context_graph,
    hybrid_alpha=0.7,
    use_graph_expansion=True,
    max_expansion_hops=3
)

for chunk in chunked_docs[:30]:
    if chunk and chunk.strip():
        context.store(
            content=chunk,
            metadata={'source': 'criminal_intelligence'},
            extract_entities=True,
            link_entities=True
        )

llm = Groq(model="llama-3.1-8b-instant", api_key=os.getenv("GROQ_API_KEY"))

intelligence_queries = [
    "Who are the key players and central nodes in the criminal network?",
    "What are the operational relationships between criminal organizations?"
]

print("\n" + "=" * 80)
print("Criminal Intelligence Analysis - GraphRAG with Multi-Hop Reasoning")
print("=" * 80)

for query in intelligence_queries:
    print(f"\n{'='*80}")
    print(f"Intelligence Query: {query}")
    print(f"{'='*80}\n")
    
    result = context.query_with_reasoning(
        query=query,
        llm_provider=llm,
        max_results=15,
        max_hops=3,
        min_score=0.2
    )
    
    print(f"Generated Response:\n{result.get('response', 'No response available')}\n")
    
    if result.get('reasoning_path'):
        print(f"Reasoning Path:\n{result.get('reasoning_path')}\n")
    
    print(f"Confidence: {result.get('confidence', 0):.3f}")
    print(f"Sources: {result.get('num_sources', 0)}")
    print(f"Reasoning Paths: {result.get('num_reasoning_paths', 0)}")
    print()


üß† Semantica is indexing: Storing vectors... üîÑüìä (0.0s) | üß† Semantica is processing: Building graph from 289 entities and 176 relationships üîÑüîó (0.0s)Intelligence Context Graph: 155 nodes, 176 edges


üß† Semantica is indexing: Stored 1 vectors |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüìä | üß† Semantica is processing: Building graph from 289 entities and 176 relationships üîÑüîó (0.0s)
Criminal Intelligence Analysis - GraphRAG with Multi-Hop Reasoning

Intelligence Query: Who are the key players and central nodes in the criminal network?

üß† Semantica is indexing: Searching for 60 similar vectors üîÑüìä (0.0s) | üß† Semantica is processing: Building graph from 289 entities and 176 relationships üîÑüîó (0.0s))

Embedding generation failed: Text cannot be empty or whitespace-only
Using random fallback embedding


Generated Response:
Based on the retrieved context, I was unable to identify key players and central nodes in a criminal network. The context appears to be related to the United Kingdom and its National Cyber Security Centre, but it does not provide information about a specific criminal network.

However, I can infer that the United Kingdom National Cyber Security Centre (Context 2 and Context 3) might be involved in combating cybercrime, which could be related to a criminal network. But without more specific information, it is not possible to determine the key players and central nodes in such a network.

If I had to make an educated guess, I would say that the key players and central nodes in a criminal network might be individuals or organizations that are involved in cybercrime, such as hackers, cyberterrorists, or organized crime groups. However, this is purely speculative and not based on any information from the retrieved context.

In terms of multi-hop connections, I did not fi

Embedding generation failed: Text cannot be empty or whitespace-only
Using random fallback embedding


Generated Response:
Based on the retrieved context, I couldn't find direct information about the operational relationships between criminal organizations. However, I can provide some insights by making multi-hop connections.

One possible connection is through the concept of "Area Network" (Context 2, Score: 0.43), which can be related to organized crime groups using communication networks to coordinate their activities. For instance, a reasoning path could be:

Area Network ‚Üí Communication Network ‚Üí Organized Crime Group ‚Üí Operational Relationship

However, this connection is indirect and requires additional information to establish a clear relationship between criminal organizations.

Another possible connection is through the concept of "the United Kingdom" (Context 4, Score: 0.42), which is related to the United Kingdom National Cyber Security Centre (Context 1, Score: 0.44). While this connection doesn't directly relate to criminal organizations, it could be used to infer th

---

## Visualization

Visualize the criminal network to explore relationships, communities, and key players.


In [35]:
from semantica.visualization import KGVisualizer

visualizer = KGVisualizer(layout="force", color_scheme="vibrant")
visualizer.visualize_network(kg, output="interactive")


  warn(


üß† Semantica is processing: Building graph from 289 entities and 176 relationships üîÑüîó (0.0s) | üß† Semantica is visualizing: Visualization generated: 289 nodes, 176 edges |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüìà

---

## Export

Export the knowledge graph in multiple formats for intelligence reporting and further analysis.


In [36]:
from semantica.export import GraphExporter, JSONExporter, CSVExporter

GraphExporter().export_knowledge_graph(kg, "criminal_network.graphml", format="graphml")
JSONExporter().export_knowledge_graph(kg, "criminal_network.json")
CSVExporter().export_knowledge_graph(kg, "criminal_network.csv")

print("Exported knowledge graph in JSON, GraphML, and CSV formats")


üß† Semantica is visualizing: Visualization generated: 289 nodes, 176 edges |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ‚úÖüìà | üß† Semantica is exporting: Exporting graph to graphml: criminal_network.graphml üîÑüíæ (0.0s)Exported knowledge graph in JSON, GraphML, and CSV formats
