[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/intelligence/01_Criminal_Network_Analysis.ipynb)

# Criminal Network Analysis - Graph Analytics & Centrality

## Overview

This notebook demonstrates **criminal network analysis** using Semantica with focus on **network centrality**, **community detection**, and **relationship mapping**. The pipeline processes OSINT feeds, police reports, and court records to build knowledge graphs for analyzing criminal networks and identifying key players and communities.

### Key Features

- **Network Centrality**: Uses centrality measures (degree, betweenness, closeness, eigenvector) to identify key players
- **Community Detection**: Detects criminal communities and groups using Louvain and Leiden algorithms
- **Relationship Mapping**: Maps relationships between persons, organizations, and events
- **Graph Analytics**: Comprehensive graph analysis including path finding and connectivity
- **Intelligence Reporting**: Generates intelligence reports from network analysis

### Learning Objectives

- Understand how to analyze criminal networks using graph analytics
- Learn to identify key players using centrality measures
- Master community detection algorithms for criminal group identification
- Explore relationship mapping and path finding in networks
- Practice graph analytics for intelligence reporting
- Analyze network structure and connectivity patterns

### Pipeline Flow

```mermaid
graph TD
    A[Data Ingestion] --> B[Document Parsing]
    B --> C[Text Processing]
    C --> D[Entity Extraction]
    D --> E[Relationship Extraction]
    E --> F[Deduplication]
    F --> G[KG Construction]
    G --> H[Embedding Generation]
    H --> I[Vector Store]
    G --> J[Centrality Analysis]
    G --> K[Community Detection]
    G --> L[Graph Analytics]
    I --> M[GraphRAG Queries]
    J --> N[Visualization]
    K --> N
    L --> N
    G --> O[Export]
```


---


In [None]:
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn


---

## Configuration & Setup

Configure API keys and set up constants for the criminal network analysis pipeline.


In [None]:
import os

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "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")

# Configuration constants
EMBEDDING_DIMENSION = 384
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200


---

## Data Ingestion

Ingest intelligence data from multiple sources including OSINT RSS feeds, web APIs, and local files.


In [None]:
from semantica.ingest import FeedIngestor, WebIngestor, FileIngestor
from contextlib import redirect_stderr
from io import StringIO
import os

os.makedirs("data", exist_ok=True)

documents = []

# Ingest from OSINT RSS feeds
osint_feeds = [
    "https://www.us-cert.gov/ncas/alerts.xml",
    "https://www.europol.europa.eu/rss.xml",
    "https://www.treasury.gov/resource-center/sanctions/OFAC-Enforcement/Pages/rss.xml",
    "https://feeds.feedburner.com/oreilly/radar",
    "https://krebsonsecurity.com/feed/",
    "https://www.schneier.com/feed/",
    "https://www.darkreading.com/rss.xml",
    "https://threatpost.com/feed/",
    "https://www.bleepingcomputer.com/feed/",
    "https://www.securityweek.com/rss",
    "https://www.infosecurity-magazine.com/rss/news/",
    "https://www.csoonline.com/index.rss"
]

feed_ingestor = FeedIngestor()
for i, feed_url in enumerate(osint_feeds, 1):
    try:
        with redirect_stderr(StringIO()):
            feed_data = feed_ingestor.ingest_feed(feed_url, validate=False)
            
            feed_count = 0
            for item in feed_data.items:
                if not item.content:
                    item.content = item.description or item.title or ""
                if item.content:
                    if not hasattr(item, 'metadata'):
                        item.metadata = {}
                    item.metadata['source'] = feed_url
                    documents.append(item)
                    feed_count += 1
            
            if feed_count > 0:
                print(f"  [{i}/{len(osint_feeds)}] Feed: {feed_count} documents")
    except Exception as e:
        print(f"  [{i}/{len(osint_feeds)}] Feed failed: {str(e)[:50]}")
        pass

# Web ingestion from various intelligence and security sources
web_links = [
    "https://www.interpol.int/en/How-we-work/Notices/View-Red-Notices",
    "https://www.unodc.org/unodc/en/data-and-analysis/index.html",
    "https://www.cisa.gov/news-events/cybersecurity-advisories",
    "https://www.us-cert.gov/ncas/alerts",
    "https://www.europol.europa.eu/newsroom",
    "https://www.ncsc.gov.uk/news",
    "https://www.cyber.gov.au/news"
]

web_ingestor = WebIngestor(respect_robots=False, delay=1.0)
for i, web_url in enumerate(web_links, 1):
    try:
        with redirect_stderr(StringIO()):
            web_content = web_ingestor.ingest_url(web_url)
            if web_content and web_content.text:
                # Add content attribute for compatibility with parser
                web_content.content = web_content.text
                if not hasattr(web_content, 'metadata'):
                    web_content.metadata = {}
                web_content.metadata['source'] = web_url
                documents.append(web_content)
                print(f"  [{i}/{len(web_links)}] Web: {len(web_content.text)} characters")
    except Exception as e:
        print(f"  [{i}/{len(web_links)}] Web failed: {str(e)[:50]}")
        pass

# Example: Web ingestion from FBI API (commented - requires authentication)
# web_ingestor = WebIngestor()
# fbi_docs = web_ingestor.ingest_url("https://api.fbi.gov/wanted/v1/list")

print(f"Ingested {len(documents)} documents")


In [None]:
from semantica.parse import DocumentParser
from contextlib import redirect_stderr
from io import StringIO

parser = DocumentParser()

print(f"Parsing {len(documents)} documents...")
parsed_documents = []
for i, doc in enumerate(documents, 1):
    try:
        with redirect_stderr(StringIO()):
            parsed = parser.parse(
                doc.content if hasattr(doc, 'content') else str(doc),
                format="auto"
            )
            parsed_documents.append(parsed)
    except Exception:
        parsed_documents.append(doc.content if hasattr(doc, 'content') else str(doc))
    if i % 50 == 0 or i == len(documents):
        print(f"  Parsed {i}/{len(documents)} documents...")

print(f"Parsed {len(parsed_documents)} documents")


---

## Text Processing

Normalize entity names and split documents using entity-aware chunking to preserve network relationships.


In [None]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter
from contextlib import redirect_stderr
from io import StringIO

normalizer = TextNormalizer()
print(f"Normalizing {len(parsed_documents)} documents...")
normalized_docs = []

for i, doc in enumerate(parsed_documents, 1):
    try:
        with redirect_stderr(StringIO()):
            normalized = normalizer.normalize(
                doc if isinstance(doc, str) else str(doc),
                clean_html=True,
                normalize_entities=True,
                remove_extra_whitespace=True
            )
            normalized_docs.append(normalized)
    except Exception:
        normalized_docs.append(doc if isinstance(doc, str) else str(doc))
    if i % 50 == 0 or i == len(parsed_documents):
        print(f"  Normalized {i}/{len(parsed_documents)} documents...")

# Use entity-aware chunking to preserve network relationships
entity_splitter = TextSplitter(
    method="entity_aware",
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

print(f"Chunking {len(normalized_docs)} documents...")
chunked_docs = []
for i, doc_text in enumerate(normalized_docs, 1):
    try:
        with redirect_stderr(StringIO()):
            chunks = entity_splitter.split(doc_text)
            chunked_docs.extend([chunk.content if hasattr(chunk, 'content') else str(chunk) for chunk in chunks])
    except Exception:
        chunked_docs.append(doc_text)
    if i % 50 == 0 or i == len(normalized_docs):
        print(f"  Chunked {i}/{len(normalized_docs)} documents ({len(chunked_docs)} chunks so far)")

print(f"Created {len(chunked_docs)} chunks from {len(normalized_docs)} documents")


---

## Entity Extraction

Extract criminal network entities including persons, organizations, events, locations, and relationships.


In [None]:
from semantica.semantic_extract import NERExtractor

extractor = NERExtractor(method="ml", model="en_core_web_sm")
chunks_to_process = chunked_docs[:10]
entity_results = extractor.extract(chunks_to_process)

all_entities = []
relevant_types = ["PERSON", "ORG", "GPE", "LOC", "EVENT", "DATE"]
for entities in entity_results:
    all_entities.extend([e for e in entities if e.label in relevant_types])

print(f"Extracted {len(all_entities)} entities")


---

## Relationship Extraction

Extract network relationships between entities such as associations, connections, involvement, and location relationships.


In [None]:
from semantica.semantic_extract import RelationExtractor

relation_extractor = RelationExtractor(
    method=["dependency", "pattern", "cooccurrence"],
    model="en_core_web_sm",
    confidence_threshold=0.5,
    max_distance=100
)

relevant_types = ["PERSON", "ORG", "GPE", "LOC", "EVENT", "DATE"]
chunk_entities_list = [[e for e in entities if e.label in relevant_types] for entities in entity_results]
relation_results = relation_extractor.extract(chunks_to_process, chunk_entities_list)

all_relationships = []
seen = set()
for relationships in relation_results:
    for rel in relationships:
        key = (rel.subject.text, rel.predicate, rel.object.text)
        if key not in seen:
            seen.add(key)
            all_relationships.append(rel)

print(f"Extracted {len(all_relationships)} relationships")


## Conflict Detection

Detect and resolve conflicts in intelligence data from multiple sources. Intelligence sources have different credibility levels.


In [None]:
from semantica.conflicts import ConflictDetector, ConflictResolver

conflict_detector = ConflictDetector()
conflict_resolver = ConflictResolver()

entity_dicts = [
    {
        "id": getattr(e, "text", str(e)),
        "text": getattr(e, "text", str(e)),
        "label": getattr(e, "label", ""),
        "metadata": getattr(e, "metadata", {})
    }
    for e in all_entities
]

print(f"Detecting conflicts in {len(entity_dicts)} entities...")
conflicts = conflict_detector.detect_entity_conflicts(entity_dicts)

if all_relationships:
    relationship_dicts = [
        {
            "source_id": getattr(rel.subject, "text", str(rel.subject)),
            "target_id": getattr(rel.object, "text", str(rel.object)),
            "type": rel.predicate,
            "confidence": rel.confidence,
            "metadata": rel.metadata
        }
        for rel in all_relationships
    ]
    relationship_conflicts = conflict_detector.detect_relationship_conflicts(relationship_dicts)
    conflicts.extend(relationship_conflicts)

print(f"Detected {len(conflicts)} conflicts")

if conflicts:
    resolved = conflict_resolver.resolve_conflicts(
        conflicts,
        strategy="credibility_weighted"
    )
    print(f"Resolved {len(resolved)} conflicts")
else:
    print("No conflicts detected")


---

## Knowledge Graph Construction

Build the criminal network knowledge graph from extracted entities and relationships.


In [None]:
from semantica.kg import GraphBuilder

builder = GraphBuilder()

print(f"Building knowledge graph...")
kg = builder.build(
    sources=all_entities,
    relationships=all_relationships
)

print(f"Built KG with {len(kg.get('entities', []))} entities and {len(kg.get('relationships', []))} relationships")


---

## Embedding Generation & Vector Store

Generate embeddings for intelligence documents and store them in a vector database for semantic search.


In [None]:
from semantica.embeddings import EmbeddingGenerator
from semantica.vector_store import VectorStore

embedding_gen = EmbeddingGenerator(model_name=EMBEDDING_MODEL, dimension=EMBEDDING_DIMENSION)
chunks_to_embed = chunked_docs[:20]

embeddings = embedding_gen.generate_embeddings(chunks_to_embed)

vector_store = VectorStore(backend="faiss", dimension=EMBEDDING_DIMENSION)
for i, (chunk, embedding) in enumerate(zip(chunks_to_embed, embeddings)):
    vector_store.add(str(i), embedding, {"text": chunk[:100]})

print(f"Generated {len(embeddings)} embeddings and stored in vector database")


---

## Network Centrality Analysis

Calculate centrality measures to identify key players in the criminal network. This is unique to this notebook and critical for intelligence analysis.


In [None]:
from semantica.kg import CentralityCalculator

calculator = CentralityCalculator()
all_centrality = calculator.calculate_all_centrality(kg)

degree = all_centrality["centrality_measures"]["degree"]
betweenness = all_centrality["centrality_measures"]["betweenness"]

print(f"Top 5 key players: {[p['node'] for p in degree['rankings'][:5]]}")
print(f"Top 5 brokers: {[b['node'] for b in betweenness['rankings'][:5]]}")

---

## Community Detection

Detect criminal communities and groups in the network. This is unique to this notebook and helps identify organized crime structures.


In [None]:
from semantica.kg import CommunityDetector

detector = CommunityDetector()
communities = detector.detect_communities(kg, "louvain")
overlapping = detector.detect_communities(kg, "overlapping")

print(f"Detected {len(communities.get('communities', []))} communities")
print(f"Detected {len(overlapping.get('communities', []))} overlapping communities")


---

## Graph Analytics

Perform comprehensive graph analytics including path finding and connectivity analysis to understand network structure.


In [None]:
from semantica.kg import GraphAnalyzer

analyzer = GraphAnalyzer()
results = analyzer.analyze_graph(kg)

stats = results.get("metrics", {})
connectivity = results.get("connectivity", {})

print(f"Graph: {stats.get('num_nodes', 0)} nodes, {stats.get('num_edges', 0)} edges")
print(f"Connected components: {len(connectivity.get('components', []))}")


---

## GraphRAG Queries

Use hybrid retrieval combining vector search and graph traversal to answer complex intelligence questions.


In [None]:
from semantica.context import AgentContext, ContextGraph
from semantica.llms import Groq
import os

context_graph = ContextGraph()
context_graph.build_from_entities_and_relationships(
    entities=kg.get('entities', []),
    relationships=[{**r, 'source_id': r.get('source_id') or r.get('source'), 'target_id': r.get('target_id') or r.get('target')} for r in kg.get('relationships', [])]
)

graph_stats = context_graph.stats()
print(f"Intelligence Context Graph: {graph_stats['node_count']} nodes, {graph_stats['edge_count']} edges")

context = AgentContext(
    vector_store=vector_store,
    knowledge_graph=context_graph,
    hybrid_alpha=0.7,
    use_graph_expansion=True,
    max_expansion_hops=3
)

for chunk in chunked_docs[:30]:
    if chunk and chunk.strip():
        context.store(
            content=chunk,
            metadata={'source': 'criminal_intelligence'},
            extract_entities=True,
            link_entities=True
        )

llm = Groq(model="llama-3.1-8b-instant", api_key=os.getenv("GROQ_API_KEY"))

intelligence_queries = [
    "Who are the key players and central nodes in the criminal network?",
    "What are the operational relationships between criminal organizations?"
]

print("\n" + "=" * 80)
print("Criminal Intelligence Analysis - GraphRAG with Multi-Hop Reasoning")
print("=" * 80)

for query in intelligence_queries:
    print(f"\n{'='*80}")
    print(f"Intelligence Query: {query}")
    print(f"{'='*80}\n")
    
    result = context.query_with_reasoning(
        query=query,
        llm_provider=llm,
        max_results=15,
        max_hops=3,
        min_score=0.2
    )
    
    print(f"Generated Response:\n{result.get('response', 'No response available')}\n")
    
    if result.get('reasoning_path'):
        print(f"Reasoning Path:\n{result.get('reasoning_path')}\n")
    
    print(f"Confidence: {result.get('confidence', 0):.3f}")
    print(f"Sources: {result.get('num_sources', 0)}")
    print(f"Reasoning Paths: {result.get('num_reasoning_paths', 0)}")
    print()


---

## Visualization

Visualize the criminal network to explore relationships, communities, and key players.


In [None]:
from semantica.visualization import KGVisualizer

visualizer = KGVisualizer(layout="force", color_scheme="vibrant")
visualizer.visualize_network(kg, output="interactive")


---

## Export

Export the knowledge graph in multiple formats for intelligence reporting and further analysis.


In [None]:
from semantica.export import GraphExporter, JSONExporter, CSVExporter

GraphExporter().export_knowledge_graph(kg, "criminal_network.graphml", format="graphml")
JSONExporter().export_knowledge_graph(kg, "criminal_network.json")
CSVExporter().export_knowledge_graph(kg, "criminal_network.csv")

print("Exported knowledge graph in JSON, GraphML, and CSV formats")
