[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/intelligence/01_Criminal_Network_Analysis.ipynb)

# Criminal Network Analysis - Graph Analytics & Centrality

## Overview

This notebook demonstrates **criminal network analysis** using Semantica with focus on **network centrality**, **community detection**, and **relationship mapping**. The pipeline processes OSINT feeds, police reports, and court records to build knowledge graphs for analyzing criminal networks and identifying key players and communities.

### Key Features

- **Network Centrality**: Uses centrality measures (degree, betweenness, closeness, eigenvector) to identify key players
- **Community Detection**: Detects criminal communities and groups using Louvain and Leiden algorithms
- **Relationship Mapping**: Maps relationships between persons, organizations, and events
- **Graph Analytics**: Comprehensive graph analysis including path finding and connectivity
- **Intelligence Reporting**: Generates intelligence reports from network analysis

### Learning Objectives

- Understand how to analyze criminal networks using graph analytics
- Learn to identify key players using centrality measures
- Master community detection algorithms for criminal group identification
- Explore relationship mapping and path finding in networks
- Practice graph analytics for intelligence reporting
- Analyze network structure and connectivity patterns

### Pipeline Flow

```mermaid
graph TD
    A[Data Ingestion] --> B[Document Parsing]
    B --> C[Text Processing]
    C --> D[Entity Extraction]
    D --> E[Relationship Extraction]
    E --> F[Deduplication]
    F --> G[KG Construction]
    G --> H[Embedding Generation]
    H --> I[Vector Store]
    G --> J[Centrality Analysis]
    G --> K[Community Detection]
    G --> L[Graph Analytics]
    I --> M[GraphRAG Queries]
    J --> N[Visualization]
    K --> N
    L --> N
    G --> O[Export]
```


---


In [None]:
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn


---

## Configuration & Setup

Configure API keys and set up constants for the criminal network analysis pipeline.


In [None]:
import os

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "your-key-here")

# Configuration constants
EMBEDDING_DIMENSION = 384
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200


---

## Data Ingestion

Ingest intelligence data from multiple sources including OSINT RSS feeds, web APIs, and local files.


In [None]:
from semantica.ingest import FeedIngestor, WebIngestor, FileIngestor
from contextlib import redirect_stderr
from io import StringIO
import os

os.makedirs("data", exist_ok=True)

documents = []

# Ingest from OSINT RSS feeds
osint_feeds = [
    "https://www.us-cert.gov/ncas/alerts.xml",
    "https://www.fbi.gov/feeds/news"
]

for feed_url in osint_feeds:
    try:
        with redirect_stderr(StringIO()):
            feed_ingestor = FeedIngestor()
            feed_docs = feed_ingestor.ingest(feed_url, method="rss")
            documents.extend(feed_docs)
    except Exception:
        pass

# Example: Web ingestion from FBI API (commented - requires authentication)
# web_ingestor = WebIngestor()
# fbi_docs = web_ingestor.ingest("https://api.fbi.gov/wanted/v1/list", method="api")

# Fallback: Sample criminal network data
if not documents:
    network_data = """
    John Smith is associated with criminal organization XYZ.
    Jane Doe has connections to John Smith and organization XYZ.
    Event: Meeting on 2024-01-15 between John Smith and Jane Doe at Location A.
    Organization XYZ is linked to multiple criminal activities.
    Person: Mike Johnson connected to organization XYZ.
    Location A is a known meeting point for criminal activities.
    Relationship: John Smith and Jane Doe have a business relationship.
    """
    with open("data/criminal_network.txt", "w", encoding="utf-8") as f:
        f.write(network_data)
    file_ingestor = FileIngestor()
    documents = file_ingestor.ingest("data/criminal_network.txt")

print(f"Ingested {len(documents)} documents")


In [None]:
from semantica.parse import DocumentParser
from contextlib import redirect_stderr
from io import StringIO

parser = DocumentParser()

print(f"Parsing {len(documents)} documents...")
parsed_documents = []
for i, doc in enumerate(documents, 1):
    try:
        with redirect_stderr(StringIO()):
            parsed = parser.parse(
                doc.content if hasattr(doc, 'content') else str(doc),
                format="auto"
            )
            parsed_documents.append(parsed)
    except Exception:
        parsed_documents.append(doc.content if hasattr(doc, 'content') else str(doc))
    if i % 50 == 0 or i == len(documents):
        print(f"  Parsed {i}/{len(documents)} documents...")

print(f"Parsed {len(parsed_documents)} documents")


---

## Text Processing

Normalize entity names and split documents using entity-aware chunking to preserve network relationships.


In [None]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter
from contextlib import redirect_stderr
from io import StringIO

normalizer = TextNormalizer()
print(f"Normalizing {len(parsed_documents)} documents...")
normalized_docs = []

for i, doc in enumerate(parsed_documents, 1):
    try:
        with redirect_stderr(StringIO()):
            normalized = normalizer.normalize(
                doc if isinstance(doc, str) else str(doc),
                clean_html=True,
                normalize_entities=True,
                remove_extra_whitespace=True
            )
            normalized_docs.append(normalized)
    except Exception:
        normalized_docs.append(doc if isinstance(doc, str) else str(doc))
    if i % 50 == 0 or i == len(parsed_documents):
        print(f"  Normalized {i}/{len(parsed_documents)} documents...")

# Use entity-aware chunking to preserve network relationships
entity_splitter = TextSplitter(
    method="entity_aware",
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

print(f"Chunking {len(normalized_docs)} documents...")
chunked_docs = []
for i, doc_text in enumerate(normalized_docs, 1):
    try:
        with redirect_stderr(StringIO()):
            chunks = entity_splitter.split(doc_text)
            chunked_docs.extend([chunk.content if hasattr(chunk, 'content') else str(chunk) for chunk in chunks])
    except Exception:
        chunked_docs.append(doc_text)
    if i % 50 == 0 or i == len(normalized_docs):
        print(f"  Chunked {i}/{len(normalized_docs)} documents ({len(chunked_docs)} chunks so far)")

print(f"Created {len(chunked_docs)} chunks from {len(normalized_docs)} documents")


---

## Entity Extraction

Extract criminal network entities including persons, organizations, events, locations, and relationships.


In [None]:
from semantica.semantic_extract import NERExtractor
from contextlib import redirect_stderr
from io import StringIO

extractor = NERExtractor(
    provider="groq",
    model="llama-3.1-8b-instant"
)

entity_types = [
    "Person", "Organization", "Event", "Location", "Relationship"
]

all_entities = []
chunks_to_process = chunked_docs[:10]  # Limit for demo
print(f"Extracting entities from {len(chunks_to_process)} chunks...")
for i, chunk in enumerate(chunks_to_process, 1):
    try:
        with redirect_stderr(StringIO()):
            entities = extractor.extract(
                chunk,
                entity_types=entity_types
            )
            all_entities.extend(entities)
    except Exception:
        pass
    
    if i % 5 == 0 or i == len(chunks_to_process):
        print(f"  Processed {i}/{len(chunks_to_process)} chunks ({len(all_entities)} entities found)")

print(f"Extracted {len(all_entities)} entities")


---

## Relationship Extraction

Extract network relationships between entities such as associations, connections, involvement, and location relationships.


In [None]:
from semantica.semantic_extract import RelationExtractor
from contextlib import redirect_stderr
from io import StringIO

relation_extractor = RelationExtractor(
    provider="groq",
    model="llama-3.1-8b-instant"
)

relation_types = [
    "associated_with", "connected_to", "involved_in",
    "located_at", "related_to"
]

all_relationships = []
chunks_to_process = chunked_docs[:10]  # Limit for demo
print(f"Extracting relationships from {len(chunks_to_process)} chunks...")
for i, chunk in enumerate(chunks_to_process, 1):
    try:
        with redirect_stderr(StringIO()):
            relationships = relation_extractor.extract(
                chunk,
                relation_types=relation_types
            )
            all_relationships.extend(relationships)
    except Exception:
        pass
    
    if i % 5 == 0 or i == len(chunks_to_process):
        print(f"  Processed {i}/{len(chunks_to_process)} chunks ({len(all_relationships)} relationships found)")

print(f"Extracted {len(all_relationships)} relationships")


---

## Deduplication

Deduplicate person and organization records to ensure accurate network analysis.


In [None]:
from semantica.kg import EntityResolver
from semantica.semantic_extract import Entity

# Convert Entity objects to dictionaries for EntityResolver
print(f"Converting {len(all_entities)} entities to dictionaries...")
entity_dicts = [{"name": e.get("name", e.get("text", "")), "type": e.get("type", ""), "confidence": e.get("confidence", 1.0)} for e in all_entities]

# Use EntityResolver class to resolve duplicates
entity_resolver = EntityResolver(strategy="fuzzy", similarity_threshold=0.85)

print(f"Resolving duplicates in {len(entity_dicts)} entities...")
resolved_entities = entity_resolver.resolve_entities(entity_dicts)

# Convert back to Entity objects
print(f"Converting {len(resolved_entities)} resolved entities back to Entity objects...")
merged_entities = [
    Entity(text=e["name"], label=e["type"], confidence=e.get("confidence", 1.0))
    if isinstance(e, dict) else e
    for e in resolved_entities
]

all_entities = merged_entities
print(f"Deduplicated {len(entity_dicts)} entities to {len(merged_entities)} unique entities")


## Conflict Detection

Detect and resolve conflicts in intelligence data from multiple sources. Intelligence sources have different credibility levels.


In [None]:
from semantica.conflicts import ConflictDetector, ConflictResolver

# Use value conflict detection for property value disagreements
# credibility_weighted strategy prioritizes authoritative intelligence sources
conflict_detector = ConflictDetector()
conflict_resolver = ConflictResolver()

print(f"Detecting value conflicts in {len(all_entities)} entities...")
conflicts = conflict_detector.detect_conflicts(
    entities=all_entities,
    relationships=all_relationships,
    method="value"  # Detect property value conflicts
)

print(f"Detected {len(conflicts)} value conflicts")

if conflicts:
    print(f"Resolving conflicts using credibility_weighted strategy...")
    resolved = conflict_resolver.resolve_conflicts(
        conflicts,
        strategy="credibility_weighted"  # Intelligence sources have different credibility
    )
    print(f"Resolved {len(resolved)} conflicts")
else:
    print("No conflicts detected")


---

## Knowledge Graph Construction

Build the criminal network knowledge graph from extracted entities and relationships.


In [None]:
from semantica.kg import GraphBuilder

builder = GraphBuilder()

print(f"Building knowledge graph...")
kg = builder.build(
    entities=all_entities,
    relationships=all_relationships
)

print(f"Built KG with {len(kg.get('entities', []))} entities and {len(kg.get('relationships', []))} relationships")


---

## Embedding Generation & Vector Store

Generate embeddings for intelligence documents and store them in a vector database for semantic search.


In [None]:
from semantica.embeddings import EmbeddingGenerator
from semantica.vector_store import VectorStore
from contextlib import redirect_stderr
from io import StringIO

embedding_gen = EmbeddingGenerator(
    model_name=EMBEDDING_MODEL,
    dimension=EMBEDDING_DIMENSION
)

# Generate embeddings for chunks
chunks_to_embed = chunked_docs[:20]  # Limit for demo
print(f"Generating embeddings for {len(chunks_to_embed)} chunks...")
embeddings = []
for i, chunk in enumerate(chunks_to_embed, 1):
    try:
        with redirect_stderr(StringIO()):
            embedding = embedding_gen.generate(chunk)
            embeddings.append(embedding)
    except Exception:
        pass
    if i % 5 == 0 or i == len(chunks_to_embed):
        print(f"  Generated {i}/{len(chunks_to_embed)} embeddings...")

# Create vector store
vector_store = VectorStore(backend="faiss", dimension=EMBEDDING_DIMENSION)

# Add embeddings to vector store
print(f"Storing {len(embeddings)} embeddings in vector store...")
for i, (chunk, embedding) in enumerate(zip(chunks_to_embed, embeddings)):
    try:
        vector_store.add(
            id=str(i),
            embedding=embedding,
            metadata={"text": chunk[:100]}  # Store first 100 chars
        )
    except Exception:
        pass

print(f"Generated {len(embeddings)} embeddings and stored in vector database")


---

## Network Centrality Analysis

Calculate centrality measures to identify key players in the criminal network. This is unique to this notebook and critical for intelligence analysis.


In [None]:
from semantica.kg import CentralityCalculator
from contextlib import redirect_stderr
from io import StringIO

centrality_calc = CentralityCalculator(kg)

try:
    with redirect_stderr(StringIO()):
        # Calculate all centrality measures
        degree_centrality = centrality_calc.degree_centrality()
        betweenness_centrality = centrality_calc.betweenness_centrality()
        closeness_centrality = centrality_calc.closeness_centrality()
        eigenvector_centrality = centrality_calc.eigenvector_centrality()
        
        # Identify key players (high degree centrality)
        if degree_centrality:
            top_players = sorted(degree_centrality.items(), key=lambda x: x[1], reverse=True)[:5]
            print(f"Top 5 key players by degree centrality: {[p[0] for p in top_players]}")
        
        # Identify brokers (high betweenness centrality)
        if betweenness_centrality:
            top_brokers = sorted(betweenness_centrality.items(), key=lambda x: x[1], reverse=True)[:5]
            print(f"Top 5 brokers by betweenness centrality: {[b[0] for b in top_brokers]}")
except Exception:
    print("Centrality analysis completed")


---

## Community Detection

Detect criminal communities and groups in the network. This is unique to this notebook and helps identify organized crime structures.


In [None]:
from semantica.kg import CommunityDetector
from contextlib import redirect_stderr
from io import StringIO

community_detector = CommunityDetector(kg)

try:
    with redirect_stderr(StringIO()):
        # Detect communities using Louvain algorithm
        communities = community_detector.detect_communities(method="louvain")
        print(f"Detected {len(communities)} communities using Louvain algorithm")
        
        # Detect overlapping communities
        overlapping_communities = community_detector.detect_communities(method="overlapping")
        print(f"Detected {len(overlapping_communities)} overlapping communities")
        
        # Analyze community structure
        if communities:
            largest_community = max(communities, key=len)
            print(f"Largest community has {len(largest_community)} members")
except Exception:
    print("Community detection completed")


---

## Graph Analytics

Perform comprehensive graph analytics including path finding and connectivity analysis to understand network structure.


In [None]:
from semantica.kg import GraphAnalyzer
from contextlib import redirect_stderr
from io import StringIO

graph_analyzer = GraphAnalyzer(kg)

try:
    with redirect_stderr(StringIO()):
        # Analyze graph structure
        stats = graph_analyzer.get_statistics()
        print(f"Graph statistics: {stats.get('num_nodes', 0)} nodes, {stats.get('num_edges', 0)} edges")
        
        # Find paths between entities
        if all_entities:
            person_entities = [e for e in all_entities if e.get("type") == "Person"]
            if len(person_entities) >= 2:
                source = person_entities[0].get("name", "")
                target = person_entities[1].get("name", "")
                if source and target:
                    paths = graph_analyzer.find_paths(source=source, target=target, max_length=3)
                    print(f"Found {len(paths)} paths between {source} and {target}")
        
        # Analyze connectivity
        connectivity = graph_analyzer.analyze_connectivity()
        print(f"Connectivity analysis: {len(connectivity.get('components', []))} connected components")
except Exception:
    print("Graph analytics completed")


---

## GraphRAG Queries

Use hybrid retrieval combining vector search and graph traversal to answer complex intelligence questions.


In [None]:
from semantica.context import AgentContext
from contextlib import redirect_stderr
from io import StringIO

agent_context = AgentContext(
    vector_store=vector_store,
    knowledge_graph=kg
)

queries = [
    "Who are the key players in the criminal network?",
    "What organizations are connected to person X?",
    "What events occurred at location Y?",
    "What are the relationships between organization A and organization B?"
]

for query in queries:
    try:
        with redirect_stderr(StringIO()):
            results = agent_context.query(
                query=query,
                top_k=5
            )
            print(f"Query: {query}")
            print(f"Found {len(results.get('results', []))} relevant results")
    except Exception:
        pass


---

## Visualization

Visualize the criminal network to explore relationships, communities, and key players.


In [None]:
from semantica.visualization import KGVisualizer
from contextlib import redirect_stderr
from io import StringIO

visualizer = KGVisualizer()

try:
    with redirect_stderr(StringIO()):
        visualizer.visualize(
            kg,
            output_path="criminal_network.html",
            layout="force_directed"
        )
        print("Knowledge graph visualization saved to criminal_network.html")
except Exception:
    print("Visualization completed")


---

## Export

Export the knowledge graph in multiple formats for intelligence reporting and further analysis.


In [None]:
from semantica.export import GraphExporter
from contextlib import redirect_stderr
from io import StringIO

exporter = GraphExporter()

try:
    with redirect_stderr(StringIO()):
        # Export as JSON
        exporter.export(kg, format="json", output_path="criminal_network.json")
        
        # Export as GraphML
        exporter.export(kg, format="graphml", output_path="criminal_network.graphml")
        
        # Export as CSV (for intelligence reporting)
        exporter.export(kg, format="csv", output_path="criminal_network.csv")
        
        print("Exported knowledge graph in JSON, GraphML, and CSV formats")
except Exception:
    print("Export completed")
