[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/finance/02_Fraud_Detection.ipynb)

# Fraud Detection - Temporal KGs & Pattern Detection

## Overview

This notebook demonstrates **fraud detection** using Semantica with focus on **temporal knowledge graphs**, **anomaly detection**, and **pattern recognition**. The pipeline analyzes transaction streams using temporal knowledge graphs to detect fraud patterns and anomalies in real-time.

### Key Features

- **Temporal Knowledge Graphs**: Builds temporal KGs to track transaction patterns over time
- **Pattern Detection**: Uses TemporalPatternDetector and reasoning for fraud detection
- **Anomaly Detection**: Uses graph-based pattern recognition to identify fraud
- **Conflict Detection**: Detects conflicting transaction data from multiple sources
- **Real-Time Stream Processing**: Demonstrates real-time transaction stream processing
- **Comprehensive Data Sources**: Multiple transaction streams, APIs, and fraud databases
- **Modular Architecture**: Direct use of Semantica modules without core orchestrator

### Learning Objectives

- Ingest transaction data from streams and APIs
- Extract transaction entities (Transactions, Accounts, Devices, Patterns, Anomalies)
- Build temporal transaction knowledge graphs
- Perform temporal queries and pattern detection
- Detect fraud patterns using graph reasoning
- Analyze transaction networks using graph analytics
- Store and query transaction data using vector stores

### Pipeline Flow

```mermaid
graph TD
    A[Data Ingestion] --> B[Document Parsing]
    B --> C[Text Processing]
    C --> D[Entity Extraction]
    D --> E[Relationship Extraction]
    E --> F[Deduplication]
    F --> G[Conflict Detection]
    G --> H[Temporal Knowledge Graph]
    H --> I[Embeddings]
    I --> J[Vector Store]
    H --> K[Temporal Queries]
    K --> L[Temporal Pattern Detection]
    L --> M[Reasoning & Fraud]
    M --> N[Graph Analytics]
    J --> O[GraphRAG Queries]
    N --> O
    O --> P[Visualization]
    P --> Q[Export]
```

## Installation


In [None]:
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn


## Configuration & Setup


In [None]:
import os

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "your-key-here")

# Configuration constants
EMBEDDING_DIMENSION = 384
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
TEMPORAL_GRANULARITY = "minute"


## Ingesting Transaction Data from Streams


In [None]:
from semantica.ingest import StreamIngestor, WebIngestor, FileIngestor
import os
from contextlib import redirect_stderr
from io import StringIO

os.makedirs("data", exist_ok=True)

# Example: Ingest from transaction stream (simulated Kafka)
# In production: stream_ingestor = StreamIngestor()
# stream_documents = stream_ingestor.ingest("kafka://localhost:9092/transactions", method="kafka")

# Example: Ingest from payment processor API
payment_api = "https://api.example.com/transactions"  # Example API endpoint
all_documents = []

try:
    web_ingestor = WebIngestor()
    with redirect_stderr(StringIO()):
        api_documents = web_ingestor.ingest(payment_api, method="url")
    for doc in api_documents:
        if not hasattr(doc, 'metadata'):
            doc.metadata = {}
        doc.metadata['source'] = 'Payment API'
        all_documents.append(doc)
except Exception:
    pass

if not all_documents:
    tx_data = """
    2024-01-01 10:00:00 - Transaction $1000 from Account A123 to Account B456
    2024-01-01 10:01:00 - Transaction $5000 from Account A123 to Account C789
    2024-01-01 10:02:00 - Transaction $10000 from Account A123 to Account D012 (unusual pattern)
    2024-01-01 10:03:00 - Multiple rapid transactions from Account A123 (suspicious)
    2024-01-01 10:04:00 - Transaction $2000 from Account B456 to Account E789
    2024-01-01 10:05:00 - Large transaction $50000 from Account A123 to Account F012 (fraud alert)
    2024-01-01 10:06:00 - Transaction $1500 from Account C789 to Account G345
    2024-01-01 10:07:00 - Unusual device login from Account A123 (suspicious activity)
    """
    with open("data/transactions.txt", "w") as f:
        f.write(tx_data)
    file_ingestor = FileIngestor()
    all_documents = file_ingestor.ingest("data/transactions.txt")

documents = all_documents
print(f"Ingested {len(documents)} documents")


## Parsing Transaction Documents


In [None]:
from semantica.parse import DocumentParser

parser = DocumentParser()

print(f"Parsing {len(documents)} documents...")
parsed_documents = []
for i, doc in enumerate(documents, 1):
    try:
        parsed = parser.parse(
            doc.content if hasattr(doc, 'content') else str(doc),
            content_type="text"
        )
        parsed_documents.append(parsed)
    except Exception:
        parsed_documents.append(doc)
    if i % 50 == 0 or i == len(documents):
        print(f"  Parsed {i}/{len(documents)} documents...")

documents = parsed_documents


## Normalizing and Chunking Transaction Data


In [None]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter

normalizer = TextNormalizer()
# Use sentence chunking for transaction logs
splitter = TextSplitter(method="sentence", chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

print(f"Normalizing {len(documents)} documents...")
normalized_documents = []
for i, doc in enumerate(documents, 1):
    normalized_text = normalizer.normalize(
        doc.content if hasattr(doc, 'content') else str(doc),
        clean_html=True,
        normalize_entities=True,
        normalize_numbers=True,
        remove_extra_whitespace=True,
        lowercase=False
    )
    normalized_documents.append(normalized_text)
    if i % 50 == 0 or i == len(documents):
        print(f"  Normalized {i}/{len(documents)} documents...")

print(f"Chunking {len(normalized_documents)} documents...")
chunked_documents = []
for i, doc_text in enumerate(normalized_documents, 1):
    try:
        with redirect_stderr(StringIO()):
            chunks = splitter.split(doc_text)
        chunked_documents.extend(chunks)
    except Exception:
        simple_splitter = TextSplitter(method="recursive", chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
        chunks = simple_splitter.split(doc_text)
        chunked_documents.extend(chunks)
    if i % 50 == 0 or i == len(normalized_documents):
        print(f"  Chunked {i}/{len(normalized_documents)} documents ({len(chunked_documents)} chunks so far)")

print(f"Created {len(chunked_documents)} chunks from {len(normalized_documents)} documents")


## Extracting Transaction Entities


In [None]:
from semantica.semantic_extract import NERExtractor

entity_extractor = NERExtractor(
    method="llm",
    provider="groq",
    llm_model="llama-3.1-8b-instant",
    temperature=0.0
)

all_entities = []
print(f"Extracting entities from {len(chunked_documents)} chunks...")
for i, chunk in enumerate(chunked_documents, 1):
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        entities = entity_extractor.extract_entities(
            chunk_text,
            entity_types=["Transaction", "Account", "Device", "Pattern", "Anomaly"]
        )
        all_entities.extend(entities)
    except Exception:
        continue
    
    if i % 20 == 0 or i == len(chunked_documents):
        print(f"  Processed {i}/{len(chunked_documents)} chunks ({len(all_entities)} entities found)")

transactions = [e for e in all_entities if e.label == "Transaction" or "transaction" in e.label.lower()]
accounts = [e for e in all_entities if e.label == "Account" or "account" in e.label.lower()]
anomalies = [e for e in all_entities if e.label in ["Anomaly", "Pattern"] or "anomaly" in e.label.lower() or "pattern" in e.label.lower()]

print(f"Extracted {len(transactions)} transactions, {len(accounts)} accounts, {len(anomalies)} anomalies/patterns")


## Extracting Transaction Relationships


In [None]:
from semantica.semantic_extract import RelationExtractor

relation_extractor = RelationExtractor(
    method="llm",
    provider="groq",
    llm_model="llama-3.1-8b-instant",
    temperature=0.0
)

all_relationships = []
print(f"Extracting relationships from {len(chunked_documents)} chunks...")
for i, chunk in enumerate(chunked_documents, 1):
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        relationships = relation_extractor.extract_relations(
            chunk_text,
            entities=all_entities,
            relation_types=["from", "to", "triggers", "detects", "associated_with", "causes"]
        )
        all_relationships.extend(relationships)
    except Exception:
        continue
    
    if i % 20 == 0 or i == len(chunked_documents):
        print(f"  Processed {i}/{len(chunked_documents)} chunks ({len(all_relationships)} relationships found)")

print(f"Extracted {len(all_relationships)} relationships")


## Resolving Duplicate Transactions


In [None]:
from semantica.kg import EntityResolver
from semantica.semantic_extract import Entity

# Convert Entity objects to dictionaries for EntityResolver
print(f"Converting {len(all_entities)} entities to dictionaries...")
entity_dicts = [{"name": e.text, "type": e.label, "confidence": e.confidence} for e in all_entities]

# Use EntityResolver class to resolve duplicates
entity_resolver = EntityResolver(strategy="fuzzy", similarity_threshold=0.85)

print(f"Resolving duplicates in {len(entity_dicts)} entities...")
resolved_entities = entity_resolver.resolve_entities(entity_dicts)

# Convert back to Entity objects
print(f"Converting {len(resolved_entities)} resolved entities back to Entity objects...")
merged_entities = [
    Entity(text=e["name"], label=e["type"], confidence=e.get("confidence", 1.0))
    for e in resolved_entities
]

print(f"Deduplicated {len(entity_dicts)} entities to {len(merged_entities)} unique entities")


## Detecting Transaction Conflicts


In [None]:
from semantica.conflicts import ConflictDetector, ConflictResolver

# Use logical conflict detection for fraud rules
# expert_review strategy flags conflicts for manual review by fraud analysts
conflict_detector = ConflictDetector()
conflict_resolver = ConflictResolver()

print(f"Detecting logical conflicts in {len(merged_entities)} entities and relationships...")
conflicts = conflict_detector.detect_conflicts(
    entities=merged_entities,
    relationships=all_relationships,
    method="logical"  # Detect logical conflicts (e.g., conflicting fraud indicators)
)

print(f"Detected {len(conflicts)} logical conflicts")

if conflicts:
    print(f"Resolving conflicts using expert_review strategy...")
    resolved = conflict_resolver.resolve_conflicts(
        conflicts,
        strategy="expert_review"  # Manual review by fraud analysts
    )
    print(f"Resolved {len(resolved)} conflicts (flagged for expert review)")
else:
    print("No conflicts detected")


## Building Temporal Transaction Knowledge Graph


In [None]:
from semantica.kg import GraphBuilder

graph_builder = GraphBuilder(
    merge_entities=True,
    resolve_conflicts=True,
    entity_resolution_strategy="fuzzy",
    enable_temporal=True,
    temporal_granularity=TEMPORAL_GRANULARITY
)

print(f"Building knowledge graph...")
kg_sources = [{
    "entities": [{"text": e.text, "type": e.label, "confidence": e.confidence} for e in merged_entities],
    "relationships": [{"source": r.source, "target": r.target, "type": r.label, "confidence": r.confidence} for r in all_relationships]
}]

kg = graph_builder.build(kg_sources)

entities_count = len(kg.get('entities', []))
relationships_count = len(kg.get('relationships', []))
print(f"Graph: {entities_count} entities, {relationships_count} relationships")


## Generating Embeddings for Transactions and Accounts


In [None]:
from semantica.embeddings import EmbeddingGenerator

embedding_gen = EmbeddingGenerator(
    provider="sentence_transformers",
    model=EMBEDDING_MODEL
)

print(f"Generating embeddings for {len(transactions)} transactions and {len(accounts)} accounts...")
transaction_texts = [t.text for t in transactions]
transaction_embeddings = embedding_gen.generate_embeddings(transaction_texts)

account_texts = [a.text for a in accounts]
account_embeddings = embedding_gen.generate_embeddings(account_texts)

print(f"Generated {len(transaction_embeddings)} transaction embeddings and {len(account_embeddings)} account embeddings")


## Populating Vector Store


In [None]:
from semantica.vector_store import VectorStore

vector_store = VectorStore(backend="faiss", dimension=EMBEDDING_DIMENSION)

print(f"Storing {len(transaction_embeddings)} transaction vectors and {len(account_embeddings)} account vectors...")
transaction_ids = vector_store.store_vectors(
    vectors=transaction_embeddings,
    metadata=[{"type": "transaction", "name": t.text, "label": t.label} for t in transactions]
)

account_ids = vector_store.store_vectors(
    vectors=account_embeddings,
    metadata=[{"type": "account", "name": a.text, "label": a.label} for a in accounts]
)

print(f"Stored {len(transaction_ids)} transaction vectors and {len(account_ids)} account vectors")


## Temporal Graph Queries


In [None]:
from semantica.kg import TemporalGraphQuery

temporal_query = TemporalGraphQuery(
    enable_temporal_reasoning=True,
    temporal_granularity=TEMPORAL_GRANULARITY
)

query_results = temporal_query.query_at_time(
    kg,
    query={"type": "Transaction"},
    at_time="2024-01-01 10:05:00"
)

evolution = temporal_query.analyze_evolution(kg)
temporal_patterns = temporal_query.detect_temporal_patterns(kg, pattern_type="sequence")

print(f"Temporal queries: {len(query_results)} transactions at query time")
print(f"Temporal patterns detected: {len(temporal_patterns)}")


## Temporal Pattern Detection


In [None]:
from semantica.kg import TemporalPatternDetector

pattern_detector = TemporalPatternDetector()

# Detect temporal fraud patterns
fraud_patterns = pattern_detector.detect_patterns(kg, pattern_type="fraud")

# Detect sequence patterns (rapid transactions, unusual timing)
sequence_patterns = pattern_detector.detect_patterns(kg, pattern_type="sequence")

print(f"Detected {len(fraud_patterns)} fraud patterns")
print(f"Detected {len(sequence_patterns)} sequence patterns")


## Reasoning and Fraud Detection


In [None]:
from semantica.reasoning import Reasoner

reasoner = Reasoner()

reasoner.add_rule("IF Account from Transaction AND Transaction amount > 10000 AND Transaction count > 3 THEN Account triggers Anomaly")
reasoner.add_rule("IF Transaction from Account AND Account triggers Anomaly THEN Transaction associated_with Pattern")

inferred_facts = reasoner.infer_facts(kg)

fraud_paths = reasoner.find_paths(
    kg,
    source_type="Account",
    target_type="Anomaly",
    max_hops=2
)

print(f"Inferred {len(inferred_facts)} facts")
print(f"Found {len(fraud_paths)} fraud paths")


## Analyzing Transaction Network Structure


In [None]:
from semantica.kg import GraphAnalyzer, CommunityDetector

graph_analyzer = GraphAnalyzer()
community_detector = CommunityDetector()

analysis = graph_analyzer.analyze_graph(kg)

communities = community_detector.detect_communities(kg, method="louvain")
connectivity = graph_analyzer.analyze_connectivity(kg)

# Detect suspicious account communities
suspicious_communities = []
for community in communities:
    community_accounts = [e for e in kg.get("entities", []) 
                          if e.get("id") in community and e.get("type") == "Account"]
    if len(community_accounts) > 0:
        # Check if community has suspicious patterns
        suspicious_communities.append({
            "community_id": len(suspicious_communities),
            "account_count": len(community_accounts)
        })

print(f"Graph analytics:")
print(f"  - Communities: {len(communities)}")
print(f"  - Connected components: {len(connectivity.get('components', []))}")
print(f"  - Graph density: {analysis.get('density', 0):.3f}")
print(f"  - Suspicious communities: {len(suspicious_communities)}")


## GraphRAG: Hybrid Vector + Graph Queries


In [None]:
from semantica.context import AgentContext

context = AgentContext(vector_store=vector_store, knowledge_graph=kg)

query = "What accounts have suspicious transaction patterns?"
results = context.retrieve(
    query,
    max_results=10,
    use_graph=True,
    expand_graph=True,
    include_entities=True,
    include_relationships=True
)

print(f"GraphRAG query: '{query}'")
print(f"\nRetrieved {len(results)} results:\n")
for i, result in enumerate(results[:5], 1):
    print(f"{i}. Score: {result.get('score', 0):.3f}")
    print(f"   Content: {result.get('content', '')[:200]}...")
    if result.get('related_entities'):
        print(f"   Related entities: {len(result['related_entities'])}")
    print()


## Visualizing the Temporal Fraud Detection Knowledge Graph


In [None]:
from semantica.visualization import KGVisualizer

visualizer = KGVisualizer()
visualizer.visualize(
    kg,
    output_path="fraud_detection_kg.html",
    layout="temporal",
    node_size=20
)

print("Visualization saved to fraud_detection_kg.html")


## Exporting Results


In [None]:
from semantica.export import GraphExporter

exporter = GraphExporter()
exporter.export(kg, output_path="fraud_detection_kg.json", format="json")
exporter.export(kg, output_path="fraud_detection_kg.graphml", format="graphml")
exporter.export(kg, output_path="fraud_detection_alerts.csv", format="csv")

print("Exported fraud detection knowledge graph to JSON, GraphML, and CSV formats")
