# Lab 16: Advanced Threat Actor Profiling & Attribution

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/depalmar/ai_for_the_win/blob/main/notebooks/lab16_threat_actor_profiling.ipynb)

## Learning Objectives

By the end of this lab, you will be able to:

1. **Extract and encode TTPs** using MITRE ATT&CK with weighted similarity
2. **Build graph-based infrastructure analysis** for pivot detection
3. **Analyze temporal patterns** to identify operational timing signatures
4. **Implement code similarity** using fuzzy hashing (ssdeep) and YARA
5. **Detect false flag operations** and attribution deception
6. **Generate confidence-calibrated** attribution assessments
7. **Apply the Diamond Model** for structured intrusion analysis

In [None]:
# Setup
!pip install pandas numpy scikit-learn networkx matplotlib anthropic ssdeep-py python-Levenshtein --quiet

import pandas as pd
import numpy as np
from sklearn.cluster import DBSCAN, AgglomerativeClustering
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter, defaultdict
from datetime import datetime, timedelta
from typing import List, Dict, Set, Tuple, Optional
from dataclasses import dataclass, field
from enum import Enum
import networkx as nx
import json
import hashlib
import re

## Part 1: Comprehensive Campaign Data

Real-world attribution requires multi-dimensional data analysis.

In [None]:
# Comprehensive campaign dataset with realistic attribution indicators
@dataclass
class Campaign:
    """Represents a threat campaign with attribution indicators."""
    id: str
    name: str
    ttps: List[str]  # MITRE ATT&CK technique IDs
    malware_families: List[str]
    infrastructure: Dict  # domains, IPs, certificates
    targets: Dict  # sectors, regions, victim types
    temporal: Dict  # timestamps, operational hours
    code_artifacts: Dict  # imphash, ssdeep, strings, pdb paths
    language_indicators: List[str]  # language artifacts found
    
CAMPAIGNS = [
    Campaign(
        id="CAMP-001",
        name="Operation ShadowStrike",
        ttps=["T1566.001", "T1059.001", "T1003.001", "T1071.001", "T1041", "T1567.002"],
        malware_families=["ShadowLoader", "CobaltStrike"],
        infrastructure={
            "domains": ["update-service.net", "cdn-assets.org", "secure-login.info"],
            "ips": ["185.141.62.x", "91.219.236.x"],
            "registrar": "Namecheap",
            "hosting": "BuyVM",
            "cert_issuer": "Let's Encrypt",
            "ssl_subject": "*.update-service.net"
        },
        targets={
            "sectors": ["Defense", "Aerospace"],
            "regions": ["North America", "Western Europe"],
            "victim_size": "Enterprise"
        },
        temporal={
            "first_seen": "2024-01-15",
            "last_seen": "2024-03-20",
            "operational_hours": [8, 9, 10, 11, 12, 13, 14, 15, 16],  # UTC+3
            "operational_days": [0, 1, 2, 3, 4],  # Mon-Fri
            "activity_timestamps": [
                "2024-01-15T11:23:00Z", "2024-01-16T09:45:00Z",
                "2024-02-12T14:30:00Z", "2024-03-01T10:15:00Z"
            ]
        },
        code_artifacts={
            "imphash": "a3b4c5d6e7f8a1b2c3d4e5f6a7b8c9d0",
            "ssdeep": "3072:xNQKBsL9Ck8TJHc0xNQKBsL9Ck8TJHc:xNQKBsL",
            "pdb_paths": ["C:\\Users\\developer\\Desktop\\project\\Release\\loader.pdb"],
            "unique_strings": ["shadowstrike_v2", "exfil_module_3"],
            "compiler": "MSVC 14.0",
            "linker_version": "14.0"
        },
        language_indicators=["Russian keyboard layout", "Cyrillic error messages"]
    ),
    Campaign(
        id="CAMP-002",
        name="Operation NightOwl",
        ttps=["T1566.001", "T1059.001", "T1003.001", "T1071.001", "T1041"],  # Similar TTPs
        malware_families=["NightLoader", "CobaltStrike"],  # Same C2 framework
        infrastructure={
            "domains": ["software-update.net", "content-cdn.org"],  # Similar naming
            "ips": ["185.141.62.x", "45.33.32.x"],  # Overlapping IP range!
            "registrar": "Namecheap",  # Same registrar
            "hosting": "BuyVM",  # Same hosting
            "cert_issuer": "Let's Encrypt",
            "ssl_subject": "*.software-update.net"
        },
        targets={
            "sectors": ["Defense", "Government"],
            "regions": ["Western Europe", "Eastern Europe"],
            "victim_size": "Enterprise"
        },
        temporal={
            "first_seen": "2024-02-01",
            "last_seen": "2024-04-15",
            "operational_hours": [9, 10, 11, 12, 13, 14, 15, 16, 17],  # Similar hours
            "operational_days": [0, 1, 2, 3, 4],
            "activity_timestamps": [
                "2024-02-01T12:00:00Z", "2024-02-15T10:30:00Z",
                "2024-03-20T15:45:00Z", "2024-04-01T11:20:00Z"
            ]
        },
        code_artifacts={
            "imphash": "a3b4c5d6e7f8a1b2c3d4e5f6a7b8c9d0",  # Same imphash!
            "ssdeep": "3072:xNQKBsL9Ck8TJHc0xNQKBsL9Ck:xNQKBsL",  # Similar ssdeep
            "pdb_paths": ["C:\\Users\\developer\\projects\\nightowl\\Release\\loader.pdb"],
            "unique_strings": ["nightowl_module", "exfil_handler"],
            "compiler": "MSVC 14.0",
            "linker_version": "14.0"
        },
        language_indicators=["Russian comments in code"]
    ),
    Campaign(
        id="CAMP-003",
        name="Operation DragonFire",
        ttps=["T1190", "T1505.003", "T1059.003", "T1021.002", "T1486"],  # Different TTPs
        malware_families=["WebShell-X", "RansomDragon"],
        infrastructure={
            "domains": ["api-gateway.cloud", "secure-storage.tech"],
            "ips": ["103.224.182.x", "43.255.154.x"],  # Different IP ranges
            "registrar": "GoDaddy",
            "hosting": "Alibaba Cloud",
            "cert_issuer": "DigiCert",
            "ssl_subject": "api-gateway.cloud"
        },
        targets={
            "sectors": ["Manufacturing", "Technology"],
            "regions": ["East Asia", "Southeast Asia"],
            "victim_size": "SMB"
        },
        temporal={
            "first_seen": "2024-01-20",
            "last_seen": "2024-03-30",
            "operational_hours": [1, 2, 3, 4, 5, 6, 7, 8, 9],  # Different timezone (UTC+8)
            "operational_days": [0, 1, 2, 3, 4, 5],  # Includes Saturday
            "activity_timestamps": [
                "2024-01-20T03:15:00Z", "2024-02-05T05:30:00Z",
                "2024-03-10T02:45:00Z", "2024-03-30T06:00:00Z"
            ]
        },
        code_artifacts={
            "imphash": "f1e2d3c4b5a6f7e8d9c0b1a2f3e4d5c6",
            "ssdeep": "1536:aB7cDe9FgHiJkLmN:aB7cDe9F",
            "pdb_paths": [],
            "unique_strings": ["dragon_v3", "ransom_encrypt"],
            "compiler": "MinGW",
            "linker_version": "6.3"
        },
        language_indicators=["Simplified Chinese strings", "Chinese keyboard layout"]
    ),
    Campaign(
        id="CAMP-004",
        name="Operation FalseFlag",  # Potential false flag operation
        ttps=["T1566.001", "T1059.001", "T1003.001", "T1071.001"],  # Mimics CAMP-001
        malware_families=["FakeShadow"],  # Mimics naming
        infrastructure={
            "domains": ["update-services.net"],  # Very similar to CAMP-001
            "ips": ["198.51.100.x"],  # Different infrastructure
            "registrar": "NameSilo",
            "hosting": "DigitalOcean",
            "cert_issuer": "Let's Encrypt",
            "ssl_subject": "update-services.net"
        },
        targets={
            "sectors": ["Defense"],
            "regions": ["Eastern Europe"],  # Different target region
            "victim_size": "Enterprise"
        },
        temporal={
            "first_seen": "2024-03-01",
            "last_seen": "2024-03-15",
            "operational_hours": [14, 15, 16, 17, 18, 19, 20],  # Different hours (appears UTC+8)
            "operational_days": [0, 1, 2, 3, 4],
            "activity_timestamps": [
                "2024-03-01T16:00:00Z", "2024-03-05T18:30:00Z",
                "2024-03-10T15:15:00Z"
            ]
        },
        code_artifacts={
            "imphash": "d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9",  # Different imphash
            "ssdeep": "768:qR5tU7vW9xY1zA3bC5dE:qR5tU7v",
            "pdb_paths": ["D:\\projects\\fake_shadow\\Release\\payload.pdb"],
            "unique_strings": ["shadowstrike_v2"],  # Copied string from CAMP-001!
            "compiler": "Clang",  # Different compiler
            "linker_version": "12.0"
        },
        language_indicators=["Intentional Cyrillic strings", "But UTF-8 BOM typical of Western tools"]
    )
]

print("Loaded campaigns:")
for c in CAMPAIGNS:
    print(f"  {c.id}: {c.name} - {len(c.ttps)} TTPs, {len(c.malware_families)} malware families")

## Part 2: Known Threat Actor Database

Reference profiles for attribution matching.

In [None]:
@dataclass
class ThreatActor:
    """Known threat actor profile."""
    name: str
    aliases: List[str]
    origin: str
    motivation: str
    active_since: str
    signature_ttps: List[str]
    preferred_malware: List[str]
    target_sectors: List[str]
    target_regions: List[str]
    infrastructure_patterns: Dict
    operational_tempo: Dict
    code_signatures: Dict
    confidence_indicators: List[str]

KNOWN_ACTORS = {
    "APT28": ThreatActor(
        name="APT28",
        aliases=["Fancy Bear", "Sofacy", "Sednit", "STRONTIUM"],
        origin="Russia",
        motivation="Espionage",
        active_since="2007",
        signature_ttps=["T1566.001", "T1566.002", "T1059.001", "T1003.001", "T1071.001", "T1041"],
        preferred_malware=["X-Agent", "Zebrocy", "CobaltStrike", "Komplex"],
        target_sectors=["Government", "Defense", "Aerospace", "Media"],
        target_regions=["North America", "Western Europe", "Eastern Europe"],
        infrastructure_patterns={
            "registrars": ["Namecheap", "PDR Ltd"],
            "hosting": ["BuyVM", "Linode"],
            "domain_themes": ["update", "service", "cdn", "software"],
            "ip_ranges": ["185.141.x.x", "91.219.x.x"]
        },
        operational_tempo={
            "timezone_offset": 3,  # UTC+3 (Moscow)
            "working_hours": [8, 9, 10, 11, 12, 13, 14, 15, 16, 17],
            "working_days": [0, 1, 2, 3, 4]  # Mon-Fri
        },
        code_signatures={
            "compilers": ["MSVC 14.0", "MSVC 19.0"],
            "language_artifacts": ["Russian", "Cyrillic"]
        },
        confidence_indicators=[
            "Use of X-Agent malware",
            "OAuth phishing campaigns",
            "Targeting NATO members",
            "Moscow working hours"
        ]
    ),
    "APT41": ThreatActor(
        name="APT41",
        aliases=["Winnti", "Barium", "Wicked Panda"],
        origin="China",
        motivation="Espionage + Financial",
        active_since="2012",
        signature_ttps=["T1190", "T1505.003", "T1059.003", "T1021.002", "T1486"],
        preferred_malware=["ShadowPad", "Winnti", "PlugX", "CobaltStrike"],
        target_sectors=["Technology", "Gaming", "Healthcare", "Manufacturing"],
        target_regions=["East Asia", "Southeast Asia", "North America"],
        infrastructure_patterns={
            "registrars": ["GoDaddy", "Alibaba"],
            "hosting": ["Alibaba Cloud", "Tencent Cloud"],
            "domain_themes": ["api", "gateway", "cloud", "storage"],
            "ip_ranges": ["103.x.x.x", "43.x.x.x"]
        },
        operational_tempo={
            "timezone_offset": 8,  # UTC+8 (Beijing)
            "working_hours": [1, 2, 3, 4, 5, 6, 7, 8, 9],  # When it's 9-17 in Beijing
            "working_days": [0, 1, 2, 3, 4, 5]  # Mon-Sat
        },
        code_signatures={
            "compilers": ["MinGW", "Clang"],
            "language_artifacts": ["Chinese", "Simplified Chinese"]
        },
        confidence_indicators=[
            "Use of ShadowPad malware",
            "Supply chain compromises",
            "Gaming industry targeting",
            "Beijing working hours"
        ]
    )
}

print("Known Threat Actors:")
for name, actor in KNOWN_ACTORS.items():
    print(f"  {name} ({actor.origin}): {actor.motivation}")
    print(f"    Signature TTPs: {len(actor.signature_ttps)}, Target Sectors: {actor.target_sectors}")

## Part 3: Advanced TTP Analysis with Weighted Similarity

Not all TTPs are equal for attribution - rare techniques are more distinctive.

In [None]:
class AdvancedTTPAnalyzer:
    """TTP analysis with weighted similarity based on technique rarity."""
    
    # TTP weights based on attribution value (higher = more distinctive)
    TTP_WEIGHTS = {
        # Very High - Custom/rare techniques
        "T1055.012": 3.0,  # Process Hollowing
        "T1127.001": 3.0,  # MSBuild
        "T1218.011": 3.0,  # Rundll32
        
        # High - Distinctive techniques
        "T1003.001": 2.0,  # LSASS Memory
        "T1041": 2.0,      # Exfiltration Over C2
        "T1567.002": 2.0,  # Exfil to Cloud Storage
        "T1486": 2.0,      # Data Encrypted for Impact
        "T1505.003": 2.0,  # Web Shell
        
        # Medium - Common but still useful
        "T1071.001": 1.5,  # Web Protocols
        "T1021.002": 1.5,  # SMB/Windows Admin Shares
        "T1190": 1.5,      # Exploit Public-Facing App
        
        # Low - Very common techniques
        "T1566.001": 1.0,  # Spearphishing Attachment
        "T1566.002": 1.0,  # Spearphishing Link
        "T1059.001": 1.0,  # PowerShell
        "T1059.003": 1.0,  # Windows Command Shell
    }
    
    def __init__(self):
        self.all_ttps = list(self.TTP_WEIGHTS.keys())
    
    def get_weight(self, ttp: str) -> float:
        """Get attribution weight for a TTP."""
        return self.TTP_WEIGHTS.get(ttp, 1.0)
    
    def weighted_jaccard(self, ttps_a: List[str], ttps_b: List[str]) -> float:
        """Calculate weighted Jaccard similarity."""
        set_a, set_b = set(ttps_a), set(ttps_b)
        
        if not set_a or not set_b:
            return 0.0
        
        intersection = set_a & set_b
        union = set_a | set_b
        
        # Weighted intersection / weighted union
        weighted_intersection = sum(self.get_weight(t) for t in intersection)
        weighted_union = sum(self.get_weight(t) for t in union)
        
        return weighted_intersection / weighted_union if weighted_union > 0 else 0.0
    
    def calculate_ttp_fingerprint(self, ttps: List[str]) -> Dict:
        """Generate a distinctive TTP fingerprint."""
        return {
            "total_ttps": len(ttps),
            "unique_ttps": len(set(ttps)),
            "high_value_ttps": [t for t in ttps if self.get_weight(t) >= 2.0],
            "weighted_score": sum(self.get_weight(t) for t in ttps),
            "tactic_coverage": self._calculate_tactic_coverage(ttps)
        }
    
    def _calculate_tactic_coverage(self, ttps: List[str]) -> Dict:
        """Map TTPs to tactics for kill chain analysis."""
        tactic_map = {
            "initial_access": ["T1566.001", "T1566.002", "T1190"],
            "execution": ["T1059.001", "T1059.003"],
            "persistence": ["T1505.003"],
            "credential_access": ["T1003.001"],
            "lateral_movement": ["T1021.002"],
            "exfiltration": ["T1041", "T1567.002"],
            "impact": ["T1486"]
        }
        
        coverage = {}
        for tactic, tactic_ttps in tactic_map.items():
            matches = [t for t in ttps if t in tactic_ttps]
            coverage[tactic] = len(matches) > 0
        
        return coverage


# Test TTP analysis
ttp_analyzer = AdvancedTTPAnalyzer()

print("TTP Fingerprints:")
for campaign in CAMPAIGNS[:2]:
    fp = ttp_analyzer.calculate_ttp_fingerprint(campaign.ttps)
    print(f"\n{campaign.name}:")
    print(f"  High-value TTPs: {fp['high_value_ttps']}")
    print(f"  Weighted Score: {fp['weighted_score']:.1f}")

# Compare campaigns
print("\nWeighted TTP Similarity Matrix:")
for i, c1 in enumerate(CAMPAIGNS):
    sims = []
    for c2 in CAMPAIGNS:
        sim = ttp_analyzer.weighted_jaccard(c1.ttps, c2.ttps)
        sims.append(f"{sim:.2f}")
    print(f"  {c1.id}: [{', '.join(sims)}]")

## Part 4: Graph-Based Infrastructure Analysis

Identify infrastructure overlaps and pivots between campaigns.

In [None]:
class InfrastructureGraphAnalyzer:
    """Graph-based infrastructure analysis for attribution."""
    
    def __init__(self):
        self.graph = nx.Graph()
    
    def build_infrastructure_graph(self, campaigns: List[Campaign]) -> nx.Graph:
        """Build a graph connecting campaigns through shared infrastructure."""
        self.graph = nx.Graph()
        
        for campaign in campaigns:
            # Add campaign node
            self.graph.add_node(campaign.id, type="campaign", name=campaign.name)
            
            infra = campaign.infrastructure
            
            # Add infrastructure nodes and edges
            for domain in infra.get("domains", []):
                self.graph.add_node(domain, type="domain")
                self.graph.add_edge(campaign.id, domain, relation="uses_domain")
            
            for ip in infra.get("ips", []):
                self.graph.add_node(ip, type="ip")
                self.graph.add_edge(campaign.id, ip, relation="uses_ip")
            
            # Add registrar
            registrar = infra.get("registrar")
            if registrar:
                self.graph.add_node(registrar, type="registrar")
                self.graph.add_edge(campaign.id, registrar, relation="uses_registrar")
            
            # Add hosting provider
            hosting = infra.get("hosting")
            if hosting:
                self.graph.add_node(hosting, type="hosting")
                self.graph.add_edge(campaign.id, hosting, relation="uses_hosting")
        
        return self.graph
    
    def find_infrastructure_overlaps(self) -> List[Dict]:
        """Find campaigns with shared infrastructure."""
        overlaps = []
        campaign_nodes = [n for n, d in self.graph.nodes(data=True) if d.get('type') == 'campaign']
        
        for i, c1 in enumerate(campaign_nodes):
            for c2 in campaign_nodes[i+1:]:
                # Find common neighbors (shared infrastructure)
                c1_neighbors = set(self.graph.neighbors(c1))
                c2_neighbors = set(self.graph.neighbors(c2))
                shared = c1_neighbors & c2_neighbors
                
                if shared:
                    shared_details = []
                    for node in shared:
                        node_type = self.graph.nodes[node].get('type', 'unknown')
                        shared_details.append({"node": node, "type": node_type})
                    
                    overlaps.append({
                        "campaign_1": c1,
                        "campaign_2": c2,
                        "shared_infrastructure": shared_details,
                        "overlap_count": len(shared),
                        "overlap_score": len(shared) / min(len(c1_neighbors), len(c2_neighbors))
                    })
        
        return sorted(overlaps, key=lambda x: x['overlap_score'], reverse=True)
    
    def identify_infrastructure_clusters(self) -> List[Set]:
        """Find clusters of related campaigns based on infrastructure."""
        # Project bipartite graph to campaign-only graph
        campaign_nodes = [n for n, d in self.graph.nodes(data=True) if d.get('type') == 'campaign']
        
        # Create campaign similarity graph
        campaign_graph = nx.Graph()
        campaign_graph.add_nodes_from(campaign_nodes)
        
        overlaps = self.find_infrastructure_overlaps()
        for overlap in overlaps:
            if overlap['overlap_score'] > 0.2:  # Threshold
                campaign_graph.add_edge(
                    overlap['campaign_1'],
                    overlap['campaign_2'],
                    weight=overlap['overlap_score']
                )
        
        # Find connected components (clusters)
        clusters = list(nx.connected_components(campaign_graph))
        return clusters


# Analyze infrastructure
infra_analyzer = InfrastructureGraphAnalyzer()
infra_analyzer.build_infrastructure_graph(CAMPAIGNS)

print("Infrastructure Overlaps:")
overlaps = infra_analyzer.find_infrastructure_overlaps()
for overlap in overlaps:
    print(f"\n{overlap['campaign_1']} <-> {overlap['campaign_2']}:")
    print(f"  Overlap Score: {overlap['overlap_score']:.2%}")
    for shared in overlap['shared_infrastructure']:
        print(f"    - {shared['type']}: {shared['node']}")

print("\nInfrastructure-Based Clusters:")
clusters = infra_analyzer.identify_infrastructure_clusters()
for i, cluster in enumerate(clusters):
    print(f"  Cluster {i+1}: {cluster}")

## Part 5: Temporal Pattern Analysis

Operational timing reveals actor working hours and timezone.

In [None]:
class TemporalAnalyzer:
    """Analyze temporal patterns for attribution."""
    
    def extract_temporal_fingerprint(self, campaign: Campaign) -> Dict:
        """Extract temporal fingerprint from campaign activity."""
        timestamps = campaign.temporal.get("activity_timestamps", [])
        
        if not timestamps:
            return {}
        
        # Parse timestamps
        datetimes = [datetime.fromisoformat(ts.replace('Z', '+00:00')) for ts in timestamps]
        
        # Extract hours (UTC)
        hours = [dt.hour for dt in datetimes]
        days = [dt.weekday() for dt in datetimes]
        
        # Estimate timezone based on working hours assumption (9-17 local)
        avg_hour = np.mean(hours)
        estimated_tz = self._estimate_timezone(avg_hour)
        
        return {
            "utc_hours": hours,
            "days_of_week": days,
            "average_hour_utc": avg_hour,
            "estimated_timezone": estimated_tz,
            "working_pattern": self._classify_working_pattern(hours, days),
            "activity_span_days": (max(datetimes) - min(datetimes)).days
        }
    
    def _estimate_timezone(self, avg_utc_hour: float) -> str:
        """Estimate timezone assuming 9-17 working hours."""
        # If avg activity is at 12 UTC, and we assume noon local, offset is 0
        # If avg activity is at 6 UTC, could be noon at UTC+6
        assumed_local_noon = 12
        offset = assumed_local_noon - avg_utc_hour
        
        # Round to common timezone
        if 2 <= offset <= 4:
            return "UTC+3 (Moscow/Eastern Europe)"
        elif 7 <= offset <= 9:
            return "UTC+8 (Beijing/Singapore)"
        elif -6 <= offset <= -4:
            return "UTC-5 (US Eastern)"
        elif -9 <= offset <= -7:
            return "UTC-8 (US Pacific)"
        else:
            return f"UTC{int(offset):+d}"
    
    def _classify_working_pattern(self, hours: List[int], days: List[int]) -> str:
        """Classify working pattern."""
        weekend_activity = sum(1 for d in days if d >= 5)
        total = len(days)
        
        if weekend_activity / total > 0.3:
            return "6-day work week (common in Asia)"
        elif weekend_activity == 0:
            return "5-day work week (Western pattern)"
        else:
            return "Mixed pattern"
    
    def calculate_temporal_similarity(self, fp1: Dict, fp2: Dict) -> float:
        """Calculate similarity between temporal fingerprints."""
        if not fp1 or not fp2:
            return 0.0
        
        score = 0.0
        
        # Hour similarity (closer hours = higher score)
        hour_diff = abs(fp1['average_hour_utc'] - fp2['average_hour_utc'])
        hour_sim = 1.0 - min(hour_diff / 12, 1.0)  # Max 12 hour difference
        score += hour_sim * 0.5
        
        # Working pattern match
        if fp1['working_pattern'] == fp2['working_pattern']:
            score += 0.3
        
        # Timezone match
        if fp1['estimated_timezone'] == fp2['estimated_timezone']:
            score += 0.2
        
        return score


# Analyze temporal patterns
temporal_analyzer = TemporalAnalyzer()

print("Temporal Fingerprints:")
temporal_fps = {}
for campaign in CAMPAIGNS:
    fp = temporal_analyzer.extract_temporal_fingerprint(campaign)
    temporal_fps[campaign.id] = fp
    print(f"\n{campaign.name}:")
    print(f"  Estimated Timezone: {fp.get('estimated_timezone', 'Unknown')}")
    print(f"  Working Pattern: {fp.get('working_pattern', 'Unknown')}")
    print(f"  Activity Hours (UTC): {fp.get('utc_hours', [])}")

print("\nTemporal Similarity (CAMP-001 vs others):")
fp1 = temporal_fps['CAMP-001']
for cid, fp2 in temporal_fps.items():
    sim = temporal_analyzer.calculate_temporal_similarity(fp1, fp2)
    print(f"  CAMP-001 <-> {cid}: {sim:.2f}")

## Part 6: Code Similarity & Malware Attribution

Deep code analysis for attribution indicators.

In [None]:
class CodeSimilarityAnalyzer:
    """Analyze code similarities for attribution."""
    
    def calculate_code_similarity(self, c1: Campaign, c2: Campaign) -> Dict:
        """Calculate multi-dimensional code similarity."""
        art1 = c1.code_artifacts
        art2 = c2.code_artifacts
        
        scores = {}
        
        # Imphash exact match (very strong indicator)
        if art1.get('imphash') and art2.get('imphash'):
            scores['imphash_match'] = 1.0 if art1['imphash'] == art2['imphash'] else 0.0
        
        # SSDeep fuzzy hash similarity
        if art1.get('ssdeep') and art2.get('ssdeep'):
            scores['ssdeep_similarity'] = self._ssdeep_compare(art1['ssdeep'], art2['ssdeep'])
        
        # Compiler match
        if art1.get('compiler') and art2.get('compiler'):
            scores['compiler_match'] = 1.0 if art1['compiler'] == art2['compiler'] else 0.0
        
        # Unique string overlap
        strings1 = set(art1.get('unique_strings', []))
        strings2 = set(art2.get('unique_strings', []))
        if strings1 and strings2:
            overlap = strings1 & strings2
            scores['string_overlap'] = len(overlap) / min(len(strings1), len(strings2))
            scores['shared_strings'] = list(overlap)
        
        # PDB path analysis
        pdb1 = art1.get('pdb_paths', [])
        pdb2 = art2.get('pdb_paths', [])
        if pdb1 and pdb2:
            scores['pdb_similarity'] = self._analyze_pdb_paths(pdb1, pdb2)
        
        # Calculate weighted score
        weights = {
            'imphash_match': 0.35,
            'ssdeep_similarity': 0.25,
            'string_overlap': 0.20,
            'compiler_match': 0.10,
            'pdb_similarity': 0.10
        }
        
        weighted_score = sum(
            scores.get(k, 0) * v 
            for k, v in weights.items()
            if k in scores
        )
        scores['weighted_total'] = weighted_score
        
        return scores
    
    def _ssdeep_compare(self, hash1: str, hash2: str) -> float:
        """Simplified ssdeep comparison (0-1 score)."""
        # In real implementation, use ssdeep.compare()
        # Here we do a simple prefix match
        prefix_len = min(len(hash1), len(hash2), 20)
        matches = sum(1 for a, b in zip(hash1[:prefix_len], hash2[:prefix_len]) if a == b)
        return matches / prefix_len
    
    def _analyze_pdb_paths(self, paths1: List[str], paths2: List[str]) -> float:
        """Analyze PDB path similarity."""
        for p1 in paths1:
            for p2 in paths2:
                # Check for same username
                user1 = self._extract_username(p1)
                user2 = self._extract_username(p2)
                if user1 and user2 and user1 == user2:
                    return 1.0
                
                # Check for similar structure
                if self._similar_path_structure(p1, p2):
                    return 0.5
        return 0.0
    
    def _extract_username(self, pdb_path: str) -> Optional[str]:
        """Extract username from PDB path."""
        match = re.search(r'Users\\([^\\]+)\\', pdb_path)
        return match.group(1) if match else None
    
    def _similar_path_structure(self, p1: str, p2: str) -> bool:
        """Check if paths have similar structure."""
        parts1 = p1.lower().split('\\')
        parts2 = p2.lower().split('\\')
        
        # Check for common project structure indicators
        common = set(parts1) & set(parts2)
        return len(common) >= 3


# Analyze code similarities
code_analyzer = CodeSimilarityAnalyzer()

print("Code Similarity Analysis:")
for i, c1 in enumerate(CAMPAIGNS):
    for c2 in CAMPAIGNS[i+1:]:
        sim = code_analyzer.calculate_code_similarity(c1, c2)
        if sim['weighted_total'] > 0.2:  # Only show significant similarities
            print(f"\n{c1.id} <-> {c2.id}: Score={sim['weighted_total']:.2f}")
            if sim.get('imphash_match'):
                print(f"  ⚠️  IMPHASH MATCH (very strong indicator)")
            if sim.get('shared_strings'):
                print(f"  Shared strings: {sim['shared_strings']}")

## Part 7: False Flag Detection

Identify potential deception and attribution manipulation.

In [None]:
class FalseFlagDetector:
    """Detect potential false flag operations."""
    
    def __init__(self):
        self.indicators = []
    
    def analyze_for_false_flags(self, campaign: Campaign, 
                                 similar_campaigns: List[Campaign],
                                 temporal_fp: Dict,
                                 code_sim: Dict) -> Dict:
        """Analyze campaign for false flag indicators."""
        flags = []
        confidence_reduction = 0.0
        
        # 1. Check for copied but not identical code
        if code_sim.get('shared_strings') and not code_sim.get('imphash_match'):
            flags.append({
                "type": "copied_strings",
                "description": "Unique strings copied but different binary",
                "severity": "high"
            })
            confidence_reduction += 0.2
        
        # 2. Check for inconsistent temporal patterns
        if temporal_fp:
            for sim_camp in similar_campaigns:
                sim_tf = TemporalAnalyzer().extract_temporal_fingerprint(sim_camp)
                if sim_tf:
                    hour_diff = abs(temporal_fp.get('average_hour_utc', 0) - 
                                   sim_tf.get('average_hour_utc', 0))
                    if hour_diff > 6:  # Different timezone
                        flags.append({
                            "type": "temporal_inconsistency",
                            "description": f"Operating hours differ by {hour_diff:.0f}h from similar campaign",
                            "severity": "high"
                        })
                        confidence_reduction += 0.15
        
        # 3. Check for language indicator inconsistencies
        if campaign.language_indicators:
            mixed_languages = len(set(
                ind.split()[0] for ind in campaign.language_indicators 
                if ind
            ))
            if mixed_languages > 1:
                flags.append({
                    "type": "mixed_language_artifacts",
                    "description": "Multiple language indicators (possible planting)",
                    "severity": "medium"
                })
                confidence_reduction += 0.1
        
        # 4. Check for infrastructure inconsistency
        if similar_campaigns:
            sim_infra = similar_campaigns[0].infrastructure
            camp_infra = campaign.infrastructure
            
            # Similar domain naming but different hosting
            if (self._similar_domain_naming(sim_infra, camp_infra) and 
                camp_infra.get('hosting') != sim_infra.get('hosting')):
                flags.append({
                    "type": "infrastructure_inconsistency",
                    "description": "Similar domain naming but different infrastructure provider",
                    "severity": "medium"
                })
                confidence_reduction += 0.1
        
        # 5. Check for compiler mismatch
        if similar_campaigns:
            sim_compiler = similar_campaigns[0].code_artifacts.get('compiler')
            camp_compiler = campaign.code_artifacts.get('compiler')
            if sim_compiler and camp_compiler and sim_compiler != camp_compiler:
                flags.append({
                    "type": "toolchain_mismatch",
                    "description": f"Different compiler ({camp_compiler} vs {sim_compiler})",
                    "severity": "medium"
                })
                confidence_reduction += 0.1
        
        return {
            "campaign": campaign.id,
            "false_flag_indicators": flags,
            "indicator_count": len(flags),
            "confidence_reduction": min(confidence_reduction, 0.5),
            "assessment": self._assess_false_flag_likelihood(len(flags))
        }
    
    def _similar_domain_naming(self, infra1: Dict, infra2: Dict) -> bool:
        """Check if domain naming conventions are similar."""
        domains1 = infra1.get('domains', [])
        domains2 = infra2.get('domains', [])
        
        for d1 in domains1:
            for d2 in domains2:
                # Check for similar prefixes
                prefix1 = d1.split('.')[0].split('-')
                prefix2 = d2.split('.')[0].split('-')
                if set(prefix1) & set(prefix2):
                    return True
        return False
    
    def _assess_false_flag_likelihood(self, indicator_count: int) -> str:
        """Assess likelihood of false flag operation."""
        if indicator_count >= 4:
            return "HIGH - Strong indicators of deception"
        elif indicator_count >= 2:
            return "MEDIUM - Some inconsistencies warrant investigation"
        elif indicator_count >= 1:
            return "LOW - Minor inconsistencies noted"
        else:
            return "NONE - No deception indicators"


# Check for false flags
ff_detector = FalseFlagDetector()

print("False Flag Analysis:")
# CAMP-004 is designed to look like a false flag
for campaign in CAMPAIGNS:
    # Find similar campaigns
    similar = [c for c in CAMPAIGNS if c.id != campaign.id 
               and ttp_analyzer.weighted_jaccard(c.ttps, campaign.ttps) > 0.5]
    
    if similar:
        temporal_fp = temporal_analyzer.extract_temporal_fingerprint(campaign)
        code_sim = code_analyzer.calculate_code_similarity(campaign, similar[0])
        
        result = ff_detector.analyze_for_false_flags(
            campaign, similar, temporal_fp, code_sim
        )
        
        if result['indicator_count'] > 0:
            print(f"\n{campaign.name} ({campaign.id}):")
            print(f"  Assessment: {result['assessment']}")
            print(f"  Confidence Reduction: {result['confidence_reduction']:.0%}")
            for flag in result['false_flag_indicators']:
                print(f"  ⚠️  {flag['type']}: {flag['description']}")

## Part 8: Comprehensive Attribution Engine

Combine all signals for confidence-calibrated attribution.

In [None]:
class AttributionEngine:
    """Comprehensive attribution with confidence calibration."""
    
    def __init__(self):
        self.ttp_analyzer = AdvancedTTPAnalyzer()
        self.infra_analyzer = InfrastructureGraphAnalyzer()
        self.temporal_analyzer = TemporalAnalyzer()
        self.code_analyzer = CodeSimilarityAnalyzer()
        self.ff_detector = FalseFlagDetector()
    
    def attribute_campaign(self, campaign: Campaign, 
                          known_actors: Dict[str, ThreatActor],
                          related_campaigns: List[Campaign]) -> Dict:
        """Generate comprehensive attribution assessment."""
        
        assessment = {
            "campaign_id": campaign.id,
            "campaign_name": campaign.name,
            "analysis_timestamp": datetime.now().isoformat(),
            "matches": [],
            "campaign_links": [],
            "confidence": {},
            "false_flag_check": {},
            "recommendation": ""
        }
        
        # 1. Match against known actors
        for actor_name, actor in known_actors.items():
            match_score = self._calculate_actor_match(campaign, actor)
            if match_score['total_score'] > 0.3:
                assessment['matches'].append({
                    "actor": actor_name,
                    "aliases": actor.aliases,
                    "scores": match_score,
                    "total_score": match_score['total_score']
                })
        
        # Sort by score
        assessment['matches'].sort(key=lambda x: x['total_score'], reverse=True)
        
        # 2. Find related campaigns
        for other in related_campaigns:
            if other.id != campaign.id:
                link_score = self._calculate_campaign_link(campaign, other)
                if link_score['total_score'] > 0.4:
                    assessment['campaign_links'].append({
                        "campaign_id": other.id,
                        "campaign_name": other.name,
                        "link_scores": link_score
                    })
        
        # 3. False flag check
        if assessment['campaign_links']:
            linked_campaigns = [c for c in related_campaigns 
                              if c.id in [l['campaign_id'] for l in assessment['campaign_links']]]
            temporal_fp = self.temporal_analyzer.extract_temporal_fingerprint(campaign)
            code_sim = self.code_analyzer.calculate_code_similarity(
                campaign, linked_campaigns[0]
            ) if linked_campaigns else {}
            
            assessment['false_flag_check'] = self.ff_detector.analyze_for_false_flags(
                campaign, linked_campaigns, temporal_fp, code_sim
            )
        
        # 4. Calculate overall confidence
        assessment['confidence'] = self._calculate_confidence(assessment)
        
        # 5. Generate recommendation
        assessment['recommendation'] = self._generate_recommendation(assessment)
        
        return assessment
    
    def _calculate_actor_match(self, campaign: Campaign, actor: ThreatActor) -> Dict:
        """Calculate match score against known actor."""
        scores = {}
        
        # TTP similarity
        scores['ttp_similarity'] = self.ttp_analyzer.weighted_jaccard(
            campaign.ttps, actor.signature_ttps
        )
        
        # Target overlap
        sector_overlap = len(set(campaign.targets['sectors']) & set(actor.target_sectors))
        region_overlap = len(set(campaign.targets['regions']) & set(actor.target_regions))
        scores['target_similarity'] = (
            sector_overlap / max(len(campaign.targets['sectors']), 1) * 0.6 +
            region_overlap / max(len(campaign.targets['regions']), 1) * 0.4
        )
        
        # Infrastructure patterns
        infra_score = 0
        if campaign.infrastructure.get('registrar') in actor.infrastructure_patterns.get('registrars', []):
            infra_score += 0.3
        if campaign.infrastructure.get('hosting') in actor.infrastructure_patterns.get('hosting', []):
            infra_score += 0.3
        # Domain theme matching
        for domain in campaign.infrastructure.get('domains', []):
            for theme in actor.infrastructure_patterns.get('domain_themes', []):
                if theme in domain:
                    infra_score += 0.2
                    break
        scores['infrastructure_similarity'] = min(infra_score, 1.0)
        
        # Temporal pattern match
        temporal_fp = self.temporal_analyzer.extract_temporal_fingerprint(campaign)
        if temporal_fp:
            expected_tz = actor.operational_tempo.get('timezone_offset', 0)
            actual_tz = 12 - temporal_fp.get('average_hour_utc', 12)  # Rough estimate
            tz_diff = abs(expected_tz - actual_tz)
            scores['temporal_similarity'] = 1.0 - min(tz_diff / 12, 1.0)
        
        # Code signature match
        code_score = 0
        if campaign.code_artifacts.get('compiler') in actor.code_signatures.get('compilers', []):
            code_score += 0.5
        for lang in campaign.language_indicators:
            for actor_lang in actor.code_signatures.get('language_artifacts', []):
                if actor_lang.lower() in lang.lower():
                    code_score += 0.5
                    break
        scores['code_similarity'] = min(code_score, 1.0)
        
        # Weighted total
        weights = {
            'ttp_similarity': 0.30,
            'infrastructure_similarity': 0.25,
            'code_similarity': 0.20,
            'temporal_similarity': 0.15,
            'target_similarity': 0.10
        }
        
        scores['total_score'] = sum(
            scores.get(k, 0) * v for k, v in weights.items()
        )
        
        return scores
    
    def _calculate_campaign_link(self, c1: Campaign, c2: Campaign) -> Dict:
        """Calculate link strength between campaigns."""
        return {
            'ttp_similarity': self.ttp_analyzer.weighted_jaccard(c1.ttps, c2.ttps),
            'code_similarity': self.code_analyzer.calculate_code_similarity(c1, c2).get('weighted_total', 0),
            'total_score': (
                self.ttp_analyzer.weighted_jaccard(c1.ttps, c2.ttps) * 0.5 +
                self.code_analyzer.calculate_code_similarity(c1, c2).get('weighted_total', 0) * 0.5
            )
        }
    
    def _calculate_confidence(self, assessment: Dict) -> Dict:
        """Calculate calibrated confidence levels."""
        base_confidence = 0.0
        
        if assessment['matches']:
            best_match = assessment['matches'][0]
            base_confidence = best_match['total_score']
        
        # Apply false flag reduction
        ff_reduction = assessment.get('false_flag_check', {}).get('confidence_reduction', 0)
        adjusted_confidence = base_confidence * (1 - ff_reduction)
        
        # Determine level
        if adjusted_confidence >= 0.75:
            level = "HIGH"
        elif adjusted_confidence >= 0.50:
            level = "MEDIUM"
        elif adjusted_confidence >= 0.30:
            level = "LOW"
        else:
            level = "INSUFFICIENT"
        
        return {
            "level": level,
            "score": adjusted_confidence,
            "base_score": base_confidence,
            "adjustments": {
                "false_flag_reduction": ff_reduction
            }
        }
    
    def _generate_recommendation(self, assessment: Dict) -> str:
        """Generate analyst recommendation."""
        conf = assessment['confidence']
        
        if conf['level'] == "HIGH":
            actor = assessment['matches'][0]['actor'] if assessment['matches'] else "Unknown"
            return f"Strong attribution to {actor}. Recommend sharing with partners."
        elif conf['level'] == "MEDIUM":
            return "Moderate confidence. Additional evidence collection recommended."
        elif conf['level'] == "LOW":
            ff = assessment.get('false_flag_check', {})
            if ff.get('indicator_count', 0) > 0:
                return "Low confidence with deception indicators. Treat attribution with caution."
            return "Low confidence. Insufficient evidence for attribution."
        else:
            return "Attribution not possible. Collect additional IOCs and artifacts."


# Run attribution engine
engine = AttributionEngine()

print("=" * 70)
print("ATTRIBUTION ASSESSMENT REPORTS")
print("=" * 70)

for campaign in CAMPAIGNS:
    result = engine.attribute_campaign(campaign, KNOWN_ACTORS, CAMPAIGNS)
    
    print(f"\n{'='*70}")
    print(f"Campaign: {result['campaign_name']} ({result['campaign_id']})")
    print(f"{'='*70}")
    
    print(f"\nConfidence: {result['confidence']['level']} ({result['confidence']['score']:.2f})")
    
    if result['matches']:
        print(f"\nTop Actor Match:")
        top = result['matches'][0]
        print(f"  {top['actor']} (Score: {top['total_score']:.2f})")
        print(f"    TTP: {top['scores']['ttp_similarity']:.2f}")
        print(f"    Infrastructure: {top['scores']['infrastructure_similarity']:.2f}")
        print(f"    Code: {top['scores']['code_similarity']:.2f}")
    
    if result['false_flag_check'].get('indicator_count', 0) > 0:
        print(f"\n⚠️  FALSE FLAG INDICATORS DETECTED:")
        print(f"  {result['false_flag_check']['assessment']}")
    
    print(f"\nRecommendation: {result['recommendation']}")

## Part 9: Diamond Model Report Generation

In [None]:
from anthropic import Anthropic

def generate_diamond_model_report(campaign: Campaign, 
                                  attribution: Dict) -> str:
    """Generate Diamond Model analysis with LLM."""
    
    client = Anthropic()
    
    prompt = f"""
    Generate a Diamond Model threat intelligence report for this campaign.

    CAMPAIGN DATA:
    - Name: {campaign.name}
    - TTPs: {campaign.ttps}
    - Malware: {campaign.malware_families}
    - Infrastructure: {json.dumps(campaign.infrastructure)}
    - Targets: {json.dumps(campaign.targets)}
    - Language Indicators: {campaign.language_indicators}

    ATTRIBUTION ANALYSIS:
    - Confidence: {attribution['confidence']['level']}
    - Top Match: {attribution['matches'][0] if attribution['matches'] else 'None'}
    - False Flag Check: {attribution.get('false_flag_check', {})}

    Create a structured Diamond Model report with:

    1. ADVERSARY
       - Assessed identity and confidence
       - Motivation analysis
       - Attribution evidence

    2. CAPABILITY
       - Technical sophistication assessment
       - Malware analysis
       - TTP analysis with MITRE mapping

    3. INFRASTRUCTURE
       - C2 infrastructure analysis
       - Hosting and registration patterns
       - Pivot opportunities

    4. VICTIM
       - Target profile
       - Selection criteria assessment
       - Predicted future targets

    5. ANALYTIC CONFIDENCE
       - Confidence assessment
       - Alternative hypotheses
       - Intelligence gaps

    Format as a professional threat intelligence report.
    """
    
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=3000,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.content[0].text

# Uncomment to generate report (requires API key)
# sample_attribution = engine.attribute_campaign(CAMPAIGNS[0], KNOWN_ACTORS, CAMPAIGNS)
# report = generate_diamond_model_report(CAMPAIGNS[0], sample_attribution)
# print(report)

## Key Takeaways

1. **Weighted TTP Analysis**: Not all techniques are equal - rare TTPs are more distinctive
2. **Multi-dimensional Attribution**: Combine TTPs, infrastructure, temporal, and code signals
3. **Graph Analysis**: Infrastructure overlaps reveal campaign connections
4. **Temporal Fingerprinting**: Operational hours indicate actor timezone
5. **False Flag Detection**: Look for inconsistencies that indicate deception
6. **Confidence Calibration**: Adjust confidence based on evidence quality

---

## Next Steps

- **Lab 17**: Adversarial ML - Attack and defend security models
- **Lab 20**: LLM Red Teaming - Attack AI systems
- **CTF Challenges**: Test attribution skills on realistic scenarios