# Named Entity Recognition for Security Intelligence

A security-focused NER pipeline using Fenic's semantic extraction capabilities to identify and analyze threats, vulnerabilities, and indicators of compromise from unstructured security reports.

This pipeline demonstrates automated security entity extraction and risk assessment:

- Zero-shot entity extraction (CVEs, IPs, domains, hashes)
- Enhanced extraction with threat intelligence context
- Document chunking for comprehensive analysis
- Risk prioritization and actionable intelligence

## Setup the session and manage imports

In [None]:
import fenic as fc
import re
from pydantic import BaseModel, Field

config = fc.SessionConfig(
    app_name="security_vulnerability_ner",
    semantic=fc.SemanticConfig(
        language_models={
            "mini" : fc.OpenAIModelConfig(model_name="gpt-4o-mini", rpm=500, tpm=200_000)
        }
    )
)

session = fc.Session.get_or_create(config)

## Sample Vulnerability Reports Dataset

This cell defines a sample dataset of security vulnerability reports. Each report includes details such as report ID, source, title, and content describing real-world security incidents, vulnerabilities, and threat intelligence. The dataset is used to demonstrate entity extraction and analysis in the NER pipeline.

In [None]:
# Sample vulnerability reports data
vulnerability_reports_data = [
        {
            "report_id": "CVE-2024-001",
            "source": "CVE Database",
            "title": "Critical OpenSSL Buffer Overflow",
            "content": "CVE-2024-3094: Buffer overflow in OpenSSL 3.0.0-3.0.12. Affects Ubuntu 22.04, RHEL 8. CVSS 9.8. IOCs: evil-domain.com, 10.0.0.50:443"
        },
        {
            "report_id": "THREAT-2024-002",
            "source": "Threat Intelligence",
            "title": "APT29 Campaign Targeting Financial Sector",
            "content": "APT29 targeting banks. Exploits CVE-2024-1234, CVE-2024-5678. Malware: SUNBURST 2.0. C2: c2-server.badguys.net (185.159.158.1)"
        },
        {
            "report_id": "SEC-ADV-2024-003",
            "source": "Security Advisory",
            "title": "Zero-Day in Popular WordPress Plugin",
            "content": "CVE-2024-9999: SQL injection in WP Super Cache 1.0.0-1.7.8. CVSS 8.5. Patch in 1.7.9. Related: CVE-2024-9998, CVE-2024-9997"
        },
        {
            "report_id": "INC-2024-004",
            "source": "Incident Report",
            "title": "Ransomware Attack on Healthcare Provider",
            "content": "LockBit 3.0 ransomware via CVE-2024-4444. Used Mimikatz 2.2.0. C2: 45.142.214.99:8443. Affected Windows Server 2016, 2019"
        },
        {
            "report_id": "VULN-2024-005",
            "source": "Bug Bounty Report",
            "title": "Authentication Bypass in Enterprise SaaS Platform",
            "content": "Auth bypass in AuthProvider 2.5.1 at login.platform.com. JWT alg:none vulnerability. Fixed in 2.5.2. Affects /api/v2/admin/*"
        }
]

In [None]:
# Create DataFrame
reports_df = session.create_dataframe(vulnerability_reports_data)

print("🔒 Security Vulnerability NER Pipeline")
print("=" * 70)
print(f"Processing {reports_df.count()} vulnerability reports\n")

## Stage 1: Basic Zero-Shot Entity Extraction

This cell defines a basic Named Entity Recognition (NER) schema for extracting key security-related entities—such as CVE IDs, software packages, IP addresses, domains, and file hashes—from vulnerability report content using Fenic's zero-shot extraction. It applies the schema to the dataset, extracts entities, and displays sample results for review.

In [None]:
# Stage 1: Basic NER with zero-shot extraction
print("🔍 Stage 1: Zero-shot entity extraction...")

# Define basic NER schema for security entities
basic_ner_schema = fc.ExtractSchema([
    fc.ExtractSchemaField(
        name="cve_ids",
        data_type=fc.ExtractSchemaList(element_type=fc.StringType),
        description="CVE identifiers in format CVE-YYYY-NNNNN"
    ),
    fc.ExtractSchemaField(
        name="software_packages",
        data_type=fc.ExtractSchemaList(element_type=fc.StringType),
        description="Software names and versions mentioned"
    ),
    fc.ExtractSchemaField(
        name="ip_addresses",
        data_type=fc.ExtractSchemaList(element_type=fc.StringType),
        description="IP addresses (IPv4 or IPv6)"
    ),
    fc.ExtractSchemaField(
        name="domains",
        data_type=fc.ExtractSchemaList(element_type=fc.StringType),
        description="Domain names and URLs"
    ),
    fc.ExtractSchemaField(
        name="file_hashes",
        data_type=fc.ExtractSchemaList(element_type=fc.StringType),
        description="File hashes (MD5, SHA1, SHA256)"
    )
])

# Apply basic extraction
basic_extraction_df = reports_df.select(
    "report_id",
    "source",
    "title",
    fc.semantic.extract("content", basic_ner_schema).alias("basic_entities")
)

# Display sample results
print("Sample basic extraction results:")
basic_readable = basic_extraction_df.select(
    "report_id",
    basic_extraction_df.basic_entities.cve_ids.alias("cve_ids"),
    basic_extraction_df.basic_entities.software_packages.alias("software_packages")
)
basic_readable.show(2)

## Stage 2: Enhanced Domain-Specific Entity Extraction

This cell defines an advanced NER schema that expands on the basic extraction by including additional security-specific entities such as attack vectors, threat actors, CVSS scores, MITRE techniques, and affected systems. It preprocesses the report content for consistency, applies the enhanced schema to extract richer security intelligence, and displays sample results for key entities.

In [None]:
# Stage 2: Enhanced extraction with domain-specific schema
print("\n🧠 Stage 2: Enhanced domain-specific extraction...")

# Define enhanced schema with security-specific entities
enhanced_ner_schema = fc.ExtractSchema([
    # Include all basic entities
    fc.ExtractSchemaField(
        name="cve_ids",
        data_type=fc.ExtractSchemaList(element_type=fc.StringType),
        description="CVE identifiers in format CVE-YYYY-NNNNN"
    ),
    fc.ExtractSchemaField(
        name="software_packages",
        data_type=fc.ExtractSchemaList(element_type=fc.StringType),
        description="Software names with specific version numbers"
    ),
    fc.ExtractSchemaField(
        name="ip_addresses",
        data_type=fc.ExtractSchemaList(element_type=fc.StringType),
        description="IP addresses (IPv4 or IPv6)"
    ),
    fc.ExtractSchemaField(
        name="domains",
        data_type=fc.ExtractSchemaList(element_type=fc.StringType),
        description="Domain names, subdomains, and URLs"
    ),
    fc.ExtractSchemaField(
        name="file_hashes",
        data_type=fc.ExtractSchemaList(element_type=fc.StringType),
        description="File hashes with hash type prefix (MD5:, SHA1:, SHA256:)"
    ),
    # Additional security-specific entities
    fc.ExtractSchemaField(
        name="attack_vectors",
        data_type=fc.ExtractSchemaList(element_type=fc.StringType),
        description="Attack methods like buffer overflow, SQL injection, phishing"
    ),
    fc.ExtractSchemaField(
        name="threat_actors",
        data_type=fc.ExtractSchemaList(element_type=fc.StringType),
        description="Threat actor names, APT groups, ransomware families"
    ),
    fc.ExtractSchemaField(
        name="cvss_scores",
        data_type=fc.ExtractSchemaList(element_type=fc.StringType),
        description="CVSS scores and severity ratings"
    ),
    fc.ExtractSchemaField(
        name="mitre_techniques",
        data_type=fc.ExtractSchemaList(element_type=fc.StringType),
        description="MITRE ATT&CK technique IDs (TXXXX format)"
    ),
    fc.ExtractSchemaField(
        name="affected_systems",
        data_type=fc.ExtractSchemaList(element_type=fc.StringType),
        description="Operating systems, platforms, or infrastructure affected"
    )
])

# Preprocess content for better extraction
@fc.udf(return_type=fc.StringType)
def preprocess_udf(content):
    # Standardize CVE format
    content = re.sub(r'CVE\s*-\s*(\d{4})\s*-\s*(\d+)', r'CVE-\1-\2', content)
    # Normalize version ranges
    content = re.sub(r'(\d+\.\d+\.\d+)\s+through\s+(\d+\.\d+\.\d+)', r'\1 to \2', content)
    # Clean up extra whitespace
    content = ' '.join(content.split())
    return content

# Apply preprocessing and enhanced extraction
enhanced_df = reports_df.select(
    "report_id",
    "source",
    "title",
    "content",
    preprocess_udf("content").alias("processed_content")
).select(
    "report_id",
    "source",
    "title",
    "content",
    fc.semantic.extract("processed_content", enhanced_ner_schema).alias("entities")
)

print("Enhanced extraction with security-specific entities:")
enhanced_readable = enhanced_df.select(
    "report_id",
    enhanced_df.entities.threat_actors.alias("threat_actors"),
    enhanced_df.entities.attack_vectors.alias("attack_vectors"),
    enhanced_df.entities.cvss_scores.alias("cvss_scores")
)
enhanced_readable.show(2)

## Stage 3: Chunking and Processing Long Documents

This cell demonstrates how to handle long vulnerability reports by splitting (chunking) their content into smaller, overlapping segments for more effective entity extraction. It identifies which reports need chunking, applies recursive word chunking to long documents, extracts entities from each chunk using the enhanced schema, aggregates the results, and displays a sample of the aggregated entities per report.

In [None]:
# Stage 3: Process long documents with chunking
print("\n📄 Stage 3: Chunking and processing long documents...")

# Add content length for chunking decisions
reports_with_length = enhanced_df.select(
    "*",
    fc.text.length(fc.col("content")).alias("content_length")
)

# Identify documents needing chunking (>80 characters for demo)
long_reports = reports_with_length.filter(fc.col("content_length") > 80)
short_reports = reports_with_length.filter(fc.col("content_length") <= 80)

print(f"Documents requiring chunking: {long_reports.count()}")
print(f"Documents processed whole: {short_reports.count()}")

# Apply chunking to long documents
chunked_df = long_reports.select(
    "report_id",
    "content",
    fc.text.recursive_word_chunk(
        fc.col("content"),
        chunk_size=50,
        chunk_overlap_percentage=15
    ).alias("chunks")
).explode("chunks").select(
    "report_id",
    fc.col("chunks").alias("chunk")
)

# Extract entities from each chunk
chunk_entities_df = chunked_df.select(
    "report_id",
    "chunk",
    fc.semantic.extract("chunk", enhanced_ner_schema).alias("chunk_entities")
)

# Aggregate entities across chunks
aggregated_entities = chunk_entities_df.group_by("report_id").agg(
    fc.collect_list(fc.col("chunk_entities")).alias("all_chunk_entities")
)

print("\nChunked extraction completed for long documents")
print(f"Total chunks processed: {chunk_entities_df.count()}")

# Show sample of aggregated chunk results
print("\nSample aggregated entities from chunks:")
aggregated_sample = aggregated_entities.select(
    "report_id",
    fc.array_size(fc.col("all_chunk_entities")).alias("chunks_with_entities")
)
aggregated_sample.show(2)

## Stage 4: Validation and Quality Assurance

This cell focuses on validating the extracted entities by creating a unified view of all reports and their extracted entities. It specifically displays the CVE IDs identified in each report, allowing for quick inspection and quality assurance of the entity extraction process.

In [None]:
# Stage 4: Validation and quality assurance
print("\n✅ Stage 4: Validating extracted entities...")

# Create a unified view for validation
all_entities_df = enhanced_df.select(
    "report_id",
    "source",
    "title",
    "entities"
)

# Show extracted CVEs
print("Extracted CVE IDs:")
cve_summary = all_entities_df.select(
    "report_id",
    all_entities_df.entities.cve_ids.alias("extracted_cves")
)
cve_summary.show(3)

## Stage 5: Analytics and Aggregation

This cell performs analytics on the extracted entities by flattening and aggregating CVE IDs, software packages, and threat actors across all reports. It identifies the most frequently mentioned CVEs, most affected software, and most active threat actors. Finally, it generates a summary of key statistics, including total and unique CVEs, total and unique threat actors, and the number of reports processed, providing actionable security intelligence insights.

In [None]:
# Stage 5: Analytics and aggregation
print("\n📊 Stage 5: Entity analytics and insights...")

# Flatten entities for analysis
flattened_cves = all_entities_df.select(
    all_entities_df.entities.cve_ids.alias("cve_id")
).explode("cve_id").filter(fc.col("cve_id").is_not_null())

flattened_software = all_entities_df.select(
    all_entities_df.entities.software_packages.alias("software")
).explode("software").filter(fc.col("software").is_not_null())

flattened_threats = all_entities_df.select(
    all_entities_df.entities.threat_actors.alias("threat_actor")
).explode("threat_actor").filter(fc.col("threat_actor").is_not_null())

# Most common CVEs
print("\nTop CVEs mentioned:")
cve_counts = flattened_cves.group_by("cve_id").agg(
    fc.count("*").alias("mentions")
).order_by(fc.col("mentions").desc())
cve_counts.show(5)

# Most affected software
print("\nMost affected software:")
software_counts = flattened_software.group_by("software").agg(
    fc.count("*").alias("mentions")
).order_by(fc.col("mentions").desc())
software_counts.show(5)

# Active threat actors
print("\nActive threat actors:")
threat_counts = flattened_threats.group_by("threat_actor").agg(
    fc.count("*").alias("reports")
).order_by(fc.col("reports").desc())
threat_counts.show(5)

# Create final comprehensive report
print("\n📋 Final Security Intelligence Summary:")
print("=" * 70)

# Summary statistics
total_cves = flattened_cves.count()
unique_cves = flattened_cves.select("cve_id").drop_duplicates().count()
total_threats = flattened_threats.count()
unique_threats = flattened_threats.select("threat_actor").drop_duplicates().count()

print(f"Total CVEs extracted: {total_cves} ({unique_cves} unique)")
print(f"Total threat actors identified: {total_threats} ({unique_threats} unique)")
print(f"Reports processed: {reports_df.count()}")

## Actionable Intelligence: Automated Risk Assessment

This cell uses a Pydantic model to define a structured risk assessment for each vulnerability report, extracting the overall risk level, recommended immediate action, and affected scope using semantic operations. 

It then identifies and displays high- and critical-risk vulnerabilities that require immediate attention, providing actionable security intelligence. The session is stopped at the end to clean up resources.

In [None]:
# Generate actionable intelligence using semantic operations
print("\n🎯 Actionable Intelligence:")

# Define Pydantic model for risk assessment
class ExtractedRiskInfo(BaseModel):
    """Directly extracted risk information from the report text. If a value is not present in the report, use an empty string."""
    severity_rating: str = Field(..., description="Explicit severity rating or risk level as stated in the report (e.g., 'critical', 'high', 'medium', 'low')")
    cvss_score: str = Field(..., description="CVSS score as stated in the report")
    mitigation_steps: str = Field(..., description="Quoted mitigation or remediation steps as stated in the report")
    affected_systems: str = Field(..., description="Exact systems, platforms, or users mentioned as affected in the report")

# Assess risk for each report
risk_assessment_df = enhanced_df.select(
    "report_id",
    "title",
    fc.semantic.extract("content", ExtractedRiskInfo).alias("risk_assessment")
)

# Show high-risk items
high_risk_df = risk_assessment_df.select(
    "report_id",
    "title",
    risk_assessment_df.risk_assessment.severity_rating.alias("risk_level"),
    risk_assessment_df.risk_assessment.mitigation_steps.alias("immediate_action"),
    risk_assessment_df.risk_assessment.affected_systems.alias("affected_scope")
).filter(
    (fc.col("risk_level") == "critical") | (fc.col("risk_level") == "high")
)

print("\nHigh-Risk Vulnerabilities Requiring Immediate Action:")
high_risk_df.show()

# Clean up
session.stop()