# Machine Learning Models for Cyber Risk Measurement

This notebook parses Apple's 2023 10-K filing to extract and label cyber-risk sentences from Item 1A (Risk Factors) for a Cyber Risk Index (CRI). Using `sec-api`, we extract Item 1A (chosen over Item 1C, absent in 2023, for predictive signals). NLTK tokenizes sentences, keywords filter cyber-relevant ones, and DeBERTa-v3 zero-shot classification labels them with a taxonomy from Florackis et al. (2023), Richmond Fed (2022), and CMU SEI (2014). Outputs include a CSV of labeled sentences supporting CRI portfolio weighting.

In [1]:
import os
#!pip install sec-api transformers nltk pandas
import nltk
import pandas as pd
from sec_api import ExtractorApi
from transformers import pipeline
#nltk.download('punkt_tab')
#nltk.download('punkt', quiet=True)

In [None]:
# I used the SEC API (free at https://sec-api.io/) to extract Item 1A from a 10-K report. It has 100 free requests/month, which is enough for this demo.
SEC_API_KEY = "ENTER YOUR KEY HERE"
if not SEC_API_KEY:
    raise ValueError("Set SEC_API_KEY env variable (get at https://sec-api.io/)")
extractor = ExtractorApi(SEC_API_KEY)

# I used Apple's 2023 10-K HTML for demo.
# For later, a pipeline can be setup that can extract 10k filings for a list of companies using CIKs and year combinations.
filing_url = "https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/aapl-20230930.htm"


# Taxonomy for cyber risk classification
# Why: Hierarchical (4 classes, 9 sub-elements) for granular signals, based on
# financial and operational risk literature (see sources below)
flattened_labels = [
    "Actions of People: Inadvertent",
    "Actions of People: Deliberate",
    "Actions of People: Inaction",
    "Systems and Technology Failures: Hardware",
    "Systems and Technology Failures: Software",
    "Failed Internal Processes: Process Design/Execution",
    "Failed Internal Processes: Vendor/Third-Party",
    "External Events: Threats",
    "External Events: Impacts",
    "None"
]

taxonomy_details = {
    "Actions of People: Inadvertent": {
        'keywords': ['mistake', 'error', 'omission', 'unintentional', 'accidental'],
        'definition': 'Unintentional errors (e.g., employee oversights). Sources: Richmond Fed (2022), CMU SEI (2014).'
    },
    "Actions of People: Deliberate": {
        'keywords': ['fraud', 'sabotage', 'theft', 'vandalism', 'malicious', 'intentional', 'hack', 'attack', 'breach'],
        'definition': 'Intentional attacks (e.g., hacking). Sources: Florackis et al. (2023), Richmond Fed (2022).'
    },
    "Actions of People: Inaction": {
        'keywords': ['lack of', 'failure to', 'inaction', 'unavailable', 'ignorance'],
        'definition': 'Failure to act (e.g., unpatched systems). Source: CMU SEI (2014).'
    },
    "Systems and Technology Failures: Hardware": {
        'keywords': ['hardware', 'capacity', 'performance', 'maintenance', 'obsolescence'],
        'definition': 'Equipment failures (e.g., server crashes). Source: CMU SEI (2014).'
    },
    "Systems and Technology Failures: Software": {
        'keywords': ['software', 'compatibility', 'configuration', 'security settings', 'bug', 'vulnerability'],
        'definition': 'Software issues (e.g., bugs). Sources: CMU SEI (2014), Florackis et al. (2023).'
    },
    "Failed Internal Processes: Process Design/Execution": {
        'keywords': ['process', 'design', 'execution', 'management', 'incident response'],
        'definition': 'Flawed processes (e.g., weak incident response). Source: CMU SEI (2014).'
    },
    "Failed Internal Processes: Vendor/Third-Party": {
        'keywords': ['vendor', 'third-party', 'supply chain', 'outsourcing'],
        'definition': 'Vendor-related risks (e.g., supply chain breaches). Source: Florackis et al. (2023).'
    },
    "External Events: Threats": {
        'keywords': ['external', 'threat', 'attack', 'denial-of-service', 'ransomware', 'phishing', 'malware', 'DoS', 'DDoS'],
        'definition': 'External threats (e.g., ransomware). Sources: Richmond Fed (2022), Florackis et al. (2023).'
    },
    "External Events: Impacts": {
        'keywords': ['disruption', 'breach', 'loss', 'theft', 'funds', 'PII', 'non-PII', 'financial loss'],
        'definition': 'Consequences (e.g., data loss). Sources: Florackis et al. (2023), Richmond Fed (2022).'
    },
    "None": {'keywords': [], 'definition': 'Non-cyber sentences.'}
}

# Taxonomy sources
# 1. Florackis, C., Louca, C., Michaely, R., & Weber, M. (2023). Cybersecurity risk. The Review of Financial Studies, 36(1), 351-407.
#    - Defines cyber risks (breaches, disruptions) for CRI signals
# 2. Federal Reserve Bank of Richmond. (2022). Cyber Risk Definition and Classification
#    for Financial Risk Management (Working Paper 22-09).
#    https://www.richmondfed.org/-/media/RichmondFedOrg/banking/qsr-people/pdf/Cyber_Risk_Classification_White_Paper_2022.pdf
#    - Structures risks by intent, consequence, cause (e.g., phishing, DoS)
# 3. Cebula, J. J., Popeck, M. E., & Young, L. R. (2014). A taxonomy of operational cyber security risks version 2 (No. CMUSEI2014TN006).
#    https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=91013
#    - Covers operational risks (people, systems, processes)

In [3]:
# Extract Item 1A (Risk Factors)
# Why: Item 1A has predictive cyber risk disclosures (per Florackis et al., 2023)
def extract_risk_factors(url):
    try:
        text = extractor.get_section(url, "1A", "text")
        if not text.strip():
            print("Item 1A empty; trying Item 7.")
            text = extractor.get_section(url, "7", "text")
        if not text.strip():
            raise ValueError("No relevant section found.")
        return text
    except Exception as e:
        raise ValueError(f"Extraction failed: {str(e)}")

# Filter sentences to reduce LLM load
# Why: Skip non-cyber sentences (e.g., no 'cyber', 'breach') to save compute;
# length filter avoids noise (e.g., table fragments)
def pre_filter_sentence(sentence, details):
    cyber_keywords = set()
    for label, info in details.items():
        if label != "None":
            cyber_keywords.update(info['keywords'])
    return len(sentence) > 10 and any(kw.lower() in sentence.lower() for kw in cyber_keywords)

# Label sentences with DeBERTa-v3
# Why: DeBERTa outperforms BART on semantic tasks; multi-label handles overlaps
# (e.g., breach as 'Threats' + 'Impacts'); distilbert fallback for low compute
def label_sentences(sentences, model_name='MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli'):
    try:
        classifier = pipeline('zero-shot-classification', model=model_name)
    except Exception as e:
        print(f"DeBERTa failed: {str(e)}. Using distilbert.")
        classifier = pipeline('zero-shot-classification', model='distilbert-base-uncased')
    
    labeled = []
    batch_size = 16  # Why: Balances CPU memory and speed
    for i in range(0, len(sentences), batch_size):
        batch = sentences[i:i+batch_size]
        results = classifier(batch, flattened_labels, multi_label=True)
        for sent, res in zip(batch, results if isinstance(results, list) else [results]):
            labels = [label for label, score in zip(res['labels'], res['scores']) if score > 0.5]
            labeled.append((sent.strip(), labels if labels else ['None']))
    return labeled

# Main 
def main(output_csv='labeled_sentences_CyberRisk.csv'):
    # Extract Item 1A
    try:
        section_text = extract_risk_factors(filing_url)
    except ValueError as e:
        print(f"Error: {str(e)}")
        return
    
    # Split into sentences
    sentences = nltk.sent_tokenize(section_text)
    print(f"Extracted {len(sentences)} sentences.")
    
    # Filter cyber-relevant sentences
    filtered_sentences = [s for s in sentences if pre_filter_sentence(s, taxonomy_details)]
    print(f"Filtered to {len(filtered_sentences)} cyber-relevant sentences.")
    
    # Label with LLM
    labeled = label_sentences(filtered_sentences)
    
    # Save to CSV
    df = pd.DataFrame(labeled, columns=['Sentence', 'Labels'])
    df['Labels'] = df['Labels'].apply(lambda x: ', '.join(x))
    df.to_csv(output_csv, index=False, encoding='utf-8')

if __name__ == '__main__':
    main()

Extracted 310 sentences.
Filtered to 103 cyber-relevant sentences.


Device set to use mps:0
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


# Results

- **Approach**: Extracted Item 1A from Apple's 2023 10-K using sec-api for predictive cyber risks (Item 1C absent); tokenized 310 sentences with NLTK; filtered to 103 via keywords; labeled with DeBERTa-v3 zero-shot (threshold 0.5) using taxonomy from Florackis et al. (2023), Richmond Fed (2022), and CMU SEI (2014).
- **Key Findings - Overall**: Identified 74 cyber-related sentences (71.8% of filtered), focusing on Apple's exposure to external attacks and disruptions.
- **Key Findings - Labels**: "External Events: Threats" dominated (51 occurrences), e.g., ransomware and unauthorized access; "Systems and Technology Failures: Software" (6), e.g., software vulnerabilities.
- **Key Findings - Insights**: Multi-labeling captured overlaps like threats with impacts; high threats reflect Apple's high-profile status, suitable for CRI risk scoring.
- **CRI Relevance**: Dominant threats can weight portfolios (e.g., higher risk for shorts); aligns with Florackis et al. for predictive signals.
- **Challenges - Technical**: API key setup required; 0.5 threshold caused under-labeling (e.g., missed "Impacts" in some access-related sentences).
- **Challenges - Performance**: LLM inference ~5-10 min on CPU/MPS, mitigated by filtering; MPS usage sped up but may vary by hardware.
- **Validation**: Labels match literature patterns; could correlate with post-filing breaches for CRI testing.