# Analytic Plan Generator

This notebook uses artificial intelligence to generate analytic plans.

## Background

As described in TC 3-12.2.4.1, "The Analytic Scheme of Maneuver is the plan to collect and analyze technical data to meet specific information requirements. It identifies what data to analyze, how to analyze it, and why it is being analyzed." The analytic scheme of maneuver, or ASOM, consists of the following components:

* **Priority information requirement**:
* **Indicator**:
* **Evidence**:
* **Data**:
* **NAI**:
* **Analytic**:



## Environment Setup

This section sets up the environment. It installs packages necessary to generate the analytic plans, imports modules, initializes helper functions, and finally defines global variables. This section also mounts Google Drive to the runtime and moves into the project folder.

### Install Packages

In [None]:
!pip install -U -q "google" 1> /dev/null
!pip install -U -q "google.genai" 1> /dev/null
!pip install mitreattack-python 1> /dev/null

### Import Modules

In [None]:
import os
from google.colab import userdata
from google.colab import drive
import requests
from mitreattack.stix20 import MitreAttackData
import json
import base64
from google import genai
from google.genai import types
import datetime
import time
import sys
import multiprocessing

### Initialize Helper Functions

In [None]:
def log(message, end="\n", flush = True):
    """
    Logs a message to the console, prepended with the current timestamp
    in ISO 8601 format.

    Args:
    message (str): The string message to log.
    """

    # Access the global flag controlling verbosity
    global verbose

    # Get the current date and time
    current_time = datetime.datetime.now()

    # Format the timestamp in ISO 8601 format
    timestamp = current_time.isoformat()

    # Construct the final log string using an f-string for clean formatting
    log_string = f"[{timestamp}] {message}"

    # Print the log string to the console if logging is turned on (verbose = True)
    if (verbose == True):
        print(log_string, end = end, flush = flush)

In [None]:
def extract_attack_id_from_stix(stix_obj):
    """Extracts the MITRE ATT&CK ID (e.g., T1548, TA0002) from a STIX object.

    Iterates through the 'external_references' list of a STIX object (like a technique,
    tactic, or software object) and returns the 'external_id' associated with the
    'mitre-attack' source name.

    Args:
        stix_obj (dict): A dictionary representing a STIX object, expected to have
                         an 'external_references' key.

    Returns:
        str | None: The MITRE ATT&CK external ID if found, otherwise None.
    """
    # Ensure 'external_references' exists and is iterable; default to empty list if not found.
    for ref in stix_obj.get("external_references", []):
        # Check if the reference source is specifically 'mitre-attack'.
        if ref.get("source_name") == "mitre-attack":
            # Return the associated external ID.
            return ref.get("external_id")
    # Return None if no 'mitre-attack' external reference was found.
    return None

### Define Global Variables

In [None]:
# Toggle logging on (verbose = True)/off (verbose = False)
verbose = True
# verbose = False

In [None]:
# Rate limits: https://ai.google.dev/gemini-api/docs/rate-limits
# Pricing: https://ai.google.dev/gemini-api/docs/pricing
# Usage: https://console.cloud.google.com/apis/api/generativelanguage.googleapis.com/metrics?project=gen-lang-client-0497172401
# Note that this notebook is designed to be run in Google Colab. The line below reads the Gemini API key for AI Studio,
# which is configured in the Secrets tab on the left side of the Colab window.
os.environ["GEMINI_API_KEY"] = userdata.get("GOOGLE_API_KEY")
log("Gemii API key loaded.")

[2025-05-09T11:48:40.506875] Gemii API key loaded.


### Mount Google Drive

In [None]:
# Mount Google Drive and move into the Google AI Studio folder
DRIVE_PATH = "/content/drive"
TECHNIQUES_PATH = "/content/drive/MyDrive/Google AI Studio/techniques"

drive.mount(DRIVE_PATH)
log(f"Google Drive mounted to {DRIVE_PATH}")

os.chdir(TECHNIQUES_PATH)
log(f"Changed directory to {TECHNIQUES_PATH}")

Mounted at /content/drive
[2025-05-09T11:49:11.705536] Google Drive mounted to /content/drive
[2025-05-09T11:49:12.373355] Changed directory to /content/drive/MyDrive/Google AI Studio/techniques


## ATT&CK Matrix Retrieval

This section first retrieves the latest MITRE ATT&CK Matrix for Enterprise from MITRE's GitHub repository. Then, it builds a data structure that stores each technique as they key with it's parent tactic, the technique's description, and detection suggestions as a nested dictionary as the value.

In [None]:
# Define the local filename for the MITRE ATT&CK data
attack_data_filename = "enterprise-attack.json"
# Define the source URL for the MITRE ATT&CK Enterprise data
attack_data_url = "https://raw.githubusercontent.com/mitre/cti/master/enterprise-attack/enterprise-attack.json"

# Check if the MITRE ATT&CK data file already exists locally.
# This avoids redundant downloads on subsequent runs, saving time and bandwidth.
# It acts as a simple caching mechanism.
if not os.path.exists(attack_data_filename):
    log(f"Local file '{attack_data_filename}' not found.")
    log (f"Downloading MITRE ATT&CK data from {attack_data_url}")
    try:
        # Attempt to download the latest enterprise ATT&CK framework definition from the official MITRE CTI GitHub repository.
        attack_response = requests.get(attack_data_url)
        # Raise an exception if the download request was unsuccessful (e.g., 404 Not Found, 500 Server Error).
        attack_response.raise_for_status()
        # Get the text content of the successful response.
        attack_text = attack_response.text

        # Parse the downloaded JSON text into a Python dictionary/list structure.
        # This allows programmatic access to the ATT&CK data.
        attack_json = json.loads(attack_text)

        # Save the downloaded and parsed data to the local file.
        # Using 'w' mode ensures that if the file somehow existed but os.path.exists returned False (e.g., race condition, though unlikely here),
        # it would be overwritten. Opening with 'utf-8' encoding is generally good practice for JSON.
        # The 'indent=4' argument makes the saved JSON file human-readable (pretty-printed).
        with open(attack_data_filename, "w", encoding='utf-8') as outfile:
            json.dump(attack_json, outfile, indent=4)
        log(f"Successfully downloaded and saved data to '{attack_data_filename}'.")

    except requests.exceptions.RequestException as e:
        # Handle potential network errors during download (e.g., connection error, timeout, bad HTTP status).
        log(f"Error downloading ATT&CK data: {e}")
        # Depending on the application's needs, might want to exit, retry, or proceed without the data.
        # For now, we'll just print the error and potentially fail later when MitreAttackData is initialized.
    except json.JSONDecodeError as e:
        # Handle cases where the downloaded content is not valid JSON.
        log(f"Error parsing downloaded JSON data: {e}")
    except IOError as e:
        # Handle potential errors during file writing (e.g., permissions denied, disk full).
        log(f"Error writing data to file '{attack_data_filename}': {e}")

# Initialize the MitreAttackData object using the local JSON file.
# This object provides methods to easily query and navigate the ATT&CK framework data
# (e.g., find techniques, tactics, mitigations, relationships).
# It encapsulates the logic for interacting with the raw JSON data structure.
# Ensure the file exists before attempting to initialize, or handle potential FileNotFoundError within MitreAttackData.
if os.path.exists(attack_data_filename):
    try:
        mad = MitreAttackData(attack_data_filename)
        log(f"MitreAttackData object initialized successfully from '{attack_data_filename}'.")
        # Now the 'mad' object can be used to work with the ATT&CK data.
        # Example (assuming methods exist):
        # techniques = mad.get_techniques()
        # tactic_details = mad.get_tactic_by_id('TA0001')
    except Exception as e:
        # Catch any potential errors during MitreAttackData initialization (e.g., file format issues not caught earlier, internal errors).
        log(f"Error initializing MitreAttackData from '{attack_data_filename}': {e}")
        # Set mad to None or handle appropriately to indicate failure.
        mad = None
else:
    # This branch is reached if the download/file writing failed in the 'if not os.path.exists' block.
    log(f"Cannot initialize MitreAttackData: File '{attack_data_filename}' is missing and could not be created.")
    sys.exit(1)

[2025-05-09T11:49:26.035270] MitreAttackData object initialized successfully from 'enterprise-attack.json'.


Now that the script has retrieved the MITRE ATT&CK Matrix for Enterprise, transform it into a new data structure for ease of reference later.

In [None]:
# Create a dictionary mapping lowercase tactic names (e.g., "initial access")
# to their corresponding MITRE ATT&CK Tactic IDs (e.g., "TA0001").
# This provides a quick lookup for tactic IDs based on names found in technique data.
# It assumes the first external reference in a tactic object contains the desired ID.
tactic_name_to_id = {
    # Convert tactic name to lowercase for case-insensitive matching later.
    tactic.name.lower(): tactic.external_references[0]["external_id"]
    # Iterate through all tactic objects provided by the MitreAttackData instance.
    for tactic in mad.get_tactics()
}

# Initialize the dictionary that will store the processed technique/sub-technique information.
# The keys will be formatted strings like "T1548.002 - Unsecured Credentials",
# and the values will be dictionaries containing tactic, description, and detection details.
technique_dict = {}

# Retrieve both techniques and sub-techniques from the MitreAttackData instance.
# Exclude any techniques/sub-techniques marked as revoked or deprecated in the source data.
# Combine them into a single list for uniform processing.
all_techniques = mad.get_techniques(remove_revoked_deprecated=True) + \
                 mad.get_subtechniques(remove_revoked_deprecated=True)

# Iterate through each technique or sub-technique object in the combined list.
for tech in all_techniques:
    # Extract the primary ATT&CK ID (e.g., "T1548" or "T1548.002") using the helper function.
    tid = extract_attack_id_from_stix(tech)
    # Skip processing if a valid ATT&CK ID couldn't be extracted.
    if not tid:
        continue # Or add logging/error handling

    # Extract the name, remove leading/trailing whitespace, and replace slashes
    # (which might cause issues in file paths or keys) with hyphens.
    name = tech.get("name", "").strip().replace("/", "-")

    # Create a unique, human-readable key for the technique_dict.
    full_key = f"{tid} - {name}"

    # Extract the description and detection details, removing leading/trailing whitespace.
    # Use .get() with a default empty string to prevent errors if keys are missing.
    description = tech.get("description", "").strip()
    # 'x_mitre_detection' is a custom STIX property used by MITRE for detection guidance.
    detection = tech.get("x_mitre_detection", "").strip()

    tactic_names = []
    # Iterate through the kill chain phases associated with the technique/sub-technique.
    # Default to an empty list if 'kill_chain_phases' is missing.
    for phase in tech.get("kill_chain_phases", []):
        # Ensure we are looking at phases belonging to the 'mitre-attack' framework
        # (as opposed to potential other frameworks like 'mitre-pre-attack').
        if phase.get("kill_chain_name") == "mitre-attack":
            # Extract the tactic's phase name (e.g., "initial-access").
            # Convert it to lowercase and replace hyphens with spaces (e.g., "initial access")
            # to match the keys created for the 'tactic_name_to_id' dictionary.
            phase_name_lookup_key = phase.get("phase_name", "").lower().replace("-", " ")

            # Look up the Tactic ID (e.g., "TA0001") using the prepared lookup key.
            tactic_id = tactic_name_to_id.get(phase_name_lookup_key)

            # If a corresponding tactic ID was found...
            if tactic_id:
                # Format the tactic information string (e.g., "TA0001 - Initial Access").
                # Use .title() for consistent capitalization of the tactic name.
                tactic_display_name = phase.get("phase_name", "").replace("-", " ").title()
                tactic_names.append(f"{tactic_id} - {tactic_display_name}")

    # Join the collected tactic strings into a single comma-separated string.
    # Use set() to remove potential duplicates and sorted() for consistent ordering.
    tactic_str = ", ".join(sorted(set(tactic_names)))

    # Store the extracted and formatted information in the main dictionary.
    technique_dict[full_key] = {
        "tactic": tactic_str,
        "description": description,
        "detection": detection,
    }

In [None]:
# Debugging block
# 'technique_dict' now contains the processed data, ready for further use
# (e.g., writing to a file, displaying in a UI, further analysis).
# print(json.dumps(technique_dict["T1133 - External Remote Services"], indent=4))

## Build Prompts to Generate Initial Analytic Plans

This section generates a prompt to build the initial analytic plan for each technique in the MITRE ATT&CK matrix. It first defines a base prompt that provides context to the model, and then generates a specific prompt crafted for each individual technique. The complete prompts are stored in `prompt_library`, where the key is the technique ID and name in the form "T1190 - Exploit Public-Facing Application".

In [None]:
# This prompt generates the initial analytic plan.
base_prompt = """\
Generate an Analytic Scheme of Maneuver (ASOM) based on the following definitions of its components. For each component, provide specific examples relevant to a scenario where we suspect a sophisticated external threat is attempting to gain unauthorized access to our organization's sensitive data or negatively impact the network.

ASOM Component Definitions:

1.  Information Requirement (IR): These identify the information about the enemy or the terrain that the commander considers most important. For example, "Has the adversary gained initial access to the network?" These should be tagged with MITRE ATT&CK tactic numbers; for example, (TA0001 - Initial Access). A complete PIR looks like this: "Has the adversary gained initial access to the network? (TA0001 - Initial Access)".

2.  Indicators: These are positive or negative evidence of threat activity pertaining to one or more information requirements. They are observable clues related to a specific information requirement. For the IR "Has the adversary gained initial access to the network? (TA0001 - Initial Access)", the indicator should be a technique within that tactic; for example, "T1190 - Exploit Public Facing Application".

3.  Evidence: This is the concrete information that supports or refutes an indicator. It provides the "proof" and can vary in complexity. For the IR "Has the adversary gained initial access to the network?" and the indicator "T1190 - Exploit Public Facing Application" beneath it, appropriate evidence could be:
    * "Anomalous login attempts from unusual geographic locations."
    * "Network traffic involving known malicious command and control (C2) infrastructure."

4.  Data: This describes the precise data necessary to identify evidence. Specificity here is key (e.g., Zeek Conn logs, Sysmon event ID 4624, Active Directory security logs). For the evidence, focus your ASOM on the following data sources: network logs, specifically Zeek logs; host logs, specifically Windows Event IDs. Write only the data name. For example, Windows Event ID 4688; Zeek conn.log

5. Data Platform: This describes the type of system from which the data can be collected. It should be one of the following options: "Endpoints", "Servers", "Network devices".

6.  Named Areas of Interest (NAIs): These are areas where data that will satisfy a specific information requirement can be collected. For all NAIs, use a dummy value of "Insert site-specific NAI here".

7.  Actions: These are high-level instructions that guide the analysts' search for evidence. Focus mostly on simple detections, but also look for opportunities to incorporate basic statistical methods data science techniques here, such as descriptive statistics, Inter‑quartile range & box‑plots, entropy measures, correlation analysis, linear regression, time series analysis, and other, similar methods. For the evidence above, appropriate actions could be:
    * "Identify all source IP addresses for failed and successful login events. Geolocate logain sources, then identify rare source countries by low frequency of occurence or percentiles to flag anomalous login events from unusual geographic locations."
    * "Inner join a list of source and destination IP addresses from public-facing devices with a list of known malicious command and control (C2) servers from a cyber threat intelligence provider updated within at least 30 days. Investigate all sessions where connections involved C2 servers."

Based on these definitions, please generate a detailed ASOM in the JSON format. The keys in the JSON object should correspond to the following ASOM components in this order: IR, Indicators, Evidence, Data Platform, Data, NAIs, Actions. Note that the key for IR should be replaced by the actual IR based on the description above. Also note that the key for evidence should be replaced with the actual form of evidence based on the description above. Here is an example template:

{
    "(Insert IR here)": {
        "Indicators": {
            "(Insert form of evidence here)": {
                "Data": "",
                "Data Platform": "",
                "NAI": "",
                "Action": ""
            },
            "(Insert form of evidence here)": {
                "Data": "",
                "Data Platform": "",
                "NAI": "",
                "Action": ""
            }
        }
    }
}

For each MITRE technique below, generate one PIR (a general question tagged with the parent tactic's T-code, in the format "Has the adversary gained initial access to the network? (TA0001 - Initial Access)") for each tactic. For example, if there is one parent tactic, generate one PIR; if there are two parent tactics, generate two PIRs. For each PIR, generate two indicators (the MITRE technique provided). For each indicator, generate one form of evidence. Each form of evidence should have Data, NAI, and Actions. I have also provided a helpful description of the technique labeled "Description:", and a potential idea for detecting it, labeled "Detection:"."""

In [None]:
prompt_library = {}

for technique in technique_dict:
    prompt_library[technique] = base_prompt + f"\n\nTechnique: {technique}\n\nTactic(s): {technique_dict[technique]['tactic']}\n\nDescription: {technique_dict[technique]['description']}\n\nDetection: {technique_dict[technique]['detection']}"

In [None]:
# print(prompt_library["T1190 - Exploit Public-Facing Application"])

## Generate the Initial Analytic Plans

This section generates one PIR and two indicators (with evidence, data, data platform, NAI, and action) for each technique in the MITRE ATT&CK matrix for Enterprise. Note that if a technique may be used at multiple points in an operation, as part of more than one tactic, a unique PIR is generated for each instance of that technique.

First, define a function with embeded context for generating the analytic plans. This function is unique to the initial generation of analytic plans.

In [None]:
def generate_initial_plan(prompt, target_model):
    """Generates an analytic plan in JSON format using a generative AI model.

    This function constructs a detailed, multi-turn prompt including definitions, examples,
    and constraints to guide the target generative model in creating an analytic plan tailored
    to a specific cyber threat scenario and MITRE ATT&CK technique. It leverages
    few-shot prompting by providing a complete example (for T1133) within the prompt context.

    Args:
        prompt (str): A string containing the specific details of the MITRE ATT&CK
                      technique for which the ASOM should be generated. This typically
                      includes Technique ID, Name, Tactics, Description, and Detection guidance.
        target_model (str): The name or identifier of the generative AI model to use
                            (e.g., "models/gemini-1.5-flash").

    Returns:
        str: A string containing the generated ASOM in JSON format. Returns the raw text
             response from the model. Error handling for API calls or invalid JSON
             responses is not explicitly included here.
    """
    # Initialize the Generative AI client using the API key stored in environment variables.
    # Ensure the 'GEMINI_API_KEY' environment variable is set before running.
    client = genai.Client(
        api_key=os.environ.get("GEMINI_API_KEY"),
    )

    # Set the target model for the generation request.
    model = target_model

    # Construct the conversation history ('contents') for the generative model.
    # This uses a multi-turn approach to provide detailed instructions and examples (few-shot learning).
    contents = [
        # Turn 1: User provides detailed instructions, definitions, constraints, and a full example.
        types.Content(
            role="user",
            parts=[
                # This large text block defines the ASOM structure, components (IR, Indicators, etc.),
                # the desired JSON output format, data source constraints (Zeek, Win Events),
                # NAI placeholder requirements, action guidance (including data science methods),
                # and crucially, provides a complete example scenario based on T1133.
                # --- START OF DETAILED PROMPT TEXT (DO NOT MODIFY CONTENT) ---
                types.Part.from_text(text="""Generate an Analytic Scheme of Maneuver (ASOM) based on the following definitions of its components. For each component, provide specific examples relevant to a scenario where we suspect a sophisticated external threat is attempting to gain unauthorized access to our organization's sensitive data or negatively impact the network.

ASOM Component Definitions:

1.  Information Requirement (IR): These identify the information about the enemy or the terrain that the commander considers most important. For example, \"Has the adversary gained initial access to the network?\" These should be tagged with MITRE ATT&CK tactic numbers; for example, (TA0001 - Initial Access). A complete PIR looks like this: \"Has the adversary gained initial access to the network? (TA0001 - Initial Access)\".

2.  Indicators: These are positive or negative evidence of threat activity pertaining to one or more information requirements. They are observable clues related to a specific information requirement. For the IR \"Has the adversary gained initial access to the network? (TA0001 - Initial Access)\", the indicator should be a technique within that tactic; for example, \"T1190 - Exploit Public Facing Application\".

3.  Evidence: This is the concrete information that supports or refutes an indicator. It provides the \"proof\" and can vary in complexity. For the IR \"Has the adversary gained initial access to the network?\" and the indicator \"T1190 - Exploit Public Facing Application\" beneath it, appropriate evidence could be:
    * \"Anomalous login attempts from unusual geographic locations.\"
    * \"Network traffic involving known malicious command and control (C2) infrastructure.\"

4.  Data: This describes the precise data necessary to identify evidence. Specificity here is key (e.g., Zeek Conn logs, Sysmon event ID 4624, Active Directory security logs). For the evidence, focus your ASOM on the following data sources: network logs, specifically Zeek logs; host logs, specifically Windows Event IDs. Write only the data name. For example, Windows Event ID 4688; Zeek conn.log

5. Data Platform: This describes the type of system from which the data can be collected. It should be one of the following options: \"Endpoints\", \"Servers\", \"Network devices\".

6.  Named Areas of Interest (NAIs): These are areas where data that will satisfy a specific information requirement can be collected. For all NAIs, use a dummy value of \"Insert site-specific NAI here\".

7.  Actions: These are high-level instructions that guide the analysts' search for evidence. Focus mostly on simple detections, but also look for opportunities to incorporate basic statistical methods data science techniques here, such as descriptive statistics, Inter‑quartile range & box‑plots, entropy measures, correlation analysis, linear regression, time series analysis, and other, similar methods. For the evidence above, appropriate actions could be:
    * \"Identify all source IP addresses for failed and successful login events. Geolocate logain sources, then identify rare source countries by low frequency of occurence or percentiles to flag anomalous login events from unusual geographic locations.\"
    * \"Inner join a list of source and destination IP addresses from public-facing devices with a list of known malicious command and control (C2) servers from a cyber threat intelligence provider updated within at least 30 days. Investigate all sessions where connections involved C2 servers.\"

Based on these definitions, please generate a detailed ASOM in the JSON format. The keys in the JSON object should correspond to the following ASOM components in this order: IR, Indicators, Evidence, Data Platform, Data, NAIs, Actions. Note that the key for IR should be replaced by the actual IR based on the description above. Also note that the key for evidence should be replaced with the actual form of evidence based on the description above. Here is an example template:

{
    \"(Insert IR here)\": {
        \"Indicators\": {
            \"(Insert form of evidence here)\": {
                \"Data\": \"\",
                \"Data Platform\": \"\",
                \"NAI\": \"\",
                \"Action\": \"\"
            },
            \"(Insert form of evidence here)\": {
                \"Data\": \"\",
                \"Data Platform\": \"\",
                \"NAI\": \"\",
                \"Action\": \"\"
            }
        }
    }
}

For each MITRE technique below, generate one PIR (a general question tagged with the parent tactic's T-code, in the format \"Has the adversary gained initial access to the network? (TA0001 - Initial Access)\") for each tactic. For example, if there is one parent tactic, generate one PIR; if there are two parent tactics, generate two PIRs. For each PIR, generate two indicators (the MITRE technique provided). For each indicator, generate one form of evidence. Each form of evidence should have Data, NAI, and Actions. I have also provided a helpful description of the technique labeled \"Description:\", and a potential idea for detecting it, labeled \"Detection:\".

Technique: T1133 - External Remote Services

Tactic: TA0001 - Initial-Access, TA0003 - Persistence

Description: Adversaries may leverage external-facing remote services to initially access and/or persist within a network. Remote services such as VPNs, Citrix, and other access mechanisms allow users to connect to internal enterprise network resources from external locations. There are often remote service gateways that manage connections and credential authentication for these services. Services such as [Windows Remote Management](https://attack.mitre.org/techniques/T1021/006) and [VNC](https://attack.mitre.org/techniques/T1021/005) can also be used externally.(Citation: MacOS VNC software for Remote Desktop)

Access to [Valid Accounts](https://attack.mitre.org/techniques/T1078) to use the service is often a requirement, which could be obtained through credential pharming or by obtaining the credentials from users after compromising the enterprise network.(Citation: Volexity Virtual Private Keylogging) Access to remote services may be used as a redundant or persistent access mechanism during an operation.

Access may also be gained through an exposed service that doesn’t require authentication. In containerized environments, this may include an exposed Docker API, Kubernetes API server, kubelet, or web application such as the Kubernetes dashboard.(Citation: Trend Micro Exposed Docker Server)(Citation: Unit 42 Hildegard Malware)

Detection: Follow best practices for detecting adversary use of [Valid Accounts](https://attack.mitre.org/techniques/T1078) for authenticating to remote services. Collect authentication logs and analyze for unusual access patterns, windows of activity, and access outside of normal business hours.

When authentication is not required to access an exposed remote service, monitor for follow-on activities such as anomalous external use of the exposed API or application."""),
                # --- END OF DETAILED PROMPT TEXT ---
            ],
        ),
        # Turn 2: Model provides its internal plan for generating the example (T1133).
        # This helps guide the model's reasoning process for the actual prompt later.
        types.Content(
            role="model",
            parts=[
                # This text simulates the model outlining its step-by-step plan to generate the ASOM for T1133.
                # --- START OF MODEL PLAN TEXT (DO NOT MODIFY CONTENT) ---
                types.Part.from_text(text="""The user wants an Analytic Scheme of Maneuver (ASOM) in JSON format.
The ASOM should follow a specific structure: IR -> Indicators -> Evidence -> Data, Data Platform, NAI, Action.
The IR key should be the actual Information Requirement string.
The Evidence key should be the actual evidence string.
The scenario involves detecting a sophisticated external threat attempting unauthorized access or network impact.
I need to generate PIRs, Indicators, Evidence, Data, Data Platform, NAI, and Actions based on the provided technique T1133 - External Remote Services.

**Plan:**

1.  **Identify Tactics:** T1133 belongs to TA0001 (Initial Access) and TA0003 (Persistence).
2.  **Formulate PIRs:** Create one PIR for each tactic, using the general question format provided and the tactic T-code.
    * PIR 1 (TA0001): \"Has the adversary gained initial access to the network via external remote services? (TA0001 - Initial Access)\"
    * PIR 2 (TA0003): \"Is the adversary maintaining persistence using external remote services? (TA0003 - Persistence)\"
3.  **Define Indicators:** For each PIR, use the provided technique T1133 as the indicator. Since the user asked for *two* indicators per PIR, but only provided *one* technique (T1133), I will use T1133 twice for each PIR, but focus on different aspects or detection methods derived from the description and detection notes.
    * PIR 1 -> Indicator 1: T1133 - External Remote Services (focus on login anomalies)
    * PIR 1 -> Indicator 2: T1133 - External Remote Services (focus on exposed services)
    * PIR 2 -> Indicator 1: T1133 - External Remote Services (focus on persistent unusual access)
    * PIR 2 -> Indicator 2: T1133 - External Remote Services (focus on redundant access mechanisms)
4.  **Develop Evidence:** For each indicator, create a specific form of evidence based on the detection notes.
    * PIR1/Indicator1 (T1133 - Login Anomalies): \"Anomalous login attempts to external remote services (VPN, Citrix, RDP) from unusual geographic locations or outside normal business hours.\"
    * PIR1/Indicator2 (T1133 - Exposed Services): \"Anomalous external usage patterns targeting potentially exposed remote services (e.g., Docker API, Kubernetes API) that do not require authentication.\"
    * PIR2/Indicator1 (T1133 - Persistent Unusual Access): \"Successful logins to external remote services exhibiting unusual access patterns (e.g., multiple failed logins followed by success, logins at odd hours consistently over time) matching known Valid Accounts.\"
    * PIR2/Indicator2 (T1133 - Redundant Access): \"Connections via external remote services originating from IP addresses associated with other known suspicious activities or previously compromised accounts.\"
5.  **Specify Data Sources:** Identify the specific logs needed for each piece of evidence, focusing on Windows Event IDs and Zeek logs as requested.
    * Evidence 1 (Login Anomalies): Windows Event ID 4624, Windows Event ID 4625, VPN Logs, Citrix Logs, Zeek conn.log
    * Evidence 2 (Exposed Services): Zeek conn.log, Web Server Access Logs (if applicable for K8s dashboard etc.)
    * Evidence 3 (Persistent Unusual Access): Windows Event ID 4624, Windows Event ID 4625, VPN Logs, Citrix Logs
    * Evidence 4 (Redundant Access): Zeek conn.log, VPN Logs, Citrix Logs, Threat Intel Feeds
6.  **Define Data Platform:** Assign the appropriate platform (\"Endpoints\", \"Servers\", \"Network devices\").
    * Evidence 1: Servers (for AD/Event Logs), Network devices (VPN/Citrix/Zeek)
    * Evidence 2: Network devices (Zeek), Servers (Web server logs)
    * Evidence 3: Servers (for AD/Event Logs), Network devices (VPN/Citrix)
    * Evidence 4: Network devices (Zeek/VPN/Citrix), Servers (Threat Intel platform)
7.  **Set NAI:** Use the placeholder \"Insert site-specific NAI here\".
8.  **Create Actions:** Define actions to find the evidence, incorporating basic analysis and statistical methods where applicable.
    * Evidence 1 Action: \"Collect remote service authentication logs (Windows Event ID 4624, 4625, VPN/Citrix). Identify source IP addresses for successful and failed logins. Geolocate source IPs. Calculate frequency distribution of source countries/regions and login times. Flag logins from low-frequency locations or occurring outside typical business hours (e.g., using IQR or percentile analysis on login times).\"
    * Evidence 2 Action: \"Monitor Zeek conn.log for connections to known ports associated with exposed services (e.g., Docker API 2375/2376, K8s API 6443/443, kubelet 10250). Analyze connection volume, source IPs, and request patterns. Use time series analysis to detect sudden spikes or changes in traffic patterns from external IPs to these services.\"
    * Evidence 3 Action: \"Correlate successful logins (Event ID 4624, VPN/Citrix logs) with preceding failed attempts (Event ID 4625). Analyze login timestamps for patterns inconsistent with normal user behavior (e.g., logins consistently occurring late at night or on weekends). Use descriptive statistics to establish baseline login behavior per user/group and identify deviations.\"
    * Evidence 4 Action: \"Cross-reference source IP addresses from remote service logs (Zeek conn.log, VPN/Citrix logs) against a list of IP addresses identified in other security alerts or known compromised host lists. Inner join connection logs with Threat Intelligence feeds for known malicious IPs connecting via remote services.\"
9.  **Assemble JSON:** Construct the final JSON object following the specified structure and order. Use the PIRs as the top-level keys. Use the evidence strings as the keys under \"Indicators\"."""),
                # --- END OF MODEL PLAN TEXT ---
                # This provides the expected JSON output for the T1133 example, acting as a strong few-shot example.
                # --- START OF MODEL EXAMPLE JSON OUTPUT (DO NOT MODIFY CONTENT) ---
                types.Part.from_text(text="""[
  {
    \"Has the adversary gained initial access to the network via external remote services? (TA0001 - Initial Access)\": {
      \"Indicators\": {
        \"T1133 - External Remote Services\": {
          \"Anomalous login attempts to external remote services (VPN, Citrix, RDP) from unusual geographic locations or outside normal business hours.\": {
            \"Data\": \"Windows Event ID 4624, Windows Event ID 4625, Zeek conn.log\",
            \"Data Platform\": \"Servers, Network devices\",
            \"NAI\": \"Insert site-specific NAI here\",
            \"Action\": \"Collect remote service authentication logs (Windows Event ID 4624, 4625, VPN/Citrix). Identify source IP addresses for successful and failed logins. Geolocate source IPs. Calculate frequency distribution of source countries/regions and login times. Flag logins from low-frequency locations or occurring outside typical business hours (e.g., using IQR or percentile analysis on login times).\"
          },
          \"Anomalous external usage patterns targeting potentially exposed remote services (e.g., Docker API, Kubernetes API) that do not require authentication.\": {
            \"Data\": \"Zeek conn.log\",
            \"Data Platform\": \"Network devices\",
            \"NAI\": \"Insert site-specific NAI here\",
            \"Action\": \"Monitor Zeek conn.log for connections to known ports associated with exposed services (e.g., Docker API 2375/2376, K8s API 6443/443, kubelet 10250). Analyze connection volume, source IPs, and request patterns. Use time series analysis to detect sudden spikes or changes in traffic patterns from external IPs to these services.\"
          }
        }
      }
    }
  },
  {
    \"Is the adversary maintaining persistence using external remote services? (TA0003 - Persistence)\": {
      \"Indicators\": {
        \"T1133 - External Remote Services\": {
          \"Successful logins to external remote services exhibiting unusual access patterns (e.g., multiple failed logins followed by success, logins at odd hours consistently over time) matching known Valid Accounts.\": {
            \"Data\": \"Windows Event ID 4624, Windows Event ID 4625\",
            \"Data Platform\": \"Servers\",
            \"NAI\": \"Insert site-specific NAI here\",
            \"Action\": \"Correlate successful logins (Event ID 4624, VPN/Citrix logs) with preceding failed attempts (Event ID 4625). Analyze login timestamps for patterns inconsistent with normal user behavior (e.g., logins consistently occurring late at night or on weekends). Use descriptive statistics to establish baseline login behavior per user/group and identify deviations.\"
          },
          \"Connections via external remote services originating from IP addresses associated with other known suspicious activities or previously compromised accounts.\": {
            \"Data\": \"Zeek conn.log\",
            \"Data Platform\": \"Network devices\",
            \"NAI\": \"Insert site-specific NAI here\",
            \"Action\": \"Cross-reference source IP addresses from remote service logs (Zeek conn.log, VPN/Citrix logs) against a list of IP addresses identified in other security alerts or known compromised host lists. Inner join connection logs with Threat Intelligence feeds for known malicious IPs connecting via remote services.\"
          }
        }
      }
    }
  }
]"""),
                # --- END OF MODEL EXAMPLE JSON OUTPUT ---
            ],
        ),
        # Turn 3: User provides the actual prompt containing the target technique details.
        types.Content(
            role="user",
            parts=[
                # The 'prompt' variable passed into the function contains the technique details
                # for which the ASOM should be generated this time.
                types.Part.from_text(text=prompt),
            ],
        ),
    ]

    # Configure the generation request.
    generate_content_config = types.GenerateContentConfig(
        # Specify that the model's response should be formatted as JSON.
        response_mime_type="application/json",
        # Provide system instructions, reinforcing the desired JSON structure.
        # This repeats the template definition from the initial user message for emphasis.
        system_instruction=[
            # --- START OF SYSTEM INSTRUCTION JSON TEMPLATE (DO NOT MODIFY CONTENT) ---
            types.Part.from_text(text="""Based on these definitions, please generate a detailed ASOM in the JSON format. The keys in the JSON object should correspond to the following ASOM components in this order: IR, Indicators, Evidence, Data Platform, Data, NAIs, Actions. Note that the key for IR should be replaced by the actual IR based on the description above. Also note that the key for evidence should be replaced with the actual form of evidence based on the description above. Here is an example template:

{
    \"(Insert IR here)\": {
        \"Indicators\": {
            \"(Insert form of evidence here)\": {
                \"Data\": \"\",
                \"Data Platform\": \"\",
                \"NAI\": \"\",
                \"Action\": \"\"
            },
            \"(Insert form of evidence here)\": {
                \"Data\": \"\",
                \"Data Platform\": \"\",
                \"NAI\": \"\",
                \"Action\": \"\"
            }
        }
    }
}"""),
            # --- END OF SYSTEM INSTRUCTION JSON TEMPLATE ---
        ],
    )

    # Send the request to the generative model.
    response = client.models.generate_content(
        model=model,              # The specified target model
        contents=contents,        # The constructed multi-turn conversation history
        config=generate_content_config, # Configuration including response format and system instructions
    )

    # Return the text content of the model's response, which should be the generated ASOM JSON.
    return response.text

Since we will go through multiple iterations of generating and refining these analytic plans, we'll create a wrapper function `prompt_model` that makes it easy to map a collection of prompts to a specific generator function.

In [None]:
def prompt_model(requests_per_minute: int,
                 max_iterations: int,
                 prompts_collection: dict,
                 generation_function,
                 output_directory: str = "./",
                 model_name: str = "gemini-2.5-flash-preview-04-17"):
    """
    Processes a collection of prompts using a specified generation function,
    saving outputs to files while adhering to rate limits and iteration caps.

    This function iterates through `prompts_collection`. For each item, it first
    checks if an output file (named based on the item's key) already exists in
    the `output_directory`. If it does, the item is skipped. Otherwise, the
    function applies rate limiting (if enabled) before calling the
    `generation_function`. The returned content is then saved to a JSON file
    (e.g., `item_key.json`). Processing stops if `max_iterations` of successful
    generations is reached or if an error occurs during any generation step.

    Args:
        requests_per_minute: The target maximum number of generation requests
                             allowed per minute. If 0 or negative, rate limiting
                             is effectively disabled (no waiting).
        max_iterations: The maximum number of *successful* generation requests
                        to perform in this run. Once this count is reached,
                        processing stops.
        prompts_collection: A dictionary where keys are unique identifiers (e.g.,
                            strings representing technique names or slugs) and
                            values are the textual prompts. Keys are used to name
                            output files (e.g., `key.json`).
        generation_function: A callable function responsible for performing the
                             actual content generation. It must accept two arguments:
                             1. `prompt_content` (str): The text prompt.
                             2. `target_model` (str): The model name/identifier.
                             It is expected to return the generated content as a string.
                             Example: `def my_generator(prompt: str, target_model: str) -> str:`
        output_directory: The path to the directory where output JSON files will be
                          saved. Defaults to the current working directory ("./").
                          The directory will be created if it does not exist.
        model_name: The name or identifier of the model to be passed to the
                    `generation_function`.
    """
    if requests_per_minute <= 0:
        min_seconds_between_requests = 0 # Disables rate limiting delays
    else:
        min_seconds_between_requests = 60 / requests_per_minute

    last_request_timestamp = 0  # Timestamp of when the last request was initiated
    successful_requests_count = 0 # Counter for successful generation operations

    # Ensure the output directory exists.
    # Consider using os.makedirs(output_directory, exist_ok=True) for a more concise way.
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)
        log(f"Created output directory: {output_directory}")

    log(f"Starting prompt processing. Max iterations: {max_iterations}, Target RPM: {requests_per_minute if requests_per_minute > 0 else 'Unlimited'}.")

    for item_key, prompt_content in prompts_collection.items():
        # Construct the full path for the potential output file.
        # Note: item_key should be filesystem-safe (see docstring).
        output_filename = f"{item_key}.json"
        output_file_path = os.path.join(output_directory, output_filename)

        # Check if the output file already exists to avoid re-processing.
        if os.path.exists(output_file_path):
            log(f"Skipping '{item_key}': Output file {output_file_path} already exists.")
            continue  # Proceed to the next item in prompts_collection

        # Rate Limiting Check
        if min_seconds_between_requests > 0: # Only apply rate limiting if enabled
            current_time = time.time()
            elapsed_time_since_last_request = current_time - last_request_timestamp

            if elapsed_time_since_last_request < min_seconds_between_requests:
                wait_time = min_seconds_between_requests - elapsed_time_since_last_request
                log(f"Rate limit check: Waiting for {wait_time:.2f} seconds...")
                time.sleep(wait_time)

        log(f"Processing item #{successful_requests_count + 1}/{max_iterations}). Generating content for '{item_key}' using model '{model_name}'...")

        # Record the time *before* making the potentially time-consuming API call.
        # This ensures rate limiting is based on the start of requests.
        last_request_timestamp = time.time()
        try:
            # Call the provided generation function.
            response_content = generation_function(prompt_content, target_model=model_name)

            # Increment the successful request counter *only after* the generation function
            # completes without raising an exception.
            successful_requests_count += 1

            # Save the generated content to the file.
            # Assumes response_content is a string. For complex objects, use json.dump.
            with open(output_file_path, "w", encoding="utf-8") as f: # Added encoding for robustness
                f.write(response_content)

            log(f"Successfully generated content for '{item_key}' and saved as {output_file_path}")

        except Exception as e:
            log(f"ERROR: Failed to generate content for '{item_key}'. Details: {e}")
            # Current strategy: stop all further processing on error.
            # Alternatives: `continue` to skip to next item, or implement retries.
            log("Stopping processing due to an error.")
            break # Exit the loop

        # Check if the maximum number of successful requests for this run has been reached.
        if successful_requests_count >= max_iterations:
            log(f"Maximum generation limit of {max_iterations} successful requests reached. Exiting loop.")
            break # Exit the loop

    log(f"Prompt processing finished. Total successful generations in this run: {successful_requests_count}.")

In [None]:
def default_skip_condition(item_key: str, output_file_path: str, prompt_content: str) -> bool:
    """
    Default condition: Skip if the output file already exists.
    """
    if os.path.exists(output_file_path):
        log(f"Skipping '{item_key}' (default condition): Output file {output_file_path} already exists.")
        return True
    return False

In [None]:
def prompt_model(requests_per_minute: int,
                 max_iterations: int,
                 prompts_collection: dict,
                 generation_function,
                 output_directory: str = "./",
                 model_name: str = "gemini-2.5-flash-preview-04-17", # Updated to a hypothetical future model
                 skip_condition_func: callable = default_skip_condition):
    """
    Processes a collection of prompts using a specified generation function,
    saving outputs to files while adhering to rate limits and iteration caps.

    This function iterates through `prompts_collection`. For each item, it first
    evaluates the `skip_condition_func`. If the condition is met (function
    returns True), the item is skipped. Otherwise, the function applies rate
    limiting (if enabled) before calling the `generation_function`. The
    returned content is then saved to a JSON file (e.g., `item_key.json`).
    Processing stops if `max_iterations` of successful generations is reached
    or if an error occurs during any generation step.

    Args:
        requests_per_minute: The target maximum number of generation requests
                             allowed per minute. If 0 or negative, rate limiting
                             is effectively disabled (no waiting).
        max_iterations: The maximum number of *successful* generation requests
                        to perform in this run. Once this count is reached,
                        processing stops.
        prompts_collection: A dictionary where keys are unique identifiers (e.g.,
                            strings representing technique names or slugs) and
                            values are the textual prompts. Keys are used to name
                            output files (e.g., `key.json`).
        generation_function: A callable function responsible for performing the
                             actual content generation. It must accept two arguments:
                             1. `prompt_content` (str): The text prompt.
                             2. `target_model` (str): The model name/identifier.
                             It is expected to return the generated content as a string.
                             Example: `def my_generator(prompt: str, target_model: str) -> str:`
        output_directory: The path to the directory where output JSON files will be
                          saved. Defaults to the current working directory ("./").
                          The directory will be created if it does not exist.
        model_name: The name or identifier of the model to be passed to the
                    `generation_function`.
        skip_condition_func: A callable function that takes `item_key` (str),
                             `output_file_path` (str), and `prompt_content` (str)
                             as arguments and returns True if the item should be
                             skipped, False otherwise. Defaults to `default_skip_condition`
                             which skips if the output file already exists.
    """
    if requests_per_minute <= 0:
        min_seconds_between_requests = 0  # Disables rate limiting delays
    else:
        min_seconds_between_requests = 60 / requests_per_minute

    last_request_timestamp = 0  # Timestamp of when the last request was initiated
    successful_requests_count = 0  # Counter for successful generation operations

    # Ensure the output directory exists.
    os.makedirs(output_directory, exist_ok=True)
    log(f"Ensured output directory exists: {output_directory}")

    log(f"Starting prompt processing. Max iterations: {max_iterations}, Target RPM: {requests_per_minute if requests_per_minute > 0 else 'Unlimited'}.")

    for item_key, prompt_content in prompts_collection.items():
        output_filename = f"{item_key}.json" # Ensure item_key is filesystem-safe
        output_file_path = os.path.join(output_directory, output_filename)

        # Use the dynamic skip condition
        if skip_condition_func(item_key, output_file_path, prompt_content):
            continue  # Proceed to the next item

        # Rate Limiting Check
        if min_seconds_between_requests > 0:
            current_time = time.time()
            elapsed_time_since_last_request = current_time - last_request_timestamp

            if elapsed_time_since_last_request < min_seconds_between_requests:
                wait_time = min_seconds_between_requests - elapsed_time_since_last_request
                log(f"Rate limit check: Waiting for {wait_time:.2f} seconds...")
                time.sleep(wait_time)

        log(f"Processing item #{successful_requests_count + 1}/{max_iterations if max_iterations > 0 else 'unlimited'}). Generating content for '{item_key}' using model '{model_name}'...")

        last_request_timestamp = time.time()
        try:
            response_content = generation_function(prompt_content, target_model=model_name)
            successful_requests_count += 1

            # Consider if response_content is already a JSON string or needs `json.dumps`
            # If generation_function returns a Python dict/list, you'd use json.dump(response_content, f)
            with open(output_file_path, "w", encoding="utf-8") as f:
                f.write(str(response_content)) # Ensure it's a string; if it's JSON, it's fine

            log(f"Successfully generated content for '{item_key}' and saved as {output_file_path}")

        except Exception as e:
            log(f"ERROR: Failed to generate content for '{item_key}'. Details: {e}")
            log("Stopping processing due to an error.")
            break

        if max_iterations > 0 and successful_requests_count >= max_iterations:
            log(f"Maximum generation limit of {max_iterations} successful requests reached. Exiting loop.")
            break

    log(f"Prompt processing finished. Total successful generations in this run: {successful_requests_count}.")

Next, generate the analytic plans.

In [None]:
prompt_model(
    requests_per_minute = 10,
    max_iterations = 1,
    prompts_collection = prompt_library,
    generation_function = generate_initial_plan,
    output_directory = "./",
    model_name = "gemini-2.5-flash-preview-04-17"
)

[2025-05-09T11:49:26.342762] Ensured output directory exists: ./
[2025-05-09T11:49:26.343720] Starting prompt processing. Max iterations: 1, Target RPM: 10.
[2025-05-09T11:49:26.345434] Skipping 'T1055.011 - Extra Window Memory Injection' (default condition): Output file ./T1055.011 - Extra Window Memory Injection.json already exists.
[2025-05-09T11:49:26.346649] Skipping 'T1053.005 - Scheduled Task' (default condition): Output file ./T1053.005 - Scheduled Task.json already exists.
[2025-05-09T11:49:26.347641] Skipping 'T1205.002 - Socket Filters' (default condition): Output file ./T1205.002 - Socket Filters.json already exists.
[2025-05-09T11:49:26.348682] Skipping 'T1560.001 - Archive via Utility' (default condition): Output file ./T1560.001 - Archive via Utility.json already exists.
[2025-05-09T11:49:26.349678] Skipping 'T1021.005 - VNC' (default condition): Output file ./T1021.005 - VNC.json already exists.
[2025-05-09T11:49:26.350663] Skipping 'T1047 - Windows Management Instrumen

In [None]:
# # Halt execution before the unfinished portion of the notebook is run
# sys.exit(0)

## Build Prompts to Improve the Analytic Plans

This section generates a prompt to build the initial analytic plan for each technique in the MITRE ATT&CK matrix. It first defines a base prompt that provides context to the model, and then generates a specific prompt crafted for each individual technique. The complete prompts are stored in `prompt_library`, where the key is the technique ID and name in the form "T1190 - Exploit Public-Facing Application".

In [None]:
base_prompt = """\
I’m going to show you an Analytic Plan as a JSON object with the following components.

1.  Information Requirement (IR): These identify the information about the enemy or the terrain that the commander considers most important. For example, "Has the adversary gained initial access to the network?" These should be tagged with MITRE ATT&CK tactic numbers; for example, (TA0001 - Initial Access). A complete IR looks like this: "Has the adversary gained initial access to the network? (TA0001 - Initial Access)".

2.  Indicators: These are positive or negative evidence of threat activity pertaining to one or more information requirements. They are observable clues related to a specific information requirement. For the IR "Has the adversary gained initial access to the network? (TA0001 - Initial Access)", the indicator should be a technique within that tactic; for example, "T1190 - Exploit Public Facing Application".

3.  Evidence: This is the concrete information that supports or refutes an indicator. It provides the "proof" and can vary in complexity. For the IR "Has the adversary gained initial access to the network?" and the indicator "T1190 - Exploit Public Facing Application" beneath it, appropriate evidence could be:
    * "Anomalous login attempts from unusual geographic locations."
    * "Network traffic involving known malicious command and control (C2) infrastructure."

4.  Data: This describes the precise data necessary to identify evidence. Specificity here is key (e.g., Zeek Conn logs, Sysmon event ID 4624, Active Directory security logs). For the evidence, focus your ASOM on the following data sources: network logs, specifically Zeek logs; host logs, specifically Windows Event IDs. Write only the data name. For example, Windows Event ID 4688; Zeek conn.log

5. Data Platform: This describes the type of system from which the data can be collected. It should be one of the following options: "Endpoints", "Servers", "Network devices".

6.  Named Areas of Interest (NAIs): These are areas where data that will satisfy a specific information requirement can be collected. For all NAIs, use a dummy value of "Insert site-specific NAI here".

7.  Actions: These are high-level instructions that guide the analysts' search for evidence. Focus mostly on simple detections, but also look for opportunities to incorporate basic statistical methods data science techniques here, such as descriptive statistics, Inter‑quartile range & box‑plots, entropy measures, correlation analysis, linear regression, time series analysis, and other, similar methods. For the evidence above, appropriate actions could be:
    * "Identify all source IP addresses for failed and successful login events. Geolocate logain sources, then identify rare source countries by low frequency of occurence or percentiles to flag anomalous login events from unusual geographic locations."
    * "Inner join a list of source and destination IP addresses from public-facing devices with a list of known malicious command and control (C2) servers from a cyber threat intelligence provider updated within at least 30 days. Investigate all sessions where connections involved C2 servers."

The keys in the JSON object correspond to the following ASOM components in this order: IR, Indicators, Evidence, Data Platform, Data, NAIs, Actions. Note that the key for IR should be replaced by the actual IR based on the description above. Also note that the key for evidence should be replaced with the actual form of evidence based on the description above. "version" and "last_updated" help track when these plans have been updated. Change "version" to "{version}", and change "last_updated" to "{last updated}". Here is an example template:

[
    {
        "(Insert IR here)": {
            "Indicators": {
                "(Insert form of evidence here)": {
                    "Data": "",
                    "Data Platform": "",
                    "NAI": "",
                    "Action": ""
                },
                "(Insert form of evidence here)": {
                    "Data": "",
                    "Data Platform": "",
                    "NAI": "",
                    "Action": ""
                }
            },
          "version": "",
          "last_updated": ""
        }
    }
]

Note that there is one IR (a general question tagged with the parent tactic's T-code, in the format "Has the adversary gained initial access to the network? (TA0001 - Initial Access)") for each tactic. Also note that for each IR, there are two indicators (the MITRE technique provided). Also note that for each indicator, there is one form of evidence. Each form of evidence has Data, NAI, and Actions.

For each indicator, I want you to add two more forms of evidence (refer to the description above) to really improve our ability to identify the specific malicious activity referred to by the parent IR and the parent indicator. Note that each new evidence should have Data, NAI, and Actions. Do not make any other changes to the analytic plan.

I have provided a helpful description of the technique labeled "Description:", and a potential idea for detecting it, labeled "Detection:".

Technique: {technique}

Tactic(s): {tactics}

Description: {description}

Detection: {detection}

Here is the current analytic plan for you to modify:

```
{original plan}
```
"""

In [None]:
# Store the new version number and last updated date
new_version = "1.1"
last_updated = "2025-05-07" # Using the current date as per guidelines

def process_technique(technique_item):
    """
    Processes a single technique to generate its refined prompt.
    Expects technique_item to be a tuple: (technique_name, technique_data)
    or just technique_name if technique_dict is globally accessible or passed differently.
    """
    technique_name, technique_data = technique_item

    json_filename = f"{technique_name}.json"
    try:
        with open(json_filename, "r") as fd:
            original_plan_content = fd.read()

        # Perform placeholder replacements
        refined_prompt = base_prompt.replace("{version}", new_version) \
                                    .replace("{last updated}", last_updated) \
                                    .replace("{technique}", technique_name) \
                                    .replace("{tactics}", technique_data['tactic']) \
                                    .replace("{description}", technique_data['description']) \
                                    .replace("{detection}", technique_data['detection']) \
                                    .replace("{original plan}", original_plan_content)
        return technique_name, refined_prompt
    except FileNotFoundError:
        log(f"Warning: File {json_filename} not found for technique {technique_name}.")
        sys.exit(1)
    except Exception as e:
        log(f"Error processing technique {technique_name}: {e}")
        sys.exit(1)

# Initialize an empty dictionary to store the refined prompts.
refine_prompt_library = {}

# Get the number of available CPU cores
# None for os.cpu_count() means it will use all available cores
num_cores = os.cpu_count()
log(f"Using {num_cores} cores for processing.")

# Create a pool of worker processes
# We pass technique_dict.items() to have both key and value in the worker function
# If your technique_dict is very large, consider if technique_dict itself needs to be passed
# or if its content can be accessed differently to avoid large data transfers to child processes.
# For this case, passing items should be fine.

# Create a list of items to process. Each item is a tuple (technique_name, technique_data)
# This assumes technique_dict is defined and accessible here.
tasks = list(technique_dict.items())

with multiprocessing.Pool(processes=num_cores) as pool:
    # map_async is non-blocking and returns an AsyncResult object.
    # The results will be a list of tuples: [(technique_name, refined_prompt), ...]
    results = pool.map(process_technique, tasks)

# Populate the refine_prompt_library from the results
for technique_name, refined_prompt in results:
    if refined_prompt is not None: # Only add if processing was successful
        refine_prompt_library[technique_name] = refined_prompt

print("Processing complete.")

[2025-05-09T11:49:27.157158] Using 2 cores for processing.
Processing complete.


In [None]:
# print(refine_prompt_library["T1133 - External Remote Services"])

## Improve the Analytic Plans

This section iterates over the existing analytic plans, then uses generative artificial intelligence to improve them.

First, define a function with embeded context for improving the analytic plans. This function is unique to the improvement of analytic plans.

In [None]:
def generate_improved_plan(prompt, target_model):
    """
    Generates an improved analytic plan in JSON format using a generative AI model.

    Generates an improved analytic plan in JSON format. This function leverages a
    target generative AI model (e.g., Gemini) by constructing a sophisticated,
    multi-turn conversational prompt. The core of this function is its advanced
    prompt engineering, designed to guide the AI model in understanding complex
    requirements and producing a structured JSON output.

    The function aims to take an existing ASOM for a given MITRE ATT&CK technique,
    and instruct the AI to "improve" it by adding two new "Evidence" components
    to each "Indicator" within the plan, while also updating version and
    timestamp information.

    Args:
        prompt (str): A string containing the specific details of the MITRE ATT&CK
                      technique for which the plan should be generated. This includes
                      PIR, Indicator, Evidence, Data, and NAI, as well as technique
                      description, and detection guidance from MITRE.
        target_model (str): The name or identifier of the generative AI model to use
                            (e.g., "models/gemini-1.5-flash").

    Returns:
        str: A string containing the generated analytic plan in JSON format. Returns
             the raw text response from the model.

    Generates an improved analytic plan in JSON format. This function leverages a
    """
    # Initialize the Google Generative AI client.
    # It retrieves the API key from the environment variable 'GEMINI_API_KEY'.
    client = genai.Client(
        api_key=os.environ.get("GEMINI_API_KEY"),
    )

    # Set the model to be used for generation from the function argument.
    model = target_model

    # Construct the multi-turn conversation history ('contents') to guide the model.
    # This includes detailed instructions, definitions, a JSON template,
    # and a comprehensive few-shot example (T1133) demonstrating the
    # task and the desired reasoning process.
    contents = [
        # Turn 1: User provides extensive context, definitions, and a full example
        # of the task to be performed on the T1133 technique.
        types.Content(
            role="user",
            parts=[
                types.Part.from_text(text="""I’m going to show you an Analytic Plan as a JSON object with the following components.

1.  Information Requirement (IR): These identify the information about the enemy or the terrain that the commander considers most important. For example, \"Has the adversary gained initial access to the network?\" These should be tagged with MITRE ATT&CK tactic numbers; for example, (TA0001 - Initial Access). A complete IR looks like this: \"Has the adversary gained initial access to the network? (TA0001 - Initial Access)\".

2.  Indicators: These are positive or negative evidence of threat activity pertaining to one or more information requirements. They are observable clues related to a specific information requirement. For the IR \"Has the adversary gained initial access to the network? (TA0001 - Initial Access)\", the indicator should be a technique within that tactic; for example, \"T1190 - Exploit Public Facing Application\".

3.  Evidence: This is the concrete information that supports or refutes an indicator. It provides the \"proof\" and can vary in complexity. For the IR \"Has the adversary gained initial access to the network?\" and the indicator \"T1190 - Exploit Public Facing Application\" beneath it, appropriate evidence could be:
    * \"Anomalous login attempts from unusual geographic locations.\"
    * \"Network traffic involving known malicious command and control (C2) infrastructure.\"

4.  Data: This describes the precise data necessary to identify evidence. Specificity here is key (e.g., Zeek Conn logs, Sysmon event ID 4624, Active Directory security logs). For the evidence, focus your ASOM on the following data sources: network logs, specifically Zeek logs; host logs, specifically Windows Event IDs. Write only the data name. For example, Windows Event ID 4688; Zeek conn.log

5. Data Platform: This describes the type of system from which the data can be collected. It should be one of the following options: \"Endpoints\", \"Servers\", \"Network devices\".

6.  Named Areas of Interest (NAIs): These are areas where data that will satisfy a specific information requirement can be collected. For all NAIs, use a dummy value of \"Insert site-specific NAI here\".

7.  Actions: These are high-level instructions that guide the analysts' search for evidence. Focus mostly on simple detections, but also look for opportunities to incorporate basic statistical methods data science techniques here, such as descriptive statistics, Inter‑quartile range & box‑plots, entropy measures, correlation analysis, linear regression, time series analysis, and other, similar methods. For the evidence above, appropriate actions could be:
    * \"Identify all source IP addresses for failed and successful login events. Geolocate logain sources, then identify rare source countries by low frequency of occurence or percentiles to flag anomalous login events from unusual geographic locations.\"
    * \"Inner join a list of source and destination IP addresses from public-facing devices with a list of known malicious command and control (C2) servers from a cyber threat intelligence provider updated within at least 30 days. Investigate all sessions where connections involved C2 servers.\"

The keys in the JSON object correspond to the following ASOM components in this order: IR, Indicators, Evidence, Data Platform, Data, NAIs, Actions. Note that the key for IR should be replaced by the actual IR based on the description above. Also note that the key for evidence should be replaced with the actual form of evidence based on the description above. \"version\" and \"last_updated\" help track when these plans have been updated. Change \"version\" to \"1.1\", and change \"last_updated\" to \"2025-05-06\". Here is an example template:

[
    {
        \"(Insert IR here)\": {
            \"Indicators\": {
                \"(Insert form of evidence here)\": {
                    \"Data\": \"\",
                    \"Data Platform\": \"\",
                    \"NAI\": \"\",
                    \"Action\": \"\"
                },
                \"(Insert form of evidence here)\": {
                    \"Data\": \"\",
                    \"Data Platform\": \"\",
                    \"NAI\": \"\",
                    \"Action\": \"\"
                }
            },
          \"version\": \"\",
          \"last_updated\": \"\"
        }
    }
]

Note that there is one IR (a general question tagged with the parent tactic's T-code, in the format \"Has the adversary gained initial access to the network? (TA0001 - Initial Access)\") for each tactic. Also note that for each IR, there are two indicators (the MITRE technique provided). Also note that for each indicator, there is one form of evidence. Each form of evidence has Data, NAI, and Actions.

For each indicator, I want you to add two more forms of evidence (refer to the description above) to really improve our ability to identify the specific malicious activity referred to by the parent IR and the parent indicator. Note that each new evidence should have Data, NAI, and Actions. Do not make any other changes to the analytic plan.

I have provided a helpful description of the technique labeled \"Description:\", and a potential idea for detecting it, labeled \"Detection:\".

Technique: T1133 - External Remote Services

Tactic: TA0001 - Initial-Access, TA0003 - Persistence

Description: Adversaries may leverage external-facing remote services to initially access and/or persist within a network. Remote services such as VPNs, Citrix, and other access mechanisms allow users to connect to internal enterprise network resources from external locations. There are often remote service gateways that manage connections and credential authentication for these services. Services such as [Windows Remote Management](https://attack.mitre.org/techniques/T1021/006) and [VNC](https://attack.mitre.org/techniques/T1021/005) can also be used externally.(Citation: MacOS VNC software for Remote Desktop)

Access to [Valid Accounts](https://attack.mitre.org/techniques/T1078) to use the service is often a requirement, which could be obtained through credential pharming or by obtaining the credentials from users after compromising the enterprise network.(Citation: Volexity Virtual Private Keylogging) Access to remote services may be used as a redundant or persistent access mechanism during an operation.

Access may also be gained through an exposed service that doesn’t require authentication. In containerized environments, this may include an exposed Docker API, Kubernetes API server, kubelet, or web application such as the Kubernetes dashboard.(Citation: Trend Micro Exposed Docker Server)(Citation: Unit 42 Hildegard Malware)

Detection: Follow best practices for detecting adversary use of [Valid Accounts](https://attack.mitre.org/techniques/T1078) for authenticating to remote services. Collect authentication logs and analyze for unusual access patterns, windows of activity, and access outside of normal business hours.

When authentication is not required to access an exposed remote service, monitor for follow-on activities such as anomalous external use of the exposed API or application.

Here is the current ASOM for you to modify:

```
[
  {
    \"Has the adversary gained initial access to the network via external remote services? (TA0001 - Initial Access)\": {
      \"Indicators\": {
        \"T1133 - External Remote Services\": {
          \"Anomalous login attempts to external remote services (VPN, Citrix, RDP) from unusual geographic locations or outside normal business hours.\": {
            \"Data\": \"Windows Event ID 4624, Windows Event ID 4625, Zeek conn.log\",
            \"Data Platform\": \"Servers, Network devices\",
            \"NAI\": \"Insert site-specific NAI here\",
            \"Action\": \"Collect remote service authentication logs (Windows Event ID 4624, 4625, VPN/Citrix). Identify source IP addresses for successful and failed logins. Geolocate source IPs. Calculate frequency distribution of source countries/regions and login times. Flag logins from low-frequency locations or occurring outside typical business hours (e.g., using IQR or percentile analysis on login times).\"
          },
          \"Anomalous external usage patterns targeting potentially exposed remote services (e.g., Docker API, Kubernetes API) that do not require authentication.\": {
            \"Data\": \"Zeek conn.log\",
            \"Data Platform\": \"Network devices\",
            \"NAI\": \"Insert site-specific NAI here\",
            \"Action\": \"Monitor Zeek conn.log for connections to known ports associated with exposed services (e.g., Docker API 2375/2376, K8s API 6443/443, kubelet 10250). Analyze connection volume, source IPs, and request patterns. Use time series analysis to detect sudden spikes or changes in traffic patterns from external IPs to these services.\"
          }
        }
      },
      \"version\": \"1.0\",
      \"last_updated\": \"2025-05-04\"
    }
  },
  {
    \"Is the adversary maintaining persistence using external remote services? (TA0003 - Persistence)\": {
      \"Indicators\": {
        \"T1133 - External Remote Services\": {
          \"Successful logins to external remote services exhibiting unusual access patterns (e.g., multiple failed logins followed by success, logins at odd hours consistently over time) matching known Valid Accounts.\": {
            \"Data\": \"Windows Event ID 4624, Windows Event ID 4625\",
            \"Data Platform\": \"Servers\",
            \"NAI\": \"Insert site-specific NAI here\",
            \"Action\": \"Correlate successful logins (Event ID 4624, VPN/Citrix logs) with preceding failed attempts (Event ID 4625). Analyze login timestamps for patterns inconsistent with normal user behavior (e.g., logins consistently occurring late at night or on weekends). Use descriptive statistics to establish baseline login behavior per user/group and identify deviations.\"
          },
          \"Connections via external remote services originating from IP addresses associated with other known suspicious activities or previously compromised accounts.\": {
            \"Data\": \"Zeek conn.log\",
            \"Data Platform\": \"Network devices\",
            \"NAI\": \"Insert site-specific NAI here\",
            \"Action\": \"Cross-reference source IP addresses from remote service logs (Zeek conn.log, VPN/Citrix logs) against a list of IP addresses identified in other security alerts or known compromised host lists. Inner join connection logs with Threat Intelligence feeds for known malicious IPs connecting via remote services.\"
          }
        }
      },
      \"version\": \"1.0\",
      \"last_updated\": \"2025-05-04\"
    }
  }
]
```"""), # This is the very long initial instruction string.
            ],
        ),
        # Turn 2: An example of how the model should "think" and respond,
        # showing the modification process for the T1133 example and outputting
        # the improved T1133 ASOM.
        types.Content(
            role="model",
            parts=[
                types.Part.from_text(text="""The user wants to update an existing JSON object representing an Analytic Plan.
I need to:
1.  Parse the provided JSON.
2.  For each IR, and for each Indicator (\"T1133 - External Remote Services\"), I need to add two new \"Evidence\" entries.
3.  Each new \"Evidence\" entry must have \"Data\", \"Data Platform\", \"NAI\", and \"Action\" fields.
4.  The \"Data\" should focus on Zeek logs and Windows Event IDs.
5.  The \"Data Platform\" should be one of \"Endpoints\", \"Servers\", \"Network devices\".
6.  \"NAI\" should be \"Insert site-specific NAI here\".
7.  \"Actions\" should guide analysts, potentially using basic statistical methods.
8.  Update \"version\" to \"1.1\" and \"last_updated\" to \"2025-05-06\" for both top-level IR objects.

Let's break down the task for each IR:

**IR 1: \"Has the adversary gained initial access to the network via external remote services? (TA0001 - Initial Access)\"**
Indicator: \"T1133 - External Remote Services\"
Existing Evidence 1: \"Anomalous login attempts to external remote services (VPN, Citrix, RDP) from unusual geographic locations or outside normal business hours.\"
Existing Evidence 2: \"Anomalous external usage patterns targeting potentially exposed remote services (e.g., Docker API, Kubernetes API) that do not require authentication.\"

*   **New Evidence 1.1 (for TA0001 - Initial Access, T1133):** Based on \"Detecting Valid Accounts\" and \"unusual access patterns\".
    *   Evidence Idea: \"Unusual number of failed login attempts followed by a successful login to an external remote service from a single source IP.\" (This points to brute-forcing or credential stuffing attempts)
    *   Data: \"Windows Event ID 4625 (failed login), Windows Event ID 4624 (successful login)\"
    *   Data Platform: \"Servers\"
    *   NAI: \"Insert site-specific NAI here\"
    *   Action: \"Aggregate Windows Event IDs 4625 and 4624 by source IP address and timestamp. For each source IP, count failed login attempts (4625) within a short time window (e.g., 5 minutes) immediately preceding a successful login (4624) to an external service. Flag IPs with a high ratio of failed to successful logins or exceeding a predefined threshold of failures before success (e.g., >5 failures).\"

*   **New Evidence 1.2 (for TA0001 - Initial Access, T1133):** Based on \"exposed service that doesn’t require authentication\" and \"monitor for follow-on activities\".
    *   Evidence Idea: \"Anomalous command execution or data transfer patterns immediately following connection to an exposed, non-authenticated remote service (e.g., Kubelet API, Docker API).\"
    *   Data: \"Zeek conn.log, Windows Event ID 4688 (Process Creation), Windows Event ID 5156 (Network Connection)\"
    *   Data Platform: \"Servers, Network devices\" (Host logs from server, network logs from network devices)
    *   NAI: \"Insert site-specific NAI here\"
    *   Action: \"Correlate Zeek conn.log entries showing connections to exposed services (e.g., Docker API ports 2375/2376, Kubelet port 10250) with subsequent process creation events (Windows Event ID 4688) or network connection events (Windows Event ID 5156) on the target server originating from the established external connection. Analyze command line arguments in 4688 for suspicious commands (e.g., `docker run`, `kubectl exec`). Use descriptive statistics to identify unusual data transfer volumes in Zeek conn.log associated with these sessions.\"

**IR 2: \"Is the adversary maintaining persistence using external remote services? (TA0003 - Persistence)\"**
Indicator: \"T1133 - External Remote Services\"
Existing Evidence 1: \"Successful logins to external remote services exhibiting unusual access patterns (e.g., multiple failed logins followed by success, logins at odd hours consistently over time) matching known Valid Accounts.\"
Existing Evidence 2: \"Connections via external remote services originating from IP addresses associated with other known suspicious activities or previously compromised accounts.\"

*   **New Evidence 2.1 (for TA0003 - Persistence, T1133):** Based on \"redundant or persistent access mechanism\" and \"Valid Accounts\".
    *   Evidence Idea: \"Use of a non-standard or newly registered external remote service application/client by a valid account.\"
    *   Data: \"Windows Event ID 4688 (Process Creation), Zeek conn.log\"
    *   Data Platform: \"Endpoints, Network devices\" (Process creation on endpoint, connection logs from network)
    *   NAI: \"Insert site-specific NAI here\"
    *   Action: \"Monitor process creation logs (Windows Event ID 4688) on endpoints for execution of uncommon VPN clients or remote access tools not part of standard corporate software. Correlate with Zeek conn.log to identify external connections made by these processes. Maintain a baseline of commonly used remote service clients per user/department and flag deviations. Analyze process parent-child relationships to detect if standard remote tools are launched by unusual parent processes.\"

*   **New Evidence 2.2 (for TA0003 - Persistence, T1133):** Based on \"services such as Windows Remote Management and VNC can also be used externally\" and \"Valid Accounts\".
    *   Evidence Idea: \"Persistent external connections to internal systems using services like WinRM or VNC from a limited set of external IPs over an extended period, especially outside business hours.\"
    *   Data: \"Windows Event ID 4624 (Logon Type 3 for RDP/network, or specific WinRM event logs if available), Zeek conn.log (for VNC ports 5900+, WinRM ports 5985/5986)\"
    *   Data Platform: \"Servers, Network devices\"
    *   NAI: \"Insert site-specific NAI here\"
    *   Action: \"Filter Zeek conn.log for traffic to WinRM (TCP 5985, 5986) or VNC (TCP 5900-590X) ports originating from external IPs. Correlate with Windows Event ID 4624 on target servers to confirm successful network logons. Use time series analysis to identify long-duration connections or recurring connections from the same external IPs, especially those active outside of normal business hours. Calculate the entropy of source IPs connecting to these services; a low entropy might indicate a persistent C2 channel.\"

Now, I will integrate these new evidences into the existing JSON structure and update the version and last_updated fields.
The JSON structure is an array of IR objects. Each IR object has a key which is the IR string itself. Inside this, there's an \"Indicators\" object. The key for the indicator is the technique ID. Inside the indicator, there's an object where keys are the \"Evidence\" strings.

Let's make sure the structure for adding evidence is correct. The indicator (e.g., \"T1133 - External Remote Services\") is a key, and its value is an object. This object contains key-value pairs where the key is the evidence string, and the value is another object with \"Data\", \"Data Platform\", \"NAI\", and \"Action\". So, I'll be adding more key-value pairs at this level.

Final check on constraints:
-   \"version\": \"1.1\"
-   \"last_updated\": \"2025-05-06\"
-   Data sources: network logs (Zeek), host logs (Windows Event IDs).
-   NAI: \"Insert site-specific NAI here\".
-   Actions: simple detections, basic statistical methods.

Let's draft the new evidence parts.

For IR1 (\"Has the adversary gained initial access to the network via external remote services? (TA0001 - Initial Access)\"), Indicator \"T1133 - External Remote Services\":
Add:
1.  Evidence: \"Unusual number of failed login attempts followed by a successful login to an external remote service from a single source IP.\"
    Data: \"Windows Event ID 4625, Windows Event ID 4624\"
    Data Platform: \"Servers\"
    NAI: \"Insert site-specific NAI here\"
    Action: \"Aggregate Windows Event IDs 4625 and 4624 by source IP address and timestamp. For each source IP, count failed login attempts (4625) within a short time window (e.g., 5 minutes) immediately preceding a successful login (4624) to an external service. Flag IPs with a high ratio of failed to successful logins or exceeding a predefined threshold of failures before success (e.g., >5 failures).\"

2.  Evidence: \"Anomalous command execution or data transfer patterns immediately following connection to an exposed, non-authenticated remote service (e.g., Kubelet API, Docker API).\"
    Data: \"Zeek conn.log, Windows Event ID 4688, Windows Event ID 5156\"
    Data Platform: \"Servers, Network devices\"
    NAI: \"Insert site-specific NAI here\"
    Action: \"Correlate Zeek conn.log entries showing connections to exposed services (e.g., Docker API ports 2375/2376, Kubelet port 10250) with subsequent process creation events (Windows Event ID 4688) or network connection events (Windows Event ID 5156) on the target server originating from the established external connection. Analyze command line arguments in 4688 for suspicious commands (e.g., `docker run`, `kubectl exec`). Use descriptive statistics to identify unusual data transfer volumes in Zeek conn.log associated with these sessions.\"

For IR2 (\"Is the adversary maintaining persistence using external remote services? (TA0003 - Persistence)\"), Indicator \"T1133 - External Remote Services\":
Add:
1.  Evidence: \"Use of a non-standard or newly registered external remote service application/client by a valid account for persistent access.\"
    Data: \"Windows Event ID 4688, Zeek conn.log\"
    Data Platform: \"Endpoints, Network devices\"
    NAI: \"Insert site-specific NAI here\"
    Action: \"Monitor process creation logs (Windows Event ID 4688) on endpoints for execution of uncommon VPN clients or remote access tools not part of standard corporate software. Correlate with Zeek conn.log to identify external connections made by these processes. Maintain a baseline of commonly used remote service clients per user/department and flag deviations using frequency analysis. Analyze process parent-child relationships to detect if standard remote tools are launched by unusual parent processes.\"

2.  Evidence: \"Persistent external connections to internal systems using services like WinRM or VNC from a consistent set of external IPs over an extended period, especially outside business hours.\"
    Data: \"Windows Event ID 4624, Zeek conn.log\"
    Data Platform: \"Servers, Network devices\"
    NAI: \"Insert site-specific NAI here\"
    Action: \"Filter Zeek conn.log for traffic to WinRM (TCP 5985, 5986) or VNC (TCP 5900-590X) ports originating from external IPs. Correlate with Windows Event ID 4624 (Logon Type 3) on target servers to confirm successful network logons. Use time series analysis to identify long-duration connections or recurring connections from the same external IPs, especially those active outside of normal business hours. Calculate the entropy of source IPs connecting to these services for specific internal hosts; a persistently low entropy for a given internal host might indicate a targeted persistent C2 channel.\"

Looks good. I will now construct the full JSON.
The original JSON is an array with two objects. Each object has the IR as the key. I need to modify the \"Indicators\"->\"T1133 - External Remote Services\" object within each of these two IR objects.
And update version and last_updated at the IR object level.
The original structure for evidence under an indicator is a dictionary where keys are the evidence strings. I just need to add new keys (new evidence strings) and their corresponding value objects.
I will make sure to carefully merge these into the existing structure.
The problem asks to \"add two more forms of evidence\", so for each indicator, there will be the original 2 + 2 new = 4 evidence entries.
Double check: \"Data Platform\" options are \"Endpoints\", \"Servers\", \"Network devices\". My choices are consistent.
\"Data\" should be specific names like \"Windows Event ID 4688; Zeek conn.log\". I used commas in my thought process, but the example uses semicolons. I'll stick to commas as it's more common in lists within a string unless the example format is strict. The user's example \"Windows Event ID 4688; Zeek conn.log\" implies semicolon might be preferred. I'll use comma separated as it's more standard, if the user has a strong preference for semicolon, they can specify. Given the example for data source \"Windows Event ID 4688; Zeek conn.log\", I will use semicolons for multiple data sources in a single string.

Revisiting the data source format: The example provided by the user says \"Write only the data name. For example, Windows Event ID 4688; Zeek conn.log\". I will adhere to this, using semicolons if multiple data sources are listed for a single evidence.

My planned data entries are:
1.1: \"Windows Event ID 4625; Windows Event ID 4624\"
1.2: \"Zeek conn.log; Windows Event ID 4688; Windows Event ID 5156\"
2.1: \"Windows Event ID 4688; Zeek conn.log\"
2.2: \"Windows Event ID 4624; Zeek conn.log\"
This looks consistent with the request.

The version and last_updated are properties of the IR object, at the same level as \"Indicators\".
Example:
```json
[
    {
        \"(Insert IR here)\": {  // IR object
            \"Indicators\": { ... },
            \"version\": \"\",      // property of IR object
            \"last_updated\": \"\"  // property of IR object
        }
    }
]
```
My plan to update `version` to \"1.1\" and `last_updated` to \"2025-05-06\" for each of the two IR objects is correct.
The phrasing of the new evidence strings is important. They should be descriptive.
For 2.1, \"Use of a non-standard or newly registered external remote service application/client by a valid account for persistent access.\" - Adding \"for persistent access\" clarifies the link to TA0003.
For 2.2, \"Persistent external connections to internal systems using services like WinRM or VNC from a consistent set of external IPs over an extended period, especially outside business hours.\" - This is also good for persistence.

One final check: The prompt states \"For each indicator, I want you to add two more forms of evidence\". Since there's one indicator (\"T1133 - External Remote Services\") under each IR, this means 2 new evidences for the TA0001 IR's T1133, and 2 new evidences for the TA0003 IR's T1133. This is what I've planned."""), # This is the model's example reasoning and output for T1133.
                 # The actual improved JSON for T1133 would follow here in the full prompt.
                 # The example shows the model producing the correctly modified JSON for T1133.
                types.Part.from_text(text="""[
  {
    \"Has the adversary gained initial access to the network via external remote services? (TA0001 - Initial Access)\": {
      \"Indicators\": {
        \"T1133 - External Remote Services\": {
          \"Anomalous login attempts to external remote services (VPN, Citrix, RDP) from unusual geographic locations or outside normal business hours.\": {
            \"Data\": \"Windows Event ID 4624; Windows Event ID 4625; Zeek conn.log\",
            \"Data Platform\": \"Servers, Network devices\",
            \"NAI\": \"Insert site-specific NAI here\",
            \"Action\": \"Collect remote service authentication logs (Windows Event ID 4624, 4625, VPN/Citrix logs, Zeek conn.log for VPN traffic). Identify source IP addresses for successful and failed logins. Geolocate source IPs. Calculate frequency distribution of source countries/regions and login times. Flag logins from low-frequency locations or occurring outside typical business hours (e.g., using IQR or percentile analysis on login times).\"
          },
          \"Anomalous external usage patterns targeting potentially exposed remote services (e.g., Docker API, Kubernetes API) that do not require authentication.\": {
            \"Data\": \"Zeek conn.log\",
            \"Data Platform\": \"Network devices\",
            \"NAI\": \"Insert site-specific NAI here\",
            \"Action\": \"Monitor Zeek conn.log for connections to known ports associated with exposed services (e.g., Docker API 2375/2376, K8s API 6443/443, kubelet 10250). Analyze connection volume, source IPs, and request patterns. Use time series analysis to detect sudden spikes or changes in traffic patterns from external IPs to these services.\"
          },
          \"Unusual number of failed login attempts followed by a successful login to an external remote service from a single source IP.\": {
            \"Data\": \"Windows Event ID 4625; Windows Event ID 4624\",
            \"Data Platform\": \"Servers\",
            \"NAI\": \"Insert site-specific NAI here\",
            \"Action\": \"Aggregate Windows Event IDs 4625 (failed login) and 4624 (successful login) by source IP address and timestamp for external remote services. For each source IP, count failed login attempts (4625) within a short time window (e.g., 5 minutes) immediately preceding a successful login (4624). Flag IPs with a high ratio of failed to successful logins or exceeding a predefined threshold of failures before success (e.g., >5 failures).\"
          },
          \"Anomalous command execution or data transfer patterns immediately following connection to an exposed, non-authenticated remote service (e.g., Kubelet API, Docker API).\": {
            \"Data\": \"Zeek conn.log; Windows Event ID 4688; Windows Event ID 5156\",
            \"Data Platform\": \"Servers, Network devices\",
            \"NAI\": \"Insert site-specific NAI here\",
            \"Action\": \"Correlate Zeek conn.log entries showing connections to exposed services (e.g., Docker API ports 2375/2376, Kubelet port 10250) with subsequent process creation events (Windows Event ID 4688) or network connection events (Windows Event ID 5156) on the target server originating from the established external connection. Analyze command line arguments in 4688 for suspicious commands (e.g., `docker run`, `kubectl exec`). Use descriptive statistics (e.g., sum, mean of bytes transferred) to identify unusual data transfer volumes in Zeek conn.log associated with these sessions compared to baseline.\"
          }
        }
      },
      \"version\": \"1.1\",
      \"last_updated\": \"2025-05-06\"
    }
  },
  {
    \"Is the adversary maintaining persistence using external remote services? (TA0003 - Persistence)\": {
      \"Indicators\": {
        \"T1133 - External Remote Services\": {
          \"Successful logins to external remote services exhibiting unusual access patterns (e.g., multiple failed logins followed by success, logins at odd hours consistently over time) matching known Valid Accounts.\": {
            \"Data\": \"Windows Event ID 4624; Windows Event ID 4625\",
            \"Data Platform\": \"Servers\",
            \"NAI\": \"Insert site-specific NAI here\",
            \"Action\": \"Correlate successful logins (Event ID 4624, VPN/Citrix logs) with preceding failed attempts (Event ID 4625) for valid accounts. Analyze login timestamps for patterns inconsistent with normal user behavior (e.g., logins consistently occurring late at night or on weekends). Use descriptive statistics (mean, median, mode of login times) to establish baseline login behavior per user/group and identify deviations using percentile analysis.\"
          },
          \"Connections via external remote services originating from IP addresses associated with other known suspicious activities or previously compromised accounts.\": {
            \"Data\": \"Zeek conn.log\",
            \"Data Platform\": \"Network devices\",
            \"NAI\": \"Insert site-specific NAI here\",
            \"Action\": \"Cross-reference source IP addresses from remote service logs (Zeek conn.log, VPN/Citrix logs) against a list of IP addresses identified in other security alerts or known compromised host lists. Inner join connection logs with Threat Intelligence feeds for known malicious IPs connecting via remote services. Investigate all sessions involving such IPs.\"
          },
          \"Use of a non-standard or newly registered external remote service application/client by a valid account for persistent access.\": {
            \"Data\": \"Windows Event ID 4688; Zeek conn.log\",
            \"Data Platform\": \"Endpoints, Network devices\",
            \"NAI\": \"Insert site-specific NAI here\",
            \"Action\": \"Monitor process creation logs (Windows Event ID 4688) on endpoints for execution of uncommon VPN clients or remote access tools not part of standard corporate software, especially if associated with valid user accounts. Correlate with Zeek conn.log to identify external connections made by these processes. Maintain a baseline of commonly used remote service clients per user/department and flag deviations using frequency analysis. Analyze process parent-child relationships to detect if standard remote tools are launched by unusual parent processes.\"
          },
          \"Persistent external connections to internal systems using services like WinRM or VNC from a consistent set of external IPs over an extended period, especially outside business hours.\": {
            \"Data\": \"Windows Event ID 4624; Zeek conn.log\",
            \"Data Platform\": \"Servers, Network devices\",
            \"NAI\": \"Insert site-specific NAI here\",
            \"Action\": \"Filter Zeek conn.log for traffic to WinRM (TCP 5985, 5986) or VNC (TCP 5900-590X) ports originating from external IPs. Correlate with Windows Event ID 4624 (Logon Type 3 for network logon) on target servers to confirm successful logons. Use time series analysis to identify long-duration connections or recurring connections from the same external IPs, especially those active outside of normal business hours. Calculate the entropy of source IPs connecting to these services for specific internal hosts; a persistently low entropy for a given internal host might indicate a targeted persistent C2 channel.\"
          }
        }
      },
      \"version\": \"1.1\",
      \"last_updated\": \"2025-05-06\"
    }
  }
]"""), # Actual improved JSON for T1133 example
            ],
        ),
        # Turn 3: The actual user prompt for the new technique, passed as an argument
        # to this Python function. This is where the specific technique details
        # and its current ASOM to be improved are provided.
        types.Content(
            role="user",
            parts=[
                types.Part.from_text(text=prompt),
            ],
        ),
    ]

    # Configuration for the content generation request.
    # Specifies the desired response MIME type as JSON and includes system instructions.
    generate_content_config = types.GenerateContentConfig(
        response_mime_type="application/json", # Instructs the model to output JSON.
        # The system instruction reiterates the JSON structure, acting as a reinforcement.
        system_instruction=[
            types.Part.from_text(text="""The keys in the JSON object correspond to the following ASOM components in this order: IR, Indicators, Evidence, Data Platform, Data, NAIs, Actions. Note that the key for IR should be replaced by the actual IR based on the description above. Also note that the key for evidence should be replaced with the actual form of evidence based on the description above. \"version\" and \"last_updated\" help track when these plans have been updated. Here is an example template:

[
    {
        \"(Insert IR here)\": {
            \"Indicators\": {
                \"(Insert form of evidence here)\": {
                    \"Data\": \"\",
                    \"Data Platform\": \"\",
                    \"NAI\": \"\",
                    \"Action\": \"\"
                },
                \"(Insert form of evidence here)\": {
                    \"Data\": \"\",
                    \"Data Platform\": \"\",
                    \"NAI\": \"\",
                    \"Action\": \"\"
                }
            },
          \"version\": \"\",
          \"last_updated\": \"\"
        }
    }
]"""),
        ],
    )

    # Send the request to the generative AI model.
    response = client.models.generate_content(
        model=model,                  # The specified target model
        contents=contents,            # The constructed multi-turn conversation history
        config=generate_content_config, # Configuration including response format and system instructions
    )

    # Return the text content of the model's response.
    # This is expected to be the generated ASOM in JSON string format.
    return response.text

In [None]:
def skip_by_version(item_key: str, output_file_path: str, prompt_content: str) -> bool:
    """
    Custom skip logic
    """
    with open(output_file_path, 'r', encoding='utf-8') as fd:
        try:
            data = json.loads(fd.read().replace(r"\"", r"\'").replace("\\", "\\\\"))
        except Exception as e:
            log(f"Error reading {output_file_path}: {e}")
            sys.exit(1)
        all_versions_are_target = all(
            (
                isinstance(item, dict) and  # Ensure the item itself is a dictionary
                len(item) == 1 and          # Ensure the item_dict has exactly one key (the question)
                # Safely get the inner dictionary (value of the single key) and assign to inner_dict
                # next(iter(item.values())) is safe because len(item) == 1 ensures item is not empty
                isinstance(inner_dict := next(iter(item.values())), dict) and # Ensure inner_dict is a dictionary
                inner_dict.get("version") == "1.1" # Check the version safely
            )
            for item in data
        )
        if (all_versions_are_target):
            log(f"Skipping '{item_key}' (custom logic): Version 1.1 already exists.")
            return True
    return False

Next, generate the improved analytic plans.

In [None]:
prompt_model(
    requests_per_minute = 10,
    max_iterations = 100,
    prompts_collection = refine_prompt_library,
    generation_function = generate_improved_plan,
    output_directory = "./",
    skip_condition_func = skip_by_version,
    model_name =  "gemini-2.5-flash-preview-04-17" # "gemini-2.5-pro-exp-03-25"
)

[2025-05-09T12:08:23.153690] Ensured output directory exists: ./
[2025-05-09T12:08:23.154939] Starting prompt processing. Max iterations: 100, Target RPM: 10.
[2025-05-09T12:08:23.161735] Skipping 'T1055.011 - Extra Window Memory Injection' (custom logic): Version 1.1 already exists.
[2025-05-09T12:08:23.165610] Skipping 'T1053.005 - Scheduled Task' (custom logic): Version 1.1 already exists.
[2025-05-09T12:08:23.168060] Skipping 'T1205.002 - Socket Filters' (custom logic): Version 1.1 already exists.
[2025-05-09T12:08:23.172196] Skipping 'T1560.001 - Archive via Utility' (custom logic): Version 1.1 already exists.
[2025-05-09T12:08:23.176516] Skipping 'T1021.005 - VNC' (custom logic): Version 1.1 already exists.
[2025-05-09T12:08:23.182238] Skipping 'T1047 - Windows Management Instrumentation' (custom logic): Version 1.1 already exists.
[2025-05-09T12:08:23.186604] Skipping 'T1113 - Screen Capture' (custom logic): Version 1.1 already exists.
[2025-05-09T12:08:23.190643] Skipping 'T102

In [None]:
# Halt execution before the code below is run
sys.exit(0)

In [None]:
def update_json_files_structure(root_dir, dict_to_add_or_update):
    """
    Recursively finds JSON files in root_dir and adds or updates keys
    from dict_to_add_or_update into the dictionary value of the primary
    key within each item of the main list.

    Args:
        root_dir (str): The path to the directory to start searching from.
        dict_to_add_or_update (dict): A dictionary containing key-value pairs
                                      to add or update.
    """
    if not isinstance(dict_to_add_or_update, dict):
        log("Error: 'dict_to_add_or_update' must be a dictionary.")
        return

    if not dict_to_add_or_update:
        log("Warning: 'dict_to_add_or_update' is empty. No changes will be made.")
        # Optionally return here if desired, or proceed to scan files anyway.
        # return

    if not os.path.isdir(root_dir):
        log(f"Error: Directory '{root_dir}' not found.")
        return

    log(f"Starting scan in directory: {root_dir}")
    log(f"Data to add/update: {dict_to_add_or_update}")
    files_processed = 0
    files_modified = 0
    errors_encountered = 0
    error_files = []
    warning_files = []

    for subdir, _, files in os.walk(root_dir):
        for filename in files:
            if filename.lower().endswith('.json'):
                file_path = os.path.join(subdir, filename)
                log(f"Processing file: {file_path}")
                files_processed += 1
                modified_in_this_file = False
                try:
                    # Read the JSON file with UTF-8 encoding
                    # Use a variable for the file handle to ensure it's closed
                    # even if errors occur before json.load() finishes
                    read_f = None
                    try:
                        read_f = open(file_path, 'r', encoding='utf-8')
                        data = json.load(read_f)
                    finally:
                        if read_f:
                            read_f.close()


                    # Ensure the top level is a list
                    if isinstance(data, list):
                        # Iterate through each item (dictionary) in the list
                        for item in data:
                            if isinstance(item, dict):
                                # Iterate through the primary keys in the item dictionary
                                # (Assuming one primary key per item as per example)
                                for primary_key in list(item.keys()): # Use list() for safe iteration
                                    # Check if the value associated with the primary key is a dictionary
                                    if isinstance(item[primary_key], dict):
                                        target_dict = item[primary_key]
                                        # Iterate through the keys and values to add/update
                                        for key_to_update, value_to_update in dict_to_add_or_update.items():
                                            # Check if the key needs to be added or if the value is different
                                            if key_to_update not in target_dict or target_dict[key_to_update] != value_to_update:
                                                target_dict[key_to_update] = value_to_update
                                                modified_in_this_file = True # Mark that a change occurred
                                    else:
                                        log(f"  Warning: Value for key '{primary_key}' in {file_path} is not a dictionary. Skipping update for this key.")
                                        warning_files.append(file_path)
                            else:
                                 log(f"  Warning: Found an item in the list that is not a dictionary in {file_path}. Skipping this item.")
                                 warning_files.append(file_path)

                    else:
                        log(f"  Warning: Root object in {file_path} is not a list. Skipping this file.")
                        warning_files.append(file_path)
                        continue # Skip to the next file

                    # Write the modified data back to the file if changes were made
                    if modified_in_this_file:
                        write_f = None
                        try:
                            write_f = open(file_path, 'w', encoding='utf-8')
                            # Use indent for pretty printing, adjust as needed (e.g., indent=4)
                            json.dump(data, write_f, indent=2, ensure_ascii=False)
                            log(f"  Successfully updated: {file_path}")
                            files_modified += 1
                        finally:
                             if write_f:
                                 write_f.close()
                    else:
                        log(f"  No changes needed for: {file_path}")


                except json.JSONDecodeError:
                    log(f"  Error: Invalid JSON format in {file_path}. Skipping.")
                    error_files.append(file_path)
                    errors_encountered += 1
                except IOError as e:
                    log(f"  Error: Could not read/write file {file_path}. Reason: {e}. Skipping.")
                    error_files.append(file_path)
                    errors_encountered += 1
                except Exception as e:
                    # Log the type of exception for better debugging
                    log(f"  Error: An unexpected {type(e).__name__} occurred processing {file_path}. Reason: {e}. Skipping.")
                    error_files.append(file_path)
                    errors_encountered += 1

    print("\n--- Processing Summary ---")
    print(f"Total files scanned: {files_processed}")
    print(f"Files successfully modified: {files_modified}")
    # Corrected calculation for skipped/unchanged files
    skipped_unchanged = files_processed - files_modified - errors_encountered
    print(f"Files skipped, unchanged, or with warnings: {skipped_unchanged}")
    print(f"Errors encountered during processing: {errors_encountered}")
    print("--------------------------")
    print("Error files:\n",error_files)
    print()
    print("Warning files:\n",warning_files)

In [None]:
# target_directory = "./"

# data_to_add = {
#     "version": "1.0",
#     "last_updated": "2025-05-04"
# }

# # Call the generalized function
# update_json_files_structure(target_directory, data_to_add)