## AOAI Insider Threat Analysis
This notebook leverages Azure OpenAI (AOAI) to simulate the reasoning of a cybersecurity analyst. Once users have been nominated as anomalous (e.g., via the train_isolation_forest.ipynb script), this notebook is used to investigate and explain their behavior using log summaries and engineered features.

##### Approach
_Log types: device, email, file, HTTP, logon logs_

- **Step 1) Pull user logs for a dynamic time window:** The analysis window is dynamically computed based on each users most recent activity. The investigation covers a variable-defined time range (currently set to 60 days) ending at the users last known activity. Each log type dataset is queried to retrieve the users activity for this time range.

- **Step 2) Summarize logs with chunking:** For each log type, the logs are broken into manageable chunks and summarized individually with AOAI. Each chunk summary highlights suspicious behavior, flags relevant entries, and assigns relevance scores. These chunks are then synthesized into a single summary for each log type.

- **Step 3) Generate structured final report:** AOAI leverages the log summaries and additional context to generate a structured report that includes: user summary and background, behavioral patterns and anomalies, timeline of suspicious events, and risk assessment and recommendations.

#### AOAI prompts
_"You are a cybersecurity analyst...."_

* **Chunk-Level Prompts** - For each log chunk, a prompt provides a description for what type of data is contained in that chunk and instructs the model to flag suspicious entries with timestamps and relevance scores.

* **Synthesis Prompts** - A synthesis prompt combines all chunk summaries into a single summary the highlights the most important findings and flagged entries.

* **Final Report Prompt** - The final prompt integrates the synthesized summaries, users background, and engineered features and instructs AOAI to generate a structured report.

#### Example AOAI Output
See: example_aoai_anomaly_analysis_output.md

#### Before Running This Notebook
- Update "api_key" with your Azure OpenAI API Key
- Set the "user" variable to the username you want to investigate
- Set the "log_window_days" variable to the desired X number of recent days you would like to analyze. I suggest matching the time range the isolation forest anomaly detection was performed on.
- Ensure the cleaned log tables (device, email, file, logon, http) are accessible


In [None]:
!pip install openai
from openai import AzureOpenAI
from IPython.display import Markdown, display
import pandas as pd

In [None]:
# UPDATE THIS to the username you would like to investigate
user = "XXXXXXXX-ID"

In [None]:
# Connect to AOAI
client = AzureOpenAI(
    api_key= "AOAI_API_KEY", # REPLACE WITH YOUR KEY
    api_version="2025-01-01-preview",
    azure_endpoint="https://cyber-aoai.openai.azure.com/"
)

#### Step 1) Pull user logs for a dynamic time window 

In [None]:
# UPDATE THIS to your desired time window
log_window_days = 60

In [None]:
# Get the date range for the last {log_window_days} of activity for the user
# This is how far in the past the log data will be pulled / investigation analysis will be performed
def get_last_date(table_name, user, date_col="date"):
    query = f"SELECT MAX({date_col}) AS last_date FROM {table_name} WHERE user = '{user}'"
    return spark.sql(query).collect()[0]["last_date"]

device_last = get_last_date("clean_device_events", user)
email_last = get_last_date("clean_email_events", user)
file_last = get_last_date("clean_file_events", user)
logon_last = get_last_date("clean_logon_events", user)
http_last = get_last_date("clean_http_events", user)

# get most recent date
all_dates = [d for d in [device_last, email_last, file_last, logon_last, http_last] if d is not None]
most_recent_date = max(all_dates)

# compute 90 day range
start_window = most_recent_date - pd.Timedelta(days=90)


start_date_str = start_window.strftime("%Y-%m-%d")
end_date_str = most_recent_date.strftime("%Y-%m-%d")

print(f"User: {user}")
print(f"Most recent activity date: {start_date_str}")
print(f"90-day window: {start_date_str} to {end_date_str}")

#### Step 2) Summarize logs with chunking

In [None]:
# chunking set up
chunk_size = 500
max_tokens_per_chunk = 1000

In [None]:
# Log source queries with descriptions of what is contained in the dataset and what to look for in analysis
log_sources = {
    "device": {
        "query": f"""
            SELECT date, user, pc, activity
            FROM clean_device_events
            WHERE user = '{user}' AND date BETWEEN DATE('{start_date_str}') AND DATE('{end_date_str}')
            ORDER BY date ASC
        """,
        "columns": ["date", "pc", "activity"],
        "description": "Device logs capture USB thumb drive connect/disconnect events. \
        Some disconnects may be missing due to power-downs.\
        Deviations from a user's normal usage may indicate data exfiltration."
    },
    "email": {
        "query": f"""
            SELECT date, user, pc, to, cc, bcc, from, size, attachments, content
            FROM clean_email_events
            WHERE user = '{user}' AND date BETWEEN DATE('{start_date_str}') AND DATE('{end_date_str}')
            ORDER BY date ASC
        """,
        "columns": ["date", "pc", "from", "to", "cc", "bcc", "attachments", "content"],
        "description": "Email logs include sender/recipient metadata and content. \
        External recipients (non-DTAA email addresses) with large attachments may suspicious. \
        Content is keyword-based and not tied to subject/body."
    },
    "file": {
        "query": f"""
            SELECT date, user, pc, filename, content
            FROM clean_file_events
            WHERE user = '{user}' AND date BETWEEN DATE('{start_date_str}') AND DATE('{end_date_str}')
            ORDER BY date ASC
        """,
        "columns": ["date", "pc", "filename", "content"],
        "description": "File logs represent file copies to removable media. \
        Content includes file headers and keywords. \
        Deviations from normal copy volume or sensitive filenames may be suspicious."
    },
    "logon": {
        "query": f"""
            SELECT date, user, pc, activity
            FROM clean_logon_events
            WHERE user = '{user}' AND date BETWEEN DATE('{start_date_str}') AND DATE('{end_date_str}')
            ORDER BY date ASC
        """,
        "columns": ["date", "pc", "activity"],
        "description": "Logon logs include logon/logoff events. \
        Screen unlocks are recorded as logons. Screen locks are not recorded. \
        Deviations from normal after-hours (outside of 6 AM to 6 PM) and weekend logon activity may be suspicious", 
    },
    "http": {
        "query": f"""
            SELECT date, user, url, pc
            FROM clean_http_events
            WHERE user = '{user}' AND date BETWEEN DATE('{start_date_str}') AND DATE('{end_date_str}')
            ORDER BY date ASC
        """,
        "columns": ["date", "pc", "url", "content"],
        "description": "HTTP logs capture web browsing activity. \
        Visits to risky domains or domains linked to malware/keylogging may indicate insider threat behavior."
    }
}

In [None]:
# chunk DataFrame
def chunk_dataframe(df, size):
    return [df.iloc[i:i + size] for i in range(0, len(df), size)]

# summarize a single chunk
def summarize_chunk(log_type, chunk_df, chunk_index, total_chunks, description):
    chunk_text = "\n".join([" | ".join(str(row[col]) for col in chunk_df.columns) for _, row in chunk_df.iterrows()])
    prompt = f"""
        You are a cybersecurity analyst. Analyze the following {log_type} logs (Chunk {chunk_index + 1} of {total_chunks}).
        {description}

        Instructions:
        - Summarize key behaviors or anomalies.
        - Flag any log entries that may indicate insider threat activity.
        - For each flagged entry, include the timestamp and a relevance score (1-5).
        - Keep your response under {max_tokens_per_chunk} tokens.

        Logs:
        {chunk_text}
        """
    response = client.chat.completions.create(
        model="gpt-4.1",
        messages=[
            {"role": "system", "content": "You are a cybersecurity analyst assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3,
        max_tokens=max_tokens_per_chunk
    )
    return response.choices[0].message.content

# summarize all chunks for a log dataset
def summarize_log_type(log_type):
    log_info = log_sources[log_type]
    df = spark.sql(log_info["query"]).toPandas().drop(columns=["user"])
    chunks = chunk_dataframe(df, chunk_size)
    summaries = []
    for i, chunk in enumerate(chunks):
        summary = summarize_chunk(log_type, chunk, i, len(chunks), log_info["description"])
        summaries.append(summary)
    combined_prompt = f"""
        You are a cybersecurity analyst. Below are summaries of {log_type} log chunks for user {user}.
        Combine them into a single summary highlighting the most important findings, flagged entries, and behavioral patterns.

        Summaries:
        {chr(10).join(summaries)}
        """
    final_response = client.chat.completions.create(
        model="gpt-4.1",
        messages=[
            {"role": "system", "content": "You are a cybersecurity analyst assistant."},
            {"role": "user", "content": combined_prompt}
        ],
        temperature=0.3,
        max_tokens=1000
    )
    return final_response.choices[0].message.content

In [None]:
# chunk and synthesize summaries for each log type
# this may take a few minutes depending on number of logs
# try altering the chunk size to speed up summarization

device_summary = summarize_log_type("device")
email_summary = summarize_log_type("email")
file_summary = summarize_log_type("file")
logon_summary = summarize_log_type("logon")
http_summary = summarize_log_type("http")

#### Step 3) Generate structured final report

In [None]:
# Get users engineered features
# (created from engineer_model_features.ipynb, stored in model_features table)
features_df = spark.sql(f"""
SELECT *
FROM model_features
WHERE user = '{user}'
""").toPandas()
if not features_df.empty:
    features = features_df.iloc[0].to_dict()
else:
    features = {}

# Get users description
# (LDAP details from clean_user_details.ipynb, stored in clean_user_details table)
# includes employee background on users role, supervisor, etc
user_details_df = spark.sql(f"""
SELECT *
FROM clean_user_details
WHERE user = '{user}'
""").toPandas()
if not user_details_df.empty:
    user_details = user_details_df.iloc[0].to_dict()
else:
    user_details = {}

In [None]:
# Build prompt
def build_prompt(user_id, user_details, features, device_summary, email_summary, file_summary, logon_summary, http_summary):
    return f"""
    You are a cybersecurity analyst assistant. Your task is to analyze user behavior to assess the risk of 
    insider threat activity over a {log_window_days} time range from {start_date_str} to {end_date_str}. 
    Use behavioral patterns, anomalies, and semantic cues to support your assessment. 
    Write in a concise, analytical tone. Focus on:
    - Patterns over time (e.g., spikes, shifts, or anomalies)
    - Deviations from baseline behavior
    - Malicious activity
    -----

    ## User Profile
    User ID: {user_id}
    User Background Information: {str(user_details)}

    ## User Activity Features: {features}
    The user activity features are derived from raw data sources such as device logs, email logs, file logs, and HTTP logs. 
    These features capture significant patterns and anomalies in user behavior that may indicate potential risks. 
    The key components include:
    - variables starting with "recent_" capture user behavior in the 14 days leading up to their last recorded event.
    This reflects short-term activity and is crucial for detecting pre-departure anomalies.
    - variables starting with "baseline_" capture typical user behavior in the 60 days prior to the recent window. 
    - variables ending with "_spike_ratio" compare recent vs. baseline activity to quantify unusual surges

    ## Log activity summaries
    These log analysis summaries are compiled by reviewing log data over {start_date_str} to {end_date_str} time range.

    Recent Device Event Summary:
    {device_summary}

    Recent Email Events Summary:
    {email_summary}

    Recent File Events Summary:
    {file_summary}

    Recent Logon Events Summary:
    {logon_summary}

    Recent HTTP Events Summary:
    {http_summary}

    -------

    Use the structured template below to summarize your findings.

    ## Analysis Output Template

    **User Summary**
    User: [Full Name] ({user_id}) — [1-sentence overview based on background]
    Time Window Analyzed: {start_date_str} to {end_date_str}

    **Behavior Summary**
    [1-3 sentence summary of the user's recent behavior, highlighting any shifts or patterns]

    **Anomalous Activities** [sort most anomalous/malicious to least] \n
    1. [High level description of anomalous activity] : [1 sentence of events or behavior] \n
    2. [High level description of anomalous activity] : [1 sentence of events or behavior] \n
    ....[add more if necessary]

    **Anomalous Timeline of Events** [sort most recent to oldest] \n
    - [Date Range 1] : [1-3 sentence of events or behavior] \n
    - [Date Range 2] : [1-3 sentence of events or behavior] \n
    ....[add more if necessary]

    **Risk Assessment**
    - Risk Level: [Low / Medium / High]
    - Justification: [Brief explanation based on data]

    **Recommendations**
    - [Suggested next steps: e.g., escalate, monitor, interview, etc.]

    ---

    Only use the data provided. Do not fabricate or assume information not present in the log summaries or features. 
    Provide examples or strong justifications. Only include activities/analysis you are certain of.
    When providing URL links from the log summaries, make sure they are not clickable. 
    Don't include user emotions, focus on technical facts.
    """

In [None]:
# Submit prompt to AOAI - with user engineered features, employee LDAP info, and log summaries
response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[
        {"role": "system", "content": "You are a cybersecurity analyst assistant."},
        {"role": "user", "content": build_prompt(user, user_details, features, device_summary, email_summary, file_summary, logon_summary, http_summary)}
    ],
    temperature=0.3,
    max_tokens=1000
)

In [None]:
# Display analysis results
aoai_output = response.choices[0].message.content
display(Markdown(aoai_output))