## AOAI Insider Threat Analysis

This notebook leverages Azure OpenAI (AOAI) to simulate the reasoning of a cybersecurity analyst. Once users have been nominated as anomalous from the (train_isolation_forest.ipynb script), this notebook can be used to further  and explain the users behavior. The goal is to analyze user behavior across multiple activity domains (device, email, file, HTTP, and logon) using both recent and baseline features.

**AOAI prompt:**
- Combines user background, engineered features, and raw event logs.
- Encourages temporal and semantic reasoning to detect anomalies.
- Produces a structured analysis summary including risk level, anomalous activities, and recommendations.

**Example AOAI output:**
- example_aoai_anomaly_analysis_output.md

**Before running script:**
- update "api_key" with your AOAI API Key
- update "user" variable with the username of the user you would like to analyze
- update "max_logs" variable with the max number of logs your deployment of AOAI can use (experiment with different values)


In [None]:
!pip install openai
from openai import AzureOpenAI
from IPython.display import Markdown, display

In [None]:
# Connect to AOAI
client = AzureOpenAI(
    api_key="AOAI_API_KEY", # REPLACE WITH YOUR KEY
    api_version="2025-01-01-preview",
    azure_endpoint="https://cyber-aoai.openai.azure.com/"
)

In [None]:
# Build prompt
def build_prompt(user_id, user_details, features, device_logs, email_logs, file_logs, http_logs):
    return f"""
    You are a cybersecurity analyst assistant. Your task is to analyze user behavior logs and features to assess the risk of 
    insider threat activity. Use behavioral patterns, anomalies, and semantic cues to support your assessment. 
    Write in a concise, analytical tone.


    ## User Profile
    User ID: {user_id}
    Background Information: {str(user_details)}

    ## User Activity Features: {features}

    Recent Device Events:
    {device_logs}

    Recent Email Events:
    {email_logs}

    Recent File Events:
    {file_logs}

    Recent HTTP Events:
    {http_logs}

    Analyze the logs and features below to assess potential insider threat behavior. Focus on:
    - Patterns over time (e.g., spikes, shifts, or anomalies)
    - Semantic meaning in the logs (e.g., risky URLs, external email activity)
    - Deviations from baseline behavior

    Use the structured template below to summarize your findings.

    ## Analysis Output Template

    **User Summary**
    User: [Full Name] ({user_id}) — [1-sentence overview based on background]

    **Behavior Summary**
    [1-3 sentence summary of the user's recent behavior, highlighting any shifts or patterns]

    **Anomalous Activities**
    1. [Most notable suspicious activity with context]
    2. [Second most notable activity]
    3. [Optional third]

    **Anomalous Timeline of Events**
    - [Date Range 1] — [Summary of events or behavior]
    - [Date Range 2] — [Summary of events or behavior]

    **Risk Assessment**
    - Risk Level: [Low / Medium / High]
    - Justification: [Brief explanation based on data]

    **Recommendations**
    - [Suggested next steps: e.g., escalate, monitor, interview, etc.]

    ---
    Only use the data provided. Do not fabricate or assume information not present in the logs or features.
    """

### Analyze Anomalous User

In [None]:
user = "USERNAME" # UPDATE WITH RELEVANT USERNAME
max_logs = 500 # UPDATE DEPENDING ON AOAI PROMPT TOKEN LIMITS

In [None]:
# pull recent user logs (number of events: max_logs)

# device logs
device_logs_df = spark.sql(f"""
SELECT date, user, pc, activity
FROM clean_device_events
WHERE user = '{user}'
ORDER BY date DESC
LIMIT {max_logs}
""").toPandas().drop(columns=['user'])

# email logs
email_logs_df = spark.sql(f"""
SELECT date, user, pc, to, cc, bcc, from, size, attachments, content
FROM clean_email_events
WHERE user = '{user}'
ORDER BY date DESC
LIMIT {max_logs}
""").toPandas().drop(columns=['user'])

# file logs
file_logs_df = spark.sql(f"""
SELECT date, user, pc, filename, content
FROM clean_file_events
WHERE user = '{user}'
ORDER BY date DESC
LIMIT {max_logs}
""").toPandas().drop(columns=['user'])

# logon logs
logon_logs_df = spark.sql(f"""
SELECT date, user, pc, activity
FROM clean_logon_events
WHERE user = '{user}'
ORDER BY date DESC
LIMIT {max_logs}
""").toPandas().drop(columns=['user'])

# http logs
http_logs_df = spark.sql(f"""
SELECT date, user, pc, url
FROM clean_http_events
WHERE user = '{user}'
ORDER BY date DESC
LIMIT {max_logs}
""").toPandas().drop(columns=['user'])

In [None]:
# format logs for AOAI prompt
def format_logs(df, columns, max_rows=max_logs):
    df = df.head(max_rows)
    return "\n".join([" | ".join(str(row[col]) for col in columns) for _, row in df.iterrows()])

device_logs_str = format_logs(device_logs_df, ["date", "pc", "activity"])
email_logs_str = format_logs(email_logs_df, ["date", "pc", "from", "to", "cc", "bcc", "size", "attachments", "content"])
file_logs_str = format_logs(file_logs_df, ["date", "pc", "filename", "content"])
logon_logs_str = format_logs(logon_logs_df, ["date", "pc", "activity"])
http_logs_str = format_logs(http_logs_df, ["date", "pc", "url"])

In [None]:
# Get users engineered features
# (created from engineer_model_features.ipynb, stored in model_features table)

features_df = spark.sql(f"""
SELECT *
FROM model_features
WHERE user = '{user}'
""").toPandas()

# Convert the row to a dictionary
if not features_df.empty:
    features_dict = features_df.iloc[0].to_dict()
else:
    features_dict = {}

In [None]:
# Get users description
# (LDAP details from clean_user_details.ipynb, stored in clean_user_details table)
# includes employee background on users role, supervisor, etc

user_details_df = spark.sql(f"""
SELECT *
FROM clean_user_details
WHERE user = '{user}'
""").toPandas()

In [None]:
# Submit prompt to AOAI - with user engineered features, employee LDAP info, and event logs
response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[
        {"role": "system", "content": "You are a cybersecurity analyst assistant."},
        {"role": "user", "content": build_prompt(user, user_details_df, features_dict, device_logs_str, email_logs_str, file_logs_str, http_logs_str)}
    ],
    temperature=0.3,
    max_tokens=1000
)

In [None]:
# Display analysis results
aoai_output = response.choices[0].message.content
display(Markdown(aoai_output))