## **Introduction**

This notebook was submitted as my part for Capstone project to show what I've learned throughout the **5-Day Gen AI Intensive Course with Google**,  which took place from Monday March 31 - Friday April 4, 2025.


In my project, I wanted to combine my professional cybersecurity expertise with skills I gained during the course. My project showcases below genAI capabilities:

- Function calling,
- Embeddings,
- and Retrieval augmented generation (RAG).

In this notebook, I will cover tasks that are often required when working in Security Operations Center, especially in companies like Managed Security Service Providers (MSSPs).
Everyday, MSSPs need to deal with huge number of security events and alerts for multiple clients' environments. Of course, this could be a benefit, considering they can observe different threat scenarios for various clients and essentially learn about possible attack surfaces thanks to vast amount of data.
And when there is a big data, there could be a good opportunity for machine learning or generative AI models. We already observe such trend in public, that ML or AI components are embedded in security tools like EDRs or SIEMs. However, it also brings an obstacle - which is a privacy of such data. Nobody would want their security alerts to be used in training of models that could be leveraged in public.

For this notebook I used ["Microsoft Security Incident Prediction" dataset](https://www.kaggle.com/datasets/Microsoft/microsoft-security-incident-prediction) which ensures privacy by implementing stringent anonymization process in which sensitive values are pseudo-anonymized using SHA1 hashing techniques. Moreover those hashes were replaced by randomly generated IDs to further enhance anonymity and prevent any potential re-identification. With such cured data we verify generative AI capabilities whilst privacy is kept.

Moreover, as part of this project, I wanted to show how genAI tools could be leveraged in another part of security operations which is threat hunting. In the presented case, I showed how genAI models could enhance capabilities of security teams by combining internal knowledge base with external OSINT resources.

**For this project I used Google's Gemini models.**

## **Creating notebook**

Considering that it was my first notebook on Kaggle platform, I will also share some steps that I did to begin this project.

I started by navigating to the above dataset ["Microsoft Security Incident Prediction" dataset](https://www.kaggle.com/datasets/Microsoft/microsoft-security-incident-prediction) and clicked on "three dots" menu. I selected "New Notebook" option which immediately created a new notebook under my profile. I was presented with below short code snippet:

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

As a result of running above code, you should see two CSV filepaths that will be used in this project. I wouldn't be a true cyber engineer, if I hadn't slighlty tweaked it to see if it might reveal some interesting files under "/kaggle/" directory.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

 But no config files found this time :) Of course, don't try to run it on entire '/' directory, as it would break your notebook execution!
 
 Anyway, let's begin the first part of this project which is automated function calling!

# **Function calling**

This part of the project is purposed to create a notebook that would support work for a security analyst. Let's assume that they work in a MSSP firm that provides services for various organizations. The prepared dataset has events of previous security alerts that were raised for their clients. This part of the project will be a proof-of-concept that will show how to equipe generative AI models with the right functions in order to support everyday tasks when investigating newly occurred incidents. The tasks could be, but not limited to the below examples:

- What are the initiating process folder paths that were involved for incidents reported for exfiltration tactic?
- I observed a suspicious file with a SHA256 value - was this hash observed in the past for org id =X?

The task could also support asks from MSSP's management board, e.g.:

- What is the organization ID that produced the most of reported incidents?
- What is the fidelity of reported incidents for organization ID = Y?


To prepare genAI Gemini model to support some of the tasks that fall into such categories, we will begin by seting up the notebook requirements.

## **Function calling requirements**

Install needed pip modules for python:

In [1]:
!pip uninstall -qqy jupyterlab kfp  
!pip install -qU "google-genai==1.7.0"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.9/100.9 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h

Then, let's define other modules that would be required in our function calling part. Remember to set-up your "GOOGLE_API_KEY" secret under Add-ons menu above. If everything is correct you should see the genai module version. In my case it was '1.7.0'

In [2]:
#import numpy as np # linear algebra
import pandas as pd # will be used for data processing in CSV files

from kaggle_secrets import UserSecretsClient
from google import genai
from google.genai import types

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")



genai.__version__

'1.7.0'

In order to properly create functions that will help genAI module better utilize the given knowledge base of security events, first let's review what we have inside.  Information about the dataset can also be obtained from above dataset home page. 
For my project I made couple of decisions which you will see reflected in the code below:

- **I specified only 16 column names** - I wanted to focus only on key parts related to incidents, and this would be enough for data processing, especially that anonymization is applied and I wouldn't identify more interesting insights myself.
- **I use only 1000 rows of the GUIDE_train.csv file** - I wanted to speed all the processing of the dataset and limit any resource exhaustion. After all, this dataset has 13 mln of evidence in total.

Lets run below code. If you see summary of first 15 rows, it means that read_csv() from pandas worked correctly. 

In [3]:
train_data_path = '../input/microsoft-security-incident-prediction/GUIDE_Train.csv'
col_names = ['ID','ORGID','ALERTID','INCIDENTID','TIMESTAMP','ALERTTITLE','CATEGORY',
                 'MITRETECHNIQUES','INCIDENTGRADE', 'ENTITYTYPE',
                 'DEVICEID','ACCOUNTUPN','FOLDERPATH','SHA256','URL','IPADDRESS']

train_dataset = pd.read_csv(train_data_path,index_col=0,usecols=lambda col: col.upper() in col_names, nrows=1000)

train_dataset.head(15)

Unnamed: 0_level_0,OrgId,IncidentId,AlertId,Timestamp,AlertTitle,Category,MitreTechniques,IncidentGrade,EntityType,DeviceId,Sha256,IpAddress,Url,AccountUpn,FolderPath
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
180388628218,0,612,123247,2024-06-04T06:05:15.000Z,6,InitialAccess,,TruePositive,Ip,98799,138268,27,160396,673934,117668
455266534868,88,326,210035,2024-06-14T03:01:25.000Z,43,Exfiltration,,FalsePositive,User,98799,138268,360606,160396,23032,117668
1056561957389,809,58352,712507,2024-06-13T04:52:55.000Z,298,InitialAccess,T1189,FalsePositive,Url,98799,138268,360606,68652,673934,117668
1279900258736,92,32992,774301,2024-06-10T16:39:36.000Z,2,CommandAndControl,,BenignPositive,Url,98799,138268,360606,13,673934,117668
214748368522,148,4359,188041,2024-06-15T01:08:07.000Z,74,Execution,,TruePositive,User,98799,138268,360606,160396,592,117668
1322849927433,11,417400,825450,2024-06-10T13:30:56.000Z,0,InitialAccess,T1078;T1078.004,FalsePositive,Ip,98799,138268,30410,160396,673934,117668
163208760309,522,566,705663,2024-06-14T23:19:45.000Z,2,CommandAndControl,,BenignPositive,Url,98799,138268,360606,3306,673934,117668
1400159339557,125,38679,47423,2024-06-06T13:39:23.000Z,3919,Exfiltration,,BenignPositive,MailMessage,98799,138268,360606,160396,34744,117668
1219770713645,21,414,197969,2024-06-09T10:21:29.000Z,4,SuspiciousActivity,,BenignPositive,Process,98799,0,360606,160396,673934,1694
1073741827836,72,70,831157,2024-06-08T02:08:01.000Z,3,InitialAccess,,TruePositive,User,98799,138268,360606,160396,5532,117668


We could also be more specific with our initial queries to explore the dataset. For example, below line would print rows where MitreTechnique has some value.

In [4]:
train_dataset.loc[train_dataset["MitreTechniques"].notnull()]

Unnamed: 0_level_0,OrgId,IncidentId,AlertId,Timestamp,AlertTitle,Category,MitreTechniques,IncidentGrade,EntityType,DeviceId,Sha256,IpAddress,Url,AccountUpn,FolderPath
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1056561957389,809,58352,712507,2024-06-13T04:52:55.000Z,298,InitialAccess,T1189,FalsePositive,Url,98799,138268,360606,68652,673934,117668
1322849927433,11,417400,825450,2024-06-10T13:30:56.000Z,0,InitialAccess,T1078;T1078.004,FalsePositive,Ip,98799,138268,30410,160396,673934,117668
781684051738,2119,6622,23284,2024-06-10T10:28:29.000Z,11,InitialAccess,T1566,BenignPositive,MailMessage,98799,138268,360606,160396,160691,117668
635655163305,261,110412,41503,2024-06-03T17:05:40.000Z,344,Collection,T1098;T1114,BenignPositive,User,98799,138268,360606,160396,268738,117668
429496732853,51,84683,134887,2024-06-05T04:17:50.000Z,26,Execution,T1559;T1106;T1059.005,BenignPositive,File,98799,4,360606,160396,673934,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
403726928569,424,580,1105620,2024-06-11T11:26:46.000Z,347,Discovery,T1047;T1082;T1497.001,FalsePositive,Process,98799,11089,360606,160396,673934,29524
841813593095,4680,65940,844286,2024-06-04T18:07:35.000Z,0,InitialAccess,T1078;T1078.004,FalsePositive,User,98799,138268,360606,160396,459628,117668
1262720388437,249,5728,4089,2024-05-29T13:07:02.000Z,1539,Exfiltration,T1030,FalsePositive,Url,98799,138268,360606,123358,673934,117668
816043788608,1236,6247,44608,2024-05-31T14:10:27.000Z,3668,Execution,T1059.001,FalsePositive,Ip,98799,138268,42849,160396,673934,117668


Right now we have available data, so we can start creating a function that would be then shared with the genAI model to use. Lets start by creating a function to get all the SHA256 hash values for a given incident. 

For writing up this function I focused on creating good docstrings description, because it would be needed for the model to understand what are the inputs for the function and what is its main purpose. In the instructions, I will also clarify for the model that it should report if any function is missing and try to avoid hallucinations.

 Key lines of this function are:

 1. Identifying all rows with the given incidentId:

`incident_files = train_dataset[train_dataset['IncidentId'] == incidentId]`

 2. Collecting all unique values of SHA256 column from them:

`sha256_uniq_vals = incident_files['Sha256'].astype(str).drop_duplicates().tolist()`

The issue I had with the second part I will explain sligthly below (it required me to use `astype(str)` function explicitly).


In [5]:
def get_all_sha256_for_incident(incidentId: int) -> list[str]:
    """
    Retrieves a list of the actual SHA256 hash values (as strings) for files involved in a specific incident.

    Even if the hash values in the underlying data look like numbers due to anonymization, 
    this function returns them as a list of unique strings, where each string is a hash identifier.

    Args:
        incidentId (int): The unique identifier for the incident you want to query.

    Returns:
        list[str]: A list containing the unique SHA256 hash values (strings) associated with the 
                   incident. Returns an empty list if the incident ID is not found or has no 
                   associated file hashes.
    """
    print(f"*Func call*: Retrieving all sha256 strings for incident id: {incidentId}")
    sha256_uniq_vals: list[str] = []
    try:
        # Ensure the column exists before trying to access it
        if 'IncidentId' in train_dataset.columns and 'Sha256' in train_dataset.columns:
            # Filter rows for the given incidentId
            incident_files = train_dataset[train_dataset['IncidentId'] == incidentId]
            
            # Get the Sha256 values, drop duplicates, convert to list
            # Important: Ensure the values are treated as strings *before* list conversion
            sha256_uniq_vals = incident_files['Sha256'].astype(str).drop_duplicates().tolist()
            
            if not sha256_uniq_vals:
                print(f"No Sha256 hashes found for IncidentId {incidentId}.")
            else:
                 print(f"Found {len(sha256_uniq_vals)} unique Sha256 hashes.")

        else:
            print("Error: Required columns ('IncidentId' or 'Sha256') not found in the dataset.")

    except KeyError:
        print(f"Error processing IncidentId {incidentId}. It might not exist in the data.")
        # Return empty list on error
        sha256_uniq_vals = []
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        sha256_uniq_vals = []

    return sha256_uniq_vals

At this point, I can mention that I had some problems with creating this function. The problem was related to the anonymization of the dataset. The fact that SHA256 are presented rather as obfuscated IDs was not clear to the model itself and it treated the a returned value as number of hashes and not the actual hash value. So in order to mitigate this issue I need to add below part in the description of function so it would be clear for the model:

`Even if the hash values in the underlying data look like numbers due to anonymization, this function returns them as a list of unique strings, where each string is a hash identifier.`


Lets test this function outside of genAI model first:

In [6]:
get_all_sha256_for_incident(38679)

*Func call*: Retrieving all sha256 strings for incident id: 38679
Found 1 unique Sha256 hashes.


['138268']

And if it is working, lets test it with an actual model. I used similar code to the one in 'day3' notebook during the course. I changed the instructions to describe the purpose for the model as precise as possible.

In [7]:
soc_tools = [get_all_sha256_for_incident]

instruction = """ You are supporting everyday tasks in security operation center (SOC) in a company. 
             You will be providing security analysts with details that you could obtain 
             from provided data about previous incident events raised for organizations.
             
             Be mindful that the data from the incidents had stringent anonymization process applied which means
             e.g. a hash value in events might not look like standard SHA256 value and rather like an obfuscated one 
             that could be represented by an ID.
             
             You will take users' questions and search for the available information with the usage
             of provided functions in soc_tools. 
        """

client = genai.Client(api_key=GOOGLE_API_KEY)
chat = client.chats.create(
    model="gemini-2.0-flash",
    config=types.GenerateContentConfig(
        system_instruction=instruction,
        tools=soc_tools,
    ),
)

And right now lets ask the model the same question as before:

In [8]:
resp = chat.send_message("What were the hash values for an incident id = 38679?")
print(f"\n{resp.text}")

*Func call*: Retrieving all sha256 strings for incident id: 38679
Found 1 unique Sha256 hashes.

The hash value for incident ID 38679 is 138268.


So we have one function working. Next, I decided to add three other functions:

- `get_all_org_ids` - This function allows me to know what are actually orgId values that I can calculate later on.
- `get_all_incident_ids` - This function retrieves all incident ids for a given organization ID
- `get_all_data` - This function is designed to collect all the available data for an incident.

at the very bottom I also launched manual tests of those functions.

In [9]:
def get_all_org_ids() -> list[int]:
    """
        Collects all organizations' identifiers that are available in the events.

    Returns:
        list[int]: A list containing unique organization ids available to view in the logs.
        This function would only return an empty list if there is an error with a dataset.
        
    """
    print(f"*Func call*: Retrieving all organization ids")
    org_ids: list[int] = []
    try:
        if 'OrgId' in train_dataset.columns:
            org_ids = train_dataset['OrgId'].astype(int).drop_duplicates().tolist()
            if not org_ids:
                    print(f"No org ids found")
            else:
                    print(f"Found {len(org_ids)} unique organization ids.") 
        else:
            print(f"Error: Required column 'OrgId' not found in the dataset.")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        org_ids = []

    return org_ids


def get_all_incident_ids(orgId: int) -> list[int]:
    """
        Collects all incident identifiers that were reported for an organization with given id as an argument to a function.
    Args:
        orgId (int): The unique identifier for the organization you want to query

    Returns:
        list[int]: A list containing unique incident ids associated with the given organization.
        Returns an empty list if there is no incidents for the organization with orgId.
        
    """
    print(f"*Func call*: Retrieving all incident ids for organization id: {orgId}")
    incident_ids: list[int] = []

    try:
        if 'OrgId' in train_dataset.columns and 'IncidentId' in train_dataset.columns:
            org_incidents = train_dataset[train_dataset['OrgId'] == orgId]
            
            incident_ids = org_incidents['IncidentId'].astype(int).drop_duplicates().tolist()

            if not incident_ids:
                print(f"No incident ids found for orgId: {orgId}.")
            else:
                 print(f"Found {len(incident_ids)} unique incident ids.")
        else:
            print("Error: Required columns ('IncidentId' or 'OrgId') not found in the dataset.")
    
    except KeyError:
        print(f"Error processing orgId {orgId}. It might not exist in the data.")
        # Return empty list on error
        incident_ids = []
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        incident_ids = []

    return incident_ids


def get_all_data(incidentId: int) -> dict | None:
    """
        Retrieves all the available data about the incident from the available events.
    Args:
        incidentId (int): The unique identifier for the incident you want to query.

    Returns:
        It returns a dictionary with all the available fields from a row with a given incident id.
        E.g. {'OrgId': [399], 'IncidentId': [389], 'AlertId': [823549], 'Timestamp': ['2024-06-04T16:48:27.000Z'], 'AlertTitle': [3],
        .... 'EntityType': ['Ip'], 'DeviceId': [98799] etc.
         
    """
    print(f"*Func call*: Retrieving all data for a given incident id: {incidentId}")
    incident_data = None
    
    if 'IncidentId' in train_dataset.columns:
        incident_data = train_dataset[train_dataset['IncidentId'] == incidentId] 
        
        if incident_data.empty:
            print("No incident data found to analyze.")
            return None
        else:
            return incident_data.to_dict(orient='list')


res_val1 = get_all_org_ids()
print(res_val1)

print("")
res_val2 = get_all_incident_ids(389)
print(res_val2)

print("")
resp_val3 =  get_all_data(21251)
print(resp_val3)


*Func call*: Retrieving all organization ids
Found 356 unique organization ids.
[0, 88, 809, 92, 148, 11, 522, 125, 21, 72, 268, 597, 23, 6, 376, 2119, 261, 51, 34, 592, 428, 789, 68, 4, 75, 12, 62, 206, 17, 16, 18, 1030, 7, 201, 28, 59, 162, 734, 293, 2, 289, 1, 133, 30, 108, 8, 35, 33, 60, 53, 184, 151, 91, 232, 25, 13, 37, 724, 10, 22, 1333, 9, 131, 246, 39, 155, 1369, 20, 192, 24, 510, 3, 394, 1199, 195, 651, 109, 225, 32, 31, 325, 298, 5, 387, 128, 36, 44, 2112, 904, 290, 976, 726, 47, 52, 773, 65, 1237, 1216, 26, 258, 255, 45, 114, 536, 43, 63, 41, 61, 98, 64, 81, 555, 70, 57, 104, 811, 170, 433, 423, 2670, 1324, 282, 83, 150, 238, 156, 183, 280, 19, 313, 395, 185, 119, 130, 113, 407, 321, 587, 142, 90, 215, 116, 129, 730, 100, 14, 174, 399, 1508, 392, 138, 67, 38, 49, 55, 332, 56, 495, 1157, 1647, 1296, 4087, 352, 629, 115, 371, 1132, 117, 618, 266, 103, 107, 602, 124, 526, 127, 237, 1894, 161, 135, 58, 573, 182, 66, 40, 50, 863, 741, 74, 77, 649, 767, 223, 676, 314, 648, 403, 6

So if we think about it, we actually could obtain all data for every incident that was raised for an organization ID = Z. But it would require manual execution of each function. Perhaps our model could help us with such task?

Unfortunately I did some tests with the model for those functions, and running queries like "Provide me with all data as dictionary output that you have for all incidents for org id = 12?" with the model ended up with an error. This is mainly because models like 'gemini-2.0-flash' focuses on fast responses and it might require yet another function that would support running (or preparing execution of above chain of functions in more direct way).

But we can still test those functions. Lets add them to `soc_tools`:

In [10]:
soc_tools = [get_all_sha256_for_incident, get_all_org_ids, get_all_incident_ids, get_all_data]

instruction = """ You are supporting everyday tasks in security operation center (SOC) in a company. 
             You will be providing security analysts with details that you could obtain 
             from provided data about previous incident events raised by Microsoft security tools.
             Be mindful that the data from the incidents had stringent anonymization process applied which means
             e.g. a hash value in events might not look like standard SHA256 value and rather like an obfuscated one.
             
             You will take users' questions and search for the available information with the usage
             of provided functions in soc_tools. 
             
             Remember that you can use multiple functions in one ask. If you would be asked about trends between organization,
             you can start by using get_all_org_ids function to get a list of available organization ids and use them as input for other functions.

        """
#tools = types.Tool(function_declarations=[schedule_meeting_function])
#config = types.GenerateContentConfig(tools=[tools])

client = genai.Client(api_key=GOOGLE_API_KEY)
chat = client.chats.create(
    model="gemini-2.0-flash",
    config=types.GenerateContentConfig(
        system_instruction=instruction,
        tools=soc_tools,
    ),
)

In [11]:
resp = chat.send_message("Please provide me with all incidents for an organization id = 51")

print(f"\n{resp.text}")

*Func call*: Retrieving all incident ids for organization id: 51
Found 5 unique incident ids.

Here is a list of incident ids for organization id 51: [84683, 328282, 328128, 156541, 28297].



We can also define some functions, that would help us obtain trends in our model. Below, you can find a function `find_org_with_most_incidents` that will provide information which organization has produced the most incidents.

In [12]:
def find_org_with_most_incidents() -> dict | None:
    """
    Calculates which organization ID has the highest count of unique incidents.

    This function retrieves all organizations, then counts the unique
    incidents associated with each organization to find the one with the maximum count.

    Returns:
        dict | None: A dictionary containing 'orgId' and 'incident_count' for the
                    organization with the most unique incidents.
                    Example: {'orgId': 123, 'incident_count': 45}
                    Returns None if no organizations or incidents are found, or if
                    an error occurs during processing.
    """
    print("*Func call*: Finding organization with the most unique incidents.")
    try:
        if 'OrgId' in train_dataset.columns and 'IncidentId' in train_dataset.columns:
            # Group by OrgId and count unique IncidentIds
            incident_counts = train_dataset.groupby('OrgId')['IncidentId'].nunique()
            
            if incident_counts.empty:
                print("No incident data found to analyze.")
                return None

            # Find the index (OrgId) of the maximum count
            print(incident_counts)
            top_org_id = incident_counts.idxmax()
            max_incidents = incident_counts.max()

            print(f"Organization {top_org_id} has the most incidents: {max_incidents}")
            return {"orgId": int(top_org_id), "incident_count": int(max_incidents)}
            
        else:
             print("Error: Required columns ('IncidentId' or 'OrgId') not found.")
             return None

    except Exception as e:
        print(f"An unexpected error occurred while finding the top organization: {e}")
        return None

Lets add this function to our model:

In [13]:
soc_tools = [find_org_with_most_incidents, get_all_sha256_for_incident, get_all_org_ids, get_all_incident_ids, get_all_data]

instruction = """ You are supporting everyday tasks in security operation center (SOC) in a company. 
             You will be providing security analysts with details that you could obtain 
             from provided data about previous incident events raised by Microsoft security tools.
             Be mindful that the data from the incidents had stringent anonymization process applied which means
             e.g. a hash value in events might not look like standard SHA256 value and rather like an obfuscated one.
             
             You will take users' questions and search for the available information with the usage
             of provided functions in soc_tools. 
             
             Remember that you can use multiple functions in one ask. If you would be asked about trends between 
             organization, you can start by using get_all_org_ids function to get a list of available organization ids 
             and use them as input for other functions.

        """

client = genai.Client(api_key=GOOGLE_API_KEY)
chat = client.chats.create(
    model="gemini-2.0-flash",
    config=types.GenerateContentConfig(
        system_instruction=instruction,
        tools=soc_tools,
    ),
)


And see how the model responds:

In [14]:
resp = chat.send_message("what id has the organization that produced the most incidents?")

print(f"\n{resp.text}")

*Func call*: Finding organization with the most unique incidents.
OrgId
0       61
1       14
2       25
3       17
4       10
        ..
2699     1
3686     1
3874     1
4087     1
4680     1
Name: IncidentId, Length: 356, dtype: int64
Organization 0 has the most incidents: 61

The organization with ID 0 has the most incidents, with a count of 61.



# **Threat Hunting Bit**

In this next part, I will create a proof-of-concept for combining our genAI model with threat hunting operations. Threat Hunting Team focuses on identifying threats that were previously undetected. They do so by building hypothesis around a potential attacks and evaluate if there is any proof for them in the available logs. 

Threat hunting Team searches for specific threat actors' behaviors rather than static indicators like IP address or SHA256 hash. The classification of observed behavior is done by using MITRE ATT&CK framework that has a matrix of tactics and techniques (TTPs) which might be leveraged by attackers in their operations.

So in order to support Threat Hunt operations, we could create a function that will support identifying specific techniques across available logs. For the purpose of this project, the task will be simplified and we will try to identify those technique patterns inside alert data we already have. In the real world, Threat Hunting Team would verify it with specific queries looking across all available data (events logs from endpoints, network logs etc.)

Lets define a function `evaluate_mitre_techniques` that will search for MITRE TTPs across all raised incidents. In the real world, the function like that, could help the team e.g. to identify incidents that were graded as FalsePositive but where in fact part of real attack in a big picture approach.

This function will use previously defined methods to collect all incident IDs for organization and get all available data for each incident. Later on, it verifies assigned MITRE TTPs and compares it with a given pattern.

In [15]:
def evaluate_mitre_techniques(orgId: int, pattern: list[str]) -> list[int]:
    """
        This function verifies if there was any incident for an organization, with a given id, that had been reported
        with the same MITRE technique pattern. 
        
        This would indicate the the behevior in the incident matches the searched pattern.
        For example for a pattern of techniques ['T1047','T1497.001','T1082'] and organization ID = 424, it would verify
        MitreTechniques values
        for each incident, and it identifies that an incident with ID = 580 matched this pattern.

        If the pattern has only one technique e.g. ['T1047'] it would print all incident ids that had this technique
        reported in MitreTechniques column (regardless of others). But if pattern has more than one technique, all of
        them must occur for a desired incident.

    Args:
        incidentId (int): The unique identifier for the incident you want to query.

    Returns:
        list[int]: Function returns list of incident ids that matched the pattern for an organization. If no incident
        matched, it would result in empty list.
        
    """
    print(f"*Func call*: Evaluating MITRE techniques: {pattern} for orgId: {orgId}")
    
    incident_ids = get_all_incident_ids(orgId)
    matched_ids: list[int] = []
    
    if not incident_ids:
        return matched_ids
  
    for id in incident_ids:

        print(f"incident id: {id}")
        incident_data = get_all_data(id)
        incident_techniques = incident_data['MitreTechniques'][0]
        if pd.isnull(incident_techniques):
            print(f"Incident does not contain any MITRE TTPs")
            continue
        
        incident_techniques = incident_techniques.split(";") #splitting techniques in first string by ';'
    
        matched = 0
        for p in pattern:
            if p in incident_techniques:
                print(f"found technique: {p}, continue")
                matched +=1
                continue
            else:
                print(f"missing technique: {p}, ending verification")
                break
        
        if matched == len(pattern):
            matched_ids.append(id)
        else:
            print(f"pattern not found in the incident id: {id}")

    return matched_ids

Lets add this function to our existing tools:

In [16]:
soc_tools = [evaluate_mitre_techniques, get_all_sha256_for_incident, get_all_org_ids, get_all_incident_ids, get_all_data, find_org_with_most_incidents]


instruction = """ You are supporting everyday tasks in security operation center (SOC) in a company. 
             You will be providing security analysts with details that you could obtain 
             from provided data about previous incident events raised by Microsoft security tools.
             Be mindful that the data from the incidents had stringent anonymization process applied which means
             e.g. a hash value in events might not look like standard SHA256 value and rather like an obfuscated one.
             
             You will take users' questions and search for the available information with the usage
             of provided functions in soc_tools. 
             
             Remember that you can use multiple functions in one ask. If you would be asked about trends between
             organization, you can start by using get_all_org_ids function to get a list of available organization ids
             and use them as input for other functions.

        """
#tools = types.Tool(function_declarations=[schedule_meeting_function])
#config = types.GenerateContentConfig(tools=[tools])

client = genai.Client(api_key=GOOGLE_API_KEY)
chat = client.chats.create(
    model="gemini-2.0-flash",
    config=types.GenerateContentConfig(
        system_instruction=instruction,
        tools=soc_tools,
    ),
)


And ask the model our question:

In [17]:
resp = chat.send_message("Please verify if there was any incident for organization id = 0 that matches MITRE pattern of [T1078,T1078.004] techniques?")

print(f"\n{resp.text}")

*Func call*: Evaluating MITRE techniques: ['T1078', 'T1078.004'] for orgId: 0
*Func call*: Retrieving all incident ids for organization id: 0
Found 61 unique incident ids.
incident id: 612
*Func call*: Retrieving all data for a given incident id: 612
Incident does not contain any MITRE TTPs
incident id: 211
*Func call*: Retrieving all data for a given incident id: 211
found technique: T1078, continue
found technique: T1078.004, continue
incident id: 260
*Func call*: Retrieving all data for a given incident id: 260
found technique: T1078, continue
found technique: T1078.004, continue
incident id: 375
*Func call*: Retrieving all data for a given incident id: 375
found technique: T1078, continue
found technique: T1078.004, continue
incident id: 136
*Func call*: Retrieving all data for a given incident id: 136
found technique: T1078, continue
found technique: T1078.004, continue
incident id: 262
*Func call*: Retrieving all data for a given incident id: 262
found technique: T1078, continue


And as a Threat Hunter we could then evaluate if those raised incidents are telling a story about usage of Valid Accounts (`T1078`) in malicious way across longer period of time or we can seee a risk of leveraging Cloud Accounts (`T1078.004`) in the organization.

However, in threat hunting we cannot rely only on the internal knowledge base and we should enrich our investigation by external open intelligence (OSINT) resources that could describe more recent threat actors in details. And to do so we can build a RAG model which was also explained during 5-Day Gen AI Intensive Course.

RAG model will:
- **R**etrieve the most suitable document from given external resources for a user query
- **A**ugment the user query with the knowledge from above resource
- and **G**enerate an answer based on this augmented prompt

Lets add neccessary requirements for this part of the project:

## RAG model requirements

In [None]:
!pip uninstall -qqy jupyterlab kfp  
!pip install -qU "chromadb==0.6.3"

For this part of the project, we will be using 'chromadb' library that will help us keep the external resource documents and their embeddings.

In [19]:
from chromadb import Documents, EmbeddingFunction, Embeddings
from google.api_core import retry

Next we will add short documents that will act as external OSINT resources. They are real blogs' parts about a specific threat actor group that has recently emerged in the cyber threat world.

In the real case scenario, those short documents could be changed to OSINT feeds or extensive research PDFs from Cyber Threat Intelligence teams.

In [20]:
#defining documents

# source: https://unit42.paloaltonetworks.com/muddled-libra/
document1 = "#Executive Summary - Muddled Libra stands at the intersection of devious social engineering and nimble technology adaptation. With an intimate knowledge of enterprise information technology, this threat group presents a significant risk even to organizations with well-developed legacy cyber defenses. Muddled Libra’s tactics can be fluid, adapting quickly to a target environment. They continue to use social engineering as their primary modus operandi, targeting a company's IT help support desk. For example, in under a few minutes, these threat actors successfully changed an account password and later reset the victim’s MFA to gain access to their networks. Muddled Libra was first noted for targeting organizations in the software automation, outsourcing and telecommunications verticals. Since then, they’ve expanded their targeting to include the technology, business process outsourcing, hospitality and more recently, financial industries. They show no signs of slowing. Unit 42 researchers and responders have investigated interrelated incidents from mid-2022 through the beginning of 2024, which we’ve attributed to the threat group Muddled Libra. Initial attacks were highly structured and favored large business process outsourcing firms serving high-value cryptocurrency holders. We believe that when the threat actors exhausted those targets, they evolved into a ransomware affiliate model with extortion as their key objective. In the cases we’ve been involved with, we observed Muddled Libra performing the following activities:     Using NSOCKS and TrueSocks proxy services     Creating email rules to forward emails from specific security vendors to the actors to monitor communications and those helping in the investigation     Deploying a custom virtual machine into the environment     Using an open-source rootkit, bedevil (bdvl) to target VMware vCenter servers     Gaining administrative permissions     Heavy use of anonymizing proxy services We also believe that members of Muddled Libra speak English as a first language, which provides them greater ability to conduct their social engineering attacks with other English speakers. Muddled Libra has also been observed using AI to spoof victims’ voices. Social media videos can be used by attackers to train AI models. The targets we’ve observed seem to be primarily in the U.S. Thwarting Muddled Libra requires interweaving tight security controls, diligent awareness training and vigilant monitoring. Palo Alto Networks customers are better protected from the threats described in this article through a modern security architecture built around Cortex XSIAM in concert with Cortex XDR. The Advanced URL Filtering and DNS Security Cloud-Delivered Security Services can help protect against command and control (C2) infrastructure, while App-ID can limit anonymization services allowed to connect to the network. # Threat Overview - The attack style defining Muddled Libra appeared on the cybersecurity radar in late 2022 with the release of the 0ktapus phishing kit. This malware kit offered the following features:     A prebuilt hosting framework     Easy C2 connectivity     Bundled attack templates These options allowed attackers to emulate mobile authentication pages cheaply and easily. With over 200 realistic fake authentication portals and some targeted smishing, attackers quickly gathered credentials and multifactor authentication (MFA) codes for over one hundred organizations. The speed and breadth of these attacks caught many defenders off-guard. While smishing is not a new tactic, the 0ktapus framework commoditized what would typically require complex infrastructure and advanced technical skills, in a way that granted even low-skilled attackers a high attack success rate. The sheer number of targets being hit with this kit created a fair amount of confusion regarding attribution in the research community. Previous reporting by Group-IB, CrowdStrike and Okta has documented and mapped many of these attacks to the following intrusion groups: 0ktapus, Scattered Spider and Scatter Swine. While these have been frequently treated as several names for one group, what these names actually define are:     An attack style using a common toolkit     A social forum-based collaboration network     An Agile-like team structure Muddled Libra is a distinct group of actors using this tradecraft. In a 2023 blog posted on ALPHV’s leak site, the attackers corroborated this view, claiming that previous researcher attribution models have been non-specific. During Unit 42 Incident Response investigations, we identified several cases we attribute to Muddled Libra. Muddled Libra has been responsible for a campaign of complex supply chain attacks, ultimately leading to high-value cryptocurrency targets. This group has only intensified their campaign. They are shifting tactics to adapt to improving cyber defenses, and they are targeting to broaden their attack scope." 

#source: https://www.splunk.com/en_us/blog/learn/scattered-spider.html 
document2 = "What is Scattered Spider? Scattered Spider is a financially motivated threat actor group founded in May 2022. The group is thought to comprise operatives based in the United States and the United Kingdom. They are believed to be primarily between the ages of 19 and 22. How they hack The group is considered expert in social engineering and uses multiple techniques — including phishing, push bombing, and subscriber identity module (SIM) swap attacks — in order to obtain credentials, install remote access tools, and/or bypass multi-factor authentication (MFA). According to the Cybersecurity & Infrastructure Security Agency (CISA), Scattered Spider uses tools like:     Fleetdeck.io and Level.io to enable remote monitoring and management of systems     Screenconnect and Splashtop to enable remote connections to network devices     Tailscale to provide virtual private networks (VPN) to secure network communications  They also use malware like AveMaria, Raccoon Stealer, and VIDAR Stealer. To begin their phishing attempts, the group creates victim-specific domains, such as victimname-sso[.]com, victimname-servicedesk[.]com, and victimname-okta[.]com.  Known aliases Scattered Spider is also referred to as Starfraud, UNC3944, Scatter Swine, and Muddled Libra. Known attacks The group, whose name was first tagged by cybersecurity researchers, gained notoriety for hacking Caesars Entertainment and MGM Resorts International, two of the largest casino and gambling companies in the United States, in September 2023. It’s possible that Scattered Spider was assisted by the group ALPHV/BlackCat. MGM shut down systems across all of its 31 resorts, while Caesars tried avoiding a shutdown by paying the group $15 million.  Reuters reported that Scattered Spider obtained six terabytes (6TB) of stolen data from the hotels and casinos, including sensitive information about the millions of guests who have stayed there.  Their way in? Posing as an MGM employee and calling an IT help desk to “recover their password.” Scattered Spider has also attacked other organizations by posing as company IT and/or helpdesk staff using phone calls or SMS messages to:     Obtain credentials from employees and gain access to the network     Direct employees to run commercial remote access tools enabling initial access     Convince employees to share their one-time password (OTP)     Send repeated MFA notification prompts leading to employees pressing the “Accept” button (also known as MFA fatigue)  They’ve even successfully convinced cellular carriers to transfer control of a targeted user’s phone number to a SIM card, gaining control over the phone and access to MFA prompts. Worse, they’ve monetized access to victim networks in numerous ways including extortion enabled by ransomware and data theft."

#source:  https://www.cisa.gov/news-events/cybersecurity-advisories/aa23-320a
document3 ="This advisory uses the MITRE ATT&CK for Enterprise framework, version 14. See the MITRE ATT&CK® Tactics and Techniques section for a table of the threat actors’ activity mapped to MITRE ATT&CK tactics and techniques. For assistance with mapping malicious cyber activity to the MITRE ATT&CK framework, see CISA and MITRE ATT&CK’s Best Practices for MITRE ATT&CK Mapping and CISA’s Decider Tool. Overview Scattered Spider (also known as Starfraud, UNC3944, Scatter Swine, and Muddled Libra) engages in data extortion and several other criminal activities. Scattered Spider threat actors are considered experts in social engineering and use multiple social engineering techniques, especially phishing, push bombing, and subscriber identity module (SIM) swap attacks, to obtain credentials, install remote access tools, and/or bypass multi-factor authentication (MFA). According to public reporting, Scattered Spider threat actors have: Posed as company IT and/or helpdesk staff using phone calls or SMS messages to obtain credentials from employees and gain access to the network [T1598],[T1656]. Posed as company IT and/or helpdesk staff to direct employees to run commercial remote access tools enabling initial access [T1204],[T1219],[T1566]. Posed as IT staff to convince employees to share their one-time password (OTP), an MFA authentication code. Sent repeated MFA notification prompts leading to employees pressing the “Accept” button (also known as MFA fatigue) [T1621]. Convinced cellular carriers to transfer control of a targeted user’s phone number to a SIM card they controlled, gaining control over the phone and access to MFA prompts. Monetized access to victim networks in numerous ways including extortion enabled by ransomware and data theft [T1657]. After gaining access to networks, the FBI observed Scattered Spider threat actors using publicly available, legitimate remote access tunneling tools. Table 1 details a list of legitimate tools Scattered Spider, repurposed and used for their criminal activity. Note: The use of these legitimate tools alone is not indicative of criminal activity. Users should review the Scattered Spider indicators of compromise (IOCs) and TTPs discussed in this CSA to determine whether they have been compromised. Table 1: Legitimate Tools Used by Scattered Spider Tool 	Intended Use Fleetdeck.io 	Enables remote monitoring and management of systems. Level.io 	Enables remote monitoring and management of systems. Mimikatz [S0002] 	Extracts credentials from a system. Ngrok [S0508] 	Enables remote access to a local web server by tunneling over the internet. Pulseway 	Enables remote monitoring and management of systems. Screenconnect 	Enables remote connections to network devices for management. Splashtop 	Enables remote connections to network devices for management. Tactical.RMM 	Enables remote monitoring and management of systems. Tailscale 	Provides virtual private networks (VPNs) to secure network communications. Teamviewer 	Enables remote connections to network devices for management. In addition to using legitimate tools, Scattered Spider also uses malware as part of its TTPs. See Table 2 for some of the malware used by Scattered Spider. Table 2: Malware Used by Scattered Spider Malware 	Use AveMaria (also known as WarZone [S0670]) 	Enables remote access to a victim’s systems. Raccoon Stealer 	Steals information including login credentials [TA0006], browser history [T1217], cookies [T1539], and other data. VIDAR Stealer 	Steals information including login credentials, browser history, cookies, and other data. Scattered Spider threat actors have historically evaded detection on target networks by using living off the land techniques and allowlisted applications to navigate victim networks, as well as frequently modifying their TTPs. Observably, Scattered Spider threat actors have exfiltrated data [TA0010] after gaining access and threatened to release it without deploying ransomware; this includes exfiltration to multiple sites including U.S.-based data centers and MEGA[.]NZ [T1567.002]. Recent Scattered Spider TTPs New TTP - File Encryption More recently, the FBI has identified Scattered Spider threat actors now encrypting victim files after exfiltration [T1486]. After exfiltrating and/or encrypting data, Scattered Spider threat actors communicate with victims via TOR, Tox, email, or encrypted applications. Reconnaissance, Resource Development, and Initial Access Scattered Spider intrusions often begin with broad phishing [T1566] and smishing [T1660] attempts against a target using victim-specific crafted domains, such as the domains listed in Table 3 [T1583.001]. Table 3: Domains Used by Scattered Spider Threat Actors Domains victimname-sso[.]com victimname-servicedesk[.]com victimname-okta[.]com In most instances, Scattered Spider threat actors conduct SIM swapping attacks against users that respond to the phishing/smishing attempt. The threat actors then work to identify the personally identifiable information (PII) of the most valuable users that succumbed to the phishing/smishing, obtaining answers for those users’ security questions. After identifying usernames, passwords, PII [T1589 ], and conducting SIM swaps, the threat actors then use social engineering techniques [T1656] to convince IT help desk personnel to reset passwords and/or MFA tokens [T1078.002],[T1199],[T1566.004] to perform account takeovers against the users in single sign-on (SSO) environments. Execution, Persistence, and Privilege Escalation Scattered Spider threat actors then register their own MFA tokens [T1556.006],[T1606] after compromising a user’s account to establish persistence [TA0003]. Further, the threat actors add a federated identity provider to the victim’s SSO tenant and activate automatic account linking [T1484.002]. The threat actors are then able to sign into any account by using a matching SSO account attribute. At this stage, the Scattered Spider threat actors already control the identity provider and then can choose an arbitrary value for this account attribute. As a result, this activity allows the threat actors to perform privileged escalation [TA0004] and continue logging in even when passwords are changed [T1078]. Additionally, they leverage common endpoint detection and response (EDR) tools installed on the victim networks to take advantage of the tools’ remote-shell capabilities and executing of commands which elevates their access. They also deploy remote monitoring and management (RMM) tools [T1219] to then maintain persistence. Discovery, Lateral Movement, and Exfiltration Once persistence is established on a target network, Scattered Spider threat actors often perform discovery, specifically searching for SharePoint sites [T1213.002], credential storage documentation [T1552.001], VMware vCenter infrastructure [T1018], backups, and instructions for setting up/logging into Virtual Private Networks (VPN) [TA0007]. The threat actors enumerate the victim’s Active Directory (AD), perform discovery and exfiltration of victim’s code repositories [T1213.003], code-signing certificates [T1552.004], and source code [T1083],[TA0010]. Threat actors activate Amazon Web Services (AWS) Systems Manager Inventory [T1538] to discover targets for lateral movement [TA0007],[TA0008], then move to both preexisting [T1021.007] and actor-created [T1578.002] Amazon Elastic Compute Cloud (EC2) instances. In instances where the ultimate goal is data exfiltration, Scattered Spider threat actors use actor-installed extract, transform, and load (ETL) tools [T1648] to bring data from multiple data sources into a centralized database [T1074],[T1530]. According to trusted third parties, where more recent incidents are concerned, Scattered Spider threat actors may have deployed BlackCat/ALPHV ransomware onto victim networks—thereby encrypting VMware Elastic Sky X integrated (ESXi) servers [T1486]. To determine if their activities have been uncovered and maintain persistence, Scattered Spider threat actors often search the victim’s Slack, Microsoft Teams, and Microsoft Exchange online for emails [T1114] or conversations regarding the threat actor’s intrusion and any security response. The threat actors frequently join incident remediation and response calls and teleconferences, likely to identify how security teams are hunting them and proactively develop new avenues of intrusion in response to victim defenses. This is sometimes achieved by creating new identities in the environment [T1136] and is often upheld with fake social media profiles [T1585.001] to backstop newly created identities."

documents = [document1, document2, document3]

We have the resources ready, now we will define the embedding function with usage of `text-embedding-004` model via Gemini API  that will help us prepare embeddings of those OSINT resources and later on of the given user's query.

In the below function we have a switch `document_mode` that will help us easily change the embedding task for the model.

In [21]:
# Define a helper to retry when per-minute quota is reached.
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})


class GeminiEmbeddingFunction(EmbeddingFunction):
    # Specify whether to generate embeddings for documents, or queries
    document_mode = True

    @retry.Retry(predicate=is_retriable)
    def __call__(self, input: Documents) -> Embeddings:
        if self.document_mode:
            embedding_task = "retrieval_document"
        else:
            embedding_task = "retrieval_query"

        response = client.models.embed_content(
            model="models/text-embedding-004",
            contents=input,
            config=types.EmbedContentConfig(
                task_type=embedding_task,
            ),
        )
        return [e.values for e in response.embeddings]

Now we will store those OSINT documents and their embeddings in the `threat_intelligence` database. If this step works, you should see number of entries in database. You can see that with parameter `embedding_function=embed_fn` we tight previously defined embedding function with chroma_client, so it will know which function should be used to calculate embeddings. 

In [22]:
import chromadb

DB_NAME = "threat_intelligence"

embed_fn = GeminiEmbeddingFunction()
embed_fn.document_mode = True

chroma_client = chromadb.Client()

db = chroma_client.get_or_create_collection(name=DB_NAME, embedding_function=embed_fn)
db.add(documents=documents, ids=[str(i) for i in range(len(documents))])

db.count()

3

When this is done, we will switch the mode for the user's query and verify if it correctly chooses a document from the database to answer this question.

In [23]:
embed_fn.document_mode = False

query = "What are the MITRE techniques associated with scattered spider?"

result = db.query(query_texts=[query], n_results=1)
[all_passages] = result["documents"]

print(all_passages[0])

This advisory uses the MITRE ATT&CK for Enterprise framework, version 14. See the MITRE ATT&CK® Tactics and Techniques section for a table of the threat actors’ activity mapped to MITRE ATT&CK tactics and techniques. For assistance with mapping malicious cyber activity to the MITRE ATT&CK framework, see CISA and MITRE ATT&CK’s Best Practices for MITRE ATT&CK Mapping and CISA’s Decider Tool. Overview Scattered Spider (also known as Starfraud, UNC3944, Scatter Swine, and Muddled Libra) engages in data extortion and several other criminal activities. Scattered Spider threat actors are considered experts in social engineering and use multiple social engineering techniques, especially phishing, push bombing, and subscriber identity module (SIM) swap attacks, to obtain credentials, install remote access tools, and/or bypass multi-factor authentication (MFA). According to public reporting, Scattered Spider threat actors have: Posed as company IT and/or helpdesk staff using phone calls or SMS 

You can see that it replied with the entire document that would answer our query without any summary. For this we would need to created an augmented prompt (with collected external information) and send it to our model.

In [24]:
query_oneline = query.replace("\n", " ") #in case the question has multiple lines.

prompt = f"""You are supporting everyday tasks in security operation center (SOC) in a company. You will be providing security analysts with details that you could obtain from from the reference passage included below.
Be sure to respond in a complete sentence, being comprehensive, including all relevant background
information. If the passage is irrelevant to the answer, you may ignore it.

QUESTION: {query_oneline}

"""

# Add the retrieved documents to the prompt just like in code snippet for Day 2 of the course
for passage in all_passages:
    passage_oneline = passage.replace("\n", " ")
    prompt += f"PASSAGE: {passage_oneline}\n"

print(prompt)

You are supporting everyday tasks in security operation center (SOC) in a company. You will be providing security analysts with details that you could obtain from from the reference passage included below.
Be sure to respond in a complete sentence, being comprehensive, including all relevant background
information. If the passage is irrelevant to the answer, you may ignore it.

QUESTION: What are the MITRE techniques associated with scattered spider?

PASSAGE: This advisory uses the MITRE ATT&CK for Enterprise framework, version 14. See the MITRE ATT&CK® Tactics and Techniques section for a table of the threat actors’ activity mapped to MITRE ATT&CK tactics and techniques. For assistance with mapping malicious cyber activity to the MITRE ATT&CK framework, see CISA and MITRE ATT&CK’s Best Practices for MITRE ATT&CK Mapping and CISA’s Decider Tool. Overview Scattered Spider (also known as Starfraud, UNC3944, Scatter Swine, and Muddled Libra) engages in data extortion and several other cr

And now, we send this augmented prompt to our model:

In [25]:
rag_response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt)

rag_response_text = rag_response.text
print(rag_response_text)

Based on the provided information, the MITRE ATT&CK techniques associated with Scattered Spider include: T1598, T1656, T1204, T1219, T1566, T1621, T1657, T1486, T1660, T1583.001, T1589, T1078.002, T1199, T1566.004, T1556.006, T1606, T1484.002, T1078, T1213.002, T1552.001, T1018, T1213.003, T1552.004, T1083, T1538, T1021.007, T1578.002, T1648, T1074, T1530, and T1114, T1136, T1585.001, T1567.002, T1217, T1539.



So it worked, we were given as list of TTPs that are associated with the selected threat group. 

As a last step of this project, we would like to combine previously defined tools for the model with our RAG output.
We will pass the generated response from RAG model and use it as an input for our `evaluate_mitre_techniques` function.

With this workflow we would be able to support Threat Hunting Team to identify emerging threat group activities in previous raised incidents for their client base.

Before we actually pass the RAG response `rag_response_text` to our model, we would need to parse the output into the list to allow the model to use our function.

To do so we will prepare a prompt in which we will pass the RAG response.

In [26]:
extraction_prompt = f"""
From the following text, extract all MITRE ATT&CK technique IDs (e.g., T1234, T1234.001).
Return the results ONLY as a raw JSON list of strings. Do not include Markdown fences like ```json or ``` or any other explanatory text before or after the list.
If no techniques are found, return an empty list [].

Text:
\"\"\"
{rag_response_text}
\"\"\"

Raw JSON List:
"""

extraction_response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=extraction_prompt
)

extracted_techniques_text = extraction_response.text.strip()
print(extracted_techniques_text)

```json
[
  "T1598",
  "T1656",
  "T1204",
  "T1219",
  "T1566",
  "T1621",
  "T1657",
  "T1486",
  "T1660",
  "T1583.001",
  "T1589",
  "T1078.002",
  "T1199",
  "T1566.004",
  "T1556.006",
  "T1606",
  "T1484.002",
  "T1078",
  "T1213.002",
  "T1552.001",
  "T1018",
  "T1213.003",
  "T1552.004",
  "T1083",
  "T1538",
  "T1021.007",
  "T1578.002",
  "T1648",
  "T1074",
  "T1530",
  "T1114",
  "T1136",
  "T1585.001",
  "T1567.002",
  "T1217",
  "T1539"
]
```


During my tests, I identified that we need to also polish the response due to Markdown characters like "```json". I tried to tweak the instructions for the model to inform it about such need, but I identified that it could be also done we a short function:

In [27]:
import re

def clean_rag_response(extracted_techniques_text: str) -> dict | None:
    """
    This function supports clearing string input of any Markdown characters like \`\`\`json or at the end..
    Running this function is important to have a clear JSON structure that could be used by other functions
    Args:
        extracted_techniques_text: string with unneccessary Markdown characters and class

    Returns:
        cleared dictionary pattern that contains MITRE technique numbers.

    """
    cleaned_pattern = None
    print(f"*Func call*: Clearing any Markdown chars from input: {extracted_techniques_text}")
    # This finds the first '[' or '{' and captures everything until the *last* matching ']' or '}'
    match = re.search(r"(\{.*\}|\[.*\])", extracted_techniques_text, re.DOTALL)
    
    if match:
        cleaned_pattern = match.group(0)
        print(f"--- Cleaned using Regex ---")

    return cleaned_pattern

resp = clean_rag_response(extracted_techniques_text)
print(resp)


*Func call*: Clearing any Markdown chars from input: ```json
[
  "T1598",
  "T1656",
  "T1204",
  "T1219",
  "T1566",
  "T1621",
  "T1657",
  "T1486",
  "T1660",
  "T1583.001",
  "T1589",
  "T1078.002",
  "T1199",
  "T1566.004",
  "T1556.006",
  "T1606",
  "T1484.002",
  "T1078",
  "T1213.002",
  "T1552.001",
  "T1018",
  "T1213.003",
  "T1552.004",
  "T1083",
  "T1538",
  "T1021.007",
  "T1578.002",
  "T1648",
  "T1074",
  "T1530",
  "T1114",
  "T1136",
  "T1585.001",
  "T1567.002",
  "T1217",
  "T1539"
]
```
--- Cleaned using Regex ---
[
  "T1598",
  "T1656",
  "T1204",
  "T1219",
  "T1566",
  "T1621",
  "T1657",
  "T1486",
  "T1660",
  "T1583.001",
  "T1589",
  "T1078.002",
  "T1199",
  "T1566.004",
  "T1556.006",
  "T1606",
  "T1484.002",
  "T1078",
  "T1213.002",
  "T1552.001",
  "T1018",
  "T1213.003",
  "T1552.004",
  "T1083",
  "T1538",
  "T1021.007",
  "T1578.002",
  "T1648",
  "T1074",
  "T1530",
  "T1114",
  "T1136",
  "T1585.001",
  "T1567.002",
  "T1217",
  "T1539"
]


So this function is ready. For this proof-of-concept we would need to add one more function that would support the model in creating smaller subpatterns for MITRE TTPs. This is only to adjust the testing with the TTPs values we have in our dataset where column `MitreTechniques` would have max 3 techniques listed.

In [28]:
def chunk_list(data_list: list[str], sublist_size: int) -> list[list[str]]:
    """
    Divides a list into sublists (chunks) of a maximum size = sublist_size.

    Args:
        data_list: The list to be divided.
        chunk_size: The maximum number of items allowed in each chunk.

    Returns:
        A list of lists, where each inner list is a chunk
        of the original list.

    """
    print(f"*Func call*: Creating sublists of {data_list} with max size of {chunk_size}")
    if not isinstance(chunk_size, int) or chunk_size <= 0:
        raise ValueError("chunk_size must be a positive integer")

    if not data_list:
        return [] # Return empty list if input is empty

    sublists = []
    list_len = len(data_list)
    for i in range(0, list_len, sublist_size):
        # Slice the list from the current index 'i' up to 'i + chunk_size'
        sublist = data_list[i : i + sublist_size]
        sublists.append(sublist)
    return sublists

And now, for the grand finale, lets add our functions to `soc_tools` and ask the model to evalute reply from previous step stored in `extracted_techniques_text`

In [29]:
soc_tools = [clean_rag_response, chunk_list, evaluate_mitre_techniques, get_all_sha256_for_incident, get_all_org_ids, get_all_incident_ids, get_all_data, find_org_with_most_incidents]


instruction = f""" You are supporting everyday tasks in security operation center (SOC) in a company. 
             You will be providing security analysts with details that you could obtain 
             from from the reference passage included below.
             Be sure to respond in a complete sentence, being comprehensive, including all relevant background
             information.
             Be mindful that the data from the incidents had stringent anonymization process applied which means
             e.g. a hash value in events might not look like standard SHA256 value and rather like an obfuscated one.
             
             You will take users' questions and search for the available information with the usage
             of provided functions in soc_tools. 
             
             Remember that you can use multiple functions in one ask. If you would be asked about trends between
             organization, you can start by using get_all_org_ids function to get a list of available organization ids
             and use them as input for other functions.
        """

client = genai.Client(api_key=GOOGLE_API_KEY)
chat = client.chats.create(
    model="gemini-2.0-flash",
    config=types.GenerateContentConfig(
        system_instruction=instruction,
        tools=soc_tools,
    ),
)


In [31]:
resp = chat.send_message(f"""Verify if there was any incident for organization ID = 51 that matches sublist of at max two MITRE techniques given in the pattern {extracted_techniques_text}? Start by clearing the RAG input of any Markdown characters and then creating subsets of the pattern. At the end provide a short info if any incidents matched the pattern or not.""")


print(f"\n{resp.text}")

*Func call*: Clearing any Markdown chars from input: ```json
[
  "T1598",
  "T1656",
  "T1204",
  "T1219",
  "T1566",
  "T1621",
  "T1657",
  "T1486",
  "T1660",
  "T1583.001",
  "T1589",
  "T1078.002",
  "T1199",
  "T1566.004",
  "T1556.006",
  "T1606",
  "T1484.002",
  "T1078",
  "T1213.002",
  "T1552.001",
  "T1018",
  "T1213.003",
  "T1552.004",
  "T1083",
  "T1538",
  "T1021.007",
  "T1578.002",
  "T1648",
  "T1074",
  "T1530",
  "T1114",
  "T1136",
  "T1585.001",
  "T1567.002",
  "T1217",
  "T1539"
]
```
--- Cleaned using Regex ---
*Func call*: Evaluating MITRE techniques: ['T1598', 'T1656'] for orgId: 51
*Func call*: Retrieving all incident ids for organization id: 51
Found 5 unique incident ids.
incident id: 84683
*Func call*: Retrieving all data for a given incident id: 84683
missing technique: T1598, ending verification
pattern not found in the incident id: 84683
incident id: 328282
*Func call*: Retrieving all data for a given incident id: 328282
missing technique: T1598, end

# Summary

And this step concludes my project. To summarize, this notebook showcases usage of `Microsoft Incident Prediction`  dataset in a security notebook that supports actions of security analysts in MSSP organiztion. This dataset shows an opportunity on how security models could be trained and still maintained privacy of data.

The defined model can utilize created functions to obtain information from the given dataset or identify trends for SOC analysts.

In this project I also showed how genAI could use external OSINT resources and apply the knowledge in a simple RAG model to support threat hunting tasks.

I hope this notebook will bring value to your research as well!