# Tutorial 4: Hunting at Scale

## Objectives

- Select threat hunting procedure(s)
- Select target tenants to run hunting procedure(s)
- Run multiple procedures in parallel
- Analyze cached results across multiple notebooks

![hunting-fan-out](images/hunting-fan-out.png)

## Description

This tutorial covers how to run concurrently threat hunting procedure across Taegis tenants.
These techniques allow threat hunting operations to scale up using notebooks.

## Step 1: Select Threat Hunting Procedure(s)

> Before we begin, please ensure that the notebook kernel is set to `taegis-hunting-tutorials`.

In the previous tutorial, we created a toy hunting procedure notebook that both describes and automates the steps to hunt for threats that manipulate Windows services to disable security controls.
We strongly recommend storing threat hunting procedure notebooks under version control software, such as `git`.
Version control helps ensure that hunters are running the same version of a hunting procedure.
Before running a threat hunting procedure, it is advised to pull down the most recent changes from version control.

As in the previous tutorial, we will list available threat hunting procedures:

In [1]:
import pandas as pd
import nbformat

from pathlib import Path

template_directory = Path("templates")
available_procedures = pd.DataFrame(
    {
        "notebook_path": notebook_path,
        **nb.metadata.hunting,
    }
    for notebook_path in template_directory.glob("*.ipynb")
    if (nb := nbformat.read(notebook_path, as_version=nbformat.NO_CONVERT))
    if "hunting" in nb.metadata
)

available_procedures.head()

Unnamed: 0,notebook_path,attack_technique_ids,data_sources,description,id,tags,title
0,templates\tutorial-03-my-first-hunting-procedu...,[TXXXX.XXX],"[scwx.process, scwx.auth]",Here is my first hunting procedure.,tutorial-03-my-first-hunting-procedure,[tutorial],My First Hunting Procedure
1,templates\tutorial-04-windows-service-manipula...,[T1562.001],[scwx.process],The goal of this procedure is to identify poss...,tutorial-04-example-procedure-abcd123,[windows],Suspicious Windows Service Manipulation


Let's find procedures related to a [specific MITRE ATT&CK technique ID](https://attack.mitre.org/techniques/T1562/), then grab the procedure notebooks paths and titles:

In [2]:
technique_of_interest = "T1562" # https://attack.mitre.org/techniques/T1562/

procedures_by_technique_id = available_procedures.explode("attack_technique_ids")

related_to_specific_technique = procedures_by_technique_id[
    procedures_by_technique_id.attack_technique_ids.str.contains(technique_of_interest)
][["notebook_path", "title"]].drop_duplicates()

selected_procedures = list(related_to_specific_technique.itertuples(index=False, name="Procedure"))
selected_procedures

[Procedure(notebook_path=WindowsPath('templates/tutorial-04-windows-service-manipulation.ipynb'), title=' Suspicious Windows Service Manipulation')]

We will run the procedure 'Suspicious Windows Service Manipulation' located in `templates/tutorial-04-windows-service-manipulation.ipynb` since it pertains to our MITRE ATT&CK technique of interest.

## Step 2: Select Target Tenants

Hunting procedures can be executed concurrently across multiple Taegis tenants.
To define the scope of the hunting procedure, we need to pick our desired Taegis environment and tenant IDs.

We can use Taegis Magic to conveniently pull down tenant information.
For MSSP partners, the `--filter-by-tenant-hierarchy` argument is particularly useful to enumerate all child tenants of a given parent tenant ID.

In [3]:
%load_ext taegis_magic

%taegis tenants search -h

usage: taegis_magic_parser [-h] [--assign NAME | --append NAME] [--display NAME] [--cache]

options:
  -h, --help      show this help message and exit
  --assign NAME   Assign results as pandas DataFrame to NAME
  --append NAME   Append results as pandas DataFrame to NAME
  --display NAME  Display NAME as markdown table
  --cache         Save output to cache / Load output from cache (if present)





For this tutorial, however, we will define a list of tenant IDs in the `foxtrot` environment as the scope for the selected procedure.

> You can populate `tenants_in_scope` with the environments and tenants that your user is authorized to access.

In [4]:
tenants_in_scope = {
    #"charlie": ["111111", "222222"],
    #"delta": ["333333", "444444"],
    #"echo": ["5555555", "6666666"]
    "foxtrot": ["145483", "145485", "145487"],
}

## Step 3: Run Hunting Procedure(s)

> **Regarding Authentication**
>
> For demonstration purposes, we are running the procedures as the logged-in user.
> If you are not already logged-in, you can force a log-in with this command:
>
> `%taegis users current-user --region <environment-to-log-in-to>`
>
> When hunting at scale, we strongly suggest creating an OAuth client to perform hunting procedures. 
> Please see the Taegis SDK for Python [authentication docs](https://github.com/secureworks/taegis-sdk-python/blob/main/docs/authentication.md) for more details.

In the previous tutorial, we used `papermill` to parameterize a single instance of a hunting procedure.
This time we will use [`concurrent.futures`](https://docs.python.org/3/library/concurrent.futures.html) to parameterize multiple instances of the procedures and execute them in parallel across separate processes.
`papermill` will execute each notebook from top to bottom, as if the user clicked `Restart -> Run All` in the Jupyter Notebook web interface.

In [5]:
import papermill as pm

from concurrent.futures import as_completed, ProcessPoolExecutor
from pathlib import Path


with ProcessPoolExecutor(max_workers=3) as executor:
    executed_procedures = {}
    for procedure in selected_procedures:
        for taegis_environment, taegis_tenants in tenants_in_scope.items():
            for tenant_id in taegis_tenants:

                output_dir = Path("output") / taegis_environment / tenant_id
                output_dir.mkdir(exist_ok=True, parents=True)
                output_notebook_path = str(output_dir / procedure.notebook_path.name)

                executed_procedures[
                    executor.submit(
                        pm.execute_notebook,
                        input_path=procedure.notebook_path,
                        output_path=output_notebook_path,
                        parameters=dict(
                            TAEGIS_TENANT_ID=tenant_id,
                            TAEGIS_ENVIRONMENT=taegis_environment,
                            INVESTIGATION_TITLE=procedure.title,
                            TAEGIS_MAGIC_NOTEBOOK_PATH=output_notebook_path,
                        ),
                        request_save_on_cell_execute=True,
                        cwd=output_dir,
                    )
                ] = (procedure, taegis_environment, tenant_id)

    # Wait for the notebooks to finish running
    # and log exceptions as they occur.
    for i, executed_procedure in enumerate(as_completed(executed_procedures)):
        procedure_scope = executed_procedures[executed_procedure]
        print(f"{i+1}/{len(executed_procedures.keys())} {procedure_scope} Finished!")
        exc = executed_procedure.exception()
        if exc:
            print(procedure_scope, exc)
    

1/3 (Procedure(notebook_path=WindowsPath('templates/tutorial-04-windows-service-manipulation.ipynb'), title=' Suspicious Windows Service Manipulation'), 'foxtrot', '145483') Finished!
2/3 (Procedure(notebook_path=WindowsPath('templates/tutorial-04-windows-service-manipulation.ipynb'), title=' Suspicious Windows Service Manipulation'), 'foxtrot', '145485') Finished!
3/3 (Procedure(notebook_path=WindowsPath('templates/tutorial-04-windows-service-manipulation.ipynb'), title=' Suspicious Windows Service Manipulation'), 'foxtrot', '145487') Finished!


As `papermill` executes each notebook, it will save the notebooks to `output` directory partitioned by environment and tenant ID:

```
/output
    /foxtrot
        /145483
            tutorial-04-windows-service-manipulation.ipynb
        /145487
            tutorial-04-windows-service-manipulation.ipynb
        /145485
            tutorial-04-windows-service-manipulation.ipynb
            
```

## Step 4: Review Results

It is helpful to review the executed notebooks for a given procedure in aggregate.
This gives the hunter an opportunity to survey the success rate of the procedure (how many completed successfully relative to the number of tenants in scope).
It also identifies tenants where the procedure may have returned the maximum number of query results, in which case refinement of the procedure might be necessary.
But most importantly, it may tell threat hunters which tenants need manual analysis and which do not.

**Performance Metrics**

`papermill` conveniently adds metadata to track execution times and whether or not an exception was raised during execution.
We can parse the metadata from the executed notebooks to see how well the procedure ran across our target tenants.

In [6]:
# Our chosen output directory
output_directory = Path("output")

executed_notebooks_df = pd.json_normalize(
    {
        "notebook_path": notebook_path,
        **nb.metadata,
    }
    for notebook_path in output_directory.rglob("*.ipynb")
    if (nb := nbformat.read(notebook_path, as_version=nbformat.NO_CONVERT))
    if ".ipynb_checkpoints" not in str(notebook_path)
)

executed_notebooks_df[["notebook_path", "papermill.exception", "papermill.start_time", "papermill.end_time"]]

Unnamed: 0,notebook_path,papermill.exception,papermill.start_time,papermill.end_time
0,output\instance-of-my-first-hunting-procedure....,,,
1,output\foxtrot\145483\tutorial-04-windows-serv...,,2023-09-13T14:49:20.548323,2023-09-13T14:49:59.498138
2,output\foxtrot\145485\tutorial-04-windows-serv...,,2023-09-13T14:49:20.550323,2023-09-13T14:50:00.277414
3,output\foxtrot\145487\tutorial-04-windows-serv...,,2023-09-13T14:49:20.548323,2023-09-13T14:50:00.895200


In this case, all three notebooks ran without exception across the tenants in scope.
Notebooks that raise exceptions usually require manual review.
Threat hunters can simply open the problematic notebook and view the stack trace, as if they ran into the exception interactively.

> `papermill` also adds runtime metadata to each cell.
> Reviewing cell-level metadata is helpful for identifying problematic queries or logic across many procedure results.

**Cached Query Results**

If you open one of the executed notebooks, it looks no different than if you ran the notebook manually.

You can re-execute the notebook to analyze the results, but you might notice that Taegis Magic queries containing the `--cache` flag return _very_ quickly.
This is because the magics cached those results when the notebook was executed previously by `papermill`.
Re-running a cached query cell will return the cached results rather than perform the query again.
Changing the cell contents of a query cell containing cached results will invalidate the cache and subsequent queries will clear any cached data and will fetch new query results from Taegis.

This caching behavior is the primary mechanism that enables per-tenant hunting procedures to scale across many tenants.
Each hunting procedure is executed - without human interaction - for each tenant, and across all tenants, in scope.
All query results are cached into the respective output notebook for a given combination of procedure and tenant.

To take a closer look at the executed notebooks, we can extract all the cached query results:

In [7]:
from taegis_magic.core.cache import get_cached_objects, get_cache_list

cached_results = []

for executed_notebook_path in output_directory.rglob("*.ipynb"):

    cache_ids = [cache_id for cache_id in get_cache_list(executed_notebook_path) if cache_id[0]]
    cache_data = get_cached_objects(executed_notebook_path)
    cached_results.extend([

        {
            "notebook_path": executed_notebook_path,
            "tenant_id": cached_object.tenant_id,
            "environment": cached_object.region,
            "service": cached_object.service,
            "cache_name": cache_name,
            "cache_hash": cache_hash,
            "query": cached_object.query,
            "num_results": len(cached_object.results),

            # The memory requirements may exceed available resources
            # when running over a very large number of tenants.
            # Exclude as necessary.
            "results_as_dataframe": pd.json_normalize(cached_object.results)
        }
        for (cache_name, cache_hash), cached_object in zip(cache_ids, cache_data)

    ])

cached_results_df = pd.DataFrame(cached_results)
cached_results_df[["environment", "tenant_id", "cache_name", "service", "num_results"]]

Unnamed: 0,environment,tenant_id,cache_name,service,num_results
0,foxtrot,145483,sc_alerts,alerts,4
1,foxtrot,145483,sc_processes,events,172
2,foxtrot,145485,sc_alerts,alerts,4
3,foxtrot,145485,sc_processes,events,146
4,foxtrot,145487,sc_alerts,alerts,4
5,foxtrot,145487,sc_processes,events,222


All three tenants in scope returned between ~150-225 results for the `sc_processes` events query.

**Aggregate Analysis**

Lastly, we can analyze the actual query results as DataFrames.
Hunters can filter down to a specific cached query result:

In [8]:
cached_results_df[
    (cached_results_df.tenant_id == "145483") &
    (cached_results_df.cache_name == "sc_processes")
].iloc[0].results_as_dataframe[["resource_id", "event_time_usec", "image_path", "commandline"]].head()

Unnamed: 0,resource_id,event_time_usec,image_path,commandline
0,event://priv:scwx.process:145483:1694526736000...,1694523086252289,\Device\HarddiskVolume2\Windows\System32\sc.exe,sc stop wuauserv
1,event://priv:scwx.process:145483:1694523904000...,1694523086252289,\Device\HarddiskVolume2\Windows\System32\sc.exe,sc stop wuauserv
2,event://priv:scwx.process:145483:1694524756000...,1694523086252289,\Device\HarddiskVolume2\Windows\System32\sc.exe,sc stop wuauserv
3,event://priv:scwx.process:145483:1694526291000...,1694523086252289,\Device\HarddiskVolume2\Windows\System32\sc.exe,sc stop wuauserv
4,event://priv:scwx.process:145483:1694527486000...,1694523086252289,\Device\HarddiskVolume2\Windows\System32\sc.exe,sc stop wuauserv


While hunting procedures are designed and executed on a per-tenant basis, threat hunters can review the query results for a cached query in a given procedure across all tenants at once.
This is a powerful technique to get a "macro" view of the data across many tenants.

In [9]:
cached_sc_processes_df = cached_results_df[cached_results_df.cache_name == "sc_processes"]
concatenated_sc_processes = pd.concat([
    df
    for df in cached_sc_processes_df.results_as_dataframe.values
], ignore_index=True)

concatenated_sc_processes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 540 entries, 0 to 539
Data columns (total 92 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   allocations                              0 non-null      object 
 1   commandline                              540 non-null    object 
 2   commandline_decoded                      540 non-null    object 
 3   commandline_decoder                      0 non-null      object 
 4   computer_name                            540 non-null    object 
 5   enrichSummary                            540 non-null    object 
 6   enrichedHostnames                        0 non-null      object 
 7   event_time_fidelity                      540 non-null    object 
 8   event_time_usec                          540 non-null    int64  
 9   external_uris                            0 non-null      object 
 10  gid                                      540 non-n

In [10]:
concatenated_sc_processes.tenant_id.value_counts()

tenant_id
145487    222
145483    172
145485    146
Name: count, dtype: int64

## Step 5: Assign for Final Disposition

The previous step shows how to review the aggregate results of a hunting procedure across all tenants in scope.
The goal of that step was to identify which tenants require manual analysis due to either problems with the automated procedure or potential evidence of a threat that warrants further review.

For tenants that require manual analysis, threat hunters are provided with copies of the executed notebook.
The hunters can then interactively run the notebooks and determine which evidence should be added to an investigation and document their findings within the notebook.
Lastly, hunters should create Taegis investigations from the executed notebook.

For tenants that do not require manual analysis, it is helpful to automate the creation of a "null findings" investigation.
Null findings investigations demonstrate to customers that the steps of the hunting procedure were performed and that no evidence of a given threat were found in the evidence.
Automating the null findings investigations is outside the scope of this tutorial.

## Wrap-Up

In this tutorial, we ran a hunting procedure across multiple Taegis tenants and reviewed the results in aggregate.
These techniques can be adapted to scale out even further by running the notebooks in serverless functions and saving the results to blob storage.